2025-05-07T20:23:19.6967074Z Current runner version: '2.323.0' 2025-05-07T20:23:19.6973897Z Runner name: 'i-04dd41b83603cbddd' 2025-05-07T20:23:19.6974849Z Machine name: 'ip-10-0-8-106' 2025-05-07T20:23:19.6977590Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:23:19.6979882Z Contents: read 2025-05-07T20:23:19.6980415Z Metadata: read 2025-05-07T20:23:19.6980915Z Packages: read 2025-05-07T20:23:19.6981414Z ##[endgroup] 2025-05-07T20:23:19.6983299Z Secret source: None 2025-05-07T20:23:19.6983942Z Prepare workflow directory 2025-05-07T20:23:19.7898929Z Prepare all required actions 2025-05-07T20:23:19.7938127Z Getting action download info 2025-05-07T20:23:20.0004139Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:23:20.3009944Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:23:20.7309895Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:23:22.4442951Z Getting action download info 2025-05-07T20:23:22.5667867Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:23:22.8530935Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.13, 12.8.0, 12.6.3, gcc) 2025-05-07T20:23:22.9136851Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:23:22.9272823Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:23:22.9285758Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:22.9287306Z ##[endgroup] 2025-05-07T20:23:24.7938367Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:23:24.7938813Z Instance Type: g5.4xlarge 2025-05-07T20:23:24.7939066Z AMI Name: unknown 2025-05-07T20:23:24.7974161Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:23:30.1909220Z ##[group]Run actions/checkout@v4 2025-05-07T20:23:30.1909540Z with: 2025-05-07T20:23:30.1909766Z submodules: true 2025-05-07T20:23:30.1910012Z repository: pytorch/FBGEMM 2025-05-07T20:23:30.1910408Z token: *** 2025-05-07T20:23:30.1910617Z ssh-strict: true 2025-05-07T20:23:30.1910836Z ssh-user: git 2025-05-07T20:23:30.1911062Z persist-credentials: true 2025-05-07T20:23:30.1911320Z clean: true 2025-05-07T20:23:30.1911552Z sparse-checkout-cone-mode: true 2025-05-07T20:23:30.1911828Z fetch-depth: 1 2025-05-07T20:23:30.1912051Z fetch-tags: false 2025-05-07T20:23:30.1912273Z show-progress: true 2025-05-07T20:23:30.1912501Z lfs: false 2025-05-07T20:23:30.1912713Z set-safe-directory: true 2025-05-07T20:23:30.1912978Z env: 2025-05-07T20:23:30.1913196Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:30.1913843Z BUILD_ENV: build_binary 2025-05-07T20:23:30.1914113Z BUILD_TARGET: genai 2025-05-07T20:23:30.1914350Z BUILD_VARIANT: cuda 2025-05-07T20:23:30.1914618Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:30.1914877Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:30.1915119Z ##[endgroup] 2025-05-07T20:23:30.3088420Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:23:30.3090630Z ##[group]Getting Git version info 2025-05-07T20:23:30.3091663Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:23:30.3092840Z [command]/usr/bin/git version 2025-05-07T20:23:30.3093401Z git version 2.47.1 2025-05-07T20:23:30.3109848Z ##[endgroup] 2025-05-07T20:23:30.3123090Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/98d045e0-d391-4420-aa0d-7228e750a89f/.gitconfig' 2025-05-07T20:23:30.3145694Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/98d045e0-d391-4420-aa0d-7228e750a89f' before making global git config changes 2025-05-07T20:23:30.3147433Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:23:30.3151653Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:30.3196826Z [command]/usr/bin/git config --local --get remote.origin.url 2025-05-07T20:23:30.3221419Z https://github.com/pytorch/FBGEMM 2025-05-07T20:23:30.3239323Z ##[group]Removing previously created refs, to avoid conflicts 2025-05-07T20:23:30.3243108Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-05-07T20:23:30.3269162Z refs/heads/main 2025-05-07T20:23:30.3279076Z [command]/usr/bin/git checkout --detach 2025-05-07T20:23:31.1952424Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:31.1999863Z [command]/usr/bin/git branch --delete --force main 2025-05-07T20:23:31.2027445Z Deleted branch main (was b6b2ce3). 2025-05-07T20:23:31.2033757Z ##[endgroup] 2025-05-07T20:23:31.2036800Z [command]/usr/bin/git submodule status 2025-05-07T20:23:31.2453382Z e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b) 2025-05-07T20:23:31.2536868Z 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd) 2025-05-07T20:23:31.2622549Z 6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec) 2025-05-07T20:23:31.2711969Z 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e) 2025-05-07T20:23:31.2800068Z f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77) 2025-05-07T20:23:31.2885847Z 420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844) 2025-05-07T20:23:31.2967955Z 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280) 2025-05-07T20:23:31.2982066Z ##[group]Cleaning the repository 2025-05-07T20:23:31.2987204Z [command]/usr/bin/git clean -ffdx 2025-05-07T20:23:31.3045917Z [command]/usr/bin/git reset --hard HEAD 2025-05-07T20:23:31.3157410Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:31.3164720Z ##[endgroup] 2025-05-07T20:23:31.3166728Z ##[group]Disabling automatic garbage collection 2025-05-07T20:23:31.3171498Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:23:31.3203132Z ##[endgroup] 2025-05-07T20:23:31.3203539Z ##[group]Setting up auth 2025-05-07T20:23:31.3208731Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:23:31.3251078Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:23:31.3580245Z Entering 'external/asmjit' 2025-05-07T20:23:31.3646920Z Entering 'external/composable_kernel' 2025-05-07T20:23:31.3721280Z Entering 'external/cpuinfo' 2025-05-07T20:23:31.3787622Z Entering 'external/cutlass' 2025-05-07T20:23:31.3862140Z Entering 'external/googletest' 2025-05-07T20:23:31.3926470Z Entering 'external/hipify_torch' 2025-05-07T20:23:31.3992884Z Entering 'external/json' 2025-05-07T20:23:31.4078076Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:23:31.4110256Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:23:31.4439254Z Entering 'external/asmjit' 2025-05-07T20:23:31.4503494Z Entering 'external/composable_kernel' 2025-05-07T20:23:31.4577075Z Entering 'external/cpuinfo' 2025-05-07T20:23:31.4644947Z Entering 'external/cutlass' 2025-05-07T20:23:31.4721354Z Entering 'external/googletest' 2025-05-07T20:23:31.4787401Z Entering 'external/hipify_torch' 2025-05-07T20:23:31.4853427Z Entering 'external/json' 2025-05-07T20:23:31.4940056Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:31.4992159Z ##[endgroup] 2025-05-07T20:23:31.4992576Z ##[group]Fetching the repository 2025-05-07T20:23:31.4999800Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:23:31.6983257Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:23:31.6984221Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:23:31.7009743Z ##[endgroup] 2025-05-07T20:23:31.7010336Z ##[group]Determining the checkout info 2025-05-07T20:23:31.7012173Z ##[endgroup] 2025-05-07T20:23:31.7016774Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:23:31.7069262Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:23:31.7098449Z ##[group]Checking out the ref 2025-05-07T20:23:31.7102718Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:23:31.7231912Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:31.7235551Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:23:31.7245123Z ##[endgroup] 2025-05-07T20:23:31.7245697Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:23:31.7250838Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:31.7300750Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:23:31.7332815Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:23:31.7364319Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:23:31.7393822Z ##[endgroup] 2025-05-07T20:23:31.7394376Z ##[group]Fetching submodules 2025-05-07T20:23:31.7396669Z [command]/usr/bin/git submodule sync 2025-05-07T20:23:31.7770479Z Synchronizing submodule url for 'external/asmjit' 2025-05-07T20:23:31.7771119Z Synchronizing submodule url for 'external/composable_kernel' 2025-05-07T20:23:31.7771945Z Synchronizing submodule url for 'external/cpuinfo' 2025-05-07T20:23:31.7772362Z Synchronizing submodule url for 'external/cutlass' 2025-05-07T20:23:31.7772774Z Synchronizing submodule url for 'external/googletest' 2025-05-07T20:23:31.7773195Z Synchronizing submodule url for 'external/hipify_torch' 2025-05-07T20:23:31.7773601Z Synchronizing submodule url for 'external/json' 2025-05-07T20:23:31.7786788Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:23:31.8210729Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:23:31.8355629Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:23:31.8454122Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:23:31.8619275Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:23:31.8706406Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:23:31.8787682Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:23:31.8885217Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:23:31.8902522Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:23:31.9235958Z Entering 'external/asmjit' 2025-05-07T20:23:31.9268855Z Entering 'external/composable_kernel' 2025-05-07T20:23:31.9301308Z Entering 'external/cpuinfo' 2025-05-07T20:23:31.9333995Z Entering 'external/cutlass' 2025-05-07T20:23:31.9365982Z Entering 'external/googletest' 2025-05-07T20:23:31.9398074Z Entering 'external/hipify_torch' 2025-05-07T20:23:31.9430359Z Entering 'external/json' 2025-05-07T20:23:31.9475600Z ##[endgroup] 2025-05-07T20:23:31.9476038Z ##[group]Persisting credentials for submodules 2025-05-07T20:23:31.9481638Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:23:31.9812050Z Entering 'external/asmjit' 2025-05-07T20:23:31.9854367Z url.https://github.com/.insteadof 2025-05-07T20:23:31.9854844Z url.https://github.com/.insteadof 2025-05-07T20:23:31.9896912Z Entering 'external/composable_kernel' 2025-05-07T20:23:31.9939703Z url.https://github.com/.insteadof 2025-05-07T20:23:31.9940053Z url.https://github.com/.insteadof 2025-05-07T20:23:31.9990226Z Entering 'external/cpuinfo' 2025-05-07T20:23:32.0036860Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0037190Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0080564Z Entering 'external/cutlass' 2025-05-07T20:23:32.0123462Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0123796Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0174436Z Entering 'external/googletest' 2025-05-07T20:23:32.0217368Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0217732Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0260174Z Entering 'external/hipify_torch' 2025-05-07T20:23:32.0302511Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0302846Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0343681Z Entering 'external/json' 2025-05-07T20:23:32.0385301Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0385642Z url.https://github.com/.insteadof 2025-05-07T20:23:32.0445860Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:23:32.0775243Z Entering 'external/asmjit' 2025-05-07T20:23:32.0837103Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:23:32.0840461Z Entering 'external/composable_kernel' 2025-05-07T20:23:32.0900638Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:23:32.0903513Z Entering 'external/cpuinfo' 2025-05-07T20:23:32.0964652Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:23:32.0968007Z Entering 'external/cutlass' 2025-05-07T20:23:32.1029263Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:23:32.1032328Z Entering 'external/googletest' 2025-05-07T20:23:32.1093284Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:23:32.1096578Z Entering 'external/hipify_torch' 2025-05-07T20:23:32.1156495Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:23:32.1159678Z Entering 'external/json' 2025-05-07T20:23:32.1219398Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:23:32.1340257Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:23:32.1670882Z Entering 'external/asmjit' 2025-05-07T20:23:32.1704523Z Entering 'external/composable_kernel' 2025-05-07T20:23:32.1736942Z Entering 'external/cpuinfo' 2025-05-07T20:23:32.1768642Z Entering 'external/cutlass' 2025-05-07T20:23:32.1800544Z Entering 'external/googletest' 2025-05-07T20:23:32.1832256Z Entering 'external/hipify_torch' 2025-05-07T20:23:32.1863951Z Entering 'external/json' 2025-05-07T20:23:32.1917819Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:23:32.2251700Z Entering 'external/asmjit' 2025-05-07T20:23:32.2284590Z Entering 'external/composable_kernel' 2025-05-07T20:23:32.2317498Z Entering 'external/cpuinfo' 2025-05-07T20:23:32.2349472Z Entering 'external/cutlass' 2025-05-07T20:23:32.2381753Z Entering 'external/googletest' 2025-05-07T20:23:32.2412765Z Entering 'external/hipify_torch' 2025-05-07T20:23:32.2444665Z Entering 'external/json' 2025-05-07T20:23:32.2487683Z ##[endgroup] 2025-05-07T20:23:32.2528765Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:23:32.2555319Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:32.2743996Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:23:32.2744324Z with: 2025-05-07T20:23:32.2744575Z name: fbgemm_genai_x86_gcc_py3.13_cu12.8.0.whl 2025-05-07T20:23:32.2744900Z merge-multiple: false 2025-05-07T20:23:32.2745168Z repository: pytorch/FBGEMM 2025-05-07T20:23:32.2745438Z run-id: 14891846252 2025-05-07T20:23:32.2745656Z env: 2025-05-07T20:23:32.2745884Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:32.2746191Z BUILD_ENV: build_binary 2025-05-07T20:23:32.2746446Z BUILD_TARGET: genai 2025-05-07T20:23:32.2746677Z BUILD_VARIANT: cuda 2025-05-07T20:23:32.2746924Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:32.2747187Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:32.2747433Z ##[endgroup] 2025-05-07T20:23:32.5091706Z Downloading single artifact 2025-05-07T20:23:32.6010576Z Preparing to download the following artifacts: 2025-05-07T20:23:32.6011564Z - fbgemm_genai_x86_gcc_py3.13_cu12.8.0.whl (ID: 3081398569, Size: 18508688, Expected Digest: sha256:0316113a2b3fde93fffa97b955c92dd5eef475455a84550f9225df12df45620e) 2025-05-07T20:23:32.6610773Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-9c0298fb-5696-52b2-b592-faf612d983f7/artifacts/e48f0a27a297b17d0606bf1cfc4cb07571f0d4bdb9bce51dcfb63b95a2571c5a.zip 2025-05-07T20:23:32.6613147Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:32.7545488Z (node:197886) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:23:32.7547101Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:23:33.0333591Z SHA256 digest of downloaded artifact is 0316113a2b3fde93fffa97b955c92dd5eef475455a84550f9225df12df45620e 2025-05-07T20:23:33.0334236Z Artifact download completed successfully. 2025-05-07T20:23:33.0334581Z Total of 1 artifact(s) downloaded 2025-05-07T20:23:33.0340610Z Download artifact has finished successfully 2025-05-07T20:23:33.0594604Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:23:33.0595030Z with: 2025-05-07T20:23:33.0595256Z driver-version: 570.133.07 2025-05-07T20:23:33.0595517Z env: 2025-05-07T20:23:33.0595751Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:33.0596072Z BUILD_ENV: build_binary 2025-05-07T20:23:33.0596331Z BUILD_TARGET: genai 2025-05-07T20:23:33.0596569Z BUILD_VARIANT: cuda 2025-05-07T20:23:33.0596818Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:33.0597091Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:33.0597335Z ##[endgroup] 2025-05-07T20:23:33.0692658Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:23:33.0693066Z with: 2025-05-07T20:23:33.0693297Z timeout_minutes: 10 2025-05-07T20:23:33.0693538Z max_attempts: 3 2025-05-07T20:23:33.0717504Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:23:33.0741654Z retry_wait_seconds: 10 2025-05-07T20:23:33.0741924Z polling_interval_seconds: 1 2025-05-07T20:23:33.0742198Z warning_on_retry: true 2025-05-07T20:23:33.0742457Z continue_on_error: false 2025-05-07T20:23:33.0742713Z env: 2025-05-07T20:23:33.0742937Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:33.0743255Z BUILD_ENV: build_binary 2025-05-07T20:23:33.0762258Z BUILD_TARGET: genai 2025-05-07T20:23:33.0762534Z BUILD_VARIANT: cuda 2025-05-07T20:23:33.0762780Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:33.0763050Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:33.0763296Z DRIVER_VERSION: 570.133.07 2025-05-07T20:23:33.0763539Z ##[endgroup] 2025-05-07T20:23:33.9599979Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:23:33.9600682Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:23:33.9603258Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:23:34.2576835Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:23:34.2577449Z No packages marked for removal. 2025-05-07T20:23:34.2647510Z Dependencies resolved. 2025-05-07T20:23:34.2658010Z Nothing to do. 2025-05-07T20:23:34.2658385Z Complete! 2025-05-07T20:23:34.3010205Z + install_nvidia_driver_common 2025-05-07T20:23:34.3014297Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:23:34.3014610Z + lspci 2025-05-07T20:23:34.3016308Z Before installing NVIDIA driver 2025-05-07T20:23:34.3134711Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:34.3135765Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:34.3136347Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:34.3136879Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:34.3137368Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:34.3137903Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:34.3138389Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:34.3138915Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:34.3139346Z + lsmod 2025-05-07T20:23:34.3183027Z Module Size Used by 2025-05-07T20:23:34.3183634Z xt_nat 16384 0 2025-05-07T20:23:34.3184157Z nvidia_modeset 1716224 0 2025-05-07T20:23:34.3184726Z video 65536 1 nvidia_modeset 2025-05-07T20:23:34.3185337Z wmi 36864 1 video 2025-05-07T20:23:34.3185886Z nvidia_uvm 1884160 0 2025-05-07T20:23:34.3186499Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:34.3187153Z drm 602112 1 nvidia 2025-05-07T20:23:34.3187763Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:34.3188499Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:34.3188982Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:34.3189295Z veth 36864 0 2025-05-07T20:23:34.3189562Z xt_conntrack 16384 1 2025-05-07T20:23:34.3189827Z nft_chain_nat 16384 3 2025-05-07T20:23:34.3190089Z xt_MASQUERADE 20480 1 2025-05-07T20:23:34.3190794Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:34.3191151Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:34.3191592Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:34.3192059Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:34.3192382Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:34.3192684Z xfrm_user 57344 1 2025-05-07T20:23:34.3192953Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:34.3193254Z xt_addrtype 16384 2 2025-05-07T20:23:34.3193524Z nft_compat 20480 4 2025-05-07T20:23:34.3193835Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:34.3194260Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:34.3194641Z br_netfilter 36864 0 2025-05-07T20:23:34.3194922Z bridge 323584 1 br_netfilter 2025-05-07T20:23:34.3195234Z stp 16384 1 bridge 2025-05-07T20:23:34.3195526Z llc 16384 2 bridge,stp 2025-05-07T20:23:34.3195818Z overlay 167936 0 2025-05-07T20:23:34.3196070Z tls 135168 0 2025-05-07T20:23:34.3196333Z nls_ascii 16384 1 2025-05-07T20:23:34.3196592Z nls_cp437 20480 1 2025-05-07T20:23:34.3196841Z vfat 24576 1 2025-05-07T20:23:34.3197099Z fat 86016 1 vfat 2025-05-07T20:23:34.3197370Z ena 180224 0 2025-05-07T20:23:34.3197615Z i8042 45056 0 2025-05-07T20:23:34.3197874Z serio 28672 3 i8042 2025-05-07T20:23:34.3198161Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:34.3198429Z button 24576 0 2025-05-07T20:23:34.3198685Z sunrpc 696320 1 2025-05-07T20:23:34.3198948Z sch_fq_codel 20480 17 2025-05-07T20:23:34.3199206Z dm_mod 188416 0 2025-05-07T20:23:34.3199473Z dax 45056 1 dm_mod 2025-05-07T20:23:34.3199760Z fuse 163840 1 2025-05-07T20:23:34.3200014Z loop 36864 0 2025-05-07T20:23:34.3200459Z configfs 57344 1 2025-05-07T20:23:34.3200746Z dmi_sysfs 20480 0 2025-05-07T20:23:34.3201114Z crc32_pclmul 16384 0 2025-05-07T20:23:34.3201378Z crc32c_intel 24576 0 2025-05-07T20:23:34.3201633Z efivarfs 24576 1 2025-05-07T20:23:34.3201906Z + modinfo nvidia 2025-05-07T20:23:34.3202304Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:34.3202758Z import_ns: DMA_BUF 2025-05-07T20:23:34.3203012Z alias: char-major-195-* 2025-05-07T20:23:34.3203292Z version: 570.133.07 2025-05-07T20:23:34.3203552Z supported: external 2025-05-07T20:23:34.3203803Z license: Dual MIT/GPL 2025-05-07T20:23:34.3204100Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:34.3204458Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:34.3204784Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:34.3205123Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:34.3205474Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:34.3205818Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:34.3206143Z depends: i2c-core,drm 2025-05-07T20:23:34.3206412Z retpoline: Y 2025-05-07T20:23:34.3206631Z name: nvidia 2025-05-07T20:23:34.3207004Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:34.3207491Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:34.3207955Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:34.3208379Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:34.3208701Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:34.3209059Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:34.3209481Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:34.3209800Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:34.3210113Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:34.3210485Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:34.3210883Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:34.3211226Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:34.3211537Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:34.3211849Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:34.3212221Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:34.3212631Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:34.3213016Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:34.3213617Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:34.3214042Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:34.3214481Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:34.3214906Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:34.3215258Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:34.3215653Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:34.3216037Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:34.3216395Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:34.3216733Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:34.3217071Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:34.3217410Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:34.3217736Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:34.3218090Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:34.3218470Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:34.3218814Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:34.3219163Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:34.3219521Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:34.3219875Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:34.3220231Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:34.3220772Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:34.3221080Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:34.3221421Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:34.3221754Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:34.3222085Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:34.3222431Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:34.3222795Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:34.3223160Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:34.3223499Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:34.3223875Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:34.3224236Z parm: rm_firmware_active:charp 2025-05-07T20:23:34.3224547Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:23:34.3224794Z ++ command -v nvidia-smi 2025-05-07T20:23:34.3225066Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:23:34.3225336Z + set +e 2025-05-07T20:23:34.3225662Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:34.3449818Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:34.3450137Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:34.3450386Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:34.3450609Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:34.3450888Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:34.3451329Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:34.3451808Z + set -e 2025-05-07T20:23:34.3452015Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:34.3452416Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:34.3453246Z + post_install_nvidia_driver_common 2025-05-07T20:23:34.3457360Z + sudo modprobe nvidia 2025-05-07T20:23:34.4614168Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:34.4614503Z + lspci 2025-05-07T20:23:34.4614732Z After installing NVIDIA driver 2025-05-07T20:23:34.4727143Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:34.4727754Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:34.4728321Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:34.4728858Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:34.4729350Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:34.4729891Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:34.4730394Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:34.4730883Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:34.4731309Z + lsmod 2025-05-07T20:23:34.4759620Z Module Size Used by 2025-05-07T20:23:34.4759920Z xt_nat 16384 0 2025-05-07T20:23:34.4760273Z nvidia_modeset 1716224 0 2025-05-07T20:23:34.4760575Z video 65536 1 nvidia_modeset 2025-05-07T20:23:34.4760888Z wmi 36864 1 video 2025-05-07T20:23:34.4761163Z nvidia_uvm 1884160 0 2025-05-07T20:23:34.4761477Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:34.4761815Z drm 602112 1 nvidia 2025-05-07T20:23:34.4762126Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:34.4762521Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:34.4762883Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:34.4763182Z veth 36864 0 2025-05-07T20:23:34.4763447Z xt_conntrack 16384 1 2025-05-07T20:23:34.4763716Z nft_chain_nat 16384 3 2025-05-07T20:23:34.4763992Z xt_MASQUERADE 20480 1 2025-05-07T20:23:34.4764309Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:34.4764669Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:34.4765392Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:34.4765876Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:34.4766197Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:34.4766506Z xfrm_user 57344 1 2025-05-07T20:23:34.4766792Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:34.4767090Z xt_addrtype 16384 2 2025-05-07T20:23:34.4767363Z nft_compat 20480 4 2025-05-07T20:23:34.4767682Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:34.4768111Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:34.4768505Z br_netfilter 36864 0 2025-05-07T20:23:34.4768799Z bridge 323584 1 br_netfilter 2025-05-07T20:23:34.4769110Z stp 16384 1 bridge 2025-05-07T20:23:34.4769406Z llc 16384 2 bridge,stp 2025-05-07T20:23:34.4769710Z overlay 167936 0 2025-05-07T20:23:34.4769985Z tls 135168 0 2025-05-07T20:23:34.4770245Z nls_ascii 16384 1 2025-05-07T20:23:34.4770515Z nls_cp437 20480 1 2025-05-07T20:23:34.4770778Z vfat 24576 1 2025-05-07T20:23:34.4771040Z fat 86016 1 vfat 2025-05-07T20:23:34.4771320Z ena 180224 0 2025-05-07T20:23:34.4771582Z i8042 45056 0 2025-05-07T20:23:34.4771845Z serio 28672 3 i8042 2025-05-07T20:23:34.4772140Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:34.4772415Z button 24576 0 2025-05-07T20:23:34.4772671Z sunrpc 696320 1 2025-05-07T20:23:34.4772936Z sch_fq_codel 20480 17 2025-05-07T20:23:34.4773211Z dm_mod 188416 0 2025-05-07T20:23:34.4773628Z dax 45056 1 dm_mod 2025-05-07T20:23:34.4773907Z fuse 163840 1 2025-05-07T20:23:34.4774168Z loop 36864 0 2025-05-07T20:23:34.4774432Z configfs 57344 1 2025-05-07T20:23:34.4774693Z dmi_sysfs 20480 0 2025-05-07T20:23:34.4774956Z crc32_pclmul 16384 0 2025-05-07T20:23:34.4775222Z crc32c_intel 24576 0 2025-05-07T20:23:34.4775481Z efivarfs 24576 1 2025-05-07T20:23:34.4775738Z + modinfo nvidia 2025-05-07T20:23:34.4778108Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:34.4778594Z import_ns: DMA_BUF 2025-05-07T20:23:34.4778875Z alias: char-major-195-* 2025-05-07T20:23:34.4779177Z version: 570.133.07 2025-05-07T20:23:34.4779436Z supported: external 2025-05-07T20:23:34.4779692Z license: Dual MIT/GPL 2025-05-07T20:23:34.4779995Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:34.4780356Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:34.4780681Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:34.4781013Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:34.4781368Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:34.4781718Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:34.4782043Z depends: i2c-core,drm 2025-05-07T20:23:34.4782310Z retpoline: Y 2025-05-07T20:23:34.4782536Z name: nvidia 2025-05-07T20:23:34.4782905Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:34.4783399Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:34.4783869Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:34.4784298Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:34.4784624Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:34.4784946Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:34.4785268Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:34.4785589Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:34.4785914Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:34.4786398Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:34.4786797Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:34.4787146Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:34.4787461Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:34.4787775Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:34.4788154Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:34.4788565Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:34.4788954Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:34.4789381Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:34.4789802Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:34.4790240Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:34.4790663Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:34.4791016Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:34.4791400Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:34.4791785Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:34.4792138Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:34.4792475Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:34.4792813Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:34.4793149Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:34.4793476Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:34.4793842Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:34.4794210Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:34.4794552Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:34.4795000Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:34.4795351Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:34.4795705Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:34.4796063Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:34.4796401Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:34.4796705Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:34.4797043Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:34.4797377Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:34.4797706Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:34.4798049Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:34.4798420Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:34.4798806Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:34.4799171Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:34.4799531Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:34.4799885Z parm: rm_firmware_active:charp 2025-05-07T20:23:34.4800288Z + set +e 2025-05-07T20:23:34.4800495Z + nvidia-smi 2025-05-07T20:23:34.4953620Z Wed May 7 20:23:34 2025 2025-05-07T20:23:34.4954007Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:34.4954519Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:34.4955023Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:34.4955534Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:34.4956084Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:34.4956526Z | | | MIG M. | 2025-05-07T20:23:34.4956879Z |=========================================+========================+======================| 2025-05-07T20:23:34.5125282Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:34.5125933Z | 0% 29C P8 26W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:34.5126338Z | | | N/A | 2025-05-07T20:23:34.5126751Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:34.5128863Z 2025-05-07T20:23:34.5129283Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:34.5129724Z | Processes: | 2025-05-07T20:23:34.5130185Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:34.5130620Z | ID ID Usage | 2025-05-07T20:23:34.5130980Z |=========================================================================================| 2025-05-07T20:23:34.5132781Z | No running processes found | 2025-05-07T20:23:34.5133274Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:34.7488861Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:34.7652307Z NVIDIA A10G 2025-05-07T20:23:34.7698437Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:34.7698692Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:34.7698933Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:34.7699229Z + set -e 2025-05-07T20:23:34.7699449Z INFO: Ignoring allowed status 0 2025-05-07T20:23:34.7705811Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:34.7709382Z + sudo yum install -y yum-utils 2025-05-07T20:23:35.1894454Z Last metadata expiration check: 0:54:02 ago on Wed May 7 19:29:33 2025. 2025-05-07T20:23:35.2139993Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:35.2538847Z Dependencies resolved. 2025-05-07T20:23:35.2720924Z Nothing to do. 2025-05-07T20:23:35.2721160Z Complete! 2025-05-07T20:23:35.3116509Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:35.3117088Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:35.3117960Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:35.5921228Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:35.6491867Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:36.1888307Z nvidia-container-toolkit 12 kB/s | 833 B 00:00 2025-05-07T20:23:36.2136622Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:36.2535968Z Dependencies resolved. 2025-05-07T20:23:36.2713876Z ================================================================================ 2025-05-07T20:23:36.2714787Z Package Arch Version Repository Size 2025-05-07T20:23:36.2715568Z ================================================================================ 2025-05-07T20:23:36.2716182Z Downgrading: 2025-05-07T20:23:36.2716931Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:36.2718132Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:36.2718849Z 2025-05-07T20:23:36.2719047Z Transaction Summary 2025-05-07T20:23:36.2719425Z ================================================================================ 2025-05-07T20:23:36.2719837Z Downgrade 2 Packages 2025-05-07T20:23:36.2720063Z 2025-05-07T20:23:36.2720282Z Total download size: 6.8 M 2025-05-07T20:23:36.2720662Z Downloading Packages: 2025-05-07T20:23:36.3115276Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 32 MB/s | 1.2 MB 00:00 2025-05-07T20:23:36.3800584Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 52 MB/s | 5.6 MB 00:00 2025-05-07T20:23:36.3812593Z -------------------------------------------------------------------------------- 2025-05-07T20:23:36.3815965Z Total 62 MB/s | 6.8 MB 00:00 2025-05-07T20:23:36.3818391Z Running transaction check 2025-05-07T20:23:36.3919673Z Transaction check succeeded. 2025-05-07T20:23:36.3919966Z Running transaction test 2025-05-07T20:23:36.4212753Z Transaction test succeeded. 2025-05-07T20:23:36.4215650Z Running transaction 2025-05-07T20:23:36.9735612Z Preparing : 1/1 2025-05-07T20:23:37.0789397Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:37.0808207Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:37.1020072Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:37.1020657Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:37.1124422Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:37.1147993Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:37.2893902Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:37.2894495Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:37.2895046Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:37.2895592Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:37.4340179Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:37.4340837Z WARNING: 2025-05-07T20:23:37.4341181Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:37.4341441Z 2025-05-07T20:23:37.4341538Z Available Versions: 2025-05-07T20:23:37.4341706Z 2025-05-07T20:23:37.4341799Z Version 2023.7.20250331: 2025-05-07T20:23:37.4342134Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:37.4342399Z 2025-05-07T20:23:37.4342533Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:37.4342755Z 2025-05-07T20:23:37.4342845Z Release notes: 2025-05-07T20:23:37.4343278Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:37.4343666Z 2025-05-07T20:23:37.4343766Z Version 2023.7.20250414: 2025-05-07T20:23:37.4344087Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:37.4344369Z 2025-05-07T20:23:37.4344491Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:37.4344716Z 2025-05-07T20:23:37.4344805Z Release notes: 2025-05-07T20:23:37.4345228Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:37.4345611Z 2025-05-07T20:23:37.4345705Z Version 2023.7.20250428: 2025-05-07T20:23:37.4346029Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:37.4346296Z 2025-05-07T20:23:37.4346420Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:37.4346638Z 2025-05-07T20:23:37.4346732Z Release notes: 2025-05-07T20:23:37.4347140Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:37.4347528Z 2025-05-07T20:23:37.4347647Z ================================================================================ 2025-05-07T20:23:37.4697988Z 2025-05-07T20:23:37.4698140Z 2025-05-07T20:23:37.4698235Z Downgraded: 2025-05-07T20:23:37.4698630Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:37.4699220Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:37.4699570Z 2025-05-07T20:23:37.4699662Z Complete! 2025-05-07T20:23:37.5184624Z + sudo systemctl restart docker 2025-05-07T20:23:40.5879570Z nvidia-persistenced failed to initialize. Check syslog for more details. 2025-05-07T20:23:40.6075029Z Wed May 7 20:23:40 2025 2025-05-07T20:23:40.6075420Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:40.6075945Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:40.6076448Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:40.6076960Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:40.6077498Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:40.6077970Z | | | MIG M. | 2025-05-07T20:23:40.6078334Z |=========================================+========================+======================| 2025-05-07T20:23:40.6207853Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:40.6208314Z | 0% 30C P8 26W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:40.6208709Z | | | N/A | 2025-05-07T20:23:40.6209118Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:40.6212162Z 2025-05-07T20:23:40.6212578Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:40.6213749Z | Processes: | 2025-05-07T20:23:40.6214208Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:40.6214639Z | ID ID Usage | 2025-05-07T20:23:40.6214991Z |=========================================================================================| 2025-05-07T20:23:40.6217596Z | No running processes found | 2025-05-07T20:23:40.6218086Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:41.1270235Z Command completed after 1 attempt(s). 2025-05-07T20:23:41.1357535Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:41.1358012Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:41.1372040Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:41.1372420Z env: 2025-05-07T20:23:41.1372659Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:41.1372969Z BUILD_ENV: build_binary 2025-05-07T20:23:41.1373228Z BUILD_TARGET: genai 2025-05-07T20:23:41.1373472Z BUILD_VARIANT: cuda 2025-05-07T20:23:41.1373714Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:41.1373984Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:41.1374303Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:41.1374648Z ##[endgroup] 2025-05-07T20:23:41.4726716Z ################################################################################ 2025-05-07T20:23:41.4727109Z # Print System Info 2025-05-07T20:23:41.4727339Z # 2025-05-07T20:23:41.4741488Z # [2025-05-07T20:23:41.473Z] + print_system_info 2025-05-07T20:23:41.4741867Z ################################################################################ 2025-05-07T20:23:41.4742102Z 2025-05-07T20:23:41.4742217Z ################################################################################ 2025-05-07T20:23:41.4742576Z [INFO] Printing environment variables ... 2025-05-07T20:23:41.4742874Z + printenv 2025-05-07T20:23:41.4743000Z 2025-05-07T20:23:41.4751695Z SHELL=/bin/bash 2025-05-07T20:23:41.4752051Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:41.4752463Z BUILD_VARIANT=cuda 2025-05-07T20:23:41.4753099Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_0aae4d04-12f7-4bb7-9b39-789ed9ac7062 2025-05-07T20:23:41.4753772Z GITHUB_ACTION=__run 2025-05-07T20:23:41.4754070Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:41.4754421Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:41.4754675Z RUNNER_NAME=i-04dd41b83603cbddd 2025-05-07T20:23:41.4754973Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:41.4755286Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:41.4755555Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:41.4755934Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:41.4756380Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:41.4756660Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:41.4756963Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:41.4757575Z *** 2025-05-07T20:23:41.4757776Z LOGNAME=ec2-user 2025-05-07T20:23:41.4758010Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:41.4758279Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:41.4758516Z GITHUB_ACTIONS=true 2025-05-07T20:23:41.4758741Z SYSTEMD_EXEC_PID=55419 2025-05-07T20:23:41.4759029Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:41.4759591Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:41.4760233Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:41.4760526Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:41.4760794Z RUNNER_OS=Linux 2025-05-07T20:23:41.4761019Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:41.4761276Z HOME=/home/ec2-user 2025-05-07T20:23:41.4761536Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:41.4762174Z LANG=C.UTF-8 2025-05-07T20:23:41.4762471Z RUNNER_TRACKING_ID=github_d537b2d4-b72f-4240-a0b6-544aab4d7466 2025-05-07T20:23:41.4762841Z RUNNER_ARCH=X64 2025-05-07T20:23:41.4763124Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:41.4763456Z BUILD_TARGET=genai 2025-05-07T20:23:41.4763999Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_0aae4d04-12f7-4bb7-9b39-789ed9ac7062 2025-05-07T20:23:41.4764886Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_0aae4d04-12f7-4bb7-9b39-789ed9ac7062 2025-05-07T20:23:41.4765635Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:41.4766324Z INVOCATION_ID=1a5cf068cbd9400aa048f8bfcc0aff7d 2025-05-07T20:23:41.4766665Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:41.4766940Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:41.4767533Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_0aae4d04-12f7-4bb7-9b39-789ed9ac7062 2025-05-07T20:23:41.4768171Z BUILD_ENV=build_binary 2025-05-07T20:23:41.4768417Z GITHUB_ACTOR=q10 2025-05-07T20:23:41.4768636Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:41.4768870Z KERN_NAME_LC=linux 2025-05-07T20:23:41.4769104Z BUILD_CUDA_VERSION=12.8.0 2025-05-07T20:23:41.4769412Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:41.4769762Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:41.4770017Z USER=ec2-user 2025-05-07T20:23:41.4770253Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:41.4770540Z SHLVL=1 2025-05-07T20:23:41.4770756Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:41.4771110Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:41.4771568Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:41.4771942Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:41.4772191Z KERN_NAME=Linux 2025-05-07T20:23:41.4772424Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:41.4772853Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:41.4773298Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:41.4773578Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:41.4773834Z JOURNAL_STREAM=8:92359 2025-05-07T20:23:41.4774157Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:41.4774527Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:41.4774844Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:41.4775188Z GITHUB_BASE_REF=main 2025-05-07T20:23:41.4775405Z CI=true 2025-05-07T20:23:41.4775621Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:41.4775914Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:41.4776200Z GITHUB_ACTION_REF= 2025-05-07T20:23:41.4776451Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:41.4777084Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_0aae4d04-12f7-4bb7-9b39-789ed9ac7062 2025-05-07T20:23:41.4777688Z MACHINE_NAME=x86_64 2025-05-07T20:23:41.4777914Z _=/usr/bin/printenv 2025-05-07T20:23:41.4778061Z 2025-05-07T20:23:41.4778191Z ################################################################################ 2025-05-07T20:23:41.4778517Z [INFO] Print ldd version ... 2025-05-07T20:23:41.4778780Z + ldd --version 2025-05-07T20:23:41.4778910Z 2025-05-07T20:23:41.4778998Z ldd (GNU libc) 2.34 2025-05-07T20:23:41.4779274Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:41.4779727Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:41.4780267Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:41.4780735Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:41.4780962Z 2025-05-07T20:23:41.4781090Z ################################################################################ 2025-05-07T20:23:41.4781404Z [INFO] Print CPU info ... 2025-05-07T20:23:41.4781650Z + nproc 2025-05-07T20:23:41.4781770Z 2025-05-07T20:23:41.4783623Z 16 2025-05-07T20:23:41.4785290Z 2025-05-07T20:23:41.4785509Z + lscpu 2025-05-07T20:23:41.4785623Z 2025-05-07T20:23:41.4858427Z Architecture: x86_64 2025-05-07T20:23:41.4858978Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:41.4859520Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.4859932Z Byte Order: Little Endian 2025-05-07T20:23:41.4860320Z CPU(s): 16 2025-05-07T20:23:41.4860760Z On-line CPU(s) list: 0-15 2025-05-07T20:23:41.4861223Z Vendor ID: AuthenticAMD 2025-05-07T20:23:41.4861704Z Model name: AMD EPYC 7R32 2025-05-07T20:23:41.4862037Z CPU family: 23 2025-05-07T20:23:41.4862566Z Model: 49 2025-05-07T20:23:41.4862878Z Thread(s) per core: 2 2025-05-07T20:23:41.4863183Z Core(s) per socket: 8 2025-05-07T20:23:41.4863495Z Socket(s): 1 2025-05-07T20:23:41.4863793Z Stepping: 0 2025-05-07T20:23:41.4864107Z BogoMIPS: 5599.99 2025-05-07T20:23:41.4866298Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.4868482Z Hypervisor vendor: KVM 2025-05-07T20:23:41.4868813Z Virtualization type: full 2025-05-07T20:23:41.4869183Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:41.4869569Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:41.4869956Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:41.4870337Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:41.4870685Z NUMA node(s): 1 2025-05-07T20:23:41.4871028Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:41.4871403Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:41.4871846Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:41.4872393Z Vulnerability L1tf: Not affected 2025-05-07T20:23:41.4872924Z Vulnerability Mds: Not affected 2025-05-07T20:23:41.4873452Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:41.4873964Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:41.4874496Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:41.4875292Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:41.4876132Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:41.4876734Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:41.4877602Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:41.4878493Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:41.4879199Z Vulnerability Srbds: Not affected 2025-05-07T20:23:41.4879581Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:41.4879922Z 2025-05-07T20:23:41.4880018Z + cat /proc/cpuinfo 2025-05-07T20:23:41.4880256Z 2025-05-07T20:23:41.4880354Z processor : 0 2025-05-07T20:23:41.4880580Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.4881001Z cpu family : 23 2025-05-07T20:23:41.4881221Z model : 49 2025-05-07T20:23:41.4881437Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.4881699Z stepping : 0 2025-05-07T20:23:41.4881924Z microcode : 0x830107f 2025-05-07T20:23:41.4882157Z cpu MHz : 3302.129 2025-05-07T20:23:41.4882384Z cache size : 512 KB 2025-05-07T20:23:41.4882610Z physical id : 0 2025-05-07T20:23:41.4882825Z siblings : 16 2025-05-07T20:23:41.4883037Z core id : 0 2025-05-07T20:23:41.4883246Z cpu cores : 8 2025-05-07T20:23:41.4883459Z apicid : 0 2025-05-07T20:23:41.4883664Z initial apicid : 0 2025-05-07T20:23:41.4883885Z fpu : yes 2025-05-07T20:23:41.4884095Z fpu_exception : yes 2025-05-07T20:23:41.4884317Z cpuid level : 13 2025-05-07T20:23:41.4884536Z wp : yes 2025-05-07T20:23:41.4886674Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.4889000Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.4889507Z bogomips : 5599.99 2025-05-07T20:23:41.4889741Z TLB size : 3072 4K pages 2025-05-07T20:23:41.4889989Z clflush size : 64 2025-05-07T20:23:41.4890216Z cache_alignment : 64 2025-05-07T20:23:41.4890499Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.4890835Z power management: 2025-05-07T20:23:41.4890974Z 2025-05-07T20:23:41.4891067Z processor : 1 2025-05-07T20:23:41.4891289Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.4891539Z cpu family : 23 2025-05-07T20:23:41.4891762Z model : 49 2025-05-07T20:23:41.4891976Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.4892231Z stepping : 0 2025-05-07T20:23:41.4892451Z microcode : 0x830107f 2025-05-07T20:23:41.4892683Z cpu MHz : 3293.424 2025-05-07T20:23:41.4892907Z cache size : 512 KB 2025-05-07T20:23:41.4893135Z physical id : 0 2025-05-07T20:23:41.4893350Z siblings : 16 2025-05-07T20:23:41.4893564Z core id : 1 2025-05-07T20:23:41.4893775Z cpu cores : 8 2025-05-07T20:23:41.4893979Z apicid : 2 2025-05-07T20:23:41.4894189Z initial apicid : 2 2025-05-07T20:23:41.4894416Z fpu : yes 2025-05-07T20:23:41.4894621Z fpu_exception : yes 2025-05-07T20:23:41.4894853Z cpuid level : 13 2025-05-07T20:23:41.4895070Z wp : yes 2025-05-07T20:23:41.4897085Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.4899389Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.4899900Z bogomips : 5599.99 2025-05-07T20:23:41.4900131Z TLB size : 3072 4K pages 2025-05-07T20:23:41.4900393Z clflush size : 64 2025-05-07T20:23:41.4900616Z cache_alignment : 64 2025-05-07T20:23:41.4900906Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.4901277Z power management: 2025-05-07T20:23:41.4955547Z 2025-05-07T20:23:41.4955744Z processor : 2 2025-05-07T20:23:41.4956000Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.4956321Z cpu family : 23 2025-05-07T20:23:41.4956589Z model : 49 2025-05-07T20:23:41.4956883Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.4957536Z stepping : 0 2025-05-07T20:23:41.4957761Z microcode : 0x830107f 2025-05-07T20:23:41.4958001Z cpu MHz : 1902.706 2025-05-07T20:23:41.4958219Z cache size : 512 KB 2025-05-07T20:23:41.4958444Z physical id : 0 2025-05-07T20:23:41.4958669Z siblings : 16 2025-05-07T20:23:41.4958872Z core id : 2 2025-05-07T20:23:41.4959082Z cpu cores : 8 2025-05-07T20:23:41.4959294Z apicid : 4 2025-05-07T20:23:41.4959495Z initial apicid : 4 2025-05-07T20:23:41.4959716Z fpu : yes 2025-05-07T20:23:41.4959923Z fpu_exception : yes 2025-05-07T20:23:41.4960269Z cpuid level : 13 2025-05-07T20:23:41.4960490Z wp : yes 2025-05-07T20:23:41.4962727Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.4965104Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.4965611Z bogomips : 5599.99 2025-05-07T20:23:41.4965845Z TLB size : 3072 4K pages 2025-05-07T20:23:41.4966094Z clflush size : 64 2025-05-07T20:23:41.4966319Z cache_alignment : 64 2025-05-07T20:23:41.4966597Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.4966930Z power management: 2025-05-07T20:23:41.4967066Z 2025-05-07T20:23:41.4967158Z processor : 3 2025-05-07T20:23:41.4967373Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.4967623Z cpu family : 23 2025-05-07T20:23:41.4967840Z model : 49 2025-05-07T20:23:41.4968045Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.4968293Z stepping : 0 2025-05-07T20:23:41.4968506Z microcode : 0x830107f 2025-05-07T20:23:41.4968742Z cpu MHz : 3298.711 2025-05-07T20:23:41.4968962Z cache size : 512 KB 2025-05-07T20:23:41.4969185Z physical id : 0 2025-05-07T20:23:41.4969392Z siblings : 16 2025-05-07T20:23:41.4969597Z core id : 3 2025-05-07T20:23:41.4969803Z cpu cores : 8 2025-05-07T20:23:41.4970005Z apicid : 6 2025-05-07T20:23:41.4970208Z initial apicid : 6 2025-05-07T20:23:41.4970428Z fpu : yes 2025-05-07T20:23:41.4970676Z fpu_exception : yes 2025-05-07T20:23:41.4970901Z cpuid level : 13 2025-05-07T20:23:41.4971115Z wp : yes 2025-05-07T20:23:41.4973161Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.4975497Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.4976005Z bogomips : 5599.99 2025-05-07T20:23:41.4976237Z TLB size : 3072 4K pages 2025-05-07T20:23:41.4976482Z clflush size : 64 2025-05-07T20:23:41.4976701Z cache_alignment : 64 2025-05-07T20:23:41.4976987Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.4977319Z power management: 2025-05-07T20:23:41.4977457Z 2025-05-07T20:23:41.4977543Z processor : 4 2025-05-07T20:23:41.4977770Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.4978012Z cpu family : 23 2025-05-07T20:23:41.4978220Z model : 49 2025-05-07T20:23:41.4978441Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.4978697Z stepping : 0 2025-05-07T20:23:41.4978908Z microcode : 0x830107f 2025-05-07T20:23:41.4979146Z cpu MHz : 3291.209 2025-05-07T20:23:41.4979457Z cache size : 512 KB 2025-05-07T20:23:41.4979673Z physical id : 0 2025-05-07T20:23:41.4979891Z siblings : 16 2025-05-07T20:23:41.4980097Z core id : 4 2025-05-07T20:23:41.4980294Z cpu cores : 8 2025-05-07T20:23:41.4980500Z apicid : 8 2025-05-07T20:23:41.4980707Z initial apicid : 8 2025-05-07T20:23:41.4980918Z fpu : yes 2025-05-07T20:23:41.4981194Z fpu_exception : yes 2025-05-07T20:23:41.4981432Z cpuid level : 13 2025-05-07T20:23:41.4981647Z wp : yes 2025-05-07T20:23:41.4983741Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.4986044Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.4986545Z bogomips : 5599.99 2025-05-07T20:23:41.4986771Z TLB size : 3072 4K pages 2025-05-07T20:23:41.4987007Z clflush size : 64 2025-05-07T20:23:41.4987232Z cache_alignment : 64 2025-05-07T20:23:41.4987511Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.4987831Z power management: 2025-05-07T20:23:41.4987973Z 2025-05-07T20:23:41.4988057Z processor : 5 2025-05-07T20:23:41.4988277Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.4988519Z cpu family : 23 2025-05-07T20:23:41.4988723Z model : 49 2025-05-07T20:23:41.4988937Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.4989189Z stepping : 0 2025-05-07T20:23:41.4989399Z microcode : 0x830107f 2025-05-07T20:23:41.4989631Z cpu MHz : 2859.366 2025-05-07T20:23:41.4989852Z cache size : 512 KB 2025-05-07T20:23:41.4990069Z physical id : 0 2025-05-07T20:23:41.4990292Z siblings : 16 2025-05-07T20:23:41.4990499Z core id : 5 2025-05-07T20:23:41.4990698Z cpu cores : 8 2025-05-07T20:23:41.4990901Z apicid : 10 2025-05-07T20:23:41.4991111Z initial apicid : 10 2025-05-07T20:23:41.4991322Z fpu : yes 2025-05-07T20:23:41.4991529Z fpu_exception : yes 2025-05-07T20:23:41.4991753Z cpuid level : 13 2025-05-07T20:23:41.4991962Z wp : yes 2025-05-07T20:23:41.4993978Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.4996279Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.4996785Z bogomips : 5599.99 2025-05-07T20:23:41.4997018Z TLB size : 3072 4K pages 2025-05-07T20:23:41.4997255Z clflush size : 64 2025-05-07T20:23:41.4997485Z cache_alignment : 64 2025-05-07T20:23:41.4997764Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.4998088Z power management: 2025-05-07T20:23:41.4998230Z 2025-05-07T20:23:41.4998315Z processor : 6 2025-05-07T20:23:41.4998536Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.4998774Z cpu family : 23 2025-05-07T20:23:41.4998988Z model : 49 2025-05-07T20:23:41.4999212Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.4999453Z stepping : 0 2025-05-07T20:23:41.4999671Z microcode : 0x830107f 2025-05-07T20:23:41.4999910Z cpu MHz : 3300.036 2025-05-07T20:23:41.5000200Z cache size : 512 KB 2025-05-07T20:23:41.5000425Z physical id : 0 2025-05-07T20:23:41.5000637Z siblings : 16 2025-05-07T20:23:41.5000834Z core id : 6 2025-05-07T20:23:41.5001126Z cpu cores : 8 2025-05-07T20:23:41.5001332Z apicid : 12 2025-05-07T20:23:41.5001534Z initial apicid : 12 2025-05-07T20:23:41.5001747Z fpu : yes 2025-05-07T20:23:41.5001951Z fpu_exception : yes 2025-05-07T20:23:41.5002165Z cpuid level : 13 2025-05-07T20:23:41.5002373Z wp : yes 2025-05-07T20:23:41.5004458Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5006751Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5007253Z bogomips : 5599.99 2025-05-07T20:23:41.5007470Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5007709Z clflush size : 64 2025-05-07T20:23:41.5007928Z cache_alignment : 64 2025-05-07T20:23:41.5008198Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5008518Z power management: 2025-05-07T20:23:41.5008651Z 2025-05-07T20:23:41.5008741Z processor : 7 2025-05-07T20:23:41.5008959Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5009206Z cpu family : 23 2025-05-07T20:23:41.5009421Z model : 49 2025-05-07T20:23:41.5009628Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5009878Z stepping : 0 2025-05-07T20:23:41.5010098Z microcode : 0x830107f 2025-05-07T20:23:41.5010326Z cpu MHz : 3299.648 2025-05-07T20:23:41.5010561Z cache size : 512 KB 2025-05-07T20:23:41.5010799Z physical id : 0 2025-05-07T20:23:41.5011053Z siblings : 16 2025-05-07T20:23:41.5011252Z core id : 7 2025-05-07T20:23:41.5011458Z cpu cores : 8 2025-05-07T20:23:41.5011667Z apicid : 14 2025-05-07T20:23:41.5011869Z initial apicid : 14 2025-05-07T20:23:41.5012092Z fpu : yes 2025-05-07T20:23:41.5012298Z fpu_exception : yes 2025-05-07T20:23:41.5012514Z cpuid level : 13 2025-05-07T20:23:41.5012731Z wp : yes 2025-05-07T20:23:41.5015102Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5017570Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5018071Z bogomips : 5599.99 2025-05-07T20:23:41.5018307Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5018553Z clflush size : 64 2025-05-07T20:23:41.5018772Z cache_alignment : 64 2025-05-07T20:23:41.5019052Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5019380Z power management: 2025-05-07T20:23:41.5019516Z 2025-05-07T20:23:41.5019610Z processor : 8 2025-05-07T20:23:41.5019825Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5020075Z cpu family : 23 2025-05-07T20:23:41.5020292Z model : 49 2025-05-07T20:23:41.5020500Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5020753Z stepping : 0 2025-05-07T20:23:41.5020969Z microcode : 0x830107f 2025-05-07T20:23:41.5021196Z cpu MHz : 3299.504 2025-05-07T20:23:41.5021420Z cache size : 512 KB 2025-05-07T20:23:41.5021647Z physical id : 0 2025-05-07T20:23:41.5021862Z siblings : 16 2025-05-07T20:23:41.5022073Z core id : 0 2025-05-07T20:23:41.5022280Z cpu cores : 8 2025-05-07T20:23:41.5022479Z apicid : 1 2025-05-07T20:23:41.5022682Z initial apicid : 1 2025-05-07T20:23:41.5023058Z fpu : yes 2025-05-07T20:23:41.5023261Z fpu_exception : yes 2025-05-07T20:23:41.5023491Z cpuid level : 13 2025-05-07T20:23:41.5023707Z wp : yes 2025-05-07T20:23:41.5025701Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5028145Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5028648Z bogomips : 5599.99 2025-05-07T20:23:41.5028877Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5029117Z clflush size : 64 2025-05-07T20:23:41.5029343Z cache_alignment : 64 2025-05-07T20:23:41.5029632Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5029957Z power management: 2025-05-07T20:23:41.5030092Z 2025-05-07T20:23:41.5030179Z processor : 9 2025-05-07T20:23:41.5030404Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5030652Z cpu family : 23 2025-05-07T20:23:41.5030860Z model : 49 2025-05-07T20:23:41.5031074Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5031322Z stepping : 0 2025-05-07T20:23:41.5031533Z microcode : 0x830107f 2025-05-07T20:23:41.5031773Z cpu MHz : 3291.473 2025-05-07T20:23:41.5031992Z cache size : 512 KB 2025-05-07T20:23:41.5032213Z physical id : 0 2025-05-07T20:23:41.5032432Z siblings : 16 2025-05-07T20:23:41.5032638Z core id : 1 2025-05-07T20:23:41.5032847Z cpu cores : 8 2025-05-07T20:23:41.5033047Z apicid : 3 2025-05-07T20:23:41.5033253Z initial apicid : 3 2025-05-07T20:23:41.5033471Z fpu : yes 2025-05-07T20:23:41.5033670Z fpu_exception : yes 2025-05-07T20:23:41.5033898Z cpuid level : 13 2025-05-07T20:23:41.5034111Z wp : yes 2025-05-07T20:23:41.5036102Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5038388Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5038892Z bogomips : 5599.99 2025-05-07T20:23:41.5039119Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5039356Z clflush size : 64 2025-05-07T20:23:41.5039581Z cache_alignment : 64 2025-05-07T20:23:41.5039868Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5040243Z power management: 2025-05-07T20:23:41.5040385Z 2025-05-07T20:23:41.5040470Z processor : 10 2025-05-07T20:23:41.5040695Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5040946Z cpu family : 23 2025-05-07T20:23:41.5041153Z model : 49 2025-05-07T20:23:41.5041364Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5041613Z stepping : 0 2025-05-07T20:23:41.5041822Z microcode : 0x830107f 2025-05-07T20:23:41.5042057Z cpu MHz : 3225.788 2025-05-07T20:23:41.5042277Z cache size : 512 KB 2025-05-07T20:23:41.5042494Z physical id : 0 2025-05-07T20:23:41.5042710Z siblings : 16 2025-05-07T20:23:41.5042920Z core id : 2 2025-05-07T20:23:41.5043122Z cpu cores : 8 2025-05-07T20:23:41.5043336Z apicid : 5 2025-05-07T20:23:41.5043546Z initial apicid : 5 2025-05-07T20:23:41.5043756Z fpu : yes 2025-05-07T20:23:41.5043962Z fpu_exception : yes 2025-05-07T20:23:41.5044188Z cpuid level : 13 2025-05-07T20:23:41.5044485Z wp : yes 2025-05-07T20:23:41.5046493Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5048783Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5049287Z bogomips : 5599.99 2025-05-07T20:23:41.5049603Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5049843Z clflush size : 64 2025-05-07T20:23:41.5050067Z cache_alignment : 64 2025-05-07T20:23:41.5050346Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5050670Z power management: 2025-05-07T20:23:41.5050813Z 2025-05-07T20:23:41.5050920Z processor : 11 2025-05-07T20:23:41.5051170Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5051412Z cpu family : 23 2025-05-07T20:23:41.5051623Z model : 49 2025-05-07T20:23:41.5051836Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5052079Z stepping : 0 2025-05-07T20:23:41.5052296Z microcode : 0x830107f 2025-05-07T20:23:41.5052528Z cpu MHz : 3300.277 2025-05-07T20:23:41.5052742Z cache size : 512 KB 2025-05-07T20:23:41.5052963Z physical id : 0 2025-05-07T20:23:41.5053178Z siblings : 16 2025-05-07T20:23:41.5053376Z core id : 3 2025-05-07T20:23:41.5053583Z cpu cores : 8 2025-05-07T20:23:41.5053786Z apicid : 7 2025-05-07T20:23:41.5053988Z initial apicid : 7 2025-05-07T20:23:41.5054208Z fpu : yes 2025-05-07T20:23:41.5054414Z fpu_exception : yes 2025-05-07T20:23:41.5054634Z cpuid level : 13 2025-05-07T20:23:41.5054845Z wp : yes 2025-05-07T20:23:41.5056878Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5059163Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5059662Z bogomips : 5599.99 2025-05-07T20:23:41.5059880Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5060127Z clflush size : 64 2025-05-07T20:23:41.5060353Z cache_alignment : 64 2025-05-07T20:23:41.5060627Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5060980Z power management: 2025-05-07T20:23:41.5061136Z 2025-05-07T20:23:41.5061226Z processor : 12 2025-05-07T20:23:41.5061443Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5061690Z cpu family : 23 2025-05-07T20:23:41.5061904Z model : 49 2025-05-07T20:23:41.5062108Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5062357Z stepping : 0 2025-05-07T20:23:41.5062570Z microcode : 0x830107f 2025-05-07T20:23:41.5062797Z cpu MHz : 3290.349 2025-05-07T20:23:41.5063019Z cache size : 512 KB 2025-05-07T20:23:41.5063243Z physical id : 0 2025-05-07T20:23:41.5063454Z siblings : 16 2025-05-07T20:23:41.5063662Z core id : 4 2025-05-07T20:23:41.5063875Z cpu cores : 8 2025-05-07T20:23:41.5064087Z apicid : 9 2025-05-07T20:23:41.5064289Z initial apicid : 9 2025-05-07T20:23:41.5064506Z fpu : yes 2025-05-07T20:23:41.5064710Z fpu_exception : yes 2025-05-07T20:23:41.5064931Z cpuid level : 13 2025-05-07T20:23:41.5065143Z wp : yes 2025-05-07T20:23:41.5067148Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5069522Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5070019Z bogomips : 5599.99 2025-05-07T20:23:41.5070247Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5070491Z clflush size : 64 2025-05-07T20:23:41.5070711Z cache_alignment : 64 2025-05-07T20:23:41.5071075Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5071401Z power management: 2025-05-07T20:23:41.5071536Z 2025-05-07T20:23:41.5071628Z processor : 13 2025-05-07T20:23:41.5071852Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5072095Z cpu family : 23 2025-05-07T20:23:41.5072306Z model : 49 2025-05-07T20:23:41.5072510Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5072757Z stepping : 0 2025-05-07T20:23:41.5072974Z microcode : 0x830107f 2025-05-07T20:23:41.5073197Z cpu MHz : 2848.281 2025-05-07T20:23:41.5073414Z cache size : 512 KB 2025-05-07T20:23:41.5073634Z physical id : 0 2025-05-07T20:23:41.5073841Z siblings : 16 2025-05-07T20:23:41.5074052Z core id : 5 2025-05-07T20:23:41.5074259Z cpu cores : 8 2025-05-07T20:23:41.5074464Z apicid : 11 2025-05-07T20:23:41.5074673Z initial apicid : 11 2025-05-07T20:23:41.5074893Z fpu : yes 2025-05-07T20:23:41.5075097Z fpu_exception : yes 2025-05-07T20:23:41.5075323Z cpuid level : 13 2025-05-07T20:23:41.5075539Z wp : yes 2025-05-07T20:23:41.5077540Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5079833Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5080397Z bogomips : 5599.99 2025-05-07T20:23:41.5080626Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5080886Z clflush size : 64 2025-05-07T20:23:41.5081134Z cache_alignment : 64 2025-05-07T20:23:41.5081412Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5081744Z power management: 2025-05-07T20:23:41.5081879Z 2025-05-07T20:23:41.5081966Z processor : 14 2025-05-07T20:23:41.5082192Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5082442Z cpu family : 23 2025-05-07T20:23:41.5082654Z model : 49 2025-05-07T20:23:41.5082870Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5083117Z stepping : 0 2025-05-07T20:23:41.5083329Z microcode : 0x830107f 2025-05-07T20:23:41.5083563Z cpu MHz : 3295.061 2025-05-07T20:23:41.5083784Z cache size : 512 KB 2025-05-07T20:23:41.5084000Z physical id : 0 2025-05-07T20:23:41.5084221Z siblings : 16 2025-05-07T20:23:41.5084427Z core id : 6 2025-05-07T20:23:41.5084626Z cpu cores : 8 2025-05-07T20:23:41.5084830Z apicid : 13 2025-05-07T20:23:41.5085038Z initial apicid : 13 2025-05-07T20:23:41.5085252Z fpu : yes 2025-05-07T20:23:41.5085456Z fpu_exception : yes 2025-05-07T20:23:41.5085677Z cpuid level : 13 2025-05-07T20:23:41.5085881Z wp : yes 2025-05-07T20:23:41.5087883Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5090265Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5090760Z bogomips : 5599.99 2025-05-07T20:23:41.5090987Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5091223Z clflush size : 64 2025-05-07T20:23:41.5091445Z cache_alignment : 64 2025-05-07T20:23:41.5091722Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5092036Z power management: 2025-05-07T20:23:41.5092179Z 2025-05-07T20:23:41.5092352Z processor : 15 2025-05-07T20:23:41.5092574Z vendor_id : AuthenticAMD 2025-05-07T20:23:41.5092811Z cpu family : 23 2025-05-07T20:23:41.5093021Z model : 49 2025-05-07T20:23:41.5093229Z model name : AMD EPYC 7R32 2025-05-07T20:23:41.5093474Z stepping : 0 2025-05-07T20:23:41.5093691Z microcode : 0x830107f 2025-05-07T20:23:41.5093919Z cpu MHz : 3294.378 2025-05-07T20:23:41.5094130Z cache size : 512 KB 2025-05-07T20:23:41.5094351Z physical id : 0 2025-05-07T20:23:41.5094564Z siblings : 16 2025-05-07T20:23:41.5094770Z core id : 7 2025-05-07T20:23:41.5094967Z cpu cores : 8 2025-05-07T20:23:41.5095171Z apicid : 15 2025-05-07T20:23:41.5095379Z initial apicid : 15 2025-05-07T20:23:41.5095588Z fpu : yes 2025-05-07T20:23:41.5095790Z fpu_exception : yes 2025-05-07T20:23:41.5096011Z cpuid level : 13 2025-05-07T20:23:41.5096214Z wp : yes 2025-05-07T20:23:41.5098218Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:41.5100501Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:41.5101001Z bogomips : 5599.99 2025-05-07T20:23:41.5101218Z TLB size : 3072 4K pages 2025-05-07T20:23:41.5101456Z clflush size : 64 2025-05-07T20:23:41.5101680Z cache_alignment : 64 2025-05-07T20:23:41.5101956Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:41.5109234Z power management: 2025-05-07T20:23:41.5109383Z 2025-05-07T20:23:41.5109387Z 2025-05-07T20:23:41.5109520Z ################################################################################ 2025-05-07T20:23:41.5109846Z [INFO] Print PCI info ... 2025-05-07T20:23:41.5110096Z + lspci -v 2025-05-07T20:23:41.5110222Z 2025-05-07T20:23:41.5110449Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:41.5110851Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:41.5111183Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:41.5111399Z 2025-05-07T20:23:41.5111603Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:41.5111998Z Physical Slot: 1 2025-05-07T20:23:41.5112250Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:41.5112460Z 2025-05-07T20:23:41.5112720Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:41.5113160Z Physical Slot: 1 2025-05-07T20:23:41.5113747Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:41.5113998Z 2025-05-07T20:23:41.5114283Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:41.5114737Z Physical Slot: 3 2025-05-07T20:23:41.5114993Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:41.5115521Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:41.5115884Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:41.5116121Z 2025-05-07T20:23:41.5116428Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:41.5116950Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:41.5117245Z Physical Slot: 4 2025-05-07T20:23:41.5117504Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:41.5117897Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:41.5118270Z Capabilities: 2025-05-07T20:23:41.5118557Z Kernel driver in use: nvme 2025-05-07T20:23:41.5118731Z 2025-05-07T20:23:41.5119106Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:41.5119605Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:41.5119960Z Physical Slot: 5 2025-05-07T20:23:41.5120281Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:41.5120659Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:41.5121051Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:41.5121384Z Capabilities: 2025-05-07T20:23:41.5121660Z Kernel driver in use: ena 2025-05-07T20:23:41.5121909Z Kernel modules: ena 2025-05-07T20:23:41.5122053Z 2025-05-07T20:23:41.5122227Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:41.5122619Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:41.5122924Z Physical Slot: 30 2025-05-07T20:23:41.5123185Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:41.5123575Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:41.5123985Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:41.5124369Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:41.5124706Z Capabilities: 2025-05-07T20:23:41.5124989Z Kernel driver in use: nvidia 2025-05-07T20:23:41.5125258Z Kernel modules: nvidia 2025-05-07T20:23:41.5125408Z 2025-05-07T20:23:41.5125718Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:41.5126245Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:41.5126547Z Physical Slot: 31 2025-05-07T20:23:41.5126794Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:41.5127160Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:41.5127556Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:41.5127898Z Capabilities: 2025-05-07T20:23:41.5128162Z Kernel driver in use: nvme 2025-05-07T20:23:41.5128337Z 2025-05-07T20:23:41.5128341Z 2025-05-07T20:23:41.5128459Z ################################################################################ 2025-05-07T20:23:41.5128799Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:41.5129095Z + uname -a 2025-05-07T20:23:41.5129226Z 2025-05-07T20:23:41.5129643Z Linux ip-10-0-8-106.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:41.5130158Z 2025-05-07T20:23:41.5130241Z + uname -m 2025-05-07T20:23:41.5130359Z 2025-05-07T20:23:41.5130441Z x86_64 2025-05-07T20:23:41.5130551Z 2025-05-07T20:23:41.5130640Z + cat /proc/version 2025-05-07T20:23:41.5130784Z 2025-05-07T20:23:41.5131337Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:41.5131982Z 2025-05-07T20:23:41.5132072Z + cat /etc/os-release 2025-05-07T20:23:41.5132220Z 2025-05-07T20:23:41.5132320Z NAME="Amazon Linux" 2025-05-07T20:23:41.5132538Z VERSION="2023" 2025-05-07T20:23:41.5132748Z ID="amzn" 2025-05-07T20:23:41.5132944Z ID_LIKE="fedora" 2025-05-07T20:23:41.5133150Z VERSION_ID="2023" 2025-05-07T20:23:41.5133481Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:41.5133771Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:41.5134060Z ANSI_COLOR="0;33" 2025-05-07T20:23:41.5134316Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:41.5134721Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:41.5135173Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:41.5135595Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:41.5136051Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:41.5136432Z VENDOR_NAME="AWS" 2025-05-07T20:23:41.5136675Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:41.5136976Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:41.5137133Z 2025-05-07T20:23:41.5137345Z ################################################################################ 2025-05-07T20:23:41.5137662Z # Print EC2 Instance Info 2025-05-07T20:23:41.5137909Z # 2025-05-07T20:23:41.5138130Z # [2025-05-07T20:23:41.507Z] + print_ec2_info 2025-05-07T20:23:41.5138461Z ################################################################################ 2025-05-07T20:23:41.5138680Z 2025-05-07T20:23:41.5197615Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:41.5309944Z instance-id: i-04dd41b83603cbddd 2025-05-07T20:23:41.5427739Z instance-type: g5.4xlarge 2025-05-07T20:23:41.5469226Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:41.5469660Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:41.5478630Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:41.5479021Z env: 2025-05-07T20:23:41.5479252Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:41.5479589Z BUILD_ENV: build_binary 2025-05-07T20:23:41.5479855Z BUILD_TARGET: genai 2025-05-07T20:23:41.5480096Z BUILD_VARIANT: cuda 2025-05-07T20:23:41.5480430Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:41.5480697Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:41.5481043Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:41.5481419Z ##[endgroup] 2025-05-07T20:23:41.8822761Z ################################################################################ 2025-05-07T20:23:41.8823180Z [INFO] Printing general display info ... 2025-05-07T20:23:41.8837076Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:41.9747510Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:41.9757587Z /usr/bin/sudo 2025-05-07T20:23:41.9767987Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:41.9778764Z /usr/bin/yum 2025-05-07T20:23:41.9780435Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:41.9800446Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:42.4526689Z Last metadata expiration check: 0:00:06 ago on Wed May 7 20:23:36 2025. 2025-05-07T20:23:42.5321833Z ================================================================================ 2025-05-07T20:23:42.5322188Z WARNING: 2025-05-07T20:23:42.5322475Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:42.5322722Z 2025-05-07T20:23:42.5322819Z Available Versions: 2025-05-07T20:23:42.5322978Z 2025-05-07T20:23:42.5323079Z Version 2023.7.20250331: 2025-05-07T20:23:42.5323397Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:42.5323664Z 2025-05-07T20:23:42.5323803Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:42.5324021Z 2025-05-07T20:23:42.5324118Z Release notes: 2025-05-07T20:23:42.5324538Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:42.5324928Z 2025-05-07T20:23:42.5325019Z Version 2023.7.20250414: 2025-05-07T20:23:42.5325338Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:42.5325596Z 2025-05-07T20:23:42.5325724Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:42.5325943Z 2025-05-07T20:23:42.5326029Z Release notes: 2025-05-07T20:23:42.5326437Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:42.5327047Z 2025-05-07T20:23:42.5327147Z Version 2023.7.20250428: 2025-05-07T20:23:42.5327466Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:42.5327725Z 2025-05-07T20:23:42.5327844Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:42.5328070Z 2025-05-07T20:23:42.5328157Z Release notes: 2025-05-07T20:23:42.5328561Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:42.5328934Z 2025-05-07T20:23:42.5329055Z ================================================================================ 2025-05-07T20:23:42.6500385Z Dependencies resolved. 2025-05-07T20:23:42.6784929Z ================================================================================ 2025-05-07T20:23:42.6786197Z Package Arch Version Repository Size 2025-05-07T20:23:42.6786820Z ================================================================================ 2025-05-07T20:23:42.6787298Z Upgrading: 2025-05-07T20:23:42.6787875Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:42.6788842Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:42.6789434Z 2025-05-07T20:23:42.6790027Z Transaction Summary 2025-05-07T20:23:42.6790463Z ================================================================================ 2025-05-07T20:23:42.6790965Z Upgrade 2 Packages 2025-05-07T20:23:42.6791192Z 2025-05-07T20:23:42.6791368Z Total download size: 6.9 M 2025-05-07T20:23:42.6791752Z Downloading Packages: 2025-05-07T20:23:42.7184939Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 32 MB/s | 1.2 MB 00:00 2025-05-07T20:23:42.7622861Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 69 MB/s | 5.7 MB 00:00 2025-05-07T20:23:42.7633347Z -------------------------------------------------------------------------------- 2025-05-07T20:23:42.7634171Z Total 82 MB/s | 6.9 MB 00:00 2025-05-07T20:23:42.7636738Z Running transaction check 2025-05-07T20:23:42.7740665Z Transaction check succeeded. 2025-05-07T20:23:42.7741105Z Running transaction test 2025-05-07T20:23:42.8034383Z Transaction test succeeded. 2025-05-07T20:23:42.8037144Z Running transaction 2025-05-07T20:23:43.3652786Z Preparing : 1/1 2025-05-07T20:23:43.4710223Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:43.4730248Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:43.4937447Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:43.4938221Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:43.5043703Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:43.5064903Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:43.6524450Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:43.6525050Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:43.6525623Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:43.6526170Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:43.8009343Z ================================================================================ 2025-05-07T20:23:43.8009727Z WARNING: 2025-05-07T20:23:43.8009982Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:43.8010229Z 2025-05-07T20:23:43.8010326Z Available Versions: 2025-05-07T20:23:43.8010486Z 2025-05-07T20:23:43.8010579Z Version 2023.7.20250331: 2025-05-07T20:23:43.8010905Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:43.8011431Z 2025-05-07T20:23:43.8011559Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:43.8011786Z 2025-05-07T20:23:43.8011876Z Release notes: 2025-05-07T20:23:43.8012301Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:43.8012683Z 2025-05-07T20:23:43.8012796Z Version 2023.7.20250414: 2025-05-07T20:23:43.8013116Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:43.8013643Z 2025-05-07T20:23:43.8013767Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:43.8013985Z 2025-05-07T20:23:43.8014081Z Release notes: 2025-05-07T20:23:43.8014488Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:43.8014872Z 2025-05-07T20:23:43.8014966Z Version 2023.7.20250428: 2025-05-07T20:23:43.8015290Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:43.8015549Z 2025-05-07T20:23:43.8015681Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:43.8015897Z 2025-05-07T20:23:43.8015987Z Release notes: 2025-05-07T20:23:43.8016394Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:43.8016769Z 2025-05-07T20:23:43.8017113Z ================================================================================ 2025-05-07T20:23:43.8578162Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:43.8578848Z 2025-05-07T20:23:43.8579019Z Upgraded: 2025-05-07T20:23:43.8579729Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:43.8580894Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:43.8581524Z 2025-05-07T20:23:43.8581628Z Complete! 2025-05-07T20:23:43.9016676Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:43.9038918Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:44.3542686Z Last metadata expiration check: 0:00:08 ago on Wed May 7 20:23:36 2025. 2025-05-07T20:23:44.3780912Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:44.3786800Z Package lshw-B.02.19.2-7.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:44.4191267Z Dependencies resolved. 2025-05-07T20:23:44.4373799Z Nothing to do. 2025-05-07T20:23:44.4374238Z Complete! 2025-05-07T20:23:44.4774365Z + hostname 2025-05-07T20:23:44.4774500Z 2025-05-07T20:23:44.4788899Z ip-10-0-8-106.ec2.internal 2025-05-07T20:23:44.4790556Z 2025-05-07T20:23:44.4790797Z + sudo lshw -C display 2025-05-07T20:23:44.4790969Z 2025-05-07T20:23:44.7280479Z *-display:0 UNCLAIMED 2025-05-07T20:23:44.7280787Z description: VGA compatible controller 2025-05-07T20:23:44.7281119Z product: Amazon.com, Inc. 2025-05-07T20:23:44.7281409Z vendor: Amazon.com, Inc. 2025-05-07T20:23:44.7281681Z physical id: 3 2025-05-07T20:23:44.7282059Z bus info: pci@0000:00:03.0 2025-05-07T20:23:44.7282621Z version: 00 2025-05-07T20:23:44.7283074Z width: 32 bits 2025-05-07T20:23:44.7283517Z clock: 33MHz 2025-05-07T20:23:44.7284013Z capabilities: vga_controller bus_master 2025-05-07T20:23:44.7284656Z configuration: latency=0 2025-05-07T20:23:44.7285337Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:44.7286010Z *-display:1 2025-05-07T20:23:44.7286460Z description: 3D controller 2025-05-07T20:23:44.7287023Z product: GA102GL [A10G] 2025-05-07T20:23:44.7287558Z vendor: NVIDIA Corporation 2025-05-07T20:23:44.7288102Z physical id: 1e 2025-05-07T20:23:44.7288582Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:44.7289098Z version: a1 2025-05-07T20:23:44.7289517Z width: 64 bits 2025-05-07T20:23:44.7289961Z clock: 33MHz 2025-05-07T20:23:44.7290547Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:44.7291300Z configuration: driver=nvidia latency=0 2025-05-07T20:23:44.7292271Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:44.7320925Z 2025-05-07T20:23:44.7321290Z ################################################################################ 2025-05-07T20:23:44.7321816Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:44.7450337Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:44.7636600Z Wed May 7 20:23:44 2025 2025-05-07T20:23:44.7636982Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:44.7637485Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:44.7637995Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:44.7638504Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:44.7639076Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:44.7639514Z | | | MIG M. | 2025-05-07T20:23:44.7640281Z |=========================================+========================+======================| 2025-05-07T20:23:44.7770093Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:44.7770559Z | 0% 30C P8 26W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:44.7770955Z | | | N/A | 2025-05-07T20:23:44.7771375Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:44.7774791Z 2025-05-07T20:23:44.7775204Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:44.7775653Z | Processes: | 2025-05-07T20:23:44.7776113Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:44.7776555Z | ID ID Usage | 2025-05-07T20:23:44.7776919Z |=========================================================================================| 2025-05-07T20:23:44.7780032Z | No running processes found | 2025-05-07T20:23:44.7780520Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:45.0406990Z ################################################################################ 2025-05-07T20:23:45.0407481Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:45.0547830Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:45.0548732Z [CHECK] rocminfo not found 2025-05-07T20:23:45.0558009Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:45.0558860Z [CHECK] rocm-smi not found 2025-05-07T20:23:45.0603345Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:45.0603787Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:45.0615650Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:45.0616015Z env: 2025-05-07T20:23:45.0616254Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:45.0616566Z BUILD_ENV: build_binary 2025-05-07T20:23:45.0616824Z BUILD_TARGET: genai 2025-05-07T20:23:45.0617066Z BUILD_VARIANT: cuda 2025-05-07T20:23:45.0617309Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:45.0617578Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:45.0617894Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:45.0618237Z ##[endgroup] 2025-05-07T20:23:45.3949462Z ################################################################################ 2025-05-07T20:23:45.3949835Z # Setup Miniconda 2025-05-07T20:23:45.3950055Z # 2025-05-07T20:23:45.3965940Z # [2025-05-07T20:23:45.396Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:45.3966373Z ################################################################################ 2025-05-07T20:23:45.3966599Z 2025-05-07T20:23:45.3982514Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:45.4925453Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:45.4925997Z [SETUP] A Miniconda installation appears to already exist in /home/ec2-user/miniconda ... 2025-05-07T20:23:45.4926559Z [SETUP] Clearing out directory: /home/ec2-user/miniconda ... 2025-05-07T20:23:45.4926941Z + rm -rf /home/ec2-user/miniconda 2025-05-07T20:23:45.4927144Z 2025-05-07T20:23:50.9984120Z 2025-05-07T20:23:50.9984611Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:50.9984917Z 2025-05-07T20:23:51.0001833Z 2025-05-07T20:23:51.0002212Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:51.0027239Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:51.9447812Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:51.9448203Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:51.9448475Z 2025-05-07T20:23:51.9591467Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:52.4028812Z Unpacking payload ... 2025-05-07T20:23:52.9235981Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:53.7230037Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:55.8583233Z 2025-05-07T20:23:55.8583995Z Installing base environment... 2025-05-07T20:23:55.8584231Z 2025-05-07T20:23:56.9491243Z Preparing transaction: ...working... done 2025-05-07T20:23:59.9640948Z Executing transaction: ...working... done 2025-05-07T20:24:00.6190515Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:00.7085612Z installation finished. 2025-05-07T20:24:00.7093381Z 2025-05-07T20:24:00.7093668Z + rm -f miniconda.sh 2025-05-07T20:24:00.7093850Z 2025-05-07T20:24:00.7411469Z 2025-05-07T20:24:00.7411845Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:24:00.7412209Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:24:00.7412434Z 2025-05-07T20:24:01.1156219Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:24:01.1156609Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:24:01.1156993Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:24:01.1157386Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:24:01.1157763Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:24:01.1158437Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:24:01.1158896Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:24:01.1159358Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:24:01.1159833Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:24:01.1160490Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:24:01.1161032Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:24:01.1161413Z no change /home/ec2-user/.bashrc 2025-05-07T20:24:01.1161695Z No action taken. 2025-05-07T20:24:01.1810662Z 2025-05-07T20:24:01.1811145Z + . /home/ec2-user/.bashrc 2025-05-07T20:24:01.1811439Z 2025-05-07T20:24:02.0228702Z 2025-05-07T20:24:02.0229257Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:24:02.0252824Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:24:15.5434259Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:17.1348581Z Solving environment: - \ | / - \ | / - \ | / done 2025-05-07T20:24:17.2317371Z 2025-05-07T20:24:17.2317819Z ## Package Plan ## 2025-05-07T20:24:17.2318022Z 2025-05-07T20:24:17.2318167Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:17.2318415Z 2025-05-07T20:24:17.2318523Z added / updated specs: 2025-05-07T20:24:17.2318791Z - conda-libmamba-solver 2025-05-07T20:24:17.2319066Z - libarchive 2025-05-07T20:24:17.2319287Z - libmamba 2025-05-07T20:24:17.2319502Z - libmambapy 2025-05-07T20:24:17.2319634Z 2025-05-07T20:24:17.2319638Z 2025-05-07T20:24:17.2319764Z The following packages will be downloaded: 2025-05-07T20:24:17.2319994Z 2025-05-07T20:24:17.2320112Z package | build 2025-05-07T20:24:17.2320536Z ---------------------------|----------------- 2025-05-07T20:24:17.2320957Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:24:17.2321486Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:24:17.2321966Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:24:17.2322463Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:24:17.2322933Z ------------------------------------------------------------ 2025-05-07T20:24:17.2323294Z Total: 1.4 MB 2025-05-07T20:24:17.2323516Z 2025-05-07T20:24:17.2323638Z The following packages will be UPDATED: 2025-05-07T20:24:17.2323851Z 2025-05-07T20:24:17.2328309Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:17.2329108Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:24:17.2329503Z 2025-05-07T20:24:17.2329731Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:17.2330059Z 2025-05-07T20:24:17.2330385Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:24:17.2331216Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:24:17.2331708Z 2025-05-07T20:24:17.2331712Z 2025-05-07T20:24:17.2331716Z 2025-05-07T20:24:17.2332130Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:17.2332511Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:24:17.2332743Z 2025-05-07T20:24:17.2333024Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:24:17.2333279Z 2025-05-07T20:24:17.2333283Z 2025-05-07T20:24:17.2340868Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:24:17.2341143Z 2025-05-07T20:24:17.2341147Z 2025-05-07T20:24:17.2341151Z 2025-05-07T20:24:17.2904558Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:24:17.2906676Z 2025-05-07T20:24:17.3015094Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:17.3015484Z 2025-05-07T20:24:17.3015820Z 2025-05-07T20:24:17.3015825Z 2025-05-07T20:24:17.3096210Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:17.3096787Z 2025-05-07T20:24:17.3186672Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:17.3187053Z 2025-05-07T20:24:17.3187248Z 2025-05-07T20:24:17.3274459Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:17.3279253Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:17.3279579Z 2025-05-07T20:24:17.3279592Z 2025-05-07T20:24:17.3279596Z 2025-05-07T20:24:17.3349239Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:17.3349533Z 2025-05-07T20:24:17.3349538Z 2025-05-07T20:24:17.3352464Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:17.3352733Z 2025-05-07T20:24:17.3352738Z 2025-05-07T20:24:17.4409947Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:17.4410365Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:17.4416732Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:17.4417108Z 2025-05-07T20:24:17.4417339Z 2025-05-07T20:24:17.4417551Z  2025-05-07T20:24:17.4417783Z 2025-05-07T20:24:17.4417787Z 2025-05-07T20:24:17.4417973Z  2025-05-07T20:24:17.4418209Z 2025-05-07T20:24:17.4418213Z 2025-05-07T20:24:17.4418216Z 2025-05-07T20:24:17.4418411Z  done 2025-05-07T20:24:17.5420830Z Preparing transaction: \ done 2025-05-07T20:24:17.6426194Z Verifying transaction: / done 2025-05-07T20:24:18.9445029Z Executing transaction: \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:20.7943950Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:24:20.7967503Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:24:21.6219599Z Channels: 2025-05-07T20:24:21.6219858Z - defaults 2025-05-07T20:24:21.6220079Z Platform: linux-64 2025-05-07T20:24:22.8596027Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:24:22.9810308Z Solving environment: - \ Channels: 2025-05-07T20:24:22.9810913Z - defaults 2025-05-07T20:24:22.9811339Z Platform: linux-64 2025-05-07T20:24:23.2772049Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:24:23.4929011Z Solving environment: - \ | / done 2025-05-07T20:24:23.5746502Z done 2025-05-07T20:24:23.6392613Z 2025-05-07T20:24:23.6392754Z ## Package Plan ## 2025-05-07T20:24:23.6392908Z 2025-05-07T20:24:23.6393071Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:23.6393366Z 2025-05-07T20:24:23.6393491Z added / updated specs: 2025-05-07T20:24:23.6393744Z - conda 2025-05-07T20:24:23.6393866Z 2025-05-07T20:24:23.6393870Z 2025-05-07T20:24:23.6393993Z The following packages will be downloaded: 2025-05-07T20:24:23.6394239Z 2025-05-07T20:24:23.6394357Z package | build 2025-05-07T20:24:23.6394689Z ---------------------------|----------------- 2025-05-07T20:24:23.6395281Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:24:23.6395690Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:24:23.6396076Z ------------------------------------------------------------ 2025-05-07T20:24:23.6396427Z Total: 1.4 MB 2025-05-07T20:24:23.6396642Z 2025-05-07T20:24:23.6396760Z The following packages will be UPDATED: 2025-05-07T20:24:23.6396979Z 2025-05-07T20:24:23.6397284Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:23.6397807Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:24:23.6398064Z 2025-05-07T20:24:23.6398212Z 2025-05-07T20:24:23.6398216Z 2025-05-07T20:24:23.6398372Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:23.6398745Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:24:23.6398994Z 2025-05-07T20:24:23.6805066Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:24:23.7029659Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:23.7030334Z 2025-05-07T20:24:23.8675708Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:23.8677931Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:23.8997773Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:23.8998148Z 2025-05-07T20:24:23.8998758Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:23.8999149Z 2025-05-07T20:24:23.9004132Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:23.9004490Z 2025-05-07T20:24:23.9004725Z 2025-05-07T20:24:23.9004903Z  done 2025-05-07T20:24:24.0007404Z Preparing transaction: \ done 2025-05-07T20:24:24.1013750Z Verifying transaction: / done 2025-05-07T20:24:26.2040326Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:26.8403397Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:24:26.8408792Z + conda clean --packages --tarball -y 2025-05-07T20:24:26.8409010Z 2025-05-07T20:24:27.8609853Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:24:27.8610217Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:24:27.9243776Z 2025-05-07T20:24:27.9251666Z + conda clean --all -y 2025-05-07T20:24:27.9251857Z 2025-05-07T20:24:28.4754540Z There are no unused tarball(s) to remove. 2025-05-07T20:24:28.4754906Z Will remove 1 index cache(s). 2025-05-07T20:24:28.4755205Z There are no unused package(s) to remove. 2025-05-07T20:24:28.4755525Z There are no tempfile(s) to remove. 2025-05-07T20:24:28.4755863Z There are no logfile(s) to remove. 2025-05-07T20:24:28.5393787Z 2025-05-07T20:24:28.5398615Z + conda info 2025-05-07T20:24:28.5398766Z 2025-05-07T20:24:29.2957121Z 2025-05-07T20:24:29.2957784Z active environment : base 2025-05-07T20:24:29.2958316Z active env location : /home/ec2-user/miniconda 2025-05-07T20:24:29.2958810Z shell level : 1 2025-05-07T20:24:29.2959231Z user config file : /home/ec2-user/.condarc 2025-05-07T20:24:29.2959803Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:24:29.2960468Z conda version : 25.3.1 2025-05-07T20:24:29.2960891Z conda-build version : not installed 2025-05-07T20:24:29.2961346Z python version : 3.13.2.final.0 2025-05-07T20:24:29.2961783Z solver : libmamba (default) 2025-05-07T20:24:29.2962272Z virtual packages : __archspec=1=zen2 2025-05-07T20:24:29.2962737Z __conda=25.3.1=0 2025-05-07T20:24:29.2963165Z __cuda=12.8=0 2025-05-07T20:24:29.2963590Z __glibc=2.34=0 2025-05-07T20:24:29.2964018Z __linux=6.1.130=0 2025-05-07T20:24:29.2964450Z __unix=0=0 2025-05-07T20:24:29.2965348Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:24:29.2965979Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:24:29.2966502Z conda av metadata url : None 2025-05-07T20:24:29.2967056Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:24:29.2967706Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:24:29.2968290Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:24:29.2968850Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:24:29.2969413Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:24:29.2969923Z /home/ec2-user/.conda/pkgs 2025-05-07T20:24:29.2970672Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:24:29.2971189Z /home/ec2-user/.conda/envs 2025-05-07T20:24:29.2971677Z platform : linux-64 2025-05-07T20:24:29.2972928Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:24:29.2974188Z UID:GID : 1000:1000 2025-05-07T20:24:29.2974610Z netrc file : None 2025-05-07T20:24:29.2975001Z offline mode : False 2025-05-07T20:24:29.2975252Z 2025-05-07T20:24:29.3629383Z 2025-05-07T20:24:29.3629953Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:24:29.3630725Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_0d771e95-4678-43f1-82ee-37ea75e113eb ... 2025-05-07T20:24:29.3631572Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:24:29.3703770Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.13 2025-05-07T20:24:29.3704287Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.13 2025-05-07T20:24:29.3722388Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:29.3722749Z env: 2025-05-07T20:24:29.3722979Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:29.3723293Z BUILD_ENV: build_binary 2025-05-07T20:24:29.3723549Z BUILD_TARGET: genai 2025-05-07T20:24:29.3723785Z BUILD_VARIANT: cuda 2025-05-07T20:24:29.3724023Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:24:29.3724286Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:29.3724596Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:29.3724933Z ##[endgroup] 2025-05-07T20:24:29.7066087Z ################################################################################ 2025-05-07T20:24:29.7066817Z # Create Conda Environment 2025-05-07T20:24:29.7067353Z # 2025-05-07T20:24:29.7080674Z # [2025-05-07T20:24:29.707Z] + create_conda_environment build_binary 3.13 2025-05-07T20:24:29.7081115Z ################################################################################ 2025-05-07T20:24:29.7081342Z 2025-05-07T20:24:29.7095540Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:29.7985293Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:29.7985681Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:24:29.7986017Z + conda info --envs 2025-05-07T20:24:29.7986159Z 2025-05-07T20:24:30.5450683Z 2025-05-07T20:24:30.5451315Z # conda environments: 2025-05-07T20:24:30.5451649Z # 2025-05-07T20:24:30.5451946Z base /home/ec2-user/miniconda 2025-05-07T20:24:30.5452249Z 2025-05-07T20:24:30.6111888Z 2025-05-07T20:24:30.6112857Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:24:32.2509697Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:32.2510112Z 2025-05-07T20:24:32.2525037Z 2025-05-07T20:24:32.2534747Z [SETUP] Creating new Conda environment (Python 3.13) ... 2025-05-07T20:24:32.2557145Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.13 2025-05-07T20:24:33.0099177Z Channels: 2025-05-07T20:24:33.0099435Z - defaults 2025-05-07T20:24:33.0099651Z Platform: linux-64 2025-05-07T20:24:34.5753206Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / done 2025-05-07T20:24:34.6760022Z Solving environment: \ done 2025-05-07T20:24:34.7050313Z 2025-05-07T20:24:34.7050709Z ## Package Plan ## 2025-05-07T20:24:34.7051021Z 2025-05-07T20:24:34.7051457Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:34.7052134Z 2025-05-07T20:24:34.7052344Z added / updated specs: 2025-05-07T20:24:34.7052855Z - python=3.13 2025-05-07T20:24:34.7053145Z 2025-05-07T20:24:34.7053154Z 2025-05-07T20:24:34.7053405Z The following packages will be downloaded: 2025-05-07T20:24:34.7054225Z 2025-05-07T20:24:34.7054490Z package | build 2025-05-07T20:24:34.7055159Z ---------------------------|----------------- 2025-05-07T20:24:34.7055923Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:34.7056488Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:34.7056940Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:34.7057370Z python_abi-3.13 | 0_cp313 6 KB 2025-05-07T20:24:34.7057767Z ------------------------------------------------------------ 2025-05-07T20:24:34.7058127Z Total: 159 KB 2025-05-07T20:24:34.7058348Z 2025-05-07T20:24:34.7058485Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:34.7058726Z 2025-05-07T20:24:34.7058937Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:34.7059413Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:34.7060067Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:34.7060578Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:34.7061095Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T20:24:34.7061578Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:34.7062071Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:34.7062523Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:34.7062992Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:34.7063458Z libmpdec pkgs/main/linux-64::libmpdec-4.0.0-h5eee18b_0 2025-05-07T20:24:34.7063955Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:34.7064435Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:34.7064897Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:34.7065346Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:34.7065777Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:34.7066227Z python pkgs/main/linux-64::python-3.13.2-hf623796_100_cp313 2025-05-07T20:24:34.7066705Z python_abi pkgs/main/linux-64::python_abi-3.13-0_cp313 2025-05-07T20:24:34.7067162Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:34.7067658Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py313h06a4308_0 2025-05-07T20:24:34.7068154Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:34.7068569Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:34.7068980Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:34.7069420Z wheel pkgs/main/linux-64::wheel-0.45.1-py313h06a4308_0 2025-05-07T20:24:34.7069856Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:34.7070255Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:34.7070510Z 2025-05-07T20:24:34.7070514Z 2025-05-07T20:24:34.7070518Z 2025-05-07T20:24:34.7070676Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:34.7071091Z ca-certificates-2025 | 129 KB | | 0% 2025-05-07T20:24:34.7071349Z 2025-05-07T20:24:34.7071615Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:34.7071876Z 2025-05-07T20:24:34.7071880Z 2025-05-07T20:24:34.7084454Z python_abi-3.13 | 6 KB | | 0%  2025-05-07T20:24:34.7084721Z 2025-05-07T20:24:34.7084725Z 2025-05-07T20:24:34.7084728Z 2025-05-07T20:24:34.7345095Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:34.7345554Z 2025-05-07T20:24:34.7413193Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:34.7413998Z 2025-05-07T20:24:34.7414019Z 2025-05-07T20:24:34.7419244Z 2025-05-07T20:24:34.7455183Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:34.7455484Z 2025-05-07T20:24:34.7455603Z 2025-05-07T20:24:34.7560088Z python_abi-3.13 | 6 KB | ########## | 100%  2025-05-07T20:24:34.7560466Z 2025-05-07T20:24:34.7560471Z 2025-05-07T20:24:34.7560477Z 2025-05-07T20:24:34.7619735Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:34.7620007Z 2025-05-07T20:24:34.7620543Z 2025-05-07T20:24:34.7631505Z python_abi-3.13 | 6 KB | ########## | 100%  2025-05-07T20:24:34.7739795Z ca-certificates-2025 | 129 KB | ########## | 100% 2025-05-07T20:24:34.7740057Z 2025-05-07T20:24:34.7760312Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:34.7765558Z ca-certificates-2025 | 129 KB | ########## | 100% 2025-05-07T20:24:34.7765929Z 2025-05-07T20:24:34.7766158Z 2025-05-07T20:24:34.7766649Z  2025-05-07T20:24:34.7766866Z 2025-05-07T20:24:34.7766870Z 2025-05-07T20:24:34.7767041Z  2025-05-07T20:24:34.7767258Z 2025-05-07T20:24:34.7767262Z 2025-05-07T20:24:34.7767266Z 2025-05-07T20:24:34.7767450Z  done 2025-05-07T20:24:34.9822079Z Preparing transaction: / - done 2025-05-07T20:24:36.4381425Z Verifying transaction: | / - \ | / - \ | / - \ | done 2025-05-07T20:24:38.7538661Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:38.8037946Z # 2025-05-07T20:24:38.8038567Z # To activate this environment, use 2025-05-07T20:24:38.8039149Z # 2025-05-07T20:24:38.8039559Z # $ conda activate build_binary 2025-05-07T20:24:38.8040092Z # 2025-05-07T20:24:38.8040824Z # To deactivate an active environment, use 2025-05-07T20:24:38.8041440Z # 2025-05-07T20:24:38.8041829Z # $ conda deactivate 2025-05-07T20:24:38.8042152Z 2025-05-07T20:24:38.9139629Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:38.9163411Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:41.7915220Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (25.1) 2025-05-07T20:24:41.7915872Z Collecting pip 2025-05-07T20:24:41.7916203Z Using cached pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:41.7916638Z Using cached pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:41.7917002Z Installing collected packages: pip 2025-05-07T20:24:41.7917313Z Attempting uninstall: pip 2025-05-07T20:24:41.7917608Z Found existing installation: pip 25.1 2025-05-07T20:24:41.7917928Z Uninstalling pip-25.1: 2025-05-07T20:24:41.7918219Z Successfully uninstalled pip-25.1 2025-05-07T20:24:41.7918575Z Successfully installed pip-25.1.1 2025-05-07T20:24:41.7918788Z 2025-05-07T20:24:41.8547316Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:41.8571473Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:42.7105253Z Channels: 2025-05-07T20:24:42.7105588Z - conda-forge 2025-05-07T20:24:42.7105911Z Platform: linux-64 2025-05-07T20:24:53.2558856Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:54.9437531Z Solving environment: / - \ | / - done 2025-05-07T20:24:55.0071579Z 2025-05-07T20:24:55.0072390Z ## Package Plan ## 2025-05-07T20:24:55.0072660Z 2025-05-07T20:24:55.0072908Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:55.0073229Z 2025-05-07T20:24:55.0073333Z added / updated specs: 2025-05-07T20:24:55.0073618Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:55.0074311Z 2025-05-07T20:24:55.0074317Z 2025-05-07T20:24:55.0074520Z The following packages will be downloaded: 2025-05-07T20:24:55.0074843Z 2025-05-07T20:24:55.0074966Z package | build 2025-05-07T20:24:55.0075312Z ---------------------------|----------------- 2025-05-07T20:24:55.0075701Z cffi-1.17.1 | py313hfab6e84_0 289 KB conda-forge 2025-05-07T20:24:55.0076164Z cryptography-44.0.3 | py313h6556f6e_0 1.5 MB conda-forge 2025-05-07T20:24:55.0076630Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:55.0077068Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:55.0077511Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:55.0077938Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:55.0078383Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:55.0079071Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:55.0079717Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:55.0080349Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:55.0080793Z ------------------------------------------------------------ 2025-05-07T20:24:55.0081149Z Total: 6.4 MB 2025-05-07T20:24:55.0081368Z 2025-05-07T20:24:55.0081502Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:55.0081738Z 2025-05-07T20:24:55.0081948Z cffi conda-forge/linux-64::cffi-1.17.1-py313hfab6e84_0 2025-05-07T20:24:55.0082512Z cryptography conda-forge/linux-64::cryptography-44.0.3-py313h6556f6e_0 2025-05-07T20:24:55.0083033Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:55.0083744Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:55.0084437Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:55.0085231Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:55.0086150Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:55.0086695Z 2025-05-07T20:24:55.0086879Z The following packages will be UPDATED: 2025-05-07T20:24:55.0087223Z 2025-05-07T20:24:55.0087834Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:55.0089069Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:55.0089921Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:55.0090589Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:55.0090980Z 2025-05-07T20:24:55.0090989Z 2025-05-07T20:24:55.0090994Z 2025-05-07T20:24:55.0091154Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:55.0091554Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:55.0091802Z 2025-05-07T20:24:55.0093738Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:55.0094036Z 2025-05-07T20:24:55.0094044Z 2025-05-07T20:24:55.0104706Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:55.0105049Z 2025-05-07T20:24:55.0105053Z 2025-05-07T20:24:55.0107707Z 2025-05-07T20:24:55.0124896Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:55.0125292Z 2025-05-07T20:24:55.0125296Z 2025-05-07T20:24:55.0125300Z 2025-05-07T20:24:55.0125312Z 2025-05-07T20:24:55.0146121Z cffi-1.17.1 | 289 KB | | 0%  2025-05-07T20:24:55.0146392Z 2025-05-07T20:24:55.0146396Z 2025-05-07T20:24:55.0146607Z 2025-05-07T20:24:55.0146610Z 2025-05-07T20:24:55.0146629Z 2025-05-07T20:24:55.0148285Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:55.0148575Z 2025-05-07T20:24:55.0148579Z 2025-05-07T20:24:55.0148583Z 2025-05-07T20:24:55.0148586Z 2025-05-07T20:24:55.0148599Z 2025-05-07T20:24:55.0148603Z 2025-05-07T20:24:55.0149828Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:55.0150169Z 2025-05-07T20:24:55.0150175Z 2025-05-07T20:24:55.0150181Z 2025-05-07T20:24:55.0150186Z 2025-05-07T20:24:55.0150202Z 2025-05-07T20:24:55.0150207Z 2025-05-07T20:24:55.0150216Z 2025-05-07T20:24:55.0155301Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:55.0155768Z 2025-05-07T20:24:55.0155782Z 2025-05-07T20:24:55.0155787Z 2025-05-07T20:24:55.0155793Z 2025-05-07T20:24:55.0155798Z 2025-05-07T20:24:55.0155803Z 2025-05-07T20:24:55.0155809Z 2025-05-07T20:24:55.0155814Z 2025-05-07T20:24:55.0156987Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:55.0157312Z 2025-05-07T20:24:55.0157323Z 2025-05-07T20:24:55.0157327Z 2025-05-07T20:24:55.0157330Z 2025-05-07T20:24:55.0157334Z 2025-05-07T20:24:55.0157337Z 2025-05-07T20:24:55.0157341Z 2025-05-07T20:24:55.0157345Z 2025-05-07T20:24:55.0157348Z 2025-05-07T20:24:55.0687390Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:55.0687702Z 2025-05-07T20:24:55.0687706Z 2025-05-07T20:24:55.0687709Z 2025-05-07T20:24:55.0688877Z 2025-05-07T20:24:55.1073983Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:24:55.1088330Z openssl-3.5.0 | 3.0 MB | ######3 | 63% 2025-05-07T20:24:55.1089896Z 2025-05-07T20:24:55.1097304Z cryptography-44.0.3 | 1.5 MB | #####6 | 56%  2025-05-07T20:24:55.1097574Z 2025-05-07T20:24:55.1103636Z 2025-05-07T20:24:55.1150296Z libgcc-15.1.0 | 810 KB | #########8 | 99%  2025-05-07T20:24:55.1150659Z 2025-05-07T20:24:55.1150664Z 2025-05-07T20:24:55.1150678Z 2025-05-07T20:24:55.1150682Z 2025-05-07T20:24:55.1154454Z 2025-05-07T20:24:55.1218387Z pyopenssl-25.0.0 | 120 KB | #####3 | 53%  2025-05-07T20:24:55.1218792Z 2025-05-07T20:24:55.1218796Z 2025-05-07T20:24:55.1218800Z 2025-05-07T20:24:55.1218804Z 2025-05-07T20:24:55.1221025Z 2025-05-07T20:24:55.1251100Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:55.1251517Z 2025-05-07T20:24:55.1251523Z 2025-05-07T20:24:55.1252581Z 2025-05-07T20:24:55.1378380Z libgomp-15.1.0 | 442 KB | 3 | 4%  2025-05-07T20:24:55.1378760Z 2025-05-07T20:24:55.1380376Z 2025-05-07T20:24:55.1615622Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:55.1616033Z 2025-05-07T20:24:55.1616039Z 2025-05-07T20:24:55.1616043Z 2025-05-07T20:24:55.1616046Z 2025-05-07T20:24:55.1616050Z 2025-05-07T20:24:55.1616054Z 2025-05-07T20:24:55.1683730Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:55.1684040Z 2025-05-07T20:24:55.1684046Z 2025-05-07T20:24:55.1685336Z 2025-05-07T20:24:55.1717461Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:55.1717838Z 2025-05-07T20:24:55.1717842Z 2025-05-07T20:24:55.1717846Z 2025-05-07T20:24:55.1717849Z 2025-05-07T20:24:55.1717853Z 2025-05-07T20:24:55.1723166Z 2025-05-07T20:24:55.1864208Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:55.1864791Z 2025-05-07T20:24:55.1992325Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:55.1992709Z 2025-05-07T20:24:55.1992713Z 2025-05-07T20:24:55.1992717Z 2025-05-07T20:24:55.1992720Z 2025-05-07T20:24:55.1999009Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:24:55.1999352Z 2025-05-07T20:24:55.1999356Z 2025-05-07T20:24:55.1999360Z 2025-05-07T20:24:55.1999651Z 2025-05-07T20:24:55.2022285Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:24:55.2035816Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:55.2036078Z 2025-05-07T20:24:55.2036082Z 2025-05-07T20:24:55.2036086Z 2025-05-07T20:24:55.2036089Z 2025-05-07T20:24:55.2036093Z 2025-05-07T20:24:55.2036097Z 2025-05-07T20:24:55.2036100Z 2025-05-07T20:24:55.2079265Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:55.2079590Z 2025-05-07T20:24:55.2079594Z 2025-05-07T20:24:55.2079598Z 2025-05-07T20:24:55.2079601Z 2025-05-07T20:24:55.2079605Z 2025-05-07T20:24:55.2079609Z 2025-05-07T20:24:55.2081180Z 2025-05-07T20:24:55.2144686Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:55.2144996Z 2025-05-07T20:24:55.2145000Z 2025-05-07T20:24:55.2145004Z 2025-05-07T20:24:55.2145007Z 2025-05-07T20:24:55.2145011Z 2025-05-07T20:24:55.2145015Z 2025-05-07T20:24:55.2145018Z 2025-05-07T20:24:55.2145032Z 2025-05-07T20:24:55.2181000Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:55.2181318Z 2025-05-07T20:24:55.2181322Z 2025-05-07T20:24:55.2181325Z 2025-05-07T20:24:55.2181329Z 2025-05-07T20:24:55.2181332Z 2025-05-07T20:24:55.2181336Z 2025-05-07T20:24:55.2181340Z 2025-05-07T20:24:55.2181381Z 2025-05-07T20:24:55.2194635Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:55.2195057Z 2025-05-07T20:24:55.2195061Z 2025-05-07T20:24:55.2195064Z 2025-05-07T20:24:55.2195076Z 2025-05-07T20:24:55.2195080Z 2025-05-07T20:24:55.2195084Z 2025-05-07T20:24:55.2195087Z 2025-05-07T20:24:55.2195091Z 2025-05-07T20:24:55.2195094Z 2025-05-07T20:24:55.2223842Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:55.2224212Z 2025-05-07T20:24:55.2224219Z 2025-05-07T20:24:55.2224224Z 2025-05-07T20:24:55.2224229Z 2025-05-07T20:24:55.2224235Z 2025-05-07T20:24:55.2224251Z 2025-05-07T20:24:55.2224256Z 2025-05-07T20:24:55.2224262Z 2025-05-07T20:24:55.2225673Z 2025-05-07T20:24:55.2247405Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:55.2247696Z 2025-05-07T20:24:55.2247700Z 2025-05-07T20:24:55.2247703Z 2025-05-07T20:24:55.2247707Z 2025-05-07T20:24:55.2247952Z 2025-05-07T20:24:55.2591633Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:55.2592351Z 2025-05-07T20:24:55.2592355Z 2025-05-07T20:24:55.2592359Z 2025-05-07T20:24:55.2594655Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:55.2595043Z 2025-05-07T20:24:55.2595049Z 2025-05-07T20:24:55.2595055Z 2025-05-07T20:24:55.2641979Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:55.2642250Z 2025-05-07T20:24:55.2642335Z 2025-05-07T20:24:55.3333621Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:55.3333883Z 2025-05-07T20:24:55.3333887Z 2025-05-07T20:24:55.3333899Z 2025-05-07T20:24:55.3333903Z 2025-05-07T20:24:55.3333911Z 2025-05-07T20:24:55.3333914Z 2025-05-07T20:24:55.3337365Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:55.3337651Z 2025-05-07T20:24:55.3337655Z 2025-05-07T20:24:55.3337659Z 2025-05-07T20:24:55.3337663Z 2025-05-07T20:24:55.3337666Z 2025-05-07T20:24:55.3337670Z 2025-05-07T20:24:55.3463290Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:55.3463576Z 2025-05-07T20:24:55.3463580Z 2025-05-07T20:24:55.3463584Z 2025-05-07T20:24:55.3463587Z 2025-05-07T20:24:55.3463591Z 2025-05-07T20:24:55.3463595Z 2025-05-07T20:24:55.3463599Z 2025-05-07T20:24:55.3467717Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:55.3468030Z 2025-05-07T20:24:55.3468034Z 2025-05-07T20:24:55.3468037Z 2025-05-07T20:24:55.3468041Z 2025-05-07T20:24:55.3468045Z 2025-05-07T20:24:55.3468049Z 2025-05-07T20:24:55.3468052Z 2025-05-07T20:24:55.3610955Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:55.3611261Z 2025-05-07T20:24:55.3611265Z 2025-05-07T20:24:55.3611269Z 2025-05-07T20:24:55.3611272Z 2025-05-07T20:24:55.3611276Z 2025-05-07T20:24:55.3611280Z 2025-05-07T20:24:55.3611283Z 2025-05-07T20:24:55.3611287Z 2025-05-07T20:24:55.3617132Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:55.3617731Z 2025-05-07T20:24:55.3617739Z 2025-05-07T20:24:55.3617746Z 2025-05-07T20:24:55.3617753Z 2025-05-07T20:24:55.3617761Z 2025-05-07T20:24:55.3617768Z 2025-05-07T20:24:55.3617775Z 2025-05-07T20:24:55.3617783Z 2025-05-07T20:24:55.4042602Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:55.4042905Z 2025-05-07T20:24:55.4042908Z 2025-05-07T20:24:55.4042912Z 2025-05-07T20:24:55.4042916Z 2025-05-07T20:24:55.4042919Z 2025-05-07T20:24:55.4042923Z 2025-05-07T20:24:55.4042933Z 2025-05-07T20:24:55.4042945Z 2025-05-07T20:24:55.4043063Z 2025-05-07T20:24:55.4046831Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:55.4047144Z 2025-05-07T20:24:55.4047148Z 2025-05-07T20:24:55.4047152Z 2025-05-07T20:24:55.4047156Z 2025-05-07T20:24:55.4047160Z 2025-05-07T20:24:55.4047163Z 2025-05-07T20:24:55.4047167Z 2025-05-07T20:24:55.4047171Z 2025-05-07T20:24:55.4047174Z 2025-05-07T20:24:55.4490219Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:55.4696290Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:55.4696537Z 2025-05-07T20:24:55.4703995Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:55.4704644Z 2025-05-07T20:24:55.4704995Z 2025-05-07T20:24:55.4705282Z  2025-05-07T20:24:55.4705621Z 2025-05-07T20:24:55.4705627Z 2025-05-07T20:24:55.4705899Z  2025-05-07T20:24:55.4706265Z 2025-05-07T20:24:55.4706271Z 2025-05-07T20:24:55.4706277Z 2025-05-07T20:24:55.4706543Z  2025-05-07T20:24:55.4706866Z 2025-05-07T20:24:55.4706871Z 2025-05-07T20:24:55.4706876Z 2025-05-07T20:24:55.4706882Z 2025-05-07T20:24:55.4707146Z  2025-05-07T20:24:55.4707474Z 2025-05-07T20:24:55.4707480Z 2025-05-07T20:24:55.4707485Z 2025-05-07T20:24:55.4707490Z 2025-05-07T20:24:55.4707496Z 2025-05-07T20:24:55.4707778Z  2025-05-07T20:24:55.4708008Z 2025-05-07T20:24:55.4708014Z 2025-05-07T20:24:55.4708020Z 2025-05-07T20:24:55.4708025Z 2025-05-07T20:24:55.4708030Z 2025-05-07T20:24:55.4708036Z 2025-05-07T20:24:55.4708290Z  2025-05-07T20:24:55.4708622Z 2025-05-07T20:24:55.4708639Z 2025-05-07T20:24:55.4708645Z 2025-05-07T20:24:55.4708657Z 2025-05-07T20:24:55.4708663Z 2025-05-07T20:24:55.4708668Z 2025-05-07T20:24:55.4708673Z 2025-05-07T20:24:55.4708960Z  2025-05-07T20:24:55.4709314Z 2025-05-07T20:24:55.4709320Z 2025-05-07T20:24:55.4709325Z 2025-05-07T20:24:55.4709330Z 2025-05-07T20:24:55.4709336Z 2025-05-07T20:24:55.4709341Z 2025-05-07T20:24:55.4709347Z 2025-05-07T20:24:55.4709352Z 2025-05-07T20:24:55.4709642Z  2025-05-07T20:24:55.4709995Z 2025-05-07T20:24:55.4710001Z 2025-05-07T20:24:55.4710006Z 2025-05-07T20:24:55.4710011Z 2025-05-07T20:24:55.4710016Z 2025-05-07T20:24:55.4710021Z 2025-05-07T20:24:55.4710026Z 2025-05-07T20:24:55.4710032Z 2025-05-07T20:24:55.4710037Z 2025-05-07T20:24:55.4710344Z  done 2025-05-07T20:24:55.5714115Z Preparing transaction: | done 2025-05-07T20:24:55.6719681Z Verifying transaction: - done 2025-05-07T20:24:57.1744943Z Executing transaction: | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:57.3509498Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:59.0707323Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:59.0723346Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:59.0746518Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:59.9350800Z Channels: 2025-05-07T20:24:59.9351279Z - conda-forge 2025-05-07T20:24:59.9351746Z Platform: linux-64 2025-05-07T20:25:03.1965649Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:03.5618640Z Solving environment: \ done 2025-05-07T20:25:03.6241393Z 2025-05-07T20:25:03.6241763Z ## Package Plan ## 2025-05-07T20:25:03.6242058Z 2025-05-07T20:25:03.6242483Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:03.6243140Z 2025-05-07T20:25:03.6243811Z added / updated specs: 2025-05-07T20:25:03.6244319Z - libxcrypt 2025-05-07T20:25:03.6244593Z 2025-05-07T20:25:03.6244597Z 2025-05-07T20:25:03.6244729Z The following packages will be downloaded: 2025-05-07T20:25:03.6244997Z 2025-05-07T20:25:03.6245126Z package | build 2025-05-07T20:25:03.6245457Z ---------------------------|----------------- 2025-05-07T20:25:03.6245859Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:25:03.6246284Z ------------------------------------------------------------ 2025-05-07T20:25:03.6246642Z Total: 98 KB 2025-05-07T20:25:03.6246862Z 2025-05-07T20:25:03.6246996Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:03.6247235Z 2025-05-07T20:25:03.6247466Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:25:03.6247785Z 2025-05-07T20:25:03.6247790Z 2025-05-07T20:25:03.6247794Z 2025-05-07T20:25:03.6247948Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:03.8134713Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:25:03.8170037Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:25:03.8269081Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:25:03.8271521Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:25:03.8271995Z 2025-05-07T20:25:03.8272307Z done 2025-05-07T20:25:03.9277062Z Preparing transaction: / done 2025-05-07T20:25:04.0281444Z Verifying transaction: \ done 2025-05-07T20:25:04.1287204Z Executing transaction: / done 2025-05-07T20:25:07.5591715Z [SETUP] Copying over ... 2025-05-07T20:25:07.5592462Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.13/crypt.h 2025-05-07T20:25:07.5593072Z 2025-05-07T20:25:07.5626253Z 2025-05-07T20:25:09.2007742Z [SETUP] Installed Python version: Python 3.13.2 2025-05-07T20:25:09.2008212Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:25:09.2044575Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:09.2045047Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:09.2057565Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:09.2057935Z env: 2025-05-07T20:25:09.2058172Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:09.2058482Z BUILD_ENV: build_binary 2025-05-07T20:25:09.2058739Z BUILD_TARGET: genai 2025-05-07T20:25:09.2058980Z BUILD_VARIANT: cuda 2025-05-07T20:25:09.2059225Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:09.2059498Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:09.2059815Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:09.2060168Z ##[endgroup] 2025-05-07T20:25:09.5431703Z ################################################################################ 2025-05-07T20:25:09.5432431Z # Install C/C++ Compilers 2025-05-07T20:25:09.5432684Z # 2025-05-07T20:25:09.5447234Z # [2025-05-07T20:25:09.544Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:25:09.5447887Z ################################################################################ 2025-05-07T20:25:09.5448257Z 2025-05-07T20:25:09.5462734Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:09.6357531Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:09.6368447Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:25:09.6390321Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:25:10.5008802Z Channels: 2025-05-07T20:25:10.5009092Z - conda-forge 2025-05-07T20:25:10.5009335Z Platform: linux-64 2025-05-07T20:25:13.7872784Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:14.1532904Z Solving environment: \ done 2025-05-07T20:25:14.2156815Z 2025-05-07T20:25:14.2157447Z ## Package Plan ## 2025-05-07T20:25:14.2157646Z 2025-05-07T20:25:14.2157982Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:14.2158319Z 2025-05-07T20:25:14.2158461Z added / updated specs: 2025-05-07T20:25:14.2158878Z - sysroot_linux-64=2.17 2025-05-07T20:25:14.2159149Z 2025-05-07T20:25:14.2159155Z 2025-05-07T20:25:14.2159346Z The following packages will be downloaded: 2025-05-07T20:25:14.2159711Z 2025-05-07T20:25:14.2159896Z package | build 2025-05-07T20:25:14.2160501Z ---------------------------|----------------- 2025-05-07T20:25:14.2161078Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:25:14.2161589Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:25:14.2162027Z ------------------------------------------------------------ 2025-05-07T20:25:14.2162402Z Total: 15.4 MB 2025-05-07T20:25:14.2162622Z 2025-05-07T20:25:14.2162759Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:14.2163008Z 2025-05-07T20:25:14.2163309Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:25:14.2163899Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:25:14.2164223Z 2025-05-07T20:25:14.2164228Z 2025-05-07T20:25:14.2164232Z 2025-05-07T20:25:14.2164389Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:14.2164954Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:14.2166081Z 2025-05-07T20:25:14.4513853Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:25:14.4542250Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:14.4542619Z 2025-05-07T20:25:14.4714105Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:25:14.4715281Z 2025-05-07T20:25:14.5514189Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:14.6659159Z sysroot_linux-64-2.1 | 14.5 MB | ####9 | 49% 2025-05-07T20:25:14.6660335Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:14.7341440Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:14.7341714Z 2025-05-07T20:25:14.7345893Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:14.7346162Z 2025-05-07T20:25:15.2399785Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:15.2407016Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:15.2408004Z 2025-05-07T20:25:15.2408294Z 2025-05-07T20:25:15.2408540Z  done 2025-05-07T20:25:15.3411655Z Preparing transaction: / done 2025-05-07T20:25:15.5418536Z Verifying transaction: \ | done 2025-05-07T20:25:15.7501242Z Executing transaction: - \ done 2025-05-07T20:25:15.9042064Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:25:15.9042394Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:25:17.5857350Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:25:17.5870127Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:25:17.5892481Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:25:18.4776466Z Channels: 2025-05-07T20:25:18.4776732Z - conda-forge 2025-05-07T20:25:18.4776969Z Platform: linux-64 2025-05-07T20:25:21.7254732Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:22.6964996Z Solving environment: \ | / done 2025-05-07T20:25:22.7607092Z 2025-05-07T20:25:22.7607692Z ## Package Plan ## 2025-05-07T20:25:22.7607963Z 2025-05-07T20:25:22.7608275Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:22.7608722Z 2025-05-07T20:25:22.7608852Z added / updated specs: 2025-05-07T20:25:22.7609197Z - gxx_linux-64=11.4.0 2025-05-07T20:25:22.7609409Z 2025-05-07T20:25:22.7609432Z 2025-05-07T20:25:22.7609603Z The following packages will be downloaded: 2025-05-07T20:25:22.7609842Z 2025-05-07T20:25:22.7609970Z package | build 2025-05-07T20:25:22.7610299Z ---------------------------|----------------- 2025-05-07T20:25:22.7610717Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:25:22.7611217Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:25:22.7611695Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:25:22.7612157Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:25:22.7612617Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:25:22.7613084Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:25:22.7613782Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:25:22.7614275Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:25:22.7614775Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:25:22.7615229Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:25:22.7615722Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:25:22.7616218Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:25:22.7616637Z ------------------------------------------------------------ 2025-05-07T20:25:22.7616990Z Total: 91.6 MB 2025-05-07T20:25:22.7617215Z 2025-05-07T20:25:22.7617350Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:22.7617583Z 2025-05-07T20:25:22.7617873Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:25:22.7618886Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:25:22.7619447Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:25:22.7619978Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:25:22.7620510Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:25:22.7621036Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:25:22.7621583Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:22.7622166Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:25:22.7622686Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:25:22.7623409Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:22.7623784Z 2025-05-07T20:25:22.7623904Z The following packages will be UPDATED: 2025-05-07T20:25:22.7624133Z 2025-05-07T20:25:22.7624455Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:25:22.7625202Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:25:22.7625623Z 2025-05-07T20:25:22.7625627Z 2025-05-07T20:25:22.7625631Z 2025-05-07T20:25:22.7625788Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:22.7626183Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:22.7626428Z 2025-05-07T20:25:22.7626827Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:22.7627082Z 2025-05-07T20:25:22.7627086Z 2025-05-07T20:25:22.7632261Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:22.7632663Z 2025-05-07T20:25:22.7632668Z 2025-05-07T20:25:22.7632674Z 2025-05-07T20:25:22.7654939Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:22.7655331Z 2025-05-07T20:25:22.7655336Z 2025-05-07T20:25:22.7655340Z 2025-05-07T20:25:22.7655343Z 2025-05-07T20:25:22.7682642Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:22.7682972Z 2025-05-07T20:25:22.7682978Z 2025-05-07T20:25:22.7682983Z 2025-05-07T20:25:22.7682988Z 2025-05-07T20:25:22.7682994Z 2025-05-07T20:25:22.7684635Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:22.7685093Z 2025-05-07T20:25:22.7685100Z 2025-05-07T20:25:22.7685105Z 2025-05-07T20:25:22.7685110Z 2025-05-07T20:25:22.7685116Z 2025-05-07T20:25:22.7686532Z 2025-05-07T20:25:22.7688422Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:25:22.7688737Z 2025-05-07T20:25:22.7688741Z 2025-05-07T20:25:22.7688754Z 2025-05-07T20:25:22.7688768Z 2025-05-07T20:25:22.7688772Z 2025-05-07T20:25:22.7688776Z 2025-05-07T20:25:22.7689590Z 2025-05-07T20:25:22.7691540Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:25:22.7691865Z 2025-05-07T20:25:22.7691870Z 2025-05-07T20:25:22.7691873Z 2025-05-07T20:25:22.7691877Z 2025-05-07T20:25:22.7691880Z 2025-05-07T20:25:22.7691884Z 2025-05-07T20:25:22.7691888Z 2025-05-07T20:25:22.7691891Z 2025-05-07T20:25:22.7694577Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:22.7694886Z 2025-05-07T20:25:22.7694889Z 2025-05-07T20:25:22.7694893Z 2025-05-07T20:25:22.7694897Z 2025-05-07T20:25:22.7694900Z 2025-05-07T20:25:22.7694904Z 2025-05-07T20:25:22.7694908Z 2025-05-07T20:25:22.7694912Z 2025-05-07T20:25:22.7694916Z 2025-05-07T20:25:22.7698791Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:25:22.7699100Z 2025-05-07T20:25:22.7699104Z 2025-05-07T20:25:22.7699117Z 2025-05-07T20:25:22.7699121Z 2025-05-07T20:25:22.7699125Z 2025-05-07T20:25:22.7699128Z 2025-05-07T20:25:22.7699132Z 2025-05-07T20:25:22.7699135Z 2025-05-07T20:25:22.7699139Z 2025-05-07T20:25:22.7699281Z 2025-05-07T20:25:22.7700208Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:25:22.7700494Z 2025-05-07T20:25:22.7700498Z 2025-05-07T20:25:22.7700502Z 2025-05-07T20:25:22.7700506Z 2025-05-07T20:25:22.7700510Z 2025-05-07T20:25:22.7700517Z 2025-05-07T20:25:22.7700520Z 2025-05-07T20:25:22.7700524Z 2025-05-07T20:25:22.7700527Z 2025-05-07T20:25:22.7700540Z 2025-05-07T20:25:22.7715334Z 2025-05-07T20:25:22.8711193Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:25:22.8711579Z 2025-05-07T20:25:22.8711584Z 2025-05-07T20:25:22.8711596Z 2025-05-07T20:25:22.8711992Z 2025-05-07T20:25:22.8724024Z libstdcxx-15.1.0 | 3.7 MB | #1 | 12%  2025-05-07T20:25:22.8724595Z 2025-05-07T20:25:22.8724599Z 2025-05-07T20:25:22.8728880Z 2025-05-07T20:25:23.0476061Z binutils_impl_linux- | 6.0 MB | #2 | 13%  2025-05-07T20:25:23.0476369Z 2025-05-07T20:25:23.0476393Z 2025-05-07T20:25:23.0476397Z 2025-05-07T20:25:23.0476401Z 2025-05-07T20:25:23.0479098Z libstdcxx-15.1.0 | 3.7 MB | ##3 | 23%  2025-05-07T20:25:23.0479381Z 2025-05-07T20:25:23.0479386Z 2025-05-07T20:25:23.0479390Z 2025-05-07T20:25:23.0935789Z binutils_impl_linux- | 6.0 MB | ##4 | 25%  2025-05-07T20:25:23.0936088Z 2025-05-07T20:25:23.0936092Z 2025-05-07T20:25:23.0936747Z 2025-05-07T20:25:23.1056771Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:23.1057076Z 2025-05-07T20:25:23.1057080Z 2025-05-07T20:25:23.1095757Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:23.1096044Z 2025-05-07T20:25:23.1096048Z 2025-05-07T20:25:23.1096052Z 2025-05-07T20:25:23.1096126Z 2025-05-07T20:25:23.1319366Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:23.1319651Z 2025-05-07T20:25:23.1319655Z 2025-05-07T20:25:23.1319666Z 2025-05-07T20:25:23.1319670Z 2025-05-07T20:25:23.1321158Z 2025-05-07T20:25:23.1354147Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:23.1354455Z 2025-05-07T20:25:23.1564323Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:23.1564616Z 2025-05-07T20:25:23.1564620Z 2025-05-07T20:25:23.1564624Z 2025-05-07T20:25:23.1564628Z 2025-05-07T20:25:23.1564631Z 2025-05-07T20:25:23.1564635Z 2025-05-07T20:25:23.1616390Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:25:23.2063706Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:23.2064012Z 2025-05-07T20:25:23.2065640Z 2025-05-07T20:25:23.2319727Z libstdcxx-devel_linu | 11.1 MB | ###1 | 32%  2025-05-07T20:25:23.2320073Z 2025-05-07T20:25:23.2320077Z 2025-05-07T20:25:23.2320097Z 2025-05-07T20:25:23.2320101Z 2025-05-07T20:25:23.2320104Z 2025-05-07T20:25:23.2358797Z libsanitizer-11.4.0 | 3.5 MB | ########5 | 85%  2025-05-07T20:25:23.2359383Z 2025-05-07T20:25:23.2619428Z gxx_impl_linux-64-11 | 11.2 MB | #7 | 17%  2025-05-07T20:25:23.3066647Z gcc_impl_linux-64-11 | 53.0 MB | 4 | 4% 2025-05-07T20:25:23.3067008Z 2025-05-07T20:25:23.3067014Z 2025-05-07T20:25:23.3067020Z 2025-05-07T20:25:23.3067025Z 2025-05-07T20:25:23.3067030Z 2025-05-07T20:25:23.3069399Z 2025-05-07T20:25:23.3069968Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:23.3070299Z 2025-05-07T20:25:23.3070305Z 2025-05-07T20:25:23.3070316Z 2025-05-07T20:25:23.3070321Z 2025-05-07T20:25:23.3070326Z 2025-05-07T20:25:23.3070331Z 2025-05-07T20:25:23.3070828Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:23.3071130Z 2025-05-07T20:25:23.3071137Z 2025-05-07T20:25:23.3359780Z libstdcxx-devel_linu | 11.1 MB | #####7 | 58%  2025-05-07T20:25:23.3365178Z 2025-05-07T20:25:23.3616265Z gxx_impl_linux-64-11 | 11.2 MB | ####2 | 42%  2025-05-07T20:25:23.3616530Z 2025-05-07T20:25:23.3616789Z 2025-05-07T20:25:23.3616794Z 2025-05-07T20:25:23.3616798Z 2025-05-07T20:25:23.3616802Z 2025-05-07T20:25:23.3616806Z 2025-05-07T20:25:23.3620691Z 2025-05-07T20:25:23.3624415Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:25:23.3923735Z gcc_impl_linux-64-11 | 53.0 MB | 8 | 9% 2025-05-07T20:25:23.3923983Z 2025-05-07T20:25:23.3923987Z 2025-05-07T20:25:23.3923990Z 2025-05-07T20:25:23.3923994Z 2025-05-07T20:25:23.3927915Z 2025-05-07T20:25:23.4072711Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:23.4072999Z 2025-05-07T20:25:23.4075131Z 2025-05-07T20:25:23.4257283Z libstdcxx-devel_linu | 11.1 MB | ########3 | 84%  2025-05-07T20:25:23.4257552Z 2025-05-07T20:25:23.4257772Z 2025-05-07T20:25:23.4257776Z 2025-05-07T20:25:23.4257780Z 2025-05-07T20:25:23.4257783Z 2025-05-07T20:25:23.4257787Z 2025-05-07T20:25:23.4257791Z 2025-05-07T20:25:23.4259547Z 2025-05-07T20:25:23.4323475Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:23.4323777Z 2025-05-07T20:25:23.4323781Z 2025-05-07T20:25:23.4323785Z 2025-05-07T20:25:23.4323788Z 2025-05-07T20:25:23.4323792Z 2025-05-07T20:25:23.4323796Z 2025-05-07T20:25:23.4323799Z 2025-05-07T20:25:23.4323803Z 2025-05-07T20:25:23.4361335Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:23.4365237Z 2025-05-07T20:25:23.4388381Z gxx_impl_linux-64-11 | 11.2 MB | ######3 | 64%  2025-05-07T20:25:23.4388639Z 2025-05-07T20:25:23.4388643Z 2025-05-07T20:25:23.4388646Z 2025-05-07T20:25:23.4388650Z 2025-05-07T20:25:23.4388653Z 2025-05-07T20:25:23.4388657Z 2025-05-07T20:25:23.4391876Z 2025-05-07T20:25:23.4628988Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:23.4686503Z gcc_impl_linux-64-11 | 53.0 MB | #4 | 14% 2025-05-07T20:25:23.4686983Z 2025-05-07T20:25:23.4686991Z 2025-05-07T20:25:23.4687010Z 2025-05-07T20:25:23.4687017Z 2025-05-07T20:25:23.4687024Z 2025-05-07T20:25:23.4687032Z 2025-05-07T20:25:23.4687039Z 2025-05-07T20:25:23.4687047Z 2025-05-07T20:25:23.4691194Z 2025-05-07T20:25:23.4745752Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:25:23.4746052Z 2025-05-07T20:25:23.4746056Z 2025-05-07T20:25:23.4746059Z 2025-05-07T20:25:23.4746063Z 2025-05-07T20:25:23.4746067Z 2025-05-07T20:25:23.4746071Z 2025-05-07T20:25:23.4746074Z 2025-05-07T20:25:23.4746078Z 2025-05-07T20:25:23.4746090Z 2025-05-07T20:25:23.4873070Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:23.4873654Z 2025-05-07T20:25:23.4873662Z 2025-05-07T20:25:23.4873670Z 2025-05-07T20:25:23.4873691Z 2025-05-07T20:25:23.4873713Z 2025-05-07T20:25:23.4873720Z 2025-05-07T20:25:23.4873727Z 2025-05-07T20:25:23.4873734Z 2025-05-07T20:25:23.4873742Z 2025-05-07T20:25:23.4875792Z 2025-05-07T20:25:23.4926836Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:25:23.4927156Z 2025-05-07T20:25:23.4927160Z 2025-05-07T20:25:23.4927163Z 2025-05-07T20:25:23.4927167Z 2025-05-07T20:25:23.4927171Z 2025-05-07T20:25:23.4927175Z 2025-05-07T20:25:23.4927178Z 2025-05-07T20:25:23.4927182Z 2025-05-07T20:25:23.4927186Z 2025-05-07T20:25:23.4927189Z 2025-05-07T20:25:23.5242267Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:23.5242572Z 2025-05-07T20:25:23.5242575Z 2025-05-07T20:25:23.5242579Z 2025-05-07T20:25:23.5242583Z 2025-05-07T20:25:23.5242586Z 2025-05-07T20:25:23.5242590Z 2025-05-07T20:25:23.5242594Z 2025-05-07T20:25:23.5242597Z 2025-05-07T20:25:23.5242601Z 2025-05-07T20:25:23.5242604Z 2025-05-07T20:25:23.5244302Z 2025-05-07T20:25:23.5302226Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:25:23.5302542Z 2025-05-07T20:25:23.5302546Z 2025-05-07T20:25:23.5302550Z 2025-05-07T20:25:23.5302770Z 2025-05-07T20:25:23.5302775Z 2025-05-07T20:25:23.5302778Z 2025-05-07T20:25:23.5302793Z 2025-05-07T20:25:23.5302796Z 2025-05-07T20:25:23.5302800Z 2025-05-07T20:25:23.5302803Z 2025-05-07T20:25:23.5308122Z 2025-05-07T20:25:23.5592977Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:23.5593862Z 2025-05-07T20:25:23.5629424Z gxx_impl_linux-64-11 | 11.2 MB | ########2 | 82%  2025-05-07T20:25:23.5710516Z gcc_impl_linux-64-11 | 53.0 MB | ##1 | 21% 2025-05-07T20:25:23.5710762Z 2025-05-07T20:25:23.5710766Z 2025-05-07T20:25:23.5710770Z 2025-05-07T20:25:23.5712780Z 2025-05-07T20:25:23.5717771Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:23.5718051Z 2025-05-07T20:25:23.5718288Z 2025-05-07T20:25:23.5718292Z 2025-05-07T20:25:23.5718296Z 2025-05-07T20:25:23.6631618Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:23.7057764Z gcc_impl_linux-64-11 | 53.0 MB | ##7 | 28% 2025-05-07T20:25:23.7058020Z 2025-05-07T20:25:23.7058024Z 2025-05-07T20:25:23.7058027Z 2025-05-07T20:25:23.7058031Z 2025-05-07T20:25:23.7058035Z 2025-05-07T20:25:23.7059640Z 2025-05-07T20:25:23.7634877Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:23.8047175Z gcc_impl_linux-64-11 | 53.0 MB | ###4 | 35% 2025-05-07T20:25:23.8047457Z 2025-05-07T20:25:23.8047576Z 2025-05-07T20:25:23.8400828Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:23.8401112Z 2025-05-07T20:25:23.8401116Z 2025-05-07T20:25:23.8401119Z 2025-05-07T20:25:23.8401123Z 2025-05-07T20:25:23.8401128Z 2025-05-07T20:25:23.8401132Z 2025-05-07T20:25:23.8401135Z 2025-05-07T20:25:23.8401477Z 2025-05-07T20:25:23.8412006Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:23.8412335Z 2025-05-07T20:25:23.8412340Z 2025-05-07T20:25:23.8412343Z 2025-05-07T20:25:23.8412347Z 2025-05-07T20:25:23.8412364Z 2025-05-07T20:25:23.8412368Z 2025-05-07T20:25:23.8412372Z 2025-05-07T20:25:23.8413007Z 2025-05-07T20:25:23.8635766Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:23.8738322Z gcc_impl_linux-64-11 | 53.0 MB | ####1 | 42% 2025-05-07T20:25:23.8738669Z 2025-05-07T20:25:23.8738990Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:23.8739262Z 2025-05-07T20:25:23.9077506Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:23.9077784Z 2025-05-07T20:25:23.9077788Z 2025-05-07T20:25:23.9077792Z 2025-05-07T20:25:23.9077795Z 2025-05-07T20:25:23.9077799Z 2025-05-07T20:25:23.9077805Z 2025-05-07T20:25:23.9077809Z 2025-05-07T20:25:23.9084095Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:23.9084417Z 2025-05-07T20:25:23.9084421Z 2025-05-07T20:25:23.9084425Z 2025-05-07T20:25:23.9084429Z 2025-05-07T20:25:23.9084433Z 2025-05-07T20:25:23.9084446Z 2025-05-07T20:25:23.9084620Z 2025-05-07T20:25:23.9494912Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:23.9495288Z 2025-05-07T20:25:23.9495292Z 2025-05-07T20:25:23.9495296Z 2025-05-07T20:25:23.9495299Z 2025-05-07T20:25:23.9495303Z 2025-05-07T20:25:23.9495307Z 2025-05-07T20:25:23.9495311Z 2025-05-07T20:25:23.9495315Z 2025-05-07T20:25:23.9495319Z 2025-05-07T20:25:23.9498387Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:23.9498684Z 2025-05-07T20:25:23.9498688Z 2025-05-07T20:25:23.9498691Z 2025-05-07T20:25:23.9498695Z 2025-05-07T20:25:23.9498699Z 2025-05-07T20:25:23.9498702Z 2025-05-07T20:25:23.9498715Z 2025-05-07T20:25:23.9498719Z 2025-05-07T20:25:23.9498722Z 2025-05-07T20:25:23.9866869Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:23.9867170Z 2025-05-07T20:25:23.9867182Z 2025-05-07T20:25:23.9867186Z 2025-05-07T20:25:23.9867190Z 2025-05-07T20:25:23.9867465Z 2025-05-07T20:25:23.9867470Z 2025-05-07T20:25:23.9867474Z 2025-05-07T20:25:23.9867477Z 2025-05-07T20:25:23.9867481Z 2025-05-07T20:25:23.9867484Z 2025-05-07T20:25:23.9868371Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:23.9868667Z 2025-05-07T20:25:23.9868671Z 2025-05-07T20:25:23.9868674Z 2025-05-07T20:25:23.9868678Z 2025-05-07T20:25:23.9868681Z 2025-05-07T20:25:23.9868685Z 2025-05-07T20:25:23.9868688Z 2025-05-07T20:25:23.9868692Z 2025-05-07T20:25:23.9868695Z 2025-05-07T20:25:23.9868699Z 2025-05-07T20:25:23.9967127Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:23.9985501Z gcc_impl_linux-64-11 | 53.0 MB | ####7 | 48% 2025-05-07T20:25:23.9986093Z 2025-05-07T20:25:23.9986096Z 2025-05-07T20:25:23.9986100Z 2025-05-07T20:25:23.9986103Z 2025-05-07T20:25:23.9987056Z 2025-05-07T20:25:24.0356977Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:24.0357581Z 2025-05-07T20:25:24.0357589Z 2025-05-07T20:25:24.0357596Z 2025-05-07T20:25:24.0357603Z 2025-05-07T20:25:24.0357610Z 2025-05-07T20:25:24.0357618Z 2025-05-07T20:25:24.0357625Z 2025-05-07T20:25:24.0357632Z 2025-05-07T20:25:24.0357639Z 2025-05-07T20:25:24.0357646Z 2025-05-07T20:25:24.0357653Z 2025-05-07T20:25:24.0361187Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:24.0361519Z 2025-05-07T20:25:24.0361523Z 2025-05-07T20:25:24.0361526Z 2025-05-07T20:25:24.0361530Z 2025-05-07T20:25:24.0361534Z 2025-05-07T20:25:24.0361537Z 2025-05-07T20:25:24.0361541Z 2025-05-07T20:25:24.0361544Z 2025-05-07T20:25:24.0361548Z 2025-05-07T20:25:24.0361551Z 2025-05-07T20:25:24.0361555Z 2025-05-07T20:25:24.0970577Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:24.1783857Z gcc_impl_linux-64-11 | 53.0 MB | #####5 | 55% 2025-05-07T20:25:24.1784121Z 2025-05-07T20:25:24.1784141Z 2025-05-07T20:25:24.1785436Z 2025-05-07T20:25:24.1798181Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:24.1798557Z 2025-05-07T20:25:24.1798574Z 2025-05-07T20:25:24.1799057Z 2025-05-07T20:25:24.1973685Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:24.3049078Z gcc_impl_linux-64-11 | 53.0 MB | ######5 | 65% 2025-05-07T20:25:24.3743274Z gcc_impl_linux-64-11 | 53.0 MB | #######5 | 76% 2025-05-07T20:25:24.3743547Z 2025-05-07T20:25:24.4049656Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:24.5050191Z gcc_impl_linux-64-11 | 53.0 MB | ########5 | 85% 2025-05-07T20:25:24.5885872Z gcc_impl_linux-64-11 | 53.0 MB | #########4 | 94% 2025-05-07T20:25:24.5886123Z 2025-05-07T20:25:24.5886420Z 2025-05-07T20:25:24.6222882Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:25.2124497Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:25.2130814Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:25.2131454Z 2025-05-07T20:25:25.2131839Z 2025-05-07T20:25:25.2132327Z  2025-05-07T20:25:25.2132694Z 2025-05-07T20:25:25.2132700Z 2025-05-07T20:25:25.2133012Z  2025-05-07T20:25:25.2133386Z 2025-05-07T20:25:25.2133393Z 2025-05-07T20:25:25.2133398Z 2025-05-07T20:25:25.2133690Z  2025-05-07T20:25:25.2134048Z 2025-05-07T20:25:25.2134054Z 2025-05-07T20:25:25.2134059Z 2025-05-07T20:25:25.2134065Z 2025-05-07T20:25:25.2134360Z  2025-05-07T20:25:25.2134701Z 2025-05-07T20:25:25.2134708Z 2025-05-07T20:25:25.2134713Z 2025-05-07T20:25:25.2134718Z 2025-05-07T20:25:25.2134722Z 2025-05-07T20:25:25.2135303Z  2025-05-07T20:25:25.2135656Z 2025-05-07T20:25:25.2135661Z 2025-05-07T20:25:25.2135666Z 2025-05-07T20:25:25.2135672Z 2025-05-07T20:25:25.2135677Z 2025-05-07T20:25:25.2135682Z 2025-05-07T20:25:25.2135912Z  2025-05-07T20:25:25.2136187Z 2025-05-07T20:25:25.2136192Z 2025-05-07T20:25:25.2136197Z 2025-05-07T20:25:25.2136202Z 2025-05-07T20:25:25.2136207Z 2025-05-07T20:25:25.2136213Z 2025-05-07T20:25:25.2136218Z 2025-05-07T20:25:25.2136485Z  2025-05-07T20:25:25.2136825Z 2025-05-07T20:25:25.2136832Z 2025-05-07T20:25:25.2136838Z 2025-05-07T20:25:25.2136844Z 2025-05-07T20:25:25.2136849Z 2025-05-07T20:25:25.2136854Z 2025-05-07T20:25:25.2137080Z 2025-05-07T20:25:25.2137085Z 2025-05-07T20:25:25.2137394Z  2025-05-07T20:25:25.2137753Z 2025-05-07T20:25:25.2137771Z 2025-05-07T20:25:25.2137777Z 2025-05-07T20:25:25.2137782Z 2025-05-07T20:25:25.2137788Z 2025-05-07T20:25:25.2137793Z 2025-05-07T20:25:25.2137799Z 2025-05-07T20:25:25.2137805Z 2025-05-07T20:25:25.2137811Z 2025-05-07T20:25:25.2138142Z  2025-05-07T20:25:25.2138519Z 2025-05-07T20:25:25.2138524Z 2025-05-07T20:25:25.2138529Z 2025-05-07T20:25:25.2138534Z 2025-05-07T20:25:25.2138540Z 2025-05-07T20:25:25.2138545Z 2025-05-07T20:25:25.2138550Z 2025-05-07T20:25:25.2138555Z 2025-05-07T20:25:25.2138561Z 2025-05-07T20:25:25.2138566Z 2025-05-07T20:25:25.2138870Z  2025-05-07T20:25:25.2139282Z 2025-05-07T20:25:25.2139290Z 2025-05-07T20:25:25.2139307Z 2025-05-07T20:25:25.2139315Z 2025-05-07T20:25:25.2139323Z 2025-05-07T20:25:25.2139330Z 2025-05-07T20:25:25.2139339Z 2025-05-07T20:25:25.2139346Z 2025-05-07T20:25:25.2139362Z 2025-05-07T20:25:25.2139377Z 2025-05-07T20:25:25.2139384Z 2025-05-07T20:25:25.2139734Z  done 2025-05-07T20:25:25.3142128Z Preparing transaction: \ done 2025-05-07T20:25:25.6150676Z Verifying transaction: / - \ done 2025-05-07T20:25:25.7160774Z Executing transaction: / done 2025-05-07T20:25:25.8803379Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:25:29.7779111Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:29.7779683Z 2025-05-07T20:25:29.7790494Z 2025-05-07T20:25:29.7809436Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:29.7810016Z 2025-05-07T20:25:29.7822339Z 2025-05-07T20:25:29.7839537Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:29.7840092Z 2025-05-07T20:25:29.7851312Z 2025-05-07T20:25:29.7868404Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:29.7868958Z 2025-05-07T20:25:29.7880979Z 2025-05-07T20:25:31.6705477Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:31.6705777Z 2025-05-07T20:25:31.7322669Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:33.6176799Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:33.6177093Z 2025-05-07T20:25:33.6800758Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:35.5582714Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:35.5583069Z 2025-05-07T20:25:35.6204242Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:37.4966262Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:37.4966562Z 2025-05-07T20:25:37.5597541Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:37.5601846Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:37.5602314Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:37.5602533Z 2025-05-07T20:25:39.4527278Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:39.4527737Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:39.4528170Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:39.4528541Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:39.4528897Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:39.4529269Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:39.4529570Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:39.4529983Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:39.4530388Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:39.4530681Z #define __CHAR_BIT__ 8 2025-05-07T20:25:39.4531481Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:39.4531844Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:39.4532210Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:39.4532554Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:39.4532853Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:39.4533173Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:39.4533489Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:39.4533798Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:39.4534144Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:39.4534481Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:39.4534909Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:39.4535347Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:39.4535679Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:39.4535970Z #define __GCC_IEC_559 2 2025-05-07T20:25:39.4536229Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:39.4536525Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:39.4536798Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:39.4537160Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:39.4537509Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:39.4537844Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:39.4538128Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:39.4538416Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:39.4538694Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:39.4538968Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:39.4539243Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:39.4539522Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:39.4539793Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:39.4540056Z #define __INT8_C(c) c 2025-05-07T20:25:39.4540306Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:39.4540614Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:39.4540950Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:39.4541282Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:39.4541654Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:39.4541939Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:39.4542224Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:39.4542512Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:39.4542806Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:39.4543217Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:39.4543650Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:39.4543947Z #define __linux 1 2025-05-07T20:25:39.4544189Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:39.4544485Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:39.4544776Z #define __unix 1 2025-05-07T20:25:39.4545020Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:39.4545315Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:39.4545595Z #define __WINT_MIN__ 0U 2025-05-07T20:25:39.4545855Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:39.4546163Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:39.4546445Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:39.4546727Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:39.4547194Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:39.4547496Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:39.4547809Z #define __INT64_C(c) c ## L 2025-05-07T20:25:39.4548093Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:39.4548407Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:39.4548683Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:39.4549059Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:39.4549452Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:39.4549714Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:39.4549990Z #define __DBL_DIG__ 15 2025-05-07T20:25:39.4550234Z #define __FLT32_DIG__ 6 2025-05-07T20:25:39.4550546Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:39.4550911Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:39.4551261Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:39.4551600Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:39.4551965Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:39.4552227Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:39.4552502Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:39.4552889Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:39.4553303Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:39.4553590Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:39.4553853Z #define __unix__ 1 2025-05-07T20:25:39.4554085Z #define __INT_WIDTH__ 32 2025-05-07T20:25:39.4554341Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:39.4554592Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:39.4554859Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:39.4555167Z #define __UINT16_C(c) c 2025-05-07T20:25:39.4555435Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:39.4555702Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:39.4556083Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:39.4556455Z #define __gnu_linux__ 1 2025-05-07T20:25:39.4556712Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:39.4557006Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:39.4557307Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:39.4557584Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:39.4557859Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:39.4558121Z #define __GNUC__ 11 2025-05-07T20:25:39.4558341Z #define __pie__ 2 2025-05-07T20:25:39.4558563Z #define __MMX__ 1 2025-05-07T20:25:39.4558797Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:39.4559070Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:39.4559365Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:39.4559653Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:39.4560010Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:39.4560585Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4560923Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:39.4561192Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:39.4561477Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:39.4561816Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:39.4562088Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:39.4562363Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:39.4562663Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:39.4562971Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:39.4563253Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:39.4563548Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:39.4563817Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:39.4564090Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:39.4572906Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:39.4573203Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:39.4573479Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:39.4573827Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:39.4574223Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:39.4574520Z #define __SSE2_MATH__ 1 2025-05-07T20:25:39.4574782Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:39.4575300Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4575620Z #define __amd64 1 2025-05-07T20:25:39.4575859Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:39.4576152Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:39.4576484Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:39.4576812Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:39.4577086Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:39.4577385Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:39.4577656Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:39.4577944Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:39.4578235Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:39.4578514Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:39.4578804Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:39.4579103Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:39.4579470Z #define __x86_64 1 2025-05-07T20:25:39.4579715Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:39.4580121Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:39.4580617Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:39.4581093Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:39.4581588Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:39.4582000Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:39.4582266Z #define __LP64__ 1 2025-05-07T20:25:39.4582513Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:39.4582889Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:39.4583290Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:39.4583579Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:39.4583875Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:39.4584187Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:39.4584478Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:39.4584764Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:39.4585057Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:39.4585368Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:39.4585663Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:39.4586016Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:39.4586393Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:39.4586689Z #define __FLT_DIG__ 6 2025-05-07T20:25:39.4586942Z #define __NO_INLINE__ 1 2025-05-07T20:25:39.4587195Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:39.4587543Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:39.4587914Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:39.4588192Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:39.4588471Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:39.4588745Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:39.4589026Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:39.4589294Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:39.4589611Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:39.4589922Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:39.4590202Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:39.4590530Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:39.4590883Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:39.4591161Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:39.4591439Z #define __FLT128_DIG__ 33 2025-05-07T20:25:39.4591697Z #define __INT32_C(c) c 2025-05-07T20:25:39.4591951Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:39.4592253Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:39.4592551Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:39.4592860Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:39.4593189Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:39.4593524Z #define unix 1 2025-05-07T20:25:39.4593773Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:39.4594109Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:39.4594437Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:39.4594898Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:39.4595247Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:39.4595520Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:39.4595804Z #define __ELF__ 1 2025-05-07T20:25:39.4596047Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:39.4596355Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:39.4596651Z #define __FLT_RADIX__ 2 2025-05-07T20:25:39.4596911Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:39.4597297Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:39.4597684Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:39.4597961Z #define __SSE_MATH__ 1 2025-05-07T20:25:39.4598199Z #define __k8 1 2025-05-07T20:25:39.4598521Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:39.4599010Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:39.4599321Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:39.4599641Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:39.4599929Z #define __LDBL_DIG__ 18 2025-05-07T20:25:39.4600336Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:39.4600663Z #define __x86_64__ 1 2025-05-07T20:25:39.4600918Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:39.4601231Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:39.4601590Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4601926Z #define __FLT64_DIG__ 15 2025-05-07T20:25:39.4602231Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:39.4602598Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:39.4602938Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:39.4603227Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:39.4603517Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4603839Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:39.4604235Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:39.4604650Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:39.4604973Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:39.4605338Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:39.4605677Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:39.4605997Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:39.4606304Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:39.4606636Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:39.4606930Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:39.4607189Z #define __SEG_FS 1 2025-05-07T20:25:39.4607444Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:39.4607739Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:39.4608038Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4608347Z #define __SEG_GS 1 2025-05-07T20:25:39.4608680Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:39.4609089Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:39.4609383Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:39.4609691Z #define __INT16_TYPE__ short int 2025-05-07T20:25:39.4610000Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:39.4610315Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:39.4610592Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:39.4610859Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:39.4611147Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:39.4611519Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:39.4611922Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4612238Z #define linux 1 2025-05-07T20:25:39.4612482Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:39.4612774Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:39.4613067Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:39.4614052Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:39.4614429Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:39.4614774Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:39.4615182Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:39.4615937Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:39.4616293Z #define __code_model_small__ 1 2025-05-07T20:25:39.4616588Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:39.4616891Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:39.4617145Z #define __k8__ 1 2025-05-07T20:25:39.4617389Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:39.4617690Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:39.4617998Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:39.4618254Z #define __pic__ 2 2025-05-07T20:25:39.4618526Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:39.4618849Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:39.4619157Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4619508Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:39.4619892Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:39.4620424Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:39.4620718Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:39.4621034Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:39.4621364Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:39.4621637Z #define __linux__ 1 2025-05-07T20:25:39.4621880Z #define __INT64_TYPE__ long int 2025-05-07T20:25:39.4622156Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:39.4622436Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:39.4622727Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:39.4622994Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:39.4623311Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4623663Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:39.4623976Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:39.4624265Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:39.4624579Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:39.4624892Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:39.4625249Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:39.4625624Z #define __SSE__ 1 2025-05-07T20:25:39.4625867Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:39.4626227Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:39.4626591Z #define __amd64__ 1 2025-05-07T20:25:39.4626829Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:39.4627090Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:39.4627374Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:39.4627661Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:39.4627938Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:39.4628234Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:39.4628509Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:39.4628798Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:39.4629082Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:39.4629453Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:39.4629942Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:39.4630314Z #define _LP64 1 2025-05-07T20:25:39.4630547Z #define __UINT8_C(c) c 2025-05-07T20:25:39.4630806Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:39.4631077Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:39.4631366Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:39.4631654Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:39.4631969Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:39.4632341Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:39.4632826Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:39.4633209Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:39.4633520Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:39.4633866Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:39.4634252Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:39.4634636Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:39.4634918Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:39.4635380Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:39.4635761Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:39.4636041Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:39.4636307Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:39.4636568Z #define __FXSR__ 1 2025-05-07T20:25:39.4636875Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:39.4637352Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:39.4637785Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:39.4638101Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:39.4638378Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:39.4638730Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:39.4639100Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:39.4639452Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:39.4639703Z #define __PIC__ 2 2025-05-07T20:25:39.4639962Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:39.4640530Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:39.4640935Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:39.4641287Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:39.4641622Z #define __SSE2__ 1 2025-05-07T20:25:39.4641861Z #define __INT32_TYPE__ int 2025-05-07T20:25:39.4642123Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:39.4642389Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:39.4642743Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:39.4643113Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:39.4643393Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:39.4643679Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:39.4643960Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:39.4644247Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:39.4644510Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:39.4644769Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:39.4645103Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:39.4645435Z #define __PIE__ 2 2025-05-07T20:25:39.4645773Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:39.4646180Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:39.4646537Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:39.4646913Z #define __INT16_C(c) c 2025-05-07T20:25:39.4647150Z #define __STDC__ 1 2025-05-07T20:25:39.4647389Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:39.4647676Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:39.4647950Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:39.4648260Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:39.4648631Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:39.4648981Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:39.4649270Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:39.4649558Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:39.4649838Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:39.4650140Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:39.4650443Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:39.4650734Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:39.4651048Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:39.4651457Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:39.4651855Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:39.4652175Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:39.4652480Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:39.4652745Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:39.4652914Z 2025-05-07T20:25:39.5152798Z 2025-05-07T20:25:39.5153554Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:39.5154124Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:39.5154390Z 2025-05-07T20:25:41.4046338Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:41.4046752Z #define __cpp_attributes 200809L 2025-05-07T20:25:41.4047624Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:41.4048138Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:41.4048441Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:41.4048716Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:41.4049061Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:41.4049461Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:41.4049905Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:41.4050341Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:41.4050736Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:41.4051015Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:41.4051277Z #define __CHAR_BIT__ 8 2025-05-07T20:25:41.4051524Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:41.4051792Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:41.4052234Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:41.4052524Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:41.4052817Z #define __cpp_static_assert 201411L 2025-05-07T20:25:41.4053130Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:41.4053443Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:41.4053763Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:41.4054070Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:41.4054409Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:41.4054749Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:41.4055168Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:41.4055591Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:41.4055917Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:41.4056209Z #define __GCC_IEC_559 2 2025-05-07T20:25:41.4056462Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:41.4056752Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:41.4057045Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:41.4057347Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:41.4057648Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:41.4057988Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:41.4058321Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:41.4058665Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:41.4059003Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:41.4059289Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:41.4059571Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:41.4059862Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:41.4060174Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:41.4060449Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:41.4060726Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:41.4061015Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:41.4061364Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:41.4061713Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:41.4061984Z #define __INT8_C(c) c 2025-05-07T20:25:41.4062233Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:41.4062521Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:41.4062862Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:41.4063204Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:41.4063490Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:41.4063799Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:41.4064130Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:41.4064498Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:41.4064799Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:41.4065096Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:41.4065373Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:41.4065703Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:41.4066012Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:41.4066426Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:41.4066858Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:41.4067161Z #define __linux 1 2025-05-07T20:25:41.4067510Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:41.4067801Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:41.4068096Z #define __unix 1 2025-05-07T20:25:41.4068336Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:41.4068630Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:41.4068935Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:41.4069222Z #define __WINT_MIN__ 0U 2025-05-07T20:25:41.4069472Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:41.4069767Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:41.4070055Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:41.4070328Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:41.4070599Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:41.4070896Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:41.4071207Z #define __INT64_C(c) c ## L 2025-05-07T20:25:41.4071568Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:41.4071880Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:41.4072175Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:41.4072488Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:41.4072781Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:41.4073059Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:41.4073418Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:41.4073812Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:41.4074079Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:41.4074366Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:41.4074654Z #define __DBL_DIG__ 15 2025-05-07T20:25:41.4074896Z #define __FLT32_DIG__ 6 2025-05-07T20:25:41.4075212Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:41.4075566Z #define __GXX_WEAK__ 1 2025-05-07T20:25:41.4075816Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:41.4076081Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:41.4076425Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:41.4076789Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:41.4077070Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:41.4077378Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:41.4077728Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:41.4078155Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:41.4078562Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:41.4078853Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:41.4079129Z #define __unix__ 1 2025-05-07T20:25:41.4079363Z #define __INT_WIDTH__ 32 2025-05-07T20:25:41.4079614Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:41.4079877Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:41.4080312Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:41.4080622Z #define __UINT16_C(c) c 2025-05-07T20:25:41.4080881Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:41.4081161Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:41.4081533Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:41.4081908Z #define __gnu_linux__ 1 2025-05-07T20:25:41.4082166Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:41.4082444Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:41.4082733Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:41.4083035Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:41.4083322Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:41.4083598Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:41.4083859Z #define __GNUC__ 11 2025-05-07T20:25:41.4092411Z #define __GXX_RTTI 1 2025-05-07T20:25:41.4092668Z #define __pie__ 2 2025-05-07T20:25:41.4092893Z #define __MMX__ 1 2025-05-07T20:25:41.4093138Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:41.4093432Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:41.4093730Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:41.4094018Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:41.4094298Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:41.4094612Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:41.4094957Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:41.4095564Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:41.4095968Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:41.4096288Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4096626Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:41.4096906Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:41.4097184Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:41.4097513Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:41.4097828Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:41.4098104Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:41.4098383Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:41.4098689Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:41.4098997Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:41.4099287Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:41.4099680Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:41.4099947Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:41.4100232Z #define __cplusplus 201703L 2025-05-07T20:25:41.4100528Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:41.4100832Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:41.4101101Z #define __DEPRECATED 1 2025-05-07T20:25:41.4101379Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:41.4101701Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:41.4101969Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:41.4102312Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:41.4102697Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:41.4102979Z #define __SSE2_MATH__ 1 2025-05-07T20:25:41.4103244Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:41.4103565Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4103869Z #define __amd64 1 2025-05-07T20:25:41.4104110Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:41.4104404Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:41.4104677Z #define __GNUG__ 11 2025-05-07T20:25:41.4104949Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:41.4105287Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:41.4105553Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:41.4105827Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:41.4106122Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:41.4106389Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:41.4106675Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:41.4106989Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:41.4107267Z #define __cpp_hex_float 201603L 2025-05-07T20:25:41.4107543Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:41.4107828Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:41.4108120Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:41.4108399Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:41.4108685Z #define __x86_64 1 2025-05-07T20:25:41.4108926Z #define __cpp_lambdas 200907L 2025-05-07T20:25:41.4109211Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:41.4109610Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:41.4110025Z #define __cpp_template_auto 201606L 2025-05-07T20:25:41.4110407Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:41.4110876Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:41.4111370Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:41.4111775Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:41.4112035Z #define __LP64__ 1 2025-05-07T20:25:41.4112279Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:41.4112650Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:41.4113040Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:41.4114013Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:41.4114437Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:41.4114833Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:41.4115223Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:41.4115564Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:41.4116165Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:41.4116517Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:41.4116898Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:41.4117196Z #define __FLT_DIG__ 6 2025-05-07T20:25:41.4117438Z #define __NO_INLINE__ 1 2025-05-07T20:25:41.4117699Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:41.4118049Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:41.4118418Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:41.4118693Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:41.4118975Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:41.4119239Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:41.4119532Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:41.4119852Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:41.4120119Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:41.4120667Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:41.4120975Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:41.4121265Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:41.4121578Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:41.4121942Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:41.4122249Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:41.4122524Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:41.4122799Z #define __FLT128_DIG__ 33 2025-05-07T20:25:41.4123056Z #define __INT32_C(c) c 2025-05-07T20:25:41.4123307Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:41.4123610Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:41.4123908Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:41.4124201Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:41.4124538Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:41.4124867Z #define unix 1 2025-05-07T20:25:41.4125097Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:41.4125382Z #define __cpp_rtti 199711L 2025-05-07T20:25:41.4125669Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:41.4126008Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:41.4126332Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:41.4126673Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:41.4127028Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:41.4127294Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:41.4127607Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:41.4127908Z #define __ELF__ 1 2025-05-07T20:25:41.4128155Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:41.4128463Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:41.4128758Z #define __FLT_RADIX__ 2 2025-05-07T20:25:41.4129019Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:41.4129401Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:41.4129791Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:41.4130090Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:41.4130380Z #define __k8 1 2025-05-07T20:25:41.4130698Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:41.4131095Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:41.4131403Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:41.4131724Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:41.4132003Z #define __LDBL_DIG__ 18 2025-05-07T20:25:41.4132257Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:41.4132534Z #define __x86_64__ 1 2025-05-07T20:25:41.4132789Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:41.4133105Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:41.4133466Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4133794Z #define __FLT64_DIG__ 15 2025-05-07T20:25:41.4134156Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:41.4134525Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:41.4134853Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:41.4135143Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:41.4135441Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4135785Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:41.4136290Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:41.4136716Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:41.4137025Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:41.4137359Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:41.4137699Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:41.4138046Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:41.4138358Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:41.4138659Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:41.4138989Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:41.4139281Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:41.4139544Z #define __SEG_FS 1 2025-05-07T20:25:41.4139792Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:41.4140162Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:41.4140463Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4140775Z #define __SEG_GS 1 2025-05-07T20:25:41.4141114Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:41.4141512Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:41.4141805Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:41.4142113Z #define __INT16_TYPE__ short int 2025-05-07T20:25:41.4142406Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:41.4142743Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:41.4143061Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:41.4143322Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:41.4143604Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:41.4143973Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:41.4144377Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4144721Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:41.4145076Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:41.4145402Z #define linux 1 2025-05-07T20:25:41.4145637Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:41.4145955Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:41.4146289Z #define __EXCEPTIONS 1 2025-05-07T20:25:41.4146545Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:41.4146826Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:41.4147117Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:41.4147422Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:41.4147792Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:41.4148209Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:41.4148574Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:41.4148928Z #define __code_model_small__ 1 2025-05-07T20:25:41.4149222Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:41.4149550Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:41.4149882Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:41.4150179Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:41.4150497Z #define __k8__ 1 2025-05-07T20:25:41.4150735Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:41.4151048Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:41.4151369Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:41.4151623Z #define __pic__ 2 2025-05-07T20:25:41.4151893Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:41.4152226Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:41.4152516Z #define __cpp_decltype 200707L 2025-05-07T20:25:41.4152831Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4153184Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:41.4153573Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:41.4153947Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:41.4154268Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:41.4154609Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:41.4154918Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:41.4155190Z #define __linux__ 1 2025-05-07T20:25:41.4155433Z #define __INT64_TYPE__ long int 2025-05-07T20:25:41.4155806Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:41.4156094Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:41.4156387Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:41.4156694Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:41.4157025Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:41.4157339Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4157675Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:41.4157955Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:41.4158268Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:41.4158589Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:41.4158933Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:41.4159310Z #define __SSE__ 1 2025-05-07T20:25:41.4159556Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:41.4159997Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:41.4160483Z #define __amd64__ 1 2025-05-07T20:25:41.4160729Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:41.4161005Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:41.4161292Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:41.4161576Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:41.4161869Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:41.4162142Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:41.4162432Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:41.4162713Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:41.4163073Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:41.4163560Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:41.4163933Z #define _LP64 1 2025-05-07T20:25:41.4164156Z #define __UINT8_C(c) c 2025-05-07T20:25:41.4164413Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:41.4164703Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:41.4164988Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:41.4165271Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:41.4165658Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:41.4166149Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:41.4166540Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:41.4166855Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:41.4167194Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:41.4167520Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:41.4167929Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:41.4168321Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:41.4168598Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:41.4168879Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:41.4169249Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:41.4169642Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:41.4169915Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:41.4170181Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:41.4170449Z #define __FXSR__ 1 2025-05-07T20:25:41.4170766Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:41.4171245Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:41.4171673Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:41.4171996Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:41.4172279Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:41.4172599Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:41.4172909Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:41.4173200Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:41.4173583Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:41.4173967Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:41.4174245Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:41.4174513Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:41.4174769Z #define __PIC__ 2 2025-05-07T20:25:41.4175029Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:41.4175595Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:41.4176006Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:41.4176352Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:41.4176715Z #define __cpp_constexpr 201603L 2025-05-07T20:25:41.4176990Z #define __SSE2__ 1 2025-05-07T20:25:41.4177235Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:41.4177542Z #define __INT32_TYPE__ int 2025-05-07T20:25:41.4177811Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:41.4178087Z #define __cpp_exceptions 199711L 2025-05-07T20:25:41.4178377Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:41.4178729Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:41.4179104Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:41.4179476Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:41.4179761Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:41.4180047Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:41.4180339Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:41.4180603Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:41.4180877Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:41.4181182Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:41.4181489Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:41.4181806Z #define __PIE__ 2 2025-05-07T20:25:41.4182142Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:41.4182577Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:41.4182904Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:41.4183273Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:41.4183653Z #define __INT16_C(c) c 2025-05-07T20:25:41.4183896Z #define __STDC__ 1 2025-05-07T20:25:41.4184131Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:41.4184403Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:41.4184697Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:41.4184971Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:41.4185291Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:41.4185663Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:41.4186015Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:41.4186293Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:41.4186600Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:41.4186898Z #define __SSE_MATH__ 1 2025-05-07T20:25:41.4187155Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:41.4187451Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:41.4187777Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:41.4188078Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:41.4188386Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:41.4188676Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:41.4188990Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:41.4189408Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:41.4189801Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:41.4190130Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:41.4190432Z #define _GNU_SOURCE 1 2025-05-07T20:25:41.4190693Z #define __cpp_init_captures 201304L 2025-05-07T20:25:41.4190991Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:41.4191251Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:41.4191426Z 2025-05-07T20:25:41.4669841Z 2025-05-07T20:25:41.4670540Z + conda run -n build_binary c++ --version 2025-05-07T20:25:41.4670880Z 2025-05-07T20:25:43.3502838Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:43.3503370Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:43.3503840Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:43.3504399Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:43.3504773Z 2025-05-07T20:25:43.3504777Z 2025-05-07T20:25:43.4118070Z 2025-05-07T20:25:43.4118426Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:43.4119475Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:43.4119810Z 2025-05-07T20:25:45.3644334Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:45.3647321Z 2025-05-07T20:25:45.3647684Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:45.3648394Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:45.3648717Z 2025-05-07T20:25:47.3128849Z #define __cplusplus 201703L 2025-05-07T20:25:47.3130978Z 2025-05-07T20:25:47.3131578Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:47.3176995Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:47.3177462Z . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:47.3189724Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:47.3190082Z env: 2025-05-07T20:25:47.3190307Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:47.3190620Z BUILD_ENV: build_binary 2025-05-07T20:25:47.3190882Z BUILD_TARGET: genai 2025-05-07T20:25:47.3191116Z BUILD_VARIANT: cuda 2025-05-07T20:25:47.3191358Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:47.3191622Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:47.3191930Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:47.3192280Z ##[endgroup] 2025-05-07T20:25:47.6500865Z ################################################################################ 2025-05-07T20:25:47.6501381Z # Install CUDA 2025-05-07T20:25:47.6501679Z # 2025-05-07T20:25:47.6517678Z # [2025-05-07T20:25:47.651Z] + install_cuda build_binary 12.8.0 2025-05-07T20:25:47.6518237Z ################################################################################ 2025-05-07T20:25:47.6518559Z 2025-05-07T20:25:47.6534587Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:47.7421187Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:47.7421579Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:47.7426588Z + conda clean --packages --tarball -y 2025-05-07T20:25:47.7426798Z 2025-05-07T20:25:48.4479259Z Will remove 29 (113.6 MB) tarball(s). 2025-05-07T20:25:48.4479815Z Will remove 6 (619 KB) package(s). 2025-05-07T20:25:48.5133660Z 2025-05-07T20:25:48.5142184Z + conda clean --all -y 2025-05-07T20:25:48.5142397Z 2025-05-07T20:25:49.1841527Z There are no unused tarball(s) to remove. 2025-05-07T20:25:49.1842038Z Will remove 1 index cache(s). 2025-05-07T20:25:49.1842438Z There are no unused package(s) to remove. 2025-05-07T20:25:49.1842851Z There are no tempfile(s) to remove. 2025-05-07T20:25:49.1843261Z There are no logfile(s) to remove. 2025-05-07T20:25:49.2460535Z 2025-05-07T20:25:49.2473475Z [INSTALL] Installing CUDA 12.8.0 ... 2025-05-07T20:25:49.2497618Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0 2025-05-07T20:25:50.1538831Z Channels: 2025-05-07T20:25:50.1539131Z - conda-forge 2025-05-07T20:25:50.1539475Z Platform: linux-64 2025-05-07T20:26:00.7925002Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:26:01.9179729Z Solving environment: - \ | / done 2025-05-07T20:26:01.9928136Z 2025-05-07T20:26:01.9928730Z ## Package Plan ## 2025-05-07T20:26:01.9928972Z 2025-05-07T20:26:01.9929272Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:01.9929705Z 2025-05-07T20:26:01.9929842Z added / updated specs: 2025-05-07T20:26:01.9930103Z - cuda=12.8.0 2025-05-07T20:26:01.9930240Z 2025-05-07T20:26:01.9930280Z 2025-05-07T20:26:01.9930404Z The following packages will be downloaded: 2025-05-07T20:26:01.9930626Z 2025-05-07T20:26:01.9930750Z package | build 2025-05-07T20:26:01.9931083Z ---------------------------|----------------- 2025-05-07T20:26:01.9931470Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:26:01.9931914Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:26:01.9932405Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:26:01.9932840Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:26:01.9933262Z cuda-12.8.0 | ha804496_0 26 KB conda-forge 2025-05-07T20:26:01.9933704Z cuda-cccl_linux-64-12.8.55 | ha770c72_1 1.0 MB conda-forge 2025-05-07T20:26:01.9934610Z cuda-command-line-tools-12.8.0| ha770c72_0 20 KB conda-forge 2025-05-07T20:26:01.9935131Z cuda-compiler-12.8.0 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:26:01.9935679Z cuda-crt-dev_linux-64-12.8.61| ha770c72_1 90 KB conda-forge 2025-05-07T20:26:01.9936365Z cuda-crt-tools-12.8.61 | ha770c72_1 27 KB conda-forge 2025-05-07T20:26:01.9936830Z cuda-cudart-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:01.9937301Z cuda-cudart-dev-12.8.57 | h5888daf_1 23 KB conda-forge 2025-05-07T20:26:01.9937809Z cuda-cudart-dev_linux-64-12.8.57| h3f2d84a_1 377 KB conda-forge 2025-05-07T20:26:01.9938320Z cuda-cudart-static-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:01.9938849Z cuda-cudart-static_linux-64-12.8.57| h3f2d84a_1 950 KB conda-forge 2025-05-07T20:26:01.9939375Z cuda-cudart_linux-64-12.8.57| h3f2d84a_1 188 KB conda-forge 2025-05-07T20:26:01.9939873Z cuda-cuobjdump-12.8.55 | hbd13f7d_0 227 KB conda-forge 2025-05-07T20:26:01.9940333Z cuda-cupti-12.8.57 | hbd13f7d_0 1.8 MB conda-forge 2025-05-07T20:26:01.9940803Z cuda-cupti-dev-12.8.57 | h5888daf_0 4.0 MB conda-forge 2025-05-07T20:26:01.9941276Z cuda-cuxxfilt-12.8.55 | hbd13f7d_0 211 KB conda-forge 2025-05-07T20:26:01.9941747Z cuda-driver-dev-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:01.9942251Z cuda-driver-dev_linux-64-12.8.90| h3f2d84a_1 36 KB conda-forge 2025-05-07T20:26:01.9942732Z cuda-gdb-12.8.55 | h50b4baa_0 353 KB conda-forge 2025-05-07T20:26:01.9943187Z cuda-libraries-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:01.9943669Z cuda-libraries-dev-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:01.9944155Z cuda-nsight-12.8.55 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:26:01.9944600Z cuda-nvcc-12.8.61 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:26:01.9945075Z cuda-nvcc-dev_linux-64-12.8.61| he91c749_1 12.7 MB conda-forge 2025-05-07T20:26:01.9945564Z cuda-nvcc-impl-12.8.61 | h85509e4_1 25 KB conda-forge 2025-05-07T20:26:01.9946039Z cuda-nvcc-tools-12.8.61 | he02047a_1 24.5 MB conda-forge 2025-05-07T20:26:01.9946521Z cuda-nvcc_linux-64-12.8.61 | h04802cd_0 25 KB conda-forge 2025-05-07T20:26:01.9946994Z cuda-nvdisasm-12.8.55 | hbd13f7d_0 4.9 MB conda-forge 2025-05-07T20:26:01.9947464Z cuda-nvml-dev-12.8.55 | hbd13f7d_0 134 KB conda-forge 2025-05-07T20:26:01.9947925Z cuda-nvprof-12.8.57 | hbd13f7d_0 2.5 MB conda-forge 2025-05-07T20:26:01.9948390Z cuda-nvprune-12.8.55 | hbd13f7d_0 68 KB conda-forge 2025-05-07T20:26:01.9948849Z cuda-nvrtc-12.8.61 | hbd13f7d_0 63.1 MB conda-forge 2025-05-07T20:26:01.9949308Z cuda-nvrtc-dev-12.8.61 | h5888daf_0 34 KB conda-forge 2025-05-07T20:26:01.9949764Z cuda-nvtx-12.8.55 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:26:01.9950231Z cuda-nvvm-dev_linux-64-12.8.61| ha770c72_1 25 KB conda-forge 2025-05-07T20:26:01.9950723Z cuda-nvvm-impl-12.8.61 | he02047a_1 20.8 MB conda-forge 2025-05-07T20:26:01.9951197Z cuda-nvvm-tools-12.8.61 | he02047a_1 23.5 MB conda-forge 2025-05-07T20:26:01.9951660Z cuda-nvvp-12.8.57 | hbd13f7d_0 112.4 MB conda-forge 2025-05-07T20:26:01.9952107Z cuda-opencl-12.8.55 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:26:01.9952578Z cuda-opencl-dev-12.8.55 | h5888daf_0 95 KB conda-forge 2025-05-07T20:26:01.9953178Z cuda-profiler-api-12.8.55 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:26:01.9953657Z cuda-runtime-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:26:01.9954139Z cuda-sanitizer-api-12.8.55 | hbd13f7d_0 8.8 MB conda-forge 2025-05-07T20:26:01.9954699Z cuda-toolkit-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:26:01.9955145Z cuda-tools-12.8.0 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:01.9955584Z cuda-version-12.8 | h5d125a7_3 21 KB conda-forge 2025-05-07T20:26:01.9956055Z cuda-visual-tools-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:01.9956532Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:26:01.9956956Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:26:01.9957350Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:26:01.9957834Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:26:01.9958367Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:26:01.9958895Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:26:01.9959412Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:26:01.9959871Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:26:01.9960457Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:26:01.9960938Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:26:01.9961390Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:26:01.9961831Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:01.9962264Z gds-tools-1.13.0.11 | h5888daf_0 37.9 MB conda-forge 2025-05-07T20:26:01.9962686Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:26:01.9963072Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:01.9963486Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:26:01.9963895Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:26:01.9964298Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:26:01.9964730Z libcublas-12.8.3.14 | h9ab20c4_0 460.2 MB conda-forge 2025-05-07T20:26:01.9965194Z libcublas-dev-12.8.3.14 | h9ab20c4_0 89 KB conda-forge 2025-05-07T20:26:01.9965647Z libcufft-11.3.3.41 | hbd13f7d_0 147.4 MB conda-forge 2025-05-07T20:26:01.9966103Z libcufft-dev-11.3.3.41 | h5888daf_0 33 KB conda-forge 2025-05-07T20:26:01.9966566Z libcufile-1.13.0.11 | h12f29b5_0 939 KB conda-forge 2025-05-07T20:26:01.9967024Z libcufile-dev-1.13.0.11 | h5888daf_0 35 KB conda-forge 2025-05-07T20:26:01.9967487Z libcurand-10.3.9.55 | hbd13f7d_0 43.6 MB conda-forge 2025-05-07T20:26:01.9967956Z libcurand-dev-10.3.9.55 | h5888daf_0 265 KB conda-forge 2025-05-07T20:26:01.9968429Z libcusolver-11.7.2.55 | h9ab20c4_0 156.9 MB conda-forge 2025-05-07T20:26:01.9968904Z libcusolver-dev-11.7.2.55 | h9ab20c4_0 59 KB conda-forge 2025-05-07T20:26:01.9969386Z libcusparse-12.5.7.53 | hbd13f7d_0 164.9 MB conda-forge 2025-05-07T20:26:01.9969868Z libcusparse-dev-12.5.7.53 | h5888daf_0 51 KB conda-forge 2025-05-07T20:26:01.9970358Z libedit-3.1.20250104 | pl5321h7949ede_0 132 KB conda-forge 2025-05-07T20:26:01.9970835Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:26:01.9971376Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:26:01.9971845Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:26:01.9972393Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:26:01.9972844Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:26:01.9973268Z libglvnd-1.7.0 | ha4b6fd6_2 129 KB conda-forge 2025-05-07T20:26:01.9973714Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:26:01.9974156Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:26:01.9974568Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:26:01.9974987Z libnpp-12.3.3.65 | hbd13f7d_0 130.6 MB conda-forge 2025-05-07T20:26:01.9975436Z libnpp-dev-12.3.3.65 | h5888daf_0 443 KB conda-forge 2025-05-07T20:26:01.9975878Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:26:01.9976320Z libnvfatbin-12.8.55 | hbd13f7d_0 793 KB conda-forge 2025-05-07T20:26:01.9976804Z libnvfatbin-dev-12.8.55 | h5888daf_0 26 KB conda-forge 2025-05-07T20:26:01.9977285Z libnvjitlink-12.8.61 | hbd13f7d_0 28.7 MB conda-forge 2025-05-07T20:26:01.9977770Z libnvjitlink-dev-12.8.61 | h5888daf_0 25 KB conda-forge 2025-05-07T20:26:01.9978240Z libnvjpeg-12.3.5.57 | h97fd463_0 3.0 MB conda-forge 2025-05-07T20:26:01.9978707Z libnvjpeg-dev-12.3.5.57 | ha770c72_0 31 KB conda-forge 2025-05-07T20:26:01.9979166Z libopengl-1.7.0 | ha4b6fd6_2 50 KB conda-forge 2025-05-07T20:26:01.9979589Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:26:01.9980022Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:26:01.9980469Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:26:01.9980921Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:26:01.9981350Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:26:01.9981817Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:26:01.9982261Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:26:01.9982711Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:26:01.9983144Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:26:01.9983569Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:26:01.9983985Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:26:01.9984386Z ncurses-6.5 | h2d0b736_3 871 KB conda-forge 2025-05-07T20:26:01.9984854Z nsight-compute-2025.1.0.14 | hb5ebaad_0 320.6 MB conda-forge 2025-05-07T20:26:01.9985314Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:26:01.9985705Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:26:01.9986113Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:26:01.9986575Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:26:01.9987030Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:26:01.9987472Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:26:01.9987930Z python-3.13.0 |h9ebbce0_101_cp313 31.5 MB conda-forge 2025-05-07T20:26:01.9988468Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:26:01.9988893Z sqlite-3.49.2 | h9eae976_0 840 KB conda-forge 2025-05-07T20:26:01.9989404Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:26:01.9989819Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:26:01.9990241Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:26:01.9990688Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:26:01.9991164Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:26:01.9991637Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:26:01.9992138Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:26:01.9992607Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:26:01.9993080Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:26:01.9993552Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:26:01.9993996Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:26:01.9994443Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:26:01.9994895Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:26:01.9995377Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:26:01.9995874Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:26:01.9996350Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:01.9996812Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:26:01.9997276Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:01.9997733Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:26:01.9998189Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:26:01.9998673Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:26:01.9999138Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:26:01.9999567Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:26:01.9999962Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:26:02.0000460Z ------------------------------------------------------------ 2025-05-07T20:26:02.0000806Z Total: 1.91 GB 2025-05-07T20:26:02.0001028Z 2025-05-07T20:26:02.0001160Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:02.0001390Z 2025-05-07T20:26:02.0001618Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:26:02.0002049Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:26:02.0002490Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:26:02.0002966Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:26:02.0003410Z cuda conda-forge/noarch::cuda-12.8.0-ha804496_0 2025-05-07T20:26:02.0003888Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 2025-05-07T20:26:02.0004498Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 2025-05-07T20:26:02.0005093Z cuda-compiler conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 2025-05-07T20:26:02.0005653Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:26:02.0006222Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 2025-05-07T20:26:02.0006846Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 2025-05-07T20:26:02.0007386Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 2025-05-07T20:26:02.0008346Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:02.0009188Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 2025-05-07T20:26:02.0010145Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:02.0011112Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:02.0012045Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0012873Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 2025-05-07T20:26:02.0013973Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 2025-05-07T20:26:02.0014840Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0015670Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 2025-05-07T20:26:02.0016585Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 2025-05-07T20:26:02.0017424Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 2025-05-07T20:26:02.0018214Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 2025-05-07T20:26:02.0019095Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 2025-05-07T20:26:02.0019975Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 2025-05-07T20:26:02.0020729Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 2025-05-07T20:26:02.0021536Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 2025-05-07T20:26:02.0022353Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 2025-05-07T20:26:02.0022948Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 2025-05-07T20:26:02.0023534Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 2025-05-07T20:26:02.0024115Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0024657Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0025195Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 2025-05-07T20:26:02.0025725Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0026251Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 2025-05-07T20:26:02.0026776Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 2025-05-07T20:26:02.0027299Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0027851Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:26:02.0028440Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 2025-05-07T20:26:02.0029009Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 2025-05-07T20:26:02.0029546Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 2025-05-07T20:26:02.0030051Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0030604Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 2025-05-07T20:26:02.0031196Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 2025-05-07T20:26:02.0031766Z cuda-runtime conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 2025-05-07T20:26:02.0032339Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0032916Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 2025-05-07T20:26:02.0033660Z cuda-tools conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 2025-05-07T20:26:02.0034169Z cuda-version conda-forge/noarch::cuda-version-12.8-h5d125a7_3 2025-05-07T20:26:02.0034848Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 2025-05-07T20:26:02.0035410Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:26:02.0035886Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:26:02.0036422Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:26:02.0037055Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:26:02.0037676Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:26:02.0038276Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:26:02.0038800Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:26:02.0039330Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:26:02.0039847Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:26:02.0040473Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:26:02.0040925Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:26:02.0041373Z gds-tools conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 2025-05-07T20:26:02.0041819Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:26:02.0042222Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:26:02.0042657Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:26:02.0043099Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:26:02.0043520Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:26:02.0044002Z libcublas conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 2025-05-07T20:26:02.0044540Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 2025-05-07T20:26:02.0045064Z libcufft conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 2025-05-07T20:26:02.0045590Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 2025-05-07T20:26:02.0046114Z libcufile conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 2025-05-07T20:26:02.0046645Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 2025-05-07T20:26:02.0047172Z libcurand conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 2025-05-07T20:26:02.0047703Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 2025-05-07T20:26:02.0048251Z libcusolver conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 2025-05-07T20:26:02.0048818Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 2025-05-07T20:26:02.0049388Z libcusparse conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 2025-05-07T20:26:02.0049958Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 2025-05-07T20:26:02.0051039Z libedit conda-forge/linux-64::libedit-3.1.20250104-pl5321h7949ede_0 2025-05-07T20:26:02.0051581Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:26:02.0052139Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:26:02.0052668Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:26:02.0053210Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:26:02.0053717Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:26:02.0054177Z libglvnd conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 2025-05-07T20:26:02.0054676Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:26:02.0055275Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:26:02.0055729Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:26:02.0056262Z libnpp conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 2025-05-07T20:26:02.0056753Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 2025-05-07T20:26:02.0057243Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:26:02.0057925Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 2025-05-07T20:26:02.0058480Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 2025-05-07T20:26:02.0059180Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 2025-05-07T20:26:02.0059747Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 2025-05-07T20:26:02.0060297Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 2025-05-07T20:26:02.0060826Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 2025-05-07T20:26:02.0061346Z libopengl conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 2025-05-07T20:26:02.0061850Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:26:02.0062326Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:26:02.0062816Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:26:02.0063304Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:26:02.0063753Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:26:02.0064235Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:26:02.0064745Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:26:02.0065216Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:26:02.0065664Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:26:02.0066097Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:26:02.0066608Z nsight-compute conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 2025-05-07T20:26:02.0067120Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:26:02.0067513Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:26:02.0067932Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:26:02.0068449Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:26:02.0068968Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:26:02.0069448Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:26:02.0069959Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:26:02.0070425Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:26:02.0071122Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:26:02.0071761Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:26:02.0072572Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:26:02.0073236Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:26:02.0073840Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:26:02.0074392Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:26:02.0074921Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:26:02.0084632Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:26:02.0085227Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:26:02.0085914Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:26:02.0086421Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:26:02.0087074Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:26:02.0087943Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:26:02.0088647Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:26:02.0089181Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:26:02.0089724Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:26:02.0090251Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:26:02.0090815Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:26:02.0091387Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:26:02.0091997Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:26:02.0092680Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:26:02.0093043Z 2025-05-07T20:26:02.0093183Z The following packages will be UPDATED: 2025-05-07T20:26:02.0093401Z 2025-05-07T20:26:02.0093699Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:26:02.0094345Z ncurses pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 2025-05-07T20:26:02.0094966Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.49.2-h9eae976_0 2025-05-07T20:26:02.0095566Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:26:02.0095911Z 2025-05-07T20:26:02.0096136Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:26:02.0096462Z 2025-05-07T20:26:02.0096718Z expat pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 2025-05-07T20:26:02.0097360Z python pkgs/main::python-3.13.2-hf623796_100~ --> conda-forge::python-3.13.0-h9ebbce0_101_cp313 2025-05-07T20:26:02.0098000Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:26:02.0098337Z 2025-05-07T20:26:02.0098363Z 2025-05-07T20:26:02.0098367Z 2025-05-07T20:26:02.0098517Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:02.0099005Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:26:02.0099348Z 2025-05-07T20:26:02.0099778Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:26:02.0100035Z 2025-05-07T20:26:02.0100039Z 2025-05-07T20:26:02.0100265Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:26:02.0100531Z 2025-05-07T20:26:02.0100535Z 2025-05-07T20:26:02.0100539Z 2025-05-07T20:26:02.0100772Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:26:02.0101040Z 2025-05-07T20:26:02.0101043Z 2025-05-07T20:26:02.0101047Z 2025-05-07T20:26:02.0101055Z 2025-05-07T20:26:02.0101287Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:26:02.0101552Z 2025-05-07T20:26:02.0101556Z 2025-05-07T20:26:02.0101559Z 2025-05-07T20:26:02.0101563Z 2025-05-07T20:26:02.0101571Z 2025-05-07T20:26:02.0104303Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:26:02.0104596Z 2025-05-07T20:26:02.0104600Z 2025-05-07T20:26:02.0104604Z 2025-05-07T20:26:02.0104607Z 2025-05-07T20:26:02.0104611Z 2025-05-07T20:26:02.0104614Z 2025-05-07T20:26:02.0105408Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:26:02.0105792Z 2025-05-07T20:26:02.0105796Z 2025-05-07T20:26:02.0105800Z 2025-05-07T20:26:02.0105804Z 2025-05-07T20:26:02.0105807Z 2025-05-07T20:26:02.0105811Z 2025-05-07T20:26:02.0105818Z 2025-05-07T20:26:02.0107831Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:26:02.0108135Z 2025-05-07T20:26:02.0108139Z 2025-05-07T20:26:02.0108222Z 2025-05-07T20:26:02.0108226Z 2025-05-07T20:26:02.0108230Z 2025-05-07T20:26:02.0108233Z 2025-05-07T20:26:02.0108237Z 2025-05-07T20:26:02.0109729Z 2025-05-07T20:26:02.0111455Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:26:02.0111861Z 2025-05-07T20:26:02.0111865Z 2025-05-07T20:26:02.0111869Z 2025-05-07T20:26:02.0111872Z 2025-05-07T20:26:02.0111876Z 2025-05-07T20:26:02.0111880Z 2025-05-07T20:26:02.0111883Z 2025-05-07T20:26:02.0111887Z 2025-05-07T20:26:02.0111891Z 2025-05-07T20:26:02.0114168Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:26:02.0114522Z 2025-05-07T20:26:02.0114526Z 2025-05-07T20:26:02.0114530Z 2025-05-07T20:26:02.0114536Z 2025-05-07T20:26:02.0114541Z 2025-05-07T20:26:02.0114547Z 2025-05-07T20:26:02.0114562Z 2025-05-07T20:26:02.0114568Z 2025-05-07T20:26:02.0114573Z 2025-05-07T20:26:02.0114579Z 2025-05-07T20:26:02.0115098Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:26:02.0115457Z 2025-05-07T20:26:02.0115461Z 2025-05-07T20:26:02.0115464Z 2025-05-07T20:26:02.0115468Z 2025-05-07T20:26:02.0115472Z 2025-05-07T20:26:02.0115484Z 2025-05-07T20:26:02.0115488Z 2025-05-07T20:26:02.0115492Z 2025-05-07T20:26:02.0115495Z 2025-05-07T20:26:02.0115499Z 2025-05-07T20:26:02.0115502Z 2025-05-07T20:26:02.0117216Z python-3.13.0 | 31.5 MB | | 0%  2025-05-07T20:26:02.0117539Z 2025-05-07T20:26:02.0117545Z 2025-05-07T20:26:02.0117550Z 2025-05-07T20:26:02.0117555Z 2025-05-07T20:26:02.0117561Z 2025-05-07T20:26:02.0117566Z 2025-05-07T20:26:02.0117571Z 2025-05-07T20:26:02.0117576Z 2025-05-07T20:26:02.0117581Z 2025-05-07T20:26:02.0117586Z 2025-05-07T20:26:02.0117591Z 2025-05-07T20:26:02.0117750Z 2025-05-07T20:26:02.0119842Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:26:02.0120429Z 2025-05-07T20:26:02.0120435Z 2025-05-07T20:26:02.0120459Z 2025-05-07T20:26:02.0120465Z 2025-05-07T20:26:02.0120470Z 2025-05-07T20:26:02.0120475Z 2025-05-07T20:26:02.0120481Z 2025-05-07T20:26:02.0120486Z 2025-05-07T20:26:02.0120491Z 2025-05-07T20:26:02.0120496Z 2025-05-07T20:26:02.0120501Z 2025-05-07T20:26:02.0120506Z 2025-05-07T20:26:02.0120512Z 2025-05-07T20:26:02.0121099Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:26:02.0121570Z 2025-05-07T20:26:02.0121576Z 2025-05-07T20:26:02.0121581Z 2025-05-07T20:26:02.0121586Z 2025-05-07T20:26:02.0121591Z 2025-05-07T20:26:02.0121596Z 2025-05-07T20:26:02.0121602Z 2025-05-07T20:26:02.0121607Z 2025-05-07T20:26:02.0121612Z 2025-05-07T20:26:02.0121617Z 2025-05-07T20:26:02.0121622Z 2025-05-07T20:26:02.0121628Z 2025-05-07T20:26:02.0121633Z 2025-05-07T20:26:02.0121647Z 2025-05-07T20:26:02.0122813Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:26:02.0123247Z 2025-05-07T20:26:02.0123260Z 2025-05-07T20:26:02.0123265Z 2025-05-07T20:26:02.0123270Z 2025-05-07T20:26:02.0123275Z 2025-05-07T20:26:02.0123280Z 2025-05-07T20:26:02.0123285Z 2025-05-07T20:26:02.0123291Z 2025-05-07T20:26:02.0123301Z 2025-05-07T20:26:02.0123313Z 2025-05-07T20:26:02.0123318Z 2025-05-07T20:26:02.0123324Z 2025-05-07T20:26:02.0123329Z 2025-05-07T20:26:02.0123333Z 2025-05-07T20:26:02.0123341Z 2025-05-07T20:26:02.0125457Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:26:02.0125918Z 2025-05-07T20:26:02.0125923Z 2025-05-07T20:26:02.0125928Z 2025-05-07T20:26:02.0125933Z 2025-05-07T20:26:02.0125939Z 2025-05-07T20:26:02.0125944Z 2025-05-07T20:26:02.0125949Z 2025-05-07T20:26:02.0125954Z 2025-05-07T20:26:02.0125959Z 2025-05-07T20:26:02.0126228Z 2025-05-07T20:26:02.0126235Z 2025-05-07T20:26:02.0126240Z 2025-05-07T20:26:02.0126245Z 2025-05-07T20:26:02.0126250Z 2025-05-07T20:26:02.0126255Z 2025-05-07T20:26:02.0126260Z 2025-05-07T20:26:02.0128425Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:26:02.0128879Z 2025-05-07T20:26:02.0128884Z 2025-05-07T20:26:02.0128889Z 2025-05-07T20:26:02.0128894Z 2025-05-07T20:26:02.0128899Z 2025-05-07T20:26:02.0128904Z 2025-05-07T20:26:02.0128909Z 2025-05-07T20:26:02.0128922Z 2025-05-07T20:26:02.0128928Z 2025-05-07T20:26:02.0128933Z 2025-05-07T20:26:02.0128938Z 2025-05-07T20:26:02.0128943Z 2025-05-07T20:26:02.0128948Z 2025-05-07T20:26:02.0128958Z 2025-05-07T20:26:02.0128963Z 2025-05-07T20:26:02.0128968Z 2025-05-07T20:26:02.0128973Z 2025-05-07T20:26:02.0130542Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:26:02.0131037Z 2025-05-07T20:26:02.0131043Z 2025-05-07T20:26:02.0131059Z 2025-05-07T20:26:02.0131065Z 2025-05-07T20:26:02.0131070Z 2025-05-07T20:26:02.0131075Z 2025-05-07T20:26:02.0131080Z 2025-05-07T20:26:02.0131086Z 2025-05-07T20:26:02.0131098Z 2025-05-07T20:26:02.0131103Z 2025-05-07T20:26:02.0131109Z 2025-05-07T20:26:02.0131114Z 2025-05-07T20:26:02.0131119Z 2025-05-07T20:26:02.0131124Z 2025-05-07T20:26:02.0131129Z 2025-05-07T20:26:02.0131134Z 2025-05-07T20:26:02.0131147Z 2025-05-07T20:26:02.0131152Z 2025-05-07T20:26:02.0132931Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:26:02.0133393Z 2025-05-07T20:26:02.0133407Z 2025-05-07T20:26:02.0133412Z 2025-05-07T20:26:02.0133417Z 2025-05-07T20:26:02.0133422Z 2025-05-07T20:26:02.0133427Z 2025-05-07T20:26:02.0133432Z 2025-05-07T20:26:02.0133438Z 2025-05-07T20:26:02.0133443Z 2025-05-07T20:26:02.0133448Z 2025-05-07T20:26:02.0133453Z 2025-05-07T20:26:02.0133458Z 2025-05-07T20:26:02.0133463Z 2025-05-07T20:26:02.0133476Z 2025-05-07T20:26:02.0133481Z 2025-05-07T20:26:02.0133486Z 2025-05-07T20:26:02.0133492Z 2025-05-07T20:26:02.0133497Z 2025-05-07T20:26:02.0133502Z 2025-05-07T20:26:02.1025561Z ... (more hidden) ... 2025-05-07T20:26:02.1033125Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:26:02.1033636Z 2025-05-07T20:26:02.1038041Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:26:02.1038424Z 2025-05-07T20:26:02.1040773Z 2025-05-07T20:26:02.1060287Z libcusparse-12.5.7.5 | 164.9 MB | 1 | 1%  2025-05-07T20:26:02.1060673Z 2025-05-07T20:26:02.1060679Z 2025-05-07T20:26:02.1061253Z 2025-05-07T20:26:02.1087248Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:26:02.1087653Z 2025-05-07T20:26:02.1087659Z 2025-05-07T20:26:02.1087664Z 2025-05-07T20:26:02.1087937Z 2025-05-07T20:26:02.2028830Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:26:02.2037944Z libcublas-12.8.3.14 | 460.2 MB | 1 | 1% 2025-05-07T20:26:02.2038229Z 2025-05-07T20:26:02.2042980Z nsight-compute-2025. | 320.6 MB | 1 | 2%  2025-05-07T20:26:02.2043280Z 2025-05-07T20:26:02.2043284Z 2025-05-07T20:26:02.2061557Z libcusparse-12.5.7.5 | 164.9 MB | 3 | 4%  2025-05-07T20:26:02.2061892Z 2025-05-07T20:26:02.2061898Z 2025-05-07T20:26:02.2061904Z 2025-05-07T20:26:02.2091725Z libcusolver-11.7.2.5 | 156.9 MB | 2 | 3%  2025-05-07T20:26:02.2092096Z 2025-05-07T20:26:02.2092100Z 2025-05-07T20:26:02.2092104Z 2025-05-07T20:26:02.2092920Z 2025-05-07T20:26:02.3029791Z libcufft-11.3.3.41 | 147.4 MB | | 1%  2025-05-07T20:26:02.3038945Z libcublas-12.8.3.14 | 460.2 MB | 1 | 2% 2025-05-07T20:26:02.3044258Z 2025-05-07T20:26:02.3047234Z nsight-compute-2025. | 320.6 MB | 2 | 3%  2025-05-07T20:26:02.3047609Z 2025-05-07T20:26:02.3047615Z 2025-05-07T20:26:02.3063060Z libcusparse-12.5.7.5 | 164.9 MB | 5 | 6%  2025-05-07T20:26:02.3063459Z 2025-05-07T20:26:02.3063464Z 2025-05-07T20:26:02.3063468Z 2025-05-07T20:26:02.3091875Z libcusolver-11.7.2.5 | 156.9 MB | 5 | 5%  2025-05-07T20:26:02.3092394Z 2025-05-07T20:26:02.3092398Z 2025-05-07T20:26:02.3092402Z 2025-05-07T20:26:02.3093000Z 2025-05-07T20:26:02.4032950Z libcufft-11.3.3.41 | 147.4 MB | 3 | 3%  2025-05-07T20:26:02.4038637Z libcublas-12.8.3.14 | 460.2 MB | 2 | 3% 2025-05-07T20:26:02.4041798Z 2025-05-07T20:26:02.4047390Z nsight-compute-2025. | 320.6 MB | 3 | 4%  2025-05-07T20:26:02.4047670Z 2025-05-07T20:26:02.4047675Z 2025-05-07T20:26:02.4065794Z libcusparse-12.5.7.5 | 164.9 MB | 7 | 8%  2025-05-07T20:26:02.4066156Z 2025-05-07T20:26:02.4066160Z 2025-05-07T20:26:02.4068161Z 2025-05-07T20:26:02.4093074Z libcusolver-11.7.2.5 | 156.9 MB | 7 | 7%  2025-05-07T20:26:02.4093432Z 2025-05-07T20:26:02.4093460Z 2025-05-07T20:26:02.4093464Z 2025-05-07T20:26:02.4094713Z 2025-05-07T20:26:02.5039340Z libcufft-11.3.3.41 | 147.4 MB | 5 | 5%  2025-05-07T20:26:02.5041857Z libcublas-12.8.3.14 | 460.2 MB | 3 | 3% 2025-05-07T20:26:02.5043131Z 2025-05-07T20:26:02.5053685Z nsight-compute-2025. | 320.6 MB | 4 | 5%  2025-05-07T20:26:02.5053976Z 2025-05-07T20:26:02.5053980Z 2025-05-07T20:26:02.5067570Z libcusparse-12.5.7.5 | 164.9 MB | # | 10%  2025-05-07T20:26:02.5067936Z 2025-05-07T20:26:02.5067949Z 2025-05-07T20:26:02.5069952Z 2025-05-07T20:26:02.5099175Z libcusolver-11.7.2.5 | 156.9 MB | 9 | 10%  2025-05-07T20:26:02.5099544Z 2025-05-07T20:26:02.5099548Z 2025-05-07T20:26:02.5099559Z 2025-05-07T20:26:02.5100042Z 2025-05-07T20:26:02.6041021Z libcufft-11.3.3.41 | 147.4 MB | 7 | 7%  2025-05-07T20:26:02.6042845Z 2025-05-07T20:26:02.6046270Z nsight-compute-2025. | 320.6 MB | 5 | 6%  2025-05-07T20:26:02.6056313Z libcublas-12.8.3.14 | 460.2 MB | 4 | 4% 2025-05-07T20:26:02.6056562Z 2025-05-07T20:26:02.6058029Z 2025-05-07T20:26:02.6068791Z libcusparse-12.5.7.5 | 164.9 MB | #2 | 12%  2025-05-07T20:26:02.6069378Z 2025-05-07T20:26:02.6069382Z 2025-05-07T20:26:02.6069386Z 2025-05-07T20:26:02.6100607Z libcusolver-11.7.2.5 | 156.9 MB | #1 | 12%  2025-05-07T20:26:02.6100932Z 2025-05-07T20:26:02.6100936Z 2025-05-07T20:26:02.6100940Z 2025-05-07T20:26:02.6101290Z 2025-05-07T20:26:02.7041858Z libcufft-11.3.3.41 | 147.4 MB | 9 | 10%  2025-05-07T20:26:02.7042383Z 2025-05-07T20:26:02.7045102Z nsight-compute-2025. | 320.6 MB | 6 | 7%  2025-05-07T20:26:02.7058476Z libcublas-12.8.3.14 | 460.2 MB | 5 | 5% 2025-05-07T20:26:02.7058891Z 2025-05-07T20:26:02.7058897Z 2025-05-07T20:26:02.7071349Z libcusparse-12.5.7.5 | 164.9 MB | #4 | 15%  2025-05-07T20:26:02.7071694Z 2025-05-07T20:26:02.7071721Z 2025-05-07T20:26:02.7071726Z 2025-05-07T20:26:02.7104420Z libcusolver-11.7.2.5 | 156.9 MB | #4 | 14%  2025-05-07T20:26:02.7104708Z 2025-05-07T20:26:02.7104723Z 2025-05-07T20:26:02.7104734Z 2025-05-07T20:26:02.7108890Z 2025-05-07T20:26:02.8042461Z libcufft-11.3.3.41 | 147.4 MB | #2 | 12%  2025-05-07T20:26:02.8043950Z 2025-05-07T20:26:02.8063447Z nsight-compute-2025. | 320.6 MB | 7 | 8%  2025-05-07T20:26:02.8063766Z 2025-05-07T20:26:02.8066021Z 2025-05-07T20:26:02.8108250Z libcusparse-12.5.7.5 | 164.9 MB | #6 | 17%  2025-05-07T20:26:02.8108538Z 2025-05-07T20:26:02.8108542Z 2025-05-07T20:26:02.8109933Z 2025-05-07T20:26:02.8112938Z libcusolver-11.7.2.5 | 156.9 MB | #6 | 17%  2025-05-07T20:26:02.8113231Z 2025-05-07T20:26:02.8113235Z 2025-05-07T20:26:02.8113239Z 2025-05-07T20:26:02.8113242Z 2025-05-07T20:26:02.8124680Z libcufft-11.3.3.41 | 147.4 MB | #4 | 14%  2025-05-07T20:26:02.9042815Z libcublas-12.8.3.14 | 460.2 MB | 5 | 6% 2025-05-07T20:26:02.9044412Z 2025-05-07T20:26:02.9066853Z nsight-compute-2025. | 320.6 MB | 9 | 9%  2025-05-07T20:26:02.9067503Z 2025-05-07T20:26:02.9067976Z 2025-05-07T20:26:02.9116480Z libcusparse-12.5.7.5 | 164.9 MB | #9 | 19%  2025-05-07T20:26:02.9116759Z 2025-05-07T20:26:02.9116763Z 2025-05-07T20:26:02.9116766Z 2025-05-07T20:26:02.9118130Z 2025-05-07T20:26:02.9121367Z libcufft-11.3.3.41 | 147.4 MB | #6 | 17%  2025-05-07T20:26:02.9121757Z 2025-05-07T20:26:02.9121761Z 2025-05-07T20:26:02.9125895Z 2025-05-07T20:26:02.9133961Z libcusolver-11.7.2.5 | 156.9 MB | #8 | 19%  2025-05-07T20:26:03.0046766Z libcublas-12.8.3.14 | 460.2 MB | 6 | 7% 2025-05-07T20:26:03.0048544Z 2025-05-07T20:26:03.0118638Z nsight-compute-2025. | 320.6 MB | # | 10%  2025-05-07T20:26:03.0118933Z 2025-05-07T20:26:03.0118937Z 2025-05-07T20:26:03.0118966Z 2025-05-07T20:26:03.0119342Z 2025-05-07T20:26:03.0124424Z libcufft-11.3.3.41 | 147.4 MB | #9 | 19%  2025-05-07T20:26:03.0124870Z 2025-05-07T20:26:03.0124892Z 2025-05-07T20:26:03.0124910Z 2025-05-07T20:26:03.0131994Z libcusolver-11.7.2.5 | 156.9 MB | ##1 | 21%  2025-05-07T20:26:03.0147704Z libcublas-12.8.3.14 | 460.2 MB | 7 | 8% 2025-05-07T20:26:03.0147970Z 2025-05-07T20:26:03.0147974Z 2025-05-07T20:26:03.1051208Z libcusparse-12.5.7.5 | 164.9 MB | ##1 | 22%  2025-05-07T20:26:03.1051948Z 2025-05-07T20:26:03.1119586Z nsight-compute-2025. | 320.6 MB | #1 | 11%  2025-05-07T20:26:03.1119874Z 2025-05-07T20:26:03.1119878Z 2025-05-07T20:26:03.1119882Z 2025-05-07T20:26:03.1121385Z 2025-05-07T20:26:03.1127734Z libcufft-11.3.3.41 | 147.4 MB | ##1 | 21%  2025-05-07T20:26:03.1128028Z 2025-05-07T20:26:03.1128032Z 2025-05-07T20:26:03.1130997Z 2025-05-07T20:26:03.1148280Z libcusolver-11.7.2.5 | 156.9 MB | ##3 | 24%  2025-05-07T20:26:03.1148585Z 2025-05-07T20:26:03.1149366Z 2025-05-07T20:26:03.1157624Z libcusparse-12.5.7.5 | 164.9 MB | ##3 | 24%  2025-05-07T20:26:03.2051342Z libcublas-12.8.3.14 | 460.2 MB | 8 | 8% 2025-05-07T20:26:03.2052654Z 2025-05-07T20:26:03.2120513Z nsight-compute-2025. | 320.6 MB | #2 | 12%  2025-05-07T20:26:03.2120798Z 2025-05-07T20:26:03.2120802Z 2025-05-07T20:26:03.2120806Z 2025-05-07T20:26:03.2120809Z 2025-05-07T20:26:03.2126445Z libcufft-11.3.3.41 | 147.4 MB | ##3 | 24%  2025-05-07T20:26:03.2126807Z 2025-05-07T20:26:03.2126813Z 2025-05-07T20:26:03.2132174Z 2025-05-07T20:26:03.2161024Z libcusolver-11.7.2.5 | 156.9 MB | ##5 | 26%  2025-05-07T20:26:03.2229004Z libcublas-12.8.3.14 | 460.2 MB | 9 | 9% 2025-05-07T20:26:03.2229265Z 2025-05-07T20:26:03.2229277Z 2025-05-07T20:26:03.3058154Z libcusparse-12.5.7.5 | 164.9 MB | ##6 | 26%  2025-05-07T20:26:03.3059196Z 2025-05-07T20:26:03.3128205Z nsight-compute-2025. | 320.6 MB | #3 | 14%  2025-05-07T20:26:03.3128554Z 2025-05-07T20:26:03.3128559Z 2025-05-07T20:26:03.3128580Z 2025-05-07T20:26:03.3133808Z libcusolver-11.7.2.5 | 156.9 MB | ##8 | 28%  2025-05-07T20:26:03.3134148Z 2025-05-07T20:26:03.3134154Z 2025-05-07T20:26:03.3134159Z 2025-05-07T20:26:03.3134165Z 2025-05-07T20:26:03.3179387Z libcufft-11.3.3.41 | 147.4 MB | ##6 | 26%  2025-05-07T20:26:03.3229315Z libcublas-12.8.3.14 | 460.2 MB | 9 | 10% 2025-05-07T20:26:03.3229589Z 2025-05-07T20:26:03.3232853Z 2025-05-07T20:26:03.4061072Z libcusparse-12.5.7.5 | 164.9 MB | ##8 | 28%  2025-05-07T20:26:03.4061834Z 2025-05-07T20:26:03.4149870Z nsight-compute-2025. | 320.6 MB | #4 | 15%  2025-05-07T20:26:03.4150260Z 2025-05-07T20:26:03.4150274Z 2025-05-07T20:26:03.4153126Z 2025-05-07T20:26:03.4183741Z libcusolver-11.7.2.5 | 156.9 MB | ### | 31%  2025-05-07T20:26:03.4203845Z libcublas-12.8.3.14 | 460.2 MB | # | 11% 2025-05-07T20:26:03.4204201Z 2025-05-07T20:26:03.4204205Z 2025-05-07T20:26:03.4204209Z 2025-05-07T20:26:03.4205334Z 2025-05-07T20:26:03.4238949Z libcufft-11.3.3.41 | 147.4 MB | ##8 | 28%  2025-05-07T20:26:03.4239238Z 2025-05-07T20:26:03.4239246Z 2025-05-07T20:26:03.5061355Z libcusparse-12.5.7.5 | 164.9 MB | ### | 31%  2025-05-07T20:26:03.5061655Z 2025-05-07T20:26:03.5152050Z nsight-compute-2025. | 320.6 MB | #5 | 16%  2025-05-07T20:26:03.5152383Z 2025-05-07T20:26:03.5152389Z 2025-05-07T20:26:03.5156877Z 2025-05-07T20:26:03.5185454Z libcusolver-11.7.2.5 | 156.9 MB | ###2 | 33%  2025-05-07T20:26:03.5207522Z libcublas-12.8.3.14 | 460.2 MB | #1 | 12% 2025-05-07T20:26:03.5207784Z 2025-05-07T20:26:03.5207789Z 2025-05-07T20:26:03.5207792Z 2025-05-07T20:26:03.5208253Z 2025-05-07T20:26:03.5239712Z libcufft-11.3.3.41 | 147.4 MB | ### | 31%  2025-05-07T20:26:03.5240056Z 2025-05-07T20:26:03.5240062Z 2025-05-07T20:26:03.6112765Z libcusparse-12.5.7.5 | 164.9 MB | ###2 | 33%  2025-05-07T20:26:03.6115046Z 2025-05-07T20:26:03.6161426Z nsight-compute-2025. | 320.6 MB | #7 | 17%  2025-05-07T20:26:03.6161740Z 2025-05-07T20:26:03.6161746Z 2025-05-07T20:26:03.6164002Z 2025-05-07T20:26:03.6186098Z libcusolver-11.7.2.5 | 156.9 MB | ###5 | 35%  2025-05-07T20:26:03.6209609Z libcublas-12.8.3.14 | 460.2 MB | #2 | 12% 2025-05-07T20:26:03.6209869Z 2025-05-07T20:26:03.6209874Z 2025-05-07T20:26:03.6209877Z 2025-05-07T20:26:03.6211091Z 2025-05-07T20:26:03.6248798Z libcufft-11.3.3.41 | 147.4 MB | ###3 | 33%  2025-05-07T20:26:03.6249092Z 2025-05-07T20:26:03.6249096Z 2025-05-07T20:26:03.7141678Z libcusparse-12.5.7.5 | 164.9 MB | ###4 | 35%  2025-05-07T20:26:03.7143130Z 2025-05-07T20:26:03.7200969Z nsight-compute-2025. | 320.6 MB | #8 | 18%  2025-05-07T20:26:03.7201407Z 2025-05-07T20:26:03.7201413Z 2025-05-07T20:26:03.7201418Z 2025-05-07T20:26:03.7228564Z libcusolver-11.7.2.5 | 156.9 MB | ###7 | 38%  2025-05-07T20:26:03.7257467Z libcublas-12.8.3.14 | 460.2 MB | #3 | 13% 2025-05-07T20:26:03.7257767Z 2025-05-07T20:26:03.7257773Z 2025-05-07T20:26:03.7271607Z libcusparse-12.5.7.5 | 164.9 MB | ###7 | 37%  2025-05-07T20:26:03.7271940Z 2025-05-07T20:26:03.7271944Z 2025-05-07T20:26:03.7271947Z 2025-05-07T20:26:03.7271951Z 2025-05-07T20:26:03.8168557Z libcufft-11.3.3.41 | 147.4 MB | ###5 | 35%  2025-05-07T20:26:03.8169224Z 2025-05-07T20:26:03.8237712Z nsight-compute-2025. | 320.6 MB | #9 | 19%  2025-05-07T20:26:03.8254503Z libcublas-12.8.3.14 | 460.2 MB | #4 | 14% 2025-05-07T20:26:03.8254806Z 2025-05-07T20:26:03.8254810Z 2025-05-07T20:26:03.8256261Z 2025-05-07T20:26:03.8272938Z libcusolver-11.7.2.5 | 156.9 MB | ###9 | 40%  2025-05-07T20:26:03.8273302Z 2025-05-07T20:26:03.8273309Z 2025-05-07T20:26:03.8273314Z 2025-05-07T20:26:03.8274726Z 2025-05-07T20:26:03.8303776Z libcufft-11.3.3.41 | 147.4 MB | ###7 | 38%  2025-05-07T20:26:03.8304158Z 2025-05-07T20:26:03.8304695Z 2025-05-07T20:26:03.9193624Z libcusparse-12.5.7.5 | 164.9 MB | ###9 | 39%  2025-05-07T20:26:03.9194664Z 2025-05-07T20:26:03.9238960Z nsight-compute-2025. | 320.6 MB | ## | 20%  2025-05-07T20:26:03.9255016Z libcublas-12.8.3.14 | 460.2 MB | #4 | 15% 2025-05-07T20:26:03.9255364Z 2025-05-07T20:26:03.9255369Z 2025-05-07T20:26:03.9255373Z 2025-05-07T20:26:03.9306903Z libcusolver-11.7.2.5 | 156.9 MB | ####2 | 42%  2025-05-07T20:26:03.9307188Z 2025-05-07T20:26:03.9307944Z 2025-05-07T20:26:03.9313171Z libcusparse-12.5.7.5 | 164.9 MB | ####1 | 42%  2025-05-07T20:26:03.9313674Z 2025-05-07T20:26:03.9313679Z 2025-05-07T20:26:03.9313683Z 2025-05-07T20:26:03.9314764Z 2025-05-07T20:26:04.0196016Z libcufft-11.3.3.41 | 147.4 MB | #### | 40%  2025-05-07T20:26:04.0197968Z 2025-05-07T20:26:04.0248423Z nsight-compute-2025. | 320.6 MB | ##1 | 21%  2025-05-07T20:26:04.0282102Z libcublas-12.8.3.14 | 460.2 MB | #5 | 16% 2025-05-07T20:26:04.0282471Z 2025-05-07T20:26:04.0282477Z 2025-05-07T20:26:04.0284903Z 2025-05-07T20:26:04.0326210Z libcusolver-11.7.2.5 | 156.9 MB | ####4 | 44%  2025-05-07T20:26:04.0326500Z 2025-05-07T20:26:04.0327255Z 2025-05-07T20:26:04.0355247Z libcusparse-12.5.7.5 | 164.9 MB | ####3 | 44%  2025-05-07T20:26:04.0355606Z 2025-05-07T20:26:04.0355612Z 2025-05-07T20:26:04.0355617Z 2025-05-07T20:26:04.0355622Z 2025-05-07T20:26:04.1243177Z libcufft-11.3.3.41 | 147.4 MB | ####2 | 42%  2025-05-07T20:26:04.1243464Z 2025-05-07T20:26:04.1272419Z nsight-compute-2025. | 320.6 MB | ##2 | 23%  2025-05-07T20:26:04.1302042Z libcublas-12.8.3.14 | 460.2 MB | #6 | 16% 2025-05-07T20:26:04.1302364Z 2025-05-07T20:26:04.1302370Z 2025-05-07T20:26:04.1310479Z 2025-05-07T20:26:04.1340844Z libcusolver-11.7.2.5 | 156.9 MB | ####6 | 47%  2025-05-07T20:26:04.1341156Z 2025-05-07T20:26:04.1341160Z 2025-05-07T20:26:04.1362940Z libcusparse-12.5.7.5 | 164.9 MB | ####5 | 46%  2025-05-07T20:26:04.1363238Z 2025-05-07T20:26:04.1363257Z 2025-05-07T20:26:04.1363262Z 2025-05-07T20:26:04.1363268Z 2025-05-07T20:26:04.2257675Z libcufft-11.3.3.41 | 147.4 MB | ####4 | 44%  2025-05-07T20:26:04.2257969Z 2025-05-07T20:26:04.2297022Z nsight-compute-2025. | 320.6 MB | ##3 | 24%  2025-05-07T20:26:04.2363329Z libcublas-12.8.3.14 | 460.2 MB | #7 | 17% 2025-05-07T20:26:04.2363636Z 2025-05-07T20:26:04.2363640Z 2025-05-07T20:26:04.2363644Z 2025-05-07T20:26:04.2363647Z 2025-05-07T20:26:04.2401372Z libcufft-11.3.3.41 | 147.4 MB | ####6 | 47%  2025-05-07T20:26:04.2401646Z 2025-05-07T20:26:04.2402340Z 2025-05-07T20:26:04.3082008Z libcusparse-12.5.7.5 | 164.9 MB | ####8 | 48%  2025-05-07T20:26:04.3082320Z 2025-05-07T20:26:04.3082324Z 2025-05-07T20:26:04.3083626Z 2025-05-07T20:26:04.3258269Z libcusolver-11.7.2.5 | 156.9 MB | ####9 | 49%  2025-05-07T20:26:04.3258639Z 2025-05-07T20:26:04.3296918Z nsight-compute-2025. | 320.6 MB | ##4 | 25%  2025-05-07T20:26:04.3365851Z libcublas-12.8.3.14 | 460.2 MB | #8 | 18% 2025-05-07T20:26:04.3366176Z 2025-05-07T20:26:04.3366180Z 2025-05-07T20:26:04.3366184Z 2025-05-07T20:26:04.3370049Z 2025-05-07T20:26:04.3402360Z libcufft-11.3.3.41 | 147.4 MB | ####9 | 49%  2025-05-07T20:26:04.3402645Z 2025-05-07T20:26:04.3404163Z 2025-05-07T20:26:04.4171645Z libcusparse-12.5.7.5 | 164.9 MB | ##### | 50%  2025-05-07T20:26:04.4171936Z 2025-05-07T20:26:04.4171940Z 2025-05-07T20:26:04.4172745Z 2025-05-07T20:26:04.4333250Z libcusolver-11.7.2.5 | 156.9 MB | ##### | 51%  2025-05-07T20:26:04.4333686Z 2025-05-07T20:26:04.4392570Z nsight-compute-2025. | 320.6 MB | ##6 | 26%  2025-05-07T20:26:04.4392872Z 2025-05-07T20:26:04.4392876Z 2025-05-07T20:26:04.4392890Z 2025-05-07T20:26:04.4396932Z 2025-05-07T20:26:04.4438736Z libcufft-11.3.3.41 | 147.4 MB | #####1 | 52%  2025-05-07T20:26:04.4543829Z libcublas-12.8.3.14 | 460.2 MB | #9 | 19% 2025-05-07T20:26:04.4544166Z 2025-05-07T20:26:04.4544170Z 2025-05-07T20:26:04.5182931Z libcusparse-12.5.7.5 | 164.9 MB | #####2 | 52%  2025-05-07T20:26:04.5183326Z 2025-05-07T20:26:04.5183331Z 2025-05-07T20:26:04.5188281Z 2025-05-07T20:26:04.5333736Z libcusolver-11.7.2.5 | 156.9 MB | #####2 | 53%  2025-05-07T20:26:04.5334028Z 2025-05-07T20:26:04.5419082Z nsight-compute-2025. | 320.6 MB | ##7 | 27%  2025-05-07T20:26:04.5419428Z 2025-05-07T20:26:04.5419434Z 2025-05-07T20:26:04.5419439Z 2025-05-07T20:26:04.5419444Z 2025-05-07T20:26:04.5562566Z libcufft-11.3.3.41 | 147.4 MB | #####3 | 54%  2025-05-07T20:26:04.5591694Z libcublas-12.8.3.14 | 460.2 MB | #9 | 20% 2025-05-07T20:26:04.5592101Z 2025-05-07T20:26:04.5595999Z 2025-05-07T20:26:04.6189960Z libcusparse-12.5.7.5 | 164.9 MB | #####4 | 55%  2025-05-07T20:26:04.6190251Z 2025-05-07T20:26:04.6190256Z 2025-05-07T20:26:04.6191659Z 2025-05-07T20:26:04.6352926Z libcusolver-11.7.2.5 | 156.9 MB | #####4 | 55%  2025-05-07T20:26:04.6355823Z 2025-05-07T20:26:04.6435407Z nsight-compute-2025. | 320.6 MB | ##8 | 28%  2025-05-07T20:26:04.6435696Z 2025-05-07T20:26:04.6435700Z 2025-05-07T20:26:04.6435704Z 2025-05-07T20:26:04.6437491Z 2025-05-07T20:26:04.6569872Z libcufft-11.3.3.41 | 147.4 MB | #####6 | 56%  2025-05-07T20:26:04.6660456Z libcublas-12.8.3.14 | 460.2 MB | ## | 21% 2025-05-07T20:26:04.6660750Z 2025-05-07T20:26:04.6665555Z 2025-05-07T20:26:04.7194210Z libcusparse-12.5.7.5 | 164.9 MB | #####6 | 57%  2025-05-07T20:26:04.7194504Z 2025-05-07T20:26:04.7194508Z 2025-05-07T20:26:04.7194511Z 2025-05-07T20:26:04.7438391Z libcusolver-11.7.2.5 | 156.9 MB | #####6 | 57%  2025-05-07T20:26:04.7438709Z 2025-05-07T20:26:04.7438714Z 2025-05-07T20:26:04.7438717Z 2025-05-07T20:26:04.7438721Z 2025-05-07T20:26:04.7450303Z libcufft-11.3.3.41 | 147.4 MB | #####8 | 59%  2025-05-07T20:26:04.7450586Z 2025-05-07T20:26:04.7619093Z nsight-compute-2025. | 320.6 MB | ##9 | 29%  2025-05-07T20:26:04.7748908Z libcublas-12.8.3.14 | 460.2 MB | ##1 | 21% 2025-05-07T20:26:04.7749270Z 2025-05-07T20:26:04.7749275Z 2025-05-07T20:26:04.8196110Z libcusparse-12.5.7.5 | 164.9 MB | #####8 | 59%  2025-05-07T20:26:04.8196404Z 2025-05-07T20:26:04.8196408Z 2025-05-07T20:26:04.8200582Z 2025-05-07T20:26:04.8443046Z libcusolver-11.7.2.5 | 156.9 MB | #####8 | 59%  2025-05-07T20:26:04.8443405Z 2025-05-07T20:26:04.8443410Z 2025-05-07T20:26:04.8443413Z 2025-05-07T20:26:04.8443433Z 2025-05-07T20:26:04.8502970Z libcufft-11.3.3.41 | 147.4 MB | ###### | 61%  2025-05-07T20:26:04.8503246Z 2025-05-07T20:26:04.8650509Z nsight-compute-2025. | 320.6 MB | ### | 31%  2025-05-07T20:26:04.8828091Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 22% 2025-05-07T20:26:04.8828389Z 2025-05-07T20:26:04.8829808Z 2025-05-07T20:26:04.9198499Z libcusparse-12.5.7.5 | 164.9 MB | ###### | 61%  2025-05-07T20:26:04.9198794Z 2025-05-07T20:26:04.9198798Z 2025-05-07T20:26:04.9199610Z 2025-05-07T20:26:04.9447573Z libcusolver-11.7.2.5 | 156.9 MB | ###### | 61%  2025-05-07T20:26:04.9447933Z 2025-05-07T20:26:04.9447937Z 2025-05-07T20:26:04.9447941Z 2025-05-07T20:26:04.9447945Z 2025-05-07T20:26:04.9503292Z libcufft-11.3.3.41 | 147.4 MB | ######3 | 63%  2025-05-07T20:26:04.9505756Z 2025-05-07T20:26:04.9697175Z nsight-compute-2025. | 320.6 MB | ###1 | 32%  2025-05-07T20:26:04.9844500Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 23% 2025-05-07T20:26:04.9844783Z 2025-05-07T20:26:04.9846871Z 2025-05-07T20:26:05.0204361Z libcusparse-12.5.7.5 | 164.9 MB | ######2 | 62%  2025-05-07T20:26:05.0204653Z 2025-05-07T20:26:05.0204657Z 2025-05-07T20:26:05.0207404Z 2025-05-07T20:26:05.0468099Z libcusolver-11.7.2.5 | 156.9 MB | ######2 | 63%  2025-05-07T20:26:05.0468417Z 2025-05-07T20:26:05.0468421Z 2025-05-07T20:26:05.0468425Z 2025-05-07T20:26:05.0470289Z 2025-05-07T20:26:05.0506083Z libcufft-11.3.3.41 | 147.4 MB | ######5 | 66%  2025-05-07T20:26:05.0506679Z 2025-05-07T20:26:05.0739959Z nsight-compute-2025. | 320.6 MB | ###2 | 33%  2025-05-07T20:26:05.0893364Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 24% 2025-05-07T20:26:05.0893734Z 2025-05-07T20:26:05.0895342Z 2025-05-07T20:26:05.1390157Z libcusparse-12.5.7.5 | 164.9 MB | ######4 | 64%  2025-05-07T20:26:05.1390588Z 2025-05-07T20:26:05.1390595Z 2025-05-07T20:26:05.1392724Z 2025-05-07T20:26:05.1485349Z libcusolver-11.7.2.5 | 156.9 MB | ######4 | 65%  2025-05-07T20:26:05.1485743Z 2025-05-07T20:26:05.1485747Z 2025-05-07T20:26:05.1485751Z 2025-05-07T20:26:05.1489085Z 2025-05-07T20:26:05.1506705Z libcufft-11.3.3.41 | 147.4 MB | ######7 | 68%  2025-05-07T20:26:05.1506986Z 2025-05-07T20:26:05.1742397Z nsight-compute-2025. | 320.6 MB | ###3 | 34%  2025-05-07T20:26:05.1894559Z libcublas-12.8.3.14 | 460.2 MB | ##4 | 24% 2025-05-07T20:26:05.1894911Z 2025-05-07T20:26:05.1897520Z 2025-05-07T20:26:05.2396388Z libcusparse-12.5.7.5 | 164.9 MB | ######6 | 66%  2025-05-07T20:26:05.2396822Z 2025-05-07T20:26:05.2396829Z 2025-05-07T20:26:05.2400452Z 2025-05-07T20:26:05.2550818Z libcusolver-11.7.2.5 | 156.9 MB | ######6 | 66%  2025-05-07T20:26:05.2551107Z 2025-05-07T20:26:05.2606904Z nsight-compute-2025. | 320.6 MB | ###4 | 35%  2025-05-07T20:26:05.2607192Z 2025-05-07T20:26:05.2607196Z 2025-05-07T20:26:05.2607225Z 2025-05-07T20:26:05.2609814Z 2025-05-07T20:26:05.2742674Z libcufft-11.3.3.41 | 147.4 MB | ####### | 70%  2025-05-07T20:26:05.2896239Z libcublas-12.8.3.14 | 460.2 MB | ##5 | 25% 2025-05-07T20:26:05.2896657Z 2025-05-07T20:26:05.2899052Z 2025-05-07T20:26:05.3396674Z libcusparse-12.5.7.5 | 164.9 MB | ######8 | 68%  2025-05-07T20:26:05.3396966Z 2025-05-07T20:26:05.3396969Z 2025-05-07T20:26:05.3397705Z 2025-05-07T20:26:05.3603809Z libcusolver-11.7.2.5 | 156.9 MB | ######8 | 69%  2025-05-07T20:26:05.3605431Z 2025-05-07T20:26:05.3651221Z nsight-compute-2025. | 320.6 MB | ###6 | 36%  2025-05-07T20:26:05.3651556Z 2025-05-07T20:26:05.3651562Z 2025-05-07T20:26:05.3651568Z 2025-05-07T20:26:05.3651573Z 2025-05-07T20:26:05.3789171Z libcufft-11.3.3.41 | 147.4 MB | #######2 | 72%  2025-05-07T20:26:05.3936108Z libcublas-12.8.3.14 | 460.2 MB | ##5 | 26% 2025-05-07T20:26:05.3936433Z 2025-05-07T20:26:05.3936469Z 2025-05-07T20:26:05.4400104Z libcusparse-12.5.7.5 | 164.9 MB | ####### | 70%  2025-05-07T20:26:05.4400614Z 2025-05-07T20:26:05.4400618Z 2025-05-07T20:26:05.4401444Z 2025-05-07T20:26:05.4614130Z libcusolver-11.7.2.5 | 156.9 MB | ####### | 70%  2025-05-07T20:26:05.4615398Z 2025-05-07T20:26:05.4652817Z nsight-compute-2025. | 320.6 MB | ###7 | 37%  2025-05-07T20:26:05.4653110Z 2025-05-07T20:26:05.4653114Z 2025-05-07T20:26:05.4653118Z 2025-05-07T20:26:05.4653121Z 2025-05-07T20:26:05.4791931Z libcufft-11.3.3.41 | 147.4 MB | #######4 | 75%  2025-05-07T20:26:05.4936998Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 27% 2025-05-07T20:26:05.4937317Z 2025-05-07T20:26:05.4939856Z 2025-05-07T20:26:05.5402542Z libcusparse-12.5.7.5 | 164.9 MB | #######2 | 72%  2025-05-07T20:26:05.5402971Z 2025-05-07T20:26:05.5402977Z 2025-05-07T20:26:05.5404537Z 2025-05-07T20:26:05.5652918Z libcusolver-11.7.2.5 | 156.9 MB | #######2 | 73%  2025-05-07T20:26:05.5653550Z 2025-05-07T20:26:05.5658854Z nsight-compute-2025. | 320.6 MB | ###8 | 38%  2025-05-07T20:26:05.5659233Z 2025-05-07T20:26:05.5659240Z 2025-05-07T20:26:05.5659262Z 2025-05-07T20:26:05.5659267Z 2025-05-07T20:26:05.5797542Z libcufft-11.3.3.41 | 147.4 MB | #######6 | 77%  2025-05-07T20:26:05.5938883Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 27% 2025-05-07T20:26:05.5939278Z 2025-05-07T20:26:05.5941128Z 2025-05-07T20:26:05.6405558Z libcusparse-12.5.7.5 | 164.9 MB | #######4 | 74%  2025-05-07T20:26:05.6405966Z 2025-05-07T20:26:05.6405972Z 2025-05-07T20:26:05.6406346Z 2025-05-07T20:26:05.6655006Z libcusolver-11.7.2.5 | 156.9 MB | #######5 | 75%  2025-05-07T20:26:05.6661655Z 2025-05-07T20:26:05.6683022Z nsight-compute-2025. | 320.6 MB | ###9 | 39%  2025-05-07T20:26:05.6683411Z 2025-05-07T20:26:05.6683416Z 2025-05-07T20:26:05.6683422Z 2025-05-07T20:26:05.6687035Z 2025-05-07T20:26:05.6803955Z libcufft-11.3.3.41 | 147.4 MB | #######9 | 79%  2025-05-07T20:26:05.6944140Z libcublas-12.8.3.14 | 460.2 MB | ##8 | 28% 2025-05-07T20:26:05.6944396Z 2025-05-07T20:26:05.6944634Z 2025-05-07T20:26:05.7409016Z libcusparse-12.5.7.5 | 164.9 MB | #######6 | 77%  2025-05-07T20:26:05.7409480Z 2025-05-07T20:26:05.7409486Z 2025-05-07T20:26:05.7409787Z 2025-05-07T20:26:05.7660479Z libcusolver-11.7.2.5 | 156.9 MB | #######7 | 77%  2025-05-07T20:26:05.7662704Z 2025-05-07T20:26:05.7686751Z nsight-compute-2025. | 320.6 MB | #### | 40%  2025-05-07T20:26:05.7687102Z 2025-05-07T20:26:05.7687109Z 2025-05-07T20:26:05.7687114Z 2025-05-07T20:26:05.7688499Z 2025-05-07T20:26:05.7805967Z libcufft-11.3.3.41 | 147.4 MB | ########1 | 82%  2025-05-07T20:26:05.7944949Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 29% 2025-05-07T20:26:05.7945292Z 2025-05-07T20:26:05.7947862Z 2025-05-07T20:26:05.8411336Z libcusparse-12.5.7.5 | 164.9 MB | #######8 | 79%  2025-05-07T20:26:05.8411728Z 2025-05-07T20:26:05.8411733Z 2025-05-07T20:26:05.8412272Z 2025-05-07T20:26:05.8664842Z libcusolver-11.7.2.5 | 156.9 MB | ######## | 80%  2025-05-07T20:26:05.8668635Z 2025-05-07T20:26:05.8692530Z nsight-compute-2025. | 320.6 MB | ####1 | 41%  2025-05-07T20:26:05.8692866Z 2025-05-07T20:26:05.8692872Z 2025-05-07T20:26:05.8692876Z 2025-05-07T20:26:05.8694039Z 2025-05-07T20:26:05.8835594Z libcufft-11.3.3.41 | 147.4 MB | ########3 | 84%  2025-05-07T20:26:05.8947565Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 30% 2025-05-07T20:26:05.8947926Z 2025-05-07T20:26:05.8947931Z 2025-05-07T20:26:05.9416567Z libcusparse-12.5.7.5 | 164.9 MB | ########1 | 81%  2025-05-07T20:26:05.9416932Z 2025-05-07T20:26:05.9416936Z 2025-05-07T20:26:05.9416954Z 2025-05-07T20:26:05.9695257Z libcusolver-11.7.2.5 | 156.9 MB | ########2 | 82%  2025-05-07T20:26:05.9696727Z 2025-05-07T20:26:05.9734706Z nsight-compute-2025. | 320.6 MB | ####2 | 43%  2025-05-07T20:26:05.9734993Z 2025-05-07T20:26:05.9734997Z 2025-05-07T20:26:05.9735000Z 2025-05-07T20:26:05.9735004Z 2025-05-07T20:26:05.9838398Z libcufft-11.3.3.41 | 147.4 MB | ########5 | 86%  2025-05-07T20:26:05.9950294Z libcublas-12.8.3.14 | 460.2 MB | ### | 31% 2025-05-07T20:26:05.9950734Z 2025-05-07T20:26:05.9952276Z 2025-05-07T20:26:06.0420639Z libcusparse-12.5.7.5 | 164.9 MB | ########3 | 84%  2025-05-07T20:26:06.0421019Z 2025-05-07T20:26:06.0421023Z 2025-05-07T20:26:06.0423193Z 2025-05-07T20:26:06.0725784Z libcusolver-11.7.2.5 | 156.9 MB | ########4 | 85%  2025-05-07T20:26:06.0727499Z 2025-05-07T20:26:06.0770380Z nsight-compute-2025. | 320.6 MB | ####3 | 44%  2025-05-07T20:26:06.0770753Z 2025-05-07T20:26:06.0770757Z 2025-05-07T20:26:06.0770761Z 2025-05-07T20:26:06.0771402Z 2025-05-07T20:26:06.0843978Z libcufft-11.3.3.41 | 147.4 MB | ########8 | 88%  2025-05-07T20:26:06.0954209Z libcublas-12.8.3.14 | 460.2 MB | ###1 | 31% 2025-05-07T20:26:06.0954479Z 2025-05-07T20:26:06.0954483Z 2025-05-07T20:26:06.1429532Z libcusparse-12.5.7.5 | 164.9 MB | ########5 | 86%  2025-05-07T20:26:06.1429941Z 2025-05-07T20:26:06.1429947Z 2025-05-07T20:26:06.1434537Z 2025-05-07T20:26:06.1747642Z libcusolver-11.7.2.5 | 156.9 MB | ########7 | 87%  2025-05-07T20:26:06.1751112Z 2025-05-07T20:26:06.1773403Z nsight-compute-2025. | 320.6 MB | ####4 | 45%  2025-05-07T20:26:06.1773759Z 2025-05-07T20:26:06.1773763Z 2025-05-07T20:26:06.1773767Z 2025-05-07T20:26:06.1774476Z 2025-05-07T20:26:06.1895505Z libcufft-11.3.3.41 | 147.4 MB | ######### | 90%  2025-05-07T20:26:06.1954514Z libcublas-12.8.3.14 | 460.2 MB | ###2 | 32% 2025-05-07T20:26:06.1954900Z 2025-05-07T20:26:06.1956830Z 2025-05-07T20:26:06.2471065Z libcusparse-12.5.7.5 | 164.9 MB | ########8 | 88%  2025-05-07T20:26:06.2471357Z 2025-05-07T20:26:06.2471361Z 2025-05-07T20:26:06.2473861Z 2025-05-07T20:26:06.2754396Z libcusolver-11.7.2.5 | 156.9 MB | ########9 | 89%  2025-05-07T20:26:06.2754735Z 2025-05-07T20:26:06.2779918Z nsight-compute-2025. | 320.6 MB | ####5 | 46%  2025-05-07T20:26:06.2780552Z 2025-05-07T20:26:06.2780556Z 2025-05-07T20:26:06.2780559Z 2025-05-07T20:26:06.2784958Z 2025-05-07T20:26:06.2902510Z libcufft-11.3.3.41 | 147.4 MB | #########2 | 93%  2025-05-07T20:26:06.3000567Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 33% 2025-05-07T20:26:06.3000843Z 2025-05-07T20:26:06.3002043Z 2025-05-07T20:26:06.3473311Z libcusparse-12.5.7.5 | 164.9 MB | ######### | 90%  2025-05-07T20:26:06.3473657Z 2025-05-07T20:26:06.3473663Z 2025-05-07T20:26:06.3475138Z 2025-05-07T20:26:06.3754909Z libcusolver-11.7.2.5 | 156.9 MB | #########1 | 92%  2025-05-07T20:26:06.3756696Z 2025-05-07T20:26:06.3844356Z nsight-compute-2025. | 320.6 MB | ####6 | 47%  2025-05-07T20:26:06.3844740Z 2025-05-07T20:26:06.3844747Z 2025-05-07T20:26:06.3844771Z 2025-05-07T20:26:06.3844774Z 2025-05-07T20:26:06.3925867Z libcufft-11.3.3.41 | 147.4 MB | #########5 | 95%  2025-05-07T20:26:06.4002378Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 34% 2025-05-07T20:26:06.4002697Z 2025-05-07T20:26:06.4005054Z 2025-05-07T20:26:06.4477476Z libcusparse-12.5.7.5 | 164.9 MB | #########2 | 93%  2025-05-07T20:26:06.4477972Z 2025-05-07T20:26:06.4477978Z 2025-05-07T20:26:06.4479354Z 2025-05-07T20:26:06.4763150Z libcusolver-11.7.2.5 | 156.9 MB | #########4 | 94%  2025-05-07T20:26:06.4763506Z 2025-05-07T20:26:06.4846548Z nsight-compute-2025. | 320.6 MB | ####8 | 48%  2025-05-07T20:26:06.4846894Z 2025-05-07T20:26:06.4846898Z 2025-05-07T20:26:06.4846902Z 2025-05-07T20:26:06.4846905Z 2025-05-07T20:26:06.4965785Z libcufft-11.3.3.41 | 147.4 MB | #########7 | 97%  2025-05-07T20:26:06.5005397Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 35% 2025-05-07T20:26:06.5005657Z 2025-05-07T20:26:06.5008939Z 2025-05-07T20:26:06.5515446Z libcusparse-12.5.7.5 | 164.9 MB | #########4 | 95%  2025-05-07T20:26:06.5515746Z 2025-05-07T20:26:06.5515749Z 2025-05-07T20:26:06.5518314Z 2025-05-07T20:26:06.5764928Z libcusolver-11.7.2.5 | 156.9 MB | #########6 | 97%  2025-05-07T20:26:06.5766037Z 2025-05-07T20:26:06.5847017Z nsight-compute-2025. | 320.6 MB | ####9 | 49%  2025-05-07T20:26:06.5847302Z 2025-05-07T20:26:06.5847306Z 2025-05-07T20:26:06.5847311Z 2025-05-07T20:26:06.5847482Z 2025-05-07T20:26:06.5976077Z libcufft-11.3.3.41 | 147.4 MB | #########9 | 99%  2025-05-07T20:26:06.6059300Z libcublas-12.8.3.14 | 460.2 MB | ###5 | 35% 2025-05-07T20:26:06.6059660Z 2025-05-07T20:26:06.6060869Z 2025-05-07T20:26:06.6515532Z libcusparse-12.5.7.5 | 164.9 MB | #########7 | 97%  2025-05-07T20:26:06.6515868Z 2025-05-07T20:26:06.6515872Z 2025-05-07T20:26:06.6518600Z 2025-05-07T20:26:06.6766813Z libcusolver-11.7.2.5 | 156.9 MB | #########9 | 99%  2025-05-07T20:26:06.6767615Z 2025-05-07T20:26:06.6980907Z nsight-compute-2025. | 320.6 MB | ##### | 50%  2025-05-07T20:26:06.7062864Z libcublas-12.8.3.14 | 460.2 MB | ###6 | 36% 2025-05-07T20:26:06.7063147Z 2025-05-07T20:26:06.7063780Z 2025-05-07T20:26:06.7767135Z libcusparse-12.5.7.5 | 164.9 MB | #########9 | 100%  2025-05-07T20:26:06.7768133Z 2025-05-07T20:26:06.7981836Z nsight-compute-2025. | 320.6 MB | #####1 | 52%  2025-05-07T20:26:06.8768854Z libcublas-12.8.3.14 | 460.2 MB | ###7 | 37% 2025-05-07T20:26:06.8771717Z 2025-05-07T20:26:06.8984483Z nsight-compute-2025. | 320.6 MB | #####3 | 53%  2025-05-07T20:26:06.9770004Z libcublas-12.8.3.14 | 460.2 MB | ###8 | 38% 2025-05-07T20:26:06.9771802Z 2025-05-07T20:26:06.9986517Z nsight-compute-2025. | 320.6 MB | #####4 | 55%  2025-05-07T20:26:07.0770071Z libcublas-12.8.3.14 | 460.2 MB | ###9 | 39% 2025-05-07T20:26:07.0770579Z 2025-05-07T20:26:07.0987180Z nsight-compute-2025. | 320.6 MB | #####6 | 57%  2025-05-07T20:26:07.1773513Z libcublas-12.8.3.14 | 460.2 MB | #### | 40% 2025-05-07T20:26:07.1773879Z 2025-05-07T20:26:07.1991196Z nsight-compute-2025. | 320.6 MB | #####8 | 58%  2025-05-07T20:26:07.2774507Z libcublas-12.8.3.14 | 460.2 MB | ####1 | 41% 2025-05-07T20:26:07.2777457Z 2025-05-07T20:26:07.2991507Z nsight-compute-2025. | 320.6 MB | #####9 | 60%  2025-05-07T20:26:07.3776156Z libcublas-12.8.3.14 | 460.2 MB | ####2 | 43% 2025-05-07T20:26:07.3776573Z 2025-05-07T20:26:07.4137045Z nsight-compute-2025. | 320.6 MB | ######1 | 61%  2025-05-07T20:26:07.4776604Z libcublas-12.8.3.14 | 460.2 MB | ####3 | 44% 2025-05-07T20:26:07.4776867Z 2025-05-07T20:26:07.5542509Z nsight-compute-2025. | 320.6 MB | ######3 | 63%  2025-05-07T20:26:07.5778326Z libcublas-12.8.3.14 | 460.2 MB | ####4 | 45% 2025-05-07T20:26:07.5779287Z 2025-05-07T20:26:07.6543040Z nsight-compute-2025. | 320.6 MB | ######5 | 65%  2025-05-07T20:26:07.6960904Z libcublas-12.8.3.14 | 460.2 MB | ####5 | 46% 2025-05-07T20:26:07.6961285Z 2025-05-07T20:26:07.7545608Z nsight-compute-2025. | 320.6 MB | ######7 | 67%  2025-05-07T20:26:07.8157429Z libcublas-12.8.3.14 | 460.2 MB | ####6 | 47% 2025-05-07T20:26:07.8159556Z 2025-05-07T20:26:07.8546847Z nsight-compute-2025. | 320.6 MB | ######8 | 69%  2025-05-07T20:26:07.9175165Z libcublas-12.8.3.14 | 460.2 MB | ####8 | 48% 2025-05-07T20:26:07.9175863Z 2025-05-07T20:26:07.9550156Z nsight-compute-2025. | 320.6 MB | ####### | 70%  2025-05-07T20:26:08.0268751Z libcublas-12.8.3.14 | 460.2 MB | ####9 | 49% 2025-05-07T20:26:08.0269519Z 2025-05-07T20:26:08.0551347Z nsight-compute-2025. | 320.6 MB | #######2 | 72%  2025-05-07T20:26:08.1426335Z libcublas-12.8.3.14 | 460.2 MB | ##### | 50% 2025-05-07T20:26:08.1426674Z 2025-05-07T20:26:08.1551819Z nsight-compute-2025. | 320.6 MB | #######3 | 74%  2025-05-07T20:26:08.2426964Z libcublas-12.8.3.14 | 460.2 MB | #####1 | 52% 2025-05-07T20:26:08.2427301Z 2025-05-07T20:26:08.2556231Z nsight-compute-2025. | 320.6 MB | #######5 | 75%  2025-05-07T20:26:08.3431453Z libcublas-12.8.3.14 | 460.2 MB | #####2 | 53% 2025-05-07T20:26:08.3431830Z 2025-05-07T20:26:08.3562912Z nsight-compute-2025. | 320.6 MB | #######6 | 77%  2025-05-07T20:26:08.4488248Z libcublas-12.8.3.14 | 460.2 MB | #####3 | 54% 2025-05-07T20:26:08.4488966Z 2025-05-07T20:26:08.4598715Z nsight-compute-2025. | 320.6 MB | #######8 | 78%  2025-05-07T20:26:08.5488838Z libcublas-12.8.3.14 | 460.2 MB | #####4 | 55% 2025-05-07T20:26:08.5489222Z 2025-05-07T20:26:08.5598696Z nsight-compute-2025. | 320.6 MB | #######9 | 80%  2025-05-07T20:26:08.6521324Z libcublas-12.8.3.14 | 460.2 MB | #####6 | 56% 2025-05-07T20:26:08.6524457Z 2025-05-07T20:26:08.6607269Z nsight-compute-2025. | 320.6 MB | ########1 | 81%  2025-05-07T20:26:08.7522744Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 57% 2025-05-07T20:26:08.7523280Z 2025-05-07T20:26:08.7616975Z nsight-compute-2025. | 320.6 MB | ########2 | 83%  2025-05-07T20:26:08.8574129Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 58% 2025-05-07T20:26:08.8575004Z 2025-05-07T20:26:08.8628650Z nsight-compute-2025. | 320.6 MB | ########4 | 84%  2025-05-07T20:26:08.9629390Z libcublas-12.8.3.14 | 460.2 MB | #####9 | 59% 2025-05-07T20:26:08.9630085Z 2025-05-07T20:26:08.9634078Z nsight-compute-2025. | 320.6 MB | ########5 | 86%  2025-05-07T20:26:09.0629713Z libcublas-12.8.3.14 | 460.2 MB | ###### | 61% 2025-05-07T20:26:09.0630079Z 2025-05-07T20:26:09.0659450Z nsight-compute-2025. | 320.6 MB | ########7 | 87%  2025-05-07T20:26:09.1632722Z libcublas-12.8.3.14 | 460.2 MB | ######1 | 62% 2025-05-07T20:26:09.1633388Z 2025-05-07T20:26:09.1661483Z nsight-compute-2025. | 320.6 MB | ########8 | 89%  2025-05-07T20:26:09.2665833Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 63% 2025-05-07T20:26:09.2797909Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 64% 2025-05-07T20:26:09.2798848Z 2025-05-07T20:26:09.3665798Z nsight-compute-2025. | 320.6 MB | ######### | 90%  2025-05-07T20:26:09.3988779Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 65% 2025-05-07T20:26:09.3989204Z 2025-05-07T20:26:09.4725488Z nsight-compute-2025. | 320.6 MB | #########1 | 92%  2025-05-07T20:26:09.4989770Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 66% 2025-05-07T20:26:09.4992140Z 2025-05-07T20:26:09.5238784Z nsight-compute-2025. | 320.6 MB | #########3 | 93%  2025-05-07T20:26:09.5239127Z 2025-05-07T20:26:09.5239133Z 2025-05-07T20:26:09.5239138Z 2025-05-07T20:26:09.5239143Z 2025-05-07T20:26:09.5867655Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:26:09.5867952Z 2025-05-07T20:26:09.5867956Z 2025-05-07T20:26:09.5867959Z 2025-05-07T20:26:09.5867963Z 2025-05-07T20:26:09.5871571Z 2025-05-07T20:26:09.6016039Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:26:09.6188925Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 68% 2025-05-07T20:26:09.6190584Z 2025-05-07T20:26:09.6868764Z nsight-compute-2025. | 320.6 MB | #########4 | 94%  2025-05-07T20:26:09.6869050Z 2025-05-07T20:26:09.6869055Z 2025-05-07T20:26:09.6869058Z 2025-05-07T20:26:09.6869062Z 2025-05-07T20:26:09.6871292Z 2025-05-07T20:26:09.7404565Z libnpp-12.3.3.65 | 130.6 MB | 2 | 3%  2025-05-07T20:26:09.7406297Z 2025-05-07T20:26:09.7433377Z nsight-compute-2025. | 320.6 MB | #########5 | 96%  2025-05-07T20:26:09.7870704Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 69% 2025-05-07T20:26:09.7871030Z 2025-05-07T20:26:09.7871034Z 2025-05-07T20:26:09.7871038Z 2025-05-07T20:26:09.7871042Z 2025-05-07T20:26:09.7872502Z 2025-05-07T20:26:09.8411648Z libnpp-12.3.3.65 | 130.6 MB | 5 | 6%  2025-05-07T20:26:09.8411940Z 2025-05-07T20:26:09.8411944Z 2025-05-07T20:26:09.8421144Z 2025-05-07T20:26:09.8571849Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:26:09.8572146Z 2025-05-07T20:26:09.8573518Z 2025-05-07T20:26:09.8776935Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:26:09.8779267Z 2025-05-07T20:26:09.8794048Z nsight-compute-2025. | 320.6 MB | #########7 | 97%  2025-05-07T20:26:09.8875612Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 70% 2025-05-07T20:26:09.8876025Z 2025-05-07T20:26:09.8876032Z 2025-05-07T20:26:09.8876037Z 2025-05-07T20:26:09.8876042Z 2025-05-07T20:26:09.8878164Z 2025-05-07T20:26:09.8936290Z libnpp-12.3.3.65 | 130.6 MB | 8 | 8%  2025-05-07T20:26:09.8936593Z 2025-05-07T20:26:09.8936597Z 2025-05-07T20:26:09.8936600Z 2025-05-07T20:26:09.8936604Z 2025-05-07T20:26:09.8936608Z 2025-05-07T20:26:09.8937896Z 2025-05-07T20:26:09.9041861Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:26:09.9042165Z 2025-05-07T20:26:09.9042181Z 2025-05-07T20:26:09.9042185Z 2025-05-07T20:26:09.9042188Z 2025-05-07T20:26:09.9042192Z 2025-05-07T20:26:09.9042196Z 2025-05-07T20:26:09.9045212Z 2025-05-07T20:26:09.9944993Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:26:09.9945304Z 2025-05-07T20:26:09.9945308Z 2025-05-07T20:26:09.9945312Z 2025-05-07T20:26:09.9945323Z 2025-05-07T20:26:09.9945327Z 2025-05-07T20:26:09.9955790Z 2025-05-07T20:26:10.0050320Z cuda-nsight-12.8.55 | 113.2 MB | 2 | 2%  2025-05-07T20:26:10.0050636Z 2025-05-07T20:26:10.0050649Z 2025-05-07T20:26:10.0050653Z 2025-05-07T20:26:10.0050657Z 2025-05-07T20:26:10.0050660Z 2025-05-07T20:26:10.0050664Z 2025-05-07T20:26:10.0052257Z 2025-05-07T20:26:10.0169599Z cuda-nvvp-12.8.57 | 112.4 MB | 2 | 2%  2025-05-07T20:26:10.0169908Z 2025-05-07T20:26:10.0169913Z 2025-05-07T20:26:10.0169916Z 2025-05-07T20:26:10.0169920Z 2025-05-07T20:26:10.0169925Z 2025-05-07T20:26:10.0238980Z libnpp-12.3.3.65 | 130.6 MB | # | 11%  2025-05-07T20:26:10.0239334Z 2025-05-07T20:26:10.0434901Z nsight-compute-2025. | 320.6 MB | #########8 | 98%  2025-05-07T20:26:10.0951009Z libcublas-12.8.3.14 | 460.2 MB | ####### | 71% 2025-05-07T20:26:10.0951359Z 2025-05-07T20:26:10.0951363Z 2025-05-07T20:26:10.0951367Z 2025-05-07T20:26:10.0951371Z 2025-05-07T20:26:10.0951375Z 2025-05-07T20:26:10.0952616Z 2025-05-07T20:26:10.1058872Z cuda-nsight-12.8.55 | 113.2 MB | 4 | 4%  2025-05-07T20:26:10.1059210Z 2025-05-07T20:26:10.1059214Z 2025-05-07T20:26:10.1059218Z 2025-05-07T20:26:10.1059221Z 2025-05-07T20:26:10.1059225Z 2025-05-07T20:26:10.1059228Z 2025-05-07T20:26:10.1059232Z 2025-05-07T20:26:10.1334151Z cuda-nvvp-12.8.57 | 112.4 MB | 3 | 4%  2025-05-07T20:26:10.1334526Z 2025-05-07T20:26:10.1334529Z 2025-05-07T20:26:10.1334533Z 2025-05-07T20:26:10.1334537Z 2025-05-07T20:26:10.1334564Z 2025-05-07T20:26:10.1659782Z libnpp-12.3.3.65 | 130.6 MB | #2 | 13%  2025-05-07T20:26:10.1663675Z 2025-05-07T20:26:10.1892876Z nsight-compute-2025. | 320.6 MB | #########9 | 99%  2025-05-07T20:26:10.1955920Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 71% 2025-05-07T20:26:10.1956281Z 2025-05-07T20:26:10.1956287Z 2025-05-07T20:26:10.1956293Z 2025-05-07T20:26:10.1956298Z 2025-05-07T20:26:10.1956303Z 2025-05-07T20:26:10.1958678Z 2025-05-07T20:26:10.2059157Z cuda-nsight-12.8.55 | 113.2 MB | 5 | 6%  2025-05-07T20:26:10.2059632Z 2025-05-07T20:26:10.2059636Z 2025-05-07T20:26:10.2059640Z 2025-05-07T20:26:10.2059644Z 2025-05-07T20:26:10.2059647Z 2025-05-07T20:26:10.2059651Z 2025-05-07T20:26:10.2063347Z 2025-05-07T20:26:10.2589243Z cuda-nvvp-12.8.57 | 112.4 MB | 6 | 6%  2025-05-07T20:26:10.2589661Z 2025-05-07T20:26:10.2589668Z 2025-05-07T20:26:10.2589673Z 2025-05-07T20:26:10.2589679Z 2025-05-07T20:26:10.2589704Z 2025-05-07T20:26:10.2965664Z libnpp-12.3.3.65 | 130.6 MB | #4 | 15%  2025-05-07T20:26:10.2966053Z 2025-05-07T20:26:10.2966059Z 2025-05-07T20:26:10.2966081Z 2025-05-07T20:26:10.2966086Z 2025-05-07T20:26:10.2966091Z 2025-05-07T20:26:10.2968072Z 2025-05-07T20:26:10.3065638Z cuda-nsight-12.8.55 | 113.2 MB | 7 | 8%  2025-05-07T20:26:10.3066063Z 2025-05-07T20:26:10.3066068Z 2025-05-07T20:26:10.3066074Z 2025-05-07T20:26:10.3066079Z 2025-05-07T20:26:10.3066084Z 2025-05-07T20:26:10.3066089Z 2025-05-07T20:26:10.3066094Z 2025-05-07T20:26:10.3308772Z cuda-nvvp-12.8.57 | 112.4 MB | 8 | 8%  2025-05-07T20:26:10.3632475Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 72% 2025-05-07T20:26:10.3632847Z 2025-05-07T20:26:10.3632853Z 2025-05-07T20:26:10.3632858Z 2025-05-07T20:26:10.3632863Z 2025-05-07T20:26:10.3634392Z 2025-05-07T20:26:10.3966130Z libnpp-12.3.3.65 | 130.6 MB | #6 | 17%  2025-05-07T20:26:10.3966430Z 2025-05-07T20:26:10.3966434Z 2025-05-07T20:26:10.3966437Z 2025-05-07T20:26:10.3966441Z 2025-05-07T20:26:10.3966445Z 2025-05-07T20:26:10.3968434Z 2025-05-07T20:26:10.4065961Z cuda-nsight-12.8.55 | 113.2 MB | # | 10%  2025-05-07T20:26:10.4066278Z 2025-05-07T20:26:10.4066282Z 2025-05-07T20:26:10.4066285Z 2025-05-07T20:26:10.4066289Z 2025-05-07T20:26:10.4066293Z 2025-05-07T20:26:10.4066296Z 2025-05-07T20:26:10.4068152Z 2025-05-07T20:26:10.4402218Z cuda-nvvp-12.8.57 | 112.4 MB | # | 11%  2025-05-07T20:26:10.4718357Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 73% 2025-05-07T20:26:10.4718630Z 2025-05-07T20:26:10.4718634Z 2025-05-07T20:26:10.4718638Z 2025-05-07T20:26:10.4718641Z 2025-05-07T20:26:10.4720973Z 2025-05-07T20:26:10.4967892Z libnpp-12.3.3.65 | 130.6 MB | #8 | 19%  2025-05-07T20:26:10.4968189Z 2025-05-07T20:26:10.4968193Z 2025-05-07T20:26:10.4968478Z 2025-05-07T20:26:10.4968483Z 2025-05-07T20:26:10.4968487Z 2025-05-07T20:26:10.4973731Z 2025-05-07T20:26:10.5067975Z cuda-nsight-12.8.55 | 113.2 MB | #2 | 13%  2025-05-07T20:26:10.5068505Z 2025-05-07T20:26:10.5068509Z 2025-05-07T20:26:10.5068512Z 2025-05-07T20:26:10.5068516Z 2025-05-07T20:26:10.5068520Z 2025-05-07T20:26:10.5068523Z 2025-05-07T20:26:10.5069872Z 2025-05-07T20:26:10.5450768Z cuda-nvvp-12.8.57 | 112.4 MB | #3 | 13%  2025-05-07T20:26:10.5723372Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 73% 2025-05-07T20:26:10.5723697Z 2025-05-07T20:26:10.5723701Z 2025-05-07T20:26:10.5723705Z 2025-05-07T20:26:10.5723708Z 2025-05-07T20:26:10.5725897Z 2025-05-07T20:26:10.5977933Z libnpp-12.3.3.65 | 130.6 MB | ## | 21%  2025-05-07T20:26:10.5978229Z 2025-05-07T20:26:10.5978233Z 2025-05-07T20:26:10.5978237Z 2025-05-07T20:26:10.5978241Z 2025-05-07T20:26:10.5978244Z 2025-05-07T20:26:10.5980402Z 2025-05-07T20:26:10.6072237Z cuda-nsight-12.8.55 | 113.2 MB | #4 | 15%  2025-05-07T20:26:10.6072548Z 2025-05-07T20:26:10.6072552Z 2025-05-07T20:26:10.6072566Z 2025-05-07T20:26:10.6072570Z 2025-05-07T20:26:10.6072573Z 2025-05-07T20:26:10.6072577Z 2025-05-07T20:26:10.6073984Z 2025-05-07T20:26:10.6515209Z cuda-nvvp-12.8.57 | 112.4 MB | #5 | 15%  2025-05-07T20:26:10.6727480Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 74% 2025-05-07T20:26:10.6727772Z 2025-05-07T20:26:10.6727776Z 2025-05-07T20:26:10.6727780Z 2025-05-07T20:26:10.6727784Z 2025-05-07T20:26:10.6731926Z 2025-05-07T20:26:10.7046718Z libnpp-12.3.3.65 | 130.6 MB | ##2 | 23%  2025-05-07T20:26:10.7047007Z 2025-05-07T20:26:10.7047011Z 2025-05-07T20:26:10.7047015Z 2025-05-07T20:26:10.7047019Z 2025-05-07T20:26:10.7047022Z 2025-05-07T20:26:10.7052373Z 2025-05-07T20:26:10.7128304Z cuda-nsight-12.8.55 | 113.2 MB | #6 | 17%  2025-05-07T20:26:10.7128653Z 2025-05-07T20:26:10.7128659Z 2025-05-07T20:26:10.7128664Z 2025-05-07T20:26:10.7128669Z 2025-05-07T20:26:10.7128674Z 2025-05-07T20:26:10.7128680Z 2025-05-07T20:26:10.7128787Z 2025-05-07T20:26:10.7582100Z cuda-nvvp-12.8.57 | 112.4 MB | #7 | 17%  2025-05-07T20:26:10.7822745Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 75% 2025-05-07T20:26:10.7823013Z 2025-05-07T20:26:10.7823018Z 2025-05-07T20:26:10.7823021Z 2025-05-07T20:26:10.7823025Z 2025-05-07T20:26:10.7823084Z 2025-05-07T20:26:10.8140836Z libnpp-12.3.3.65 | 130.6 MB | ##4 | 25%  2025-05-07T20:26:10.8141231Z 2025-05-07T20:26:10.8141235Z 2025-05-07T20:26:10.8141239Z 2025-05-07T20:26:10.8141242Z 2025-05-07T20:26:10.8141246Z 2025-05-07T20:26:10.8141249Z 2025-05-07T20:26:10.8141253Z 2025-05-07T20:26:10.8201882Z cuda-nvvp-12.8.57 | 112.4 MB | #9 | 20%  2025-05-07T20:26:10.8202180Z 2025-05-07T20:26:10.8202205Z 2025-05-07T20:26:10.8202209Z 2025-05-07T20:26:10.8202213Z 2025-05-07T20:26:10.8202217Z 2025-05-07T20:26:10.8204709Z 2025-05-07T20:26:10.8645385Z cuda-nsight-12.8.55 | 113.2 MB | #9 | 19%  2025-05-07T20:26:10.8930716Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 75% 2025-05-07T20:26:10.8930997Z 2025-05-07T20:26:10.8931001Z 2025-05-07T20:26:10.8931005Z 2025-05-07T20:26:10.8931008Z 2025-05-07T20:26:10.8943245Z 2025-05-07T20:26:10.9154880Z libnpp-12.3.3.65 | 130.6 MB | ##6 | 27%  2025-05-07T20:26:10.9155224Z 2025-05-07T20:26:10.9155228Z 2025-05-07T20:26:10.9155232Z 2025-05-07T20:26:10.9155236Z 2025-05-07T20:26:10.9155239Z 2025-05-07T20:26:10.9155243Z 2025-05-07T20:26:10.9155247Z 2025-05-07T20:26:10.9231088Z cuda-nvvp-12.8.57 | 112.4 MB | ##1 | 22%  2025-05-07T20:26:10.9231384Z 2025-05-07T20:26:10.9231388Z 2025-05-07T20:26:10.9231392Z 2025-05-07T20:26:10.9231395Z 2025-05-07T20:26:10.9231399Z 2025-05-07T20:26:10.9231633Z 2025-05-07T20:26:10.9935190Z cuda-nsight-12.8.55 | 113.2 MB | ##1 | 21%  2025-05-07T20:26:10.9935492Z 2025-05-07T20:26:10.9935496Z 2025-05-07T20:26:10.9935731Z 2025-05-07T20:26:10.9935735Z 2025-05-07T20:26:10.9940362Z 2025-05-07T20:26:11.0158755Z libnpp-12.3.3.65 | 130.6 MB | ##8 | 29%  2025-05-07T20:26:11.0159039Z 2025-05-07T20:26:11.0159043Z 2025-05-07T20:26:11.0159047Z 2025-05-07T20:26:11.0159051Z 2025-05-07T20:26:11.0159055Z 2025-05-07T20:26:11.0159058Z 2025-05-07T20:26:11.0161104Z 2025-05-07T20:26:11.0235556Z cuda-nvvp-12.8.57 | 112.4 MB | ##4 | 24%  2025-05-07T20:26:11.0238683Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 76% 2025-05-07T20:26:11.0238974Z 2025-05-07T20:26:11.0238979Z 2025-05-07T20:26:11.0238982Z 2025-05-07T20:26:11.0238986Z 2025-05-07T20:26:11.0238990Z 2025-05-07T20:26:11.0238995Z 2025-05-07T20:26:11.0944610Z cuda-nsight-12.8.55 | 113.2 MB | ##3 | 24%  2025-05-07T20:26:11.0944916Z 2025-05-07T20:26:11.0944920Z 2025-05-07T20:26:11.0944924Z 2025-05-07T20:26:11.0944927Z 2025-05-07T20:26:11.0952158Z 2025-05-07T20:26:11.1239390Z libnpp-12.3.3.65 | 130.6 MB | ### | 31%  2025-05-07T20:26:11.1258162Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 77% 2025-05-07T20:26:11.1258508Z 2025-05-07T20:26:11.1258513Z 2025-05-07T20:26:11.1258516Z 2025-05-07T20:26:11.1258520Z 2025-05-07T20:26:11.1258523Z 2025-05-07T20:26:11.1260751Z 2025-05-07T20:26:11.1268688Z cuda-nsight-12.8.55 | 113.2 MB | ##5 | 26%  2025-05-07T20:26:11.1269001Z 2025-05-07T20:26:11.1269005Z 2025-05-07T20:26:11.1269008Z 2025-05-07T20:26:11.1269012Z 2025-05-07T20:26:11.1269016Z 2025-05-07T20:26:11.1269019Z 2025-05-07T20:26:11.1269023Z 2025-05-07T20:26:11.2123344Z cuda-nvvp-12.8.57 | 112.4 MB | ##6 | 27%  2025-05-07T20:26:11.2123638Z 2025-05-07T20:26:11.2123642Z 2025-05-07T20:26:11.2123664Z 2025-05-07T20:26:11.2123668Z 2025-05-07T20:26:11.2125537Z 2025-05-07T20:26:11.2247621Z libnpp-12.3.3.65 | 130.6 MB | ###2 | 33%  2025-05-07T20:26:11.2251047Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 77% 2025-05-07T20:26:11.2251418Z 2025-05-07T20:26:11.2251424Z 2025-05-07T20:26:11.2251429Z 2025-05-07T20:26:11.2251434Z 2025-05-07T20:26:11.2251439Z 2025-05-07T20:26:11.2251445Z 2025-05-07T20:26:11.2293302Z cuda-nsight-12.8.55 | 113.2 MB | ##7 | 28%  2025-05-07T20:26:11.2293605Z 2025-05-07T20:26:11.2293609Z 2025-05-07T20:26:11.2293612Z 2025-05-07T20:26:11.2293616Z 2025-05-07T20:26:11.2293619Z 2025-05-07T20:26:11.2293623Z 2025-05-07T20:26:11.2295407Z 2025-05-07T20:26:11.3129373Z cuda-nvvp-12.8.57 | 112.4 MB | ##8 | 29%  2025-05-07T20:26:11.3129671Z 2025-05-07T20:26:11.3129675Z 2025-05-07T20:26:11.3129679Z 2025-05-07T20:26:11.3129682Z 2025-05-07T20:26:11.3129686Z 2025-05-07T20:26:11.3251839Z libnpp-12.3.3.65 | 130.6 MB | ###4 | 34%  2025-05-07T20:26:11.3345517Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 78% 2025-05-07T20:26:11.3345899Z 2025-05-07T20:26:11.3345905Z 2025-05-07T20:26:11.3345910Z 2025-05-07T20:26:11.3345915Z 2025-05-07T20:26:11.3345921Z 2025-05-07T20:26:11.3345926Z 2025-05-07T20:26:11.3413803Z cuda-nsight-12.8.55 | 113.2 MB | ##9 | 30%  2025-05-07T20:26:11.3414105Z 2025-05-07T20:26:11.3414109Z 2025-05-07T20:26:11.3414113Z 2025-05-07T20:26:11.3414116Z 2025-05-07T20:26:11.3414120Z 2025-05-07T20:26:11.3414123Z 2025-05-07T20:26:11.3420323Z 2025-05-07T20:26:11.4182279Z cuda-nvvp-12.8.57 | 112.4 MB | ###1 | 31%  2025-05-07T20:26:11.4182582Z 2025-05-07T20:26:11.4182586Z 2025-05-07T20:26:11.4182589Z 2025-05-07T20:26:11.4182593Z 2025-05-07T20:26:11.4186441Z 2025-05-07T20:26:11.4311067Z libnpp-12.3.3.65 | 130.6 MB | ###6 | 36%  2025-05-07T20:26:11.4347042Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 78% 2025-05-07T20:26:11.4347307Z 2025-05-07T20:26:11.4347311Z 2025-05-07T20:26:11.4347314Z 2025-05-07T20:26:11.4347318Z 2025-05-07T20:26:11.4347464Z 2025-05-07T20:26:11.4347468Z 2025-05-07T20:26:11.4455815Z cuda-nsight-12.8.55 | 113.2 MB | ###2 | 32%  2025-05-07T20:26:11.4456116Z 2025-05-07T20:26:11.4456120Z 2025-05-07T20:26:11.4456123Z 2025-05-07T20:26:11.4456127Z 2025-05-07T20:26:11.4456131Z 2025-05-07T20:26:11.4456135Z 2025-05-07T20:26:11.4460027Z 2025-05-07T20:26:11.5225797Z cuda-nvvp-12.8.57 | 112.4 MB | ###3 | 33%  2025-05-07T20:26:11.5226107Z 2025-05-07T20:26:11.5226111Z 2025-05-07T20:26:11.5226114Z 2025-05-07T20:26:11.5226118Z 2025-05-07T20:26:11.5226123Z 2025-05-07T20:26:11.5347827Z libnpp-12.3.3.65 | 130.6 MB | ###8 | 38%  2025-05-07T20:26:11.5348102Z 2025-05-07T20:26:11.5348105Z 2025-05-07T20:26:11.5348109Z 2025-05-07T20:26:11.5348132Z 2025-05-07T20:26:11.5348136Z 2025-05-07T20:26:11.5348143Z 2025-05-07T20:26:11.5365823Z cuda-nsight-12.8.55 | 113.2 MB | ###4 | 35%  2025-05-07T20:26:11.5459481Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 79% 2025-05-07T20:26:11.5459854Z 2025-05-07T20:26:11.5459860Z 2025-05-07T20:26:11.5459865Z 2025-05-07T20:26:11.5459871Z 2025-05-07T20:26:11.5459874Z 2025-05-07T20:26:11.5459878Z 2025-05-07T20:26:11.5462376Z 2025-05-07T20:26:11.6228087Z cuda-nvvp-12.8.57 | 112.4 MB | ###5 | 36%  2025-05-07T20:26:11.6228382Z 2025-05-07T20:26:11.6228386Z 2025-05-07T20:26:11.6228389Z 2025-05-07T20:26:11.6228393Z 2025-05-07T20:26:11.6229818Z 2025-05-07T20:26:11.6369255Z libnpp-12.3.3.65 | 130.6 MB | #### | 40%  2025-05-07T20:26:11.6396990Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 79% 2025-05-07T20:26:11.6397278Z 2025-05-07T20:26:11.6397282Z 2025-05-07T20:26:11.6397286Z 2025-05-07T20:26:11.6397290Z 2025-05-07T20:26:11.6397317Z 2025-05-07T20:26:11.6399498Z 2025-05-07T20:26:11.7228642Z cuda-nsight-12.8.55 | 113.2 MB | ###6 | 37%  2025-05-07T20:26:11.7229028Z 2025-05-07T20:26:11.7229049Z 2025-05-07T20:26:11.7229053Z 2025-05-07T20:26:11.7229057Z 2025-05-07T20:26:11.7229064Z 2025-05-07T20:26:11.7373763Z libnpp-12.3.3.65 | 130.6 MB | ####2 | 42%  2025-05-07T20:26:11.7411739Z libcublas-12.8.3.14 | 460.2 MB | ######## | 80% 2025-05-07T20:26:11.7412101Z 2025-05-07T20:26:11.7412107Z 2025-05-07T20:26:11.7412111Z 2025-05-07T20:26:11.7412115Z 2025-05-07T20:26:11.7412118Z 2025-05-07T20:26:11.7412687Z 2025-05-07T20:26:11.7425350Z cuda-nsight-12.8.55 | 113.2 MB | ###9 | 39%  2025-05-07T20:26:11.7425859Z 2025-05-07T20:26:11.7425865Z 2025-05-07T20:26:11.7425871Z 2025-05-07T20:26:11.7425876Z 2025-05-07T20:26:11.7425881Z 2025-05-07T20:26:11.7425886Z 2025-05-07T20:26:11.7429313Z 2025-05-07T20:26:11.8265193Z cuda-nvvp-12.8.57 | 112.4 MB | ###7 | 38%  2025-05-07T20:26:11.8265609Z 2025-05-07T20:26:11.8265614Z 2025-05-07T20:26:11.8265619Z 2025-05-07T20:26:11.8265625Z 2025-05-07T20:26:11.8269247Z 2025-05-07T20:26:11.8398302Z libnpp-12.3.3.65 | 130.6 MB | ####4 | 44%  2025-05-07T20:26:11.8427255Z libcublas-12.8.3.14 | 460.2 MB | ######## | 81% 2025-05-07T20:26:11.8427628Z 2025-05-07T20:26:11.8427791Z 2025-05-07T20:26:11.8427799Z 2025-05-07T20:26:11.8427804Z 2025-05-07T20:26:11.8427809Z 2025-05-07T20:26:11.8427815Z 2025-05-07T20:26:11.8428936Z 2025-05-07T20:26:11.8565067Z cuda-nvvp-12.8.57 | 112.4 MB | ###9 | 40%  2025-05-07T20:26:11.8565475Z 2025-05-07T20:26:11.8565480Z 2025-05-07T20:26:11.8565486Z 2025-05-07T20:26:11.8565502Z 2025-05-07T20:26:11.8565508Z 2025-05-07T20:26:11.8565513Z 2025-05-07T20:26:11.9267025Z cuda-nsight-12.8.55 | 113.2 MB | ####1 | 41%  2025-05-07T20:26:11.9267446Z 2025-05-07T20:26:11.9267803Z 2025-05-07T20:26:11.9267810Z 2025-05-07T20:26:11.9267814Z 2025-05-07T20:26:11.9269353Z 2025-05-07T20:26:11.9428363Z libnpp-12.3.3.65 | 130.6 MB | ####6 | 46%  2025-05-07T20:26:11.9428942Z 2025-05-07T20:26:11.9428946Z 2025-05-07T20:26:11.9428949Z 2025-05-07T20:26:11.9428952Z 2025-05-07T20:26:11.9428956Z 2025-05-07T20:26:11.9428959Z 2025-05-07T20:26:11.9432760Z 2025-05-07T20:26:11.9446323Z cuda-nvvp-12.8.57 | 112.4 MB | ####2 | 42%  2025-05-07T20:26:11.9568245Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 81% 2025-05-07T20:26:11.9568516Z 2025-05-07T20:26:11.9568521Z 2025-05-07T20:26:11.9568527Z 2025-05-07T20:26:11.9568540Z 2025-05-07T20:26:11.9568546Z 2025-05-07T20:26:11.9570815Z 2025-05-07T20:26:12.0271458Z cuda-nsight-12.8.55 | 113.2 MB | ####3 | 44%  2025-05-07T20:26:12.0271780Z 2025-05-07T20:26:12.0271784Z 2025-05-07T20:26:12.0271796Z 2025-05-07T20:26:12.0271800Z 2025-05-07T20:26:12.0273558Z 2025-05-07T20:26:12.0449185Z libnpp-12.3.3.65 | 130.6 MB | ####8 | 48%  2025-05-07T20:26:12.0476273Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 82% 2025-05-07T20:26:12.0476579Z 2025-05-07T20:26:12.0476584Z 2025-05-07T20:26:12.0476588Z 2025-05-07T20:26:12.0476592Z 2025-05-07T20:26:12.0476596Z 2025-05-07T20:26:12.0476599Z 2025-05-07T20:26:12.0478086Z 2025-05-07T20:26:12.0743016Z cuda-nvvp-12.8.57 | 112.4 MB | ####4 | 44%  2025-05-07T20:26:12.0743391Z 2025-05-07T20:26:12.0743395Z 2025-05-07T20:26:12.0743399Z 2025-05-07T20:26:12.0743403Z 2025-05-07T20:26:12.0743414Z 2025-05-07T20:26:12.0747105Z 2025-05-07T20:26:12.1304660Z cuda-nsight-12.8.55 | 113.2 MB | ####5 | 46%  2025-05-07T20:26:12.1305015Z 2025-05-07T20:26:12.1305019Z 2025-05-07T20:26:12.1305023Z 2025-05-07T20:26:12.1305027Z 2025-05-07T20:26:12.1309735Z 2025-05-07T20:26:12.1450436Z libnpp-12.3.3.65 | 130.6 MB | ##### | 50%  2025-05-07T20:26:12.1478694Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 82% 2025-05-07T20:26:12.1478959Z 2025-05-07T20:26:12.1478963Z 2025-05-07T20:26:12.1478967Z 2025-05-07T20:26:12.1478983Z 2025-05-07T20:26:12.1478987Z 2025-05-07T20:26:12.1478990Z 2025-05-07T20:26:12.1479061Z 2025-05-07T20:26:12.1786704Z cuda-nvvp-12.8.57 | 112.4 MB | ####6 | 47%  2025-05-07T20:26:12.1787015Z 2025-05-07T20:26:12.1787019Z 2025-05-07T20:26:12.1787023Z 2025-05-07T20:26:12.1787027Z 2025-05-07T20:26:12.1787031Z 2025-05-07T20:26:12.1795984Z 2025-05-07T20:26:12.2305750Z cuda-nsight-12.8.55 | 113.2 MB | ####7 | 48%  2025-05-07T20:26:12.2306165Z 2025-05-07T20:26:12.2306169Z 2025-05-07T20:26:12.2306172Z 2025-05-07T20:26:12.2306176Z 2025-05-07T20:26:12.2307884Z 2025-05-07T20:26:12.2456696Z libnpp-12.3.3.65 | 130.6 MB | #####2 | 52%  2025-05-07T20:26:12.2480010Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 83% 2025-05-07T20:26:12.2480490Z 2025-05-07T20:26:12.2480494Z 2025-05-07T20:26:12.2480498Z 2025-05-07T20:26:12.2480502Z 2025-05-07T20:26:12.2480505Z 2025-05-07T20:26:12.2480509Z 2025-05-07T20:26:12.2480850Z 2025-05-07T20:26:12.2816241Z cuda-nvvp-12.8.57 | 112.4 MB | ####8 | 49%  2025-05-07T20:26:12.2816698Z 2025-05-07T20:26:12.2816705Z 2025-05-07T20:26:12.2816711Z 2025-05-07T20:26:12.2816717Z 2025-05-07T20:26:12.2816722Z 2025-05-07T20:26:12.2818674Z 2025-05-07T20:26:12.3475931Z cuda-nsight-12.8.55 | 113.2 MB | ##### | 50%  2025-05-07T20:26:12.3476295Z 2025-05-07T20:26:12.3476299Z 2025-05-07T20:26:12.3476303Z 2025-05-07T20:26:12.3476306Z 2025-05-07T20:26:12.3478551Z 2025-05-07T20:26:12.3481576Z libnpp-12.3.3.65 | 130.6 MB | #####4 | 54%  2025-05-07T20:26:12.3481885Z 2025-05-07T20:26:12.3481889Z 2025-05-07T20:26:12.3481893Z 2025-05-07T20:26:12.3481897Z 2025-05-07T20:26:12.3481901Z 2025-05-07T20:26:12.3481905Z 2025-05-07T20:26:12.3483579Z 2025-05-07T20:26:12.3488779Z cuda-nvvp-12.8.57 | 112.4 MB | #####1 | 51%  2025-05-07T20:26:12.3832611Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 84% 2025-05-07T20:26:12.3833308Z 2025-05-07T20:26:12.3833313Z 2025-05-07T20:26:12.3833316Z 2025-05-07T20:26:12.3833320Z 2025-05-07T20:26:12.3833324Z 2025-05-07T20:26:12.3836138Z 2025-05-07T20:26:12.4477578Z cuda-nsight-12.8.55 | 113.2 MB | #####2 | 52%  2025-05-07T20:26:12.4477965Z 2025-05-07T20:26:12.4477969Z 2025-05-07T20:26:12.4477973Z 2025-05-07T20:26:12.4477976Z 2025-05-07T20:26:12.4477980Z 2025-05-07T20:26:12.4483749Z libnpp-12.3.3.65 | 130.6 MB | #####6 | 56%  2025-05-07T20:26:12.4484070Z 2025-05-07T20:26:12.4484074Z 2025-05-07T20:26:12.4484078Z 2025-05-07T20:26:12.4484081Z 2025-05-07T20:26:12.4484085Z 2025-05-07T20:26:12.4484089Z 2025-05-07T20:26:12.4486058Z 2025-05-07T20:26:12.4553168Z cuda-nvvp-12.8.57 | 112.4 MB | #####3 | 53%  2025-05-07T20:26:12.4836190Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 84% 2025-05-07T20:26:12.4836574Z 2025-05-07T20:26:12.4836679Z 2025-05-07T20:26:12.4836730Z 2025-05-07T20:26:12.4836735Z 2025-05-07T20:26:12.4836741Z 2025-05-07T20:26:12.4836750Z 2025-05-07T20:26:12.5502753Z cuda-nsight-12.8.55 | 113.2 MB | #####4 | 54%  2025-05-07T20:26:12.5503085Z 2025-05-07T20:26:12.5503090Z 2025-05-07T20:26:12.5503093Z 2025-05-07T20:26:12.5503097Z 2025-05-07T20:26:12.5510005Z 2025-05-07T20:26:12.5571818Z libnpp-12.3.3.65 | 130.6 MB | #####8 | 58%  2025-05-07T20:26:12.5575947Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 85% 2025-05-07T20:26:12.5576205Z 2025-05-07T20:26:12.5576209Z 2025-05-07T20:26:12.5576213Z 2025-05-07T20:26:12.5576216Z 2025-05-07T20:26:12.5576220Z 2025-05-07T20:26:12.5576232Z 2025-05-07T20:26:12.5577909Z 2025-05-07T20:26:12.5837520Z cuda-nvvp-12.8.57 | 112.4 MB | #####5 | 56%  2025-05-07T20:26:12.5837953Z 2025-05-07T20:26:12.5837959Z 2025-05-07T20:26:12.5837965Z 2025-05-07T20:26:12.5837970Z 2025-05-07T20:26:12.5837974Z 2025-05-07T20:26:12.5839114Z 2025-05-07T20:26:12.6510022Z cuda-nsight-12.8.55 | 113.2 MB | #####6 | 57%  2025-05-07T20:26:12.6510433Z 2025-05-07T20:26:12.6510438Z 2025-05-07T20:26:12.6510441Z 2025-05-07T20:26:12.6510445Z 2025-05-07T20:26:12.6512349Z 2025-05-07T20:26:12.6573051Z libnpp-12.3.3.65 | 130.6 MB | ###### | 60%  2025-05-07T20:26:12.6576107Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 85% 2025-05-07T20:26:12.6576441Z 2025-05-07T20:26:12.6576446Z 2025-05-07T20:26:12.6576449Z 2025-05-07T20:26:12.6576453Z 2025-05-07T20:26:12.6576457Z 2025-05-07T20:26:12.6576460Z 2025-05-07T20:26:12.6576464Z 2025-05-07T20:26:12.6919154Z cuda-nvvp-12.8.57 | 112.4 MB | #####7 | 58%  2025-05-07T20:26:12.6919576Z 2025-05-07T20:26:12.6919754Z 2025-05-07T20:26:12.6919786Z 2025-05-07T20:26:12.6919792Z 2025-05-07T20:26:12.6919797Z 2025-05-07T20:26:12.6924381Z 2025-05-07T20:26:12.7514581Z cuda-nsight-12.8.55 | 113.2 MB | #####8 | 59%  2025-05-07T20:26:12.7514957Z 2025-05-07T20:26:12.7514961Z 2025-05-07T20:26:12.7514965Z 2025-05-07T20:26:12.7514969Z 2025-05-07T20:26:12.7516434Z 2025-05-07T20:26:12.7582123Z libnpp-12.3.3.65 | 130.6 MB | ######2 | 62%  2025-05-07T20:26:12.7582531Z 2025-05-07T20:26:12.7582535Z 2025-05-07T20:26:12.7582539Z 2025-05-07T20:26:12.7582542Z 2025-05-07T20:26:12.7582546Z 2025-05-07T20:26:12.7582549Z 2025-05-07T20:26:12.7582553Z 2025-05-07T20:26:12.7620857Z cuda-nvvp-12.8.57 | 112.4 MB | ###### | 60%  2025-05-07T20:26:12.7947852Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 86% 2025-05-07T20:26:12.7948222Z 2025-05-07T20:26:12.7948228Z 2025-05-07T20:26:12.7948244Z 2025-05-07T20:26:12.7948249Z 2025-05-07T20:26:12.7948255Z 2025-05-07T20:26:12.7949767Z 2025-05-07T20:26:12.8516184Z cuda-nsight-12.8.55 | 113.2 MB | ###### | 61%  2025-05-07T20:26:12.8516524Z 2025-05-07T20:26:12.8516528Z 2025-05-07T20:26:12.8516531Z 2025-05-07T20:26:12.8516717Z 2025-05-07T20:26:12.8523373Z 2025-05-07T20:26:12.8593167Z libnpp-12.3.3.65 | 130.6 MB | ######4 | 64%  2025-05-07T20:26:12.8593557Z 2025-05-07T20:26:12.8593561Z 2025-05-07T20:26:12.8593565Z 2025-05-07T20:26:12.8593568Z 2025-05-07T20:26:12.8593572Z 2025-05-07T20:26:12.8593576Z 2025-05-07T20:26:12.8595700Z 2025-05-07T20:26:12.8629141Z cuda-nvvp-12.8.57 | 112.4 MB | ######2 | 63%  2025-05-07T20:26:12.8948196Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 87% 2025-05-07T20:26:12.8948588Z 2025-05-07T20:26:12.8948788Z 2025-05-07T20:26:12.8948795Z 2025-05-07T20:26:12.8948801Z 2025-05-07T20:26:12.8948806Z 2025-05-07T20:26:12.8950404Z 2025-05-07T20:26:12.9521120Z cuda-nsight-12.8.55 | 113.2 MB | ######3 | 63%  2025-05-07T20:26:12.9521460Z 2025-05-07T20:26:12.9521465Z 2025-05-07T20:26:12.9521469Z 2025-05-07T20:26:12.9521474Z 2025-05-07T20:26:12.9521486Z 2025-05-07T20:26:12.9599296Z libnpp-12.3.3.65 | 130.6 MB | ######6 | 66%  2025-05-07T20:26:12.9599599Z 2025-05-07T20:26:12.9599603Z 2025-05-07T20:26:12.9599607Z 2025-05-07T20:26:12.9599610Z 2025-05-07T20:26:12.9599621Z 2025-05-07T20:26:12.9599624Z 2025-05-07T20:26:12.9603369Z 2025-05-07T20:26:12.9632004Z cuda-nvvp-12.8.57 | 112.4 MB | ######4 | 65%  2025-05-07T20:26:12.9953643Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 87% 2025-05-07T20:26:12.9953909Z 2025-05-07T20:26:12.9953913Z 2025-05-07T20:26:12.9953916Z 2025-05-07T20:26:12.9953920Z 2025-05-07T20:26:12.9953924Z 2025-05-07T20:26:12.9956042Z 2025-05-07T20:26:13.0602377Z cuda-nsight-12.8.55 | 113.2 MB | ######5 | 66%  2025-05-07T20:26:13.0602695Z 2025-05-07T20:26:13.0602700Z 2025-05-07T20:26:13.0602703Z 2025-05-07T20:26:13.0602736Z 2025-05-07T20:26:13.0602740Z 2025-05-07T20:26:13.0602743Z 2025-05-07T20:26:13.0603445Z 2025-05-07T20:26:13.0627313Z cuda-nvvp-12.8.57 | 112.4 MB | ######7 | 67%  2025-05-07T20:26:13.0627625Z 2025-05-07T20:26:13.0627629Z 2025-05-07T20:26:13.0627633Z 2025-05-07T20:26:13.0627644Z 2025-05-07T20:26:13.0627648Z 2025-05-07T20:26:13.0650542Z libnpp-12.3.3.65 | 130.6 MB | ######8 | 68%  2025-05-07T20:26:13.1049241Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 88% 2025-05-07T20:26:13.1049574Z 2025-05-07T20:26:13.1049637Z 2025-05-07T20:26:13.1049643Z 2025-05-07T20:26:13.1049648Z 2025-05-07T20:26:13.1049654Z 2025-05-07T20:26:13.1056919Z 2025-05-07T20:26:13.1603345Z cuda-nsight-12.8.55 | 113.2 MB | ######8 | 68%  2025-05-07T20:26:13.1603862Z 2025-05-07T20:26:13.1603871Z 2025-05-07T20:26:13.1603877Z 2025-05-07T20:26:13.1603884Z 2025-05-07T20:26:13.1603894Z 2025-05-07T20:26:13.1603902Z 2025-05-07T20:26:13.1604522Z 2025-05-07T20:26:13.1630115Z cuda-nvvp-12.8.57 | 112.4 MB | ######9 | 70%  2025-05-07T20:26:13.1630423Z 2025-05-07T20:26:13.1630427Z 2025-05-07T20:26:13.1630446Z 2025-05-07T20:26:13.1630449Z 2025-05-07T20:26:13.1632597Z 2025-05-07T20:26:13.1652138Z libnpp-12.3.3.65 | 130.6 MB | ####### | 70%  2025-05-07T20:26:13.2057034Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 89% 2025-05-07T20:26:13.2057419Z 2025-05-07T20:26:13.2057425Z 2025-05-07T20:26:13.2057430Z 2025-05-07T20:26:13.2057435Z 2025-05-07T20:26:13.2057440Z 2025-05-07T20:26:13.2059982Z 2025-05-07T20:26:13.2656003Z cuda-nsight-12.8.55 | 113.2 MB | ####### | 70%  2025-05-07T20:26:13.2656332Z 2025-05-07T20:26:13.2656336Z 2025-05-07T20:26:13.2656340Z 2025-05-07T20:26:13.2656343Z 2025-05-07T20:26:13.2656348Z 2025-05-07T20:26:13.2656352Z 2025-05-07T20:26:13.2656359Z 2025-05-07T20:26:13.2686901Z cuda-nvvp-12.8.57 | 112.4 MB | #######1 | 72%  2025-05-07T20:26:13.2687213Z 2025-05-07T20:26:13.2687217Z 2025-05-07T20:26:13.2687221Z 2025-05-07T20:26:13.2687225Z 2025-05-07T20:26:13.2688515Z 2025-05-07T20:26:13.2694772Z libnpp-12.3.3.65 | 130.6 MB | #######2 | 72%  2025-05-07T20:26:13.3067432Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 89% 2025-05-07T20:26:13.3067763Z 2025-05-07T20:26:13.3067778Z 2025-05-07T20:26:13.3067784Z 2025-05-07T20:26:13.3067789Z 2025-05-07T20:26:13.3067794Z 2025-05-07T20:26:13.3069345Z 2025-05-07T20:26:13.3659030Z cuda-nsight-12.8.55 | 113.2 MB | #######2 | 73%  2025-05-07T20:26:13.3659368Z 2025-05-07T20:26:13.3659372Z 2025-05-07T20:26:13.3659376Z 2025-05-07T20:26:13.3659380Z 2025-05-07T20:26:13.3659384Z 2025-05-07T20:26:13.3659388Z 2025-05-07T20:26:13.3659392Z 2025-05-07T20:26:13.3686814Z cuda-nvvp-12.8.57 | 112.4 MB | #######4 | 74%  2025-05-07T20:26:13.3687133Z 2025-05-07T20:26:13.3687139Z 2025-05-07T20:26:13.3687172Z 2025-05-07T20:26:13.3687178Z 2025-05-07T20:26:13.3689302Z 2025-05-07T20:26:13.3697817Z libnpp-12.3.3.65 | 130.6 MB | #######4 | 74%  2025-05-07T20:26:13.4070921Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 90% 2025-05-07T20:26:13.4071195Z 2025-05-07T20:26:13.4071199Z 2025-05-07T20:26:13.4071212Z 2025-05-07T20:26:13.4071216Z 2025-05-07T20:26:13.4071219Z 2025-05-07T20:26:13.4074384Z 2025-05-07T20:26:13.4665538Z cuda-nsight-12.8.55 | 113.2 MB | #######4 | 75%  2025-05-07T20:26:13.4665874Z 2025-05-07T20:26:13.4665878Z 2025-05-07T20:26:13.4665889Z 2025-05-07T20:26:13.4665893Z 2025-05-07T20:26:13.4665897Z 2025-05-07T20:26:13.4665901Z 2025-05-07T20:26:13.4665905Z 2025-05-07T20:26:13.4731524Z cuda-nvvp-12.8.57 | 112.4 MB | #######6 | 77%  2025-05-07T20:26:13.4801142Z libcublas-12.8.3.14 | 460.2 MB | ######### | 90% 2025-05-07T20:26:13.4801416Z 2025-05-07T20:26:13.4801420Z 2025-05-07T20:26:13.4801425Z 2025-05-07T20:26:13.4801454Z 2025-05-07T20:26:13.4806375Z 2025-05-07T20:26:13.5106447Z libnpp-12.3.3.65 | 130.6 MB | #######6 | 76%  2025-05-07T20:26:13.5106756Z 2025-05-07T20:26:13.5106784Z 2025-05-07T20:26:13.5106788Z 2025-05-07T20:26:13.5106791Z 2025-05-07T20:26:13.5106795Z 2025-05-07T20:26:13.5107560Z 2025-05-07T20:26:13.5670019Z cuda-nsight-12.8.55 | 113.2 MB | #######7 | 77%  2025-05-07T20:26:13.5670616Z 2025-05-07T20:26:13.5670620Z 2025-05-07T20:26:13.5670625Z 2025-05-07T20:26:13.5670629Z 2025-05-07T20:26:13.5670634Z 2025-05-07T20:26:13.5670638Z 2025-05-07T20:26:13.5670643Z 2025-05-07T20:26:13.5737940Z cuda-nvvp-12.8.57 | 112.4 MB | #######9 | 79%  2025-05-07T20:26:13.5898301Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 91% 2025-05-07T20:26:13.5898667Z 2025-05-07T20:26:13.5898680Z 2025-05-07T20:26:13.5898684Z 2025-05-07T20:26:13.5898688Z 2025-05-07T20:26:13.5910216Z 2025-05-07T20:26:13.6115875Z libnpp-12.3.3.65 | 130.6 MB | #######8 | 78%  2025-05-07T20:26:13.6116257Z 2025-05-07T20:26:13.6116263Z 2025-05-07T20:26:13.6116269Z 2025-05-07T20:26:13.6116274Z 2025-05-07T20:26:13.6116292Z 2025-05-07T20:26:13.6116298Z 2025-05-07T20:26:13.6693331Z cuda-nsight-12.8.55 | 113.2 MB | #######9 | 79%  2025-05-07T20:26:13.6693666Z 2025-05-07T20:26:13.6693670Z 2025-05-07T20:26:13.6693673Z 2025-05-07T20:26:13.6693677Z 2025-05-07T20:26:13.6693681Z 2025-05-07T20:26:13.6693685Z 2025-05-07T20:26:13.6698071Z 2025-05-07T20:26:13.6911922Z cuda-nvvp-12.8.57 | 112.4 MB | ########1 | 82%  2025-05-07T20:26:13.6953949Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 92% 2025-05-07T20:26:13.6954217Z 2025-05-07T20:26:13.6954221Z 2025-05-07T20:26:13.6954225Z 2025-05-07T20:26:13.6954228Z 2025-05-07T20:26:13.6954260Z 2025-05-07T20:26:13.7120308Z libnpp-12.3.3.65 | 130.6 MB | ######## | 80%  2025-05-07T20:26:13.7120614Z 2025-05-07T20:26:13.7120864Z 2025-05-07T20:26:13.7120869Z 2025-05-07T20:26:13.7120873Z 2025-05-07T20:26:13.7120885Z 2025-05-07T20:26:13.7129221Z 2025-05-07T20:26:13.7719158Z cuda-nsight-12.8.55 | 113.2 MB | ########1 | 82%  2025-05-07T20:26:13.7719750Z 2025-05-07T20:26:13.7719756Z 2025-05-07T20:26:13.7719771Z 2025-05-07T20:26:13.7719775Z 2025-05-07T20:26:13.7719779Z 2025-05-07T20:26:13.7719782Z 2025-05-07T20:26:13.7719786Z 2025-05-07T20:26:13.7970530Z cuda-nvvp-12.8.57 | 112.4 MB | ########4 | 84%  2025-05-07T20:26:13.8043865Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 92% 2025-05-07T20:26:13.8044239Z 2025-05-07T20:26:13.8044245Z 2025-05-07T20:26:13.8044251Z 2025-05-07T20:26:13.8044257Z 2025-05-07T20:26:13.8044261Z 2025-05-07T20:26:13.8142386Z libnpp-12.3.3.65 | 130.6 MB | ########2 | 82%  2025-05-07T20:26:13.8142725Z 2025-05-07T20:26:13.8142729Z 2025-05-07T20:26:13.8142734Z 2025-05-07T20:26:13.8142737Z 2025-05-07T20:26:13.8142762Z 2025-05-07T20:26:13.8142766Z 2025-05-07T20:26:13.8736644Z cuda-nsight-12.8.55 | 113.2 MB | ########4 | 84%  2025-05-07T20:26:13.8737047Z 2025-05-07T20:26:13.8737073Z 2025-05-07T20:26:13.8737076Z 2025-05-07T20:26:13.8737080Z 2025-05-07T20:26:13.8737084Z 2025-05-07T20:26:13.8737097Z 2025-05-07T20:26:13.8743458Z 2025-05-07T20:26:13.8972005Z cuda-nvvp-12.8.57 | 112.4 MB | ########6 | 87%  2025-05-07T20:26:13.9045578Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 93% 2025-05-07T20:26:13.9045940Z 2025-05-07T20:26:13.9045945Z 2025-05-07T20:26:13.9045948Z 2025-05-07T20:26:13.9045952Z 2025-05-07T20:26:13.9045956Z 2025-05-07T20:26:13.9144451Z libnpp-12.3.3.65 | 130.6 MB | ########4 | 84%  2025-05-07T20:26:13.9144734Z 2025-05-07T20:26:13.9144738Z 2025-05-07T20:26:13.9144742Z 2025-05-07T20:26:13.9144745Z 2025-05-07T20:26:13.9144749Z 2025-05-07T20:26:13.9147652Z 2025-05-07T20:26:13.9819189Z cuda-nsight-12.8.55 | 113.2 MB | ########6 | 87%  2025-05-07T20:26:13.9819522Z 2025-05-07T20:26:13.9819526Z 2025-05-07T20:26:13.9819530Z 2025-05-07T20:26:13.9819534Z 2025-05-07T20:26:13.9819552Z 2025-05-07T20:26:13.9819565Z 2025-05-07T20:26:13.9819569Z 2025-05-07T20:26:14.0057007Z cuda-nvvp-12.8.57 | 112.4 MB | ########8 | 89%  2025-05-07T20:26:14.0057391Z 2025-05-07T20:26:14.0057396Z 2025-05-07T20:26:14.0057399Z 2025-05-07T20:26:14.0057412Z 2025-05-07T20:26:14.0060134Z 2025-05-07T20:26:14.0066452Z libnpp-12.3.3.65 | 130.6 MB | ########6 | 86%  2025-05-07T20:26:14.0148942Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 94% 2025-05-07T20:26:14.0149329Z 2025-05-07T20:26:14.0149335Z 2025-05-07T20:26:14.0149340Z 2025-05-07T20:26:14.0149346Z 2025-05-07T20:26:14.0149351Z 2025-05-07T20:26:14.0152320Z 2025-05-07T20:26:14.0854102Z cuda-nsight-12.8.55 | 113.2 MB | ########9 | 89%  2025-05-07T20:26:14.0854531Z 2025-05-07T20:26:14.0854568Z 2025-05-07T20:26:14.0854574Z 2025-05-07T20:26:14.0854579Z 2025-05-07T20:26:14.0854584Z 2025-05-07T20:26:14.0854589Z 2025-05-07T20:26:14.0854595Z 2025-05-07T20:26:14.1068626Z cuda-nvvp-12.8.57 | 112.4 MB | #########1 | 91%  2025-05-07T20:26:14.1110850Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 94% 2025-05-07T20:26:14.1111115Z 2025-05-07T20:26:14.1111421Z 2025-05-07T20:26:14.1111425Z 2025-05-07T20:26:14.1111429Z 2025-05-07T20:26:14.1112969Z 2025-05-07T20:26:14.1151008Z libnpp-12.3.3.65 | 130.6 MB | ########8 | 88%  2025-05-07T20:26:14.1151430Z 2025-05-07T20:26:14.1151436Z 2025-05-07T20:26:14.1151441Z 2025-05-07T20:26:14.1151447Z 2025-05-07T20:26:14.1151452Z 2025-05-07T20:26:14.1151457Z 2025-05-07T20:26:14.1854544Z cuda-nsight-12.8.55 | 113.2 MB | #########1 | 92%  2025-05-07T20:26:14.1855010Z 2025-05-07T20:26:14.1855026Z 2025-05-07T20:26:14.1855032Z 2025-05-07T20:26:14.1855037Z 2025-05-07T20:26:14.1855042Z 2025-05-07T20:26:14.1855302Z 2025-05-07T20:26:14.1855308Z 2025-05-07T20:26:14.2071221Z cuda-nvvp-12.8.57 | 112.4 MB | #########3 | 94%  2025-05-07T20:26:14.2143416Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 95% 2025-05-07T20:26:14.2143808Z 2025-05-07T20:26:14.2143814Z 2025-05-07T20:26:14.2143820Z 2025-05-07T20:26:14.2143825Z 2025-05-07T20:26:14.2152605Z 2025-05-07T20:26:14.2243308Z libnpp-12.3.3.65 | 130.6 MB | ######### | 90%  2025-05-07T20:26:14.2243739Z 2025-05-07T20:26:14.2243745Z 2025-05-07T20:26:14.2243750Z 2025-05-07T20:26:14.2243756Z 2025-05-07T20:26:14.2243770Z 2025-05-07T20:26:14.2243775Z 2025-05-07T20:26:14.2857548Z cuda-nsight-12.8.55 | 113.2 MB | #########4 | 94%  2025-05-07T20:26:14.2857972Z 2025-05-07T20:26:14.2857976Z 2025-05-07T20:26:14.2857980Z 2025-05-07T20:26:14.2857992Z 2025-05-07T20:26:14.2857996Z 2025-05-07T20:26:14.2858000Z 2025-05-07T20:26:14.2863831Z 2025-05-07T20:26:14.3153751Z cuda-nvvp-12.8.57 | 112.4 MB | #########6 | 96%  2025-05-07T20:26:14.3201613Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 95% 2025-05-07T20:26:14.3201900Z 2025-05-07T20:26:14.3201958Z 2025-05-07T20:26:14.3201964Z 2025-05-07T20:26:14.3201968Z 2025-05-07T20:26:14.3205717Z 2025-05-07T20:26:14.3282328Z libnpp-12.3.3.65 | 130.6 MB | #########2 | 92%  2025-05-07T20:26:14.3282626Z 2025-05-07T20:26:14.3282630Z 2025-05-07T20:26:14.3282634Z 2025-05-07T20:26:14.3282637Z 2025-05-07T20:26:14.3282641Z 2025-05-07T20:26:14.3285062Z 2025-05-07T20:26:14.3858015Z cuda-nsight-12.8.55 | 113.2 MB | #########6 | 96%  2025-05-07T20:26:14.3858494Z 2025-05-07T20:26:14.3858501Z 2025-05-07T20:26:14.3858506Z 2025-05-07T20:26:14.3858525Z 2025-05-07T20:26:14.3858530Z 2025-05-07T20:26:14.3858535Z 2025-05-07T20:26:14.3858540Z 2025-05-07T20:26:14.4156407Z cuda-nvvp-12.8.57 | 112.4 MB | #########8 | 99%  2025-05-07T20:26:14.4283587Z libcublas-12.8.3.14 | 460.2 MB | #########6 | 96% 2025-05-07T20:26:14.4283950Z 2025-05-07T20:26:14.4283954Z 2025-05-07T20:26:14.4283958Z 2025-05-07T20:26:14.4283962Z 2025-05-07T20:26:14.4283976Z 2025-05-07T20:26:14.4286123Z 2025-05-07T20:26:14.4288896Z cuda-nsight-12.8.55 | 113.2 MB | #########8 | 99%  2025-05-07T20:26:14.4289250Z 2025-05-07T20:26:14.4289254Z 2025-05-07T20:26:14.4289258Z 2025-05-07T20:26:14.4289261Z 2025-05-07T20:26:14.4289273Z 2025-05-07T20:26:14.5157917Z libnpp-12.3.3.65 | 130.6 MB | #########3 | 94%  2025-05-07T20:26:14.5285300Z libcublas-12.8.3.14 | 460.2 MB | #########6 | 97% 2025-05-07T20:26:14.5285667Z 2025-05-07T20:26:14.5285672Z 2025-05-07T20:26:14.5285675Z 2025-05-07T20:26:14.5285679Z 2025-05-07T20:26:14.5289587Z 2025-05-07T20:26:14.6160467Z libnpp-12.3.3.65 | 130.6 MB | #########6 | 96%  2025-05-07T20:26:14.6290349Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 97% 2025-05-07T20:26:14.6290737Z 2025-05-07T20:26:14.6290774Z 2025-05-07T20:26:14.6290778Z 2025-05-07T20:26:14.6290782Z 2025-05-07T20:26:14.6290786Z 2025-05-07T20:26:14.7161716Z libnpp-12.3.3.65 | 130.6 MB | #########8 | 98%  2025-05-07T20:26:14.8162036Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 98% 2025-05-07T20:26:14.9164942Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 99% 2025-05-07T20:26:17.8908872Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 100% 2025-05-07T20:26:17.8909156Z 2025-05-07T20:26:17.8909159Z 2025-05-07T20:26:17.8909164Z 2025-05-07T20:26:17.8909167Z 2025-05-07T20:26:17.9631202Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:26:17.9631620Z 2025-05-07T20:26:17.9631626Z 2025-05-07T20:26:17.9631632Z 2025-05-07T20:26:17.9631637Z 2025-05-07T20:26:17.9631642Z 2025-05-07T20:26:17.9631648Z 2025-05-07T20:26:17.9633955Z 2025-05-07T20:26:18.0112938Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:26:18.0113239Z 2025-05-07T20:26:18.0113738Z 2025-05-07T20:26:18.0113748Z 2025-05-07T20:26:18.0113753Z 2025-05-07T20:26:18.0113758Z 2025-05-07T20:26:18.0113764Z 2025-05-07T20:26:18.0113769Z 2025-05-07T20:26:18.0115118Z 2025-05-07T20:26:18.1116135Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:26:18.1116481Z 2025-05-07T20:26:18.1116485Z 2025-05-07T20:26:18.1116489Z 2025-05-07T20:26:18.1116492Z 2025-05-07T20:26:18.1116496Z 2025-05-07T20:26:18.1116499Z 2025-05-07T20:26:18.1116503Z 2025-05-07T20:26:18.1116506Z 2025-05-07T20:26:18.1239654Z cuda-nvrtc-12.8.61 | 63.1 MB | 4 | 5%  2025-05-07T20:26:18.1239950Z 2025-05-07T20:26:18.1239954Z 2025-05-07T20:26:18.1239958Z 2025-05-07T20:26:18.1239962Z 2025-05-07T20:26:18.1239965Z 2025-05-07T20:26:18.1239969Z 2025-05-07T20:26:18.1598253Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:26:18.1598560Z 2025-05-07T20:26:18.1598563Z 2025-05-07T20:26:18.1598583Z 2025-05-07T20:26:18.1598587Z 2025-05-07T20:26:18.1598590Z 2025-05-07T20:26:18.1598594Z 2025-05-07T20:26:18.1598598Z 2025-05-07T20:26:18.1598601Z 2025-05-07T20:26:18.1601822Z 2025-05-07T20:26:18.2175493Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:26:18.2175803Z 2025-05-07T20:26:18.2175807Z 2025-05-07T20:26:18.2175810Z 2025-05-07T20:26:18.2175814Z 2025-05-07T20:26:18.2175818Z 2025-05-07T20:26:18.2175821Z 2025-05-07T20:26:18.2175825Z 2025-05-07T20:26:18.2177294Z 2025-05-07T20:26:18.2603342Z cuda-nvrtc-12.8.61 | 63.1 MB | 9 | 10%  2025-05-07T20:26:18.2603720Z 2025-05-07T20:26:18.2603724Z 2025-05-07T20:26:18.2603728Z 2025-05-07T20:26:18.2603732Z 2025-05-07T20:26:18.2603735Z 2025-05-07T20:26:18.2603739Z 2025-05-07T20:26:18.2603743Z 2025-05-07T20:26:18.2603746Z 2025-05-07T20:26:18.2605180Z 2025-05-07T20:26:18.3178558Z libcurand-10.3.9.55 | 43.6 MB | 7 | 7%  2025-05-07T20:26:18.3178905Z 2025-05-07T20:26:18.3178909Z 2025-05-07T20:26:18.3178913Z 2025-05-07T20:26:18.3178916Z 2025-05-07T20:26:18.3178920Z 2025-05-07T20:26:18.3178924Z 2025-05-07T20:26:18.3178935Z 2025-05-07T20:26:18.3180070Z 2025-05-07T20:26:18.3708320Z cuda-nvrtc-12.8.61 | 63.1 MB | #4 | 15%  2025-05-07T20:26:18.3708658Z 2025-05-07T20:26:18.3708661Z 2025-05-07T20:26:18.3708665Z 2025-05-07T20:26:18.3708668Z 2025-05-07T20:26:18.3708672Z 2025-05-07T20:26:18.3708675Z 2025-05-07T20:26:18.3708679Z 2025-05-07T20:26:18.3708682Z 2025-05-07T20:26:18.3709929Z 2025-05-07T20:26:18.4411137Z libcurand-10.3.9.55 | 43.6 MB | #4 | 14%  2025-05-07T20:26:18.4411564Z 2025-05-07T20:26:18.4411568Z 2025-05-07T20:26:18.4411572Z 2025-05-07T20:26:18.4411585Z 2025-05-07T20:26:18.4411589Z 2025-05-07T20:26:18.4411592Z 2025-05-07T20:26:18.4411596Z 2025-05-07T20:26:18.4414449Z 2025-05-07T20:26:18.4827134Z cuda-nvrtc-12.8.61 | 63.1 MB | #9 | 19%  2025-05-07T20:26:18.4827442Z 2025-05-07T20:26:18.4827446Z 2025-05-07T20:26:18.4827449Z 2025-05-07T20:26:18.4827453Z 2025-05-07T20:26:18.4827462Z 2025-05-07T20:26:18.4827465Z 2025-05-07T20:26:18.4827469Z 2025-05-07T20:26:18.4827473Z 2025-05-07T20:26:18.4827476Z 2025-05-07T20:26:18.5450606Z libcurand-10.3.9.55 | 43.6 MB | ## | 21%  2025-05-07T20:26:18.5450912Z 2025-05-07T20:26:18.5450916Z 2025-05-07T20:26:18.5450920Z 2025-05-07T20:26:18.5450923Z 2025-05-07T20:26:18.5450927Z 2025-05-07T20:26:18.5450930Z 2025-05-07T20:26:18.5450934Z 2025-05-07T20:26:18.5452810Z 2025-05-07T20:26:18.5828768Z cuda-nvrtc-12.8.61 | 63.1 MB | ##3 | 24%  2025-05-07T20:26:18.5829146Z 2025-05-07T20:26:18.5829150Z 2025-05-07T20:26:18.5829154Z 2025-05-07T20:26:18.5829157Z 2025-05-07T20:26:18.5829161Z 2025-05-07T20:26:18.5829164Z 2025-05-07T20:26:18.5829168Z 2025-05-07T20:26:18.5829171Z 2025-05-07T20:26:18.5829737Z 2025-05-07T20:26:18.6477703Z libcurand-10.3.9.55 | 43.6 MB | ##7 | 27%  2025-05-07T20:26:18.6478016Z 2025-05-07T20:26:18.6478222Z 2025-05-07T20:26:18.6478228Z 2025-05-07T20:26:18.6478233Z 2025-05-07T20:26:18.6478237Z 2025-05-07T20:26:18.6478240Z 2025-05-07T20:26:18.6478244Z 2025-05-07T20:26:18.6486698Z 2025-05-07T20:26:18.6930538Z cuda-nvrtc-12.8.61 | 63.1 MB | ##8 | 28%  2025-05-07T20:26:18.6930836Z 2025-05-07T20:26:18.6930841Z 2025-05-07T20:26:18.6930845Z 2025-05-07T20:26:18.6930849Z 2025-05-07T20:26:18.6930853Z 2025-05-07T20:26:18.6930857Z 2025-05-07T20:26:18.6930868Z 2025-05-07T20:26:18.6930872Z 2025-05-07T20:26:18.6936938Z 2025-05-07T20:26:18.7518513Z libcurand-10.3.9.55 | 43.6 MB | ###3 | 34%  2025-05-07T20:26:18.7518804Z 2025-05-07T20:26:18.7518808Z 2025-05-07T20:26:18.7518822Z 2025-05-07T20:26:18.7518826Z 2025-05-07T20:26:18.7518831Z 2025-05-07T20:26:18.7518853Z 2025-05-07T20:26:18.7518857Z 2025-05-07T20:26:18.7521518Z 2025-05-07T20:26:18.7969594Z cuda-nvrtc-12.8.61 | 63.1 MB | ###2 | 33%  2025-05-07T20:26:18.7969903Z 2025-05-07T20:26:18.7969907Z 2025-05-07T20:26:18.7969910Z 2025-05-07T20:26:18.7969914Z 2025-05-07T20:26:18.7969917Z 2025-05-07T20:26:18.7969921Z 2025-05-07T20:26:18.7969925Z 2025-05-07T20:26:18.7969928Z 2025-05-07T20:26:18.7975062Z 2025-05-07T20:26:18.8519381Z libcurand-10.3.9.55 | 43.6 MB | #### | 40%  2025-05-07T20:26:18.8519689Z 2025-05-07T20:26:18.8519693Z 2025-05-07T20:26:18.8519697Z 2025-05-07T20:26:18.8519712Z 2025-05-07T20:26:18.8519716Z 2025-05-07T20:26:18.8519720Z 2025-05-07T20:26:18.8519724Z 2025-05-07T20:26:18.8523574Z 2025-05-07T20:26:18.8972550Z cuda-nvrtc-12.8.61 | 63.1 MB | ###7 | 37%  2025-05-07T20:26:18.8972865Z 2025-05-07T20:26:18.8972869Z 2025-05-07T20:26:18.8972873Z 2025-05-07T20:26:18.8972901Z 2025-05-07T20:26:18.8972905Z 2025-05-07T20:26:18.8972909Z 2025-05-07T20:26:18.8972913Z 2025-05-07T20:26:18.8972916Z 2025-05-07T20:26:18.8974341Z 2025-05-07T20:26:18.9520294Z libcurand-10.3.9.55 | 43.6 MB | ####6 | 46%  2025-05-07T20:26:18.9520627Z 2025-05-07T20:26:18.9520631Z 2025-05-07T20:26:18.9520635Z 2025-05-07T20:26:18.9520638Z 2025-05-07T20:26:18.9520642Z 2025-05-07T20:26:18.9520646Z 2025-05-07T20:26:18.9520649Z 2025-05-07T20:26:18.9527024Z 2025-05-07T20:26:18.9973834Z cuda-nvrtc-12.8.61 | 63.1 MB | ####2 | 42%  2025-05-07T20:26:18.9974137Z 2025-05-07T20:26:18.9974146Z 2025-05-07T20:26:18.9974151Z 2025-05-07T20:26:18.9974157Z 2025-05-07T20:26:18.9974162Z 2025-05-07T20:26:18.9974167Z 2025-05-07T20:26:18.9974172Z 2025-05-07T20:26:18.9974177Z 2025-05-07T20:26:18.9975265Z 2025-05-07T20:26:19.0624281Z libcurand-10.3.9.55 | 43.6 MB | #####3 | 54%  2025-05-07T20:26:19.0624643Z 2025-05-07T20:26:19.0624670Z 2025-05-07T20:26:19.0624674Z 2025-05-07T20:26:19.0624677Z 2025-05-07T20:26:19.0624681Z 2025-05-07T20:26:19.0624685Z 2025-05-07T20:26:19.0624689Z 2025-05-07T20:26:19.0628882Z 2025-05-07T20:26:19.0979714Z cuda-nvrtc-12.8.61 | 63.1 MB | ####7 | 47%  2025-05-07T20:26:19.0980061Z 2025-05-07T20:26:19.0980065Z 2025-05-07T20:26:19.0980069Z 2025-05-07T20:26:19.0980080Z 2025-05-07T20:26:19.0980084Z 2025-05-07T20:26:19.0980087Z 2025-05-07T20:26:19.0980091Z 2025-05-07T20:26:19.0980094Z 2025-05-07T20:26:19.0982378Z 2025-05-07T20:26:19.1480263Z libcurand-10.3.9.55 | 43.6 MB | ###### | 61%  2025-05-07T20:26:19.1480623Z 2025-05-07T20:26:19.1480627Z 2025-05-07T20:26:19.1480632Z 2025-05-07T20:26:19.1480635Z 2025-05-07T20:26:19.1480639Z 2025-05-07T20:26:19.1630981Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:19.1631344Z 2025-05-07T20:26:19.1631347Z 2025-05-07T20:26:19.1631652Z 2025-05-07T20:26:19.1631657Z 2025-05-07T20:26:19.1631661Z 2025-05-07T20:26:19.1631664Z 2025-05-07T20:26:19.1631668Z 2025-05-07T20:26:19.1632783Z 2025-05-07T20:26:19.1981401Z cuda-nvrtc-12.8.61 | 63.1 MB | #####1 | 52%  2025-05-07T20:26:19.1981748Z 2025-05-07T20:26:19.1981752Z 2025-05-07T20:26:19.1981761Z 2025-05-07T20:26:19.1981765Z 2025-05-07T20:26:19.1981769Z 2025-05-07T20:26:19.1981773Z 2025-05-07T20:26:19.1981776Z 2025-05-07T20:26:19.1981780Z 2025-05-07T20:26:19.1982219Z 2025-05-07T20:26:19.2134133Z libcurand-10.3.9.55 | 43.6 MB | ######8 | 68%  2025-05-07T20:26:19.2134501Z 2025-05-07T20:26:19.2134505Z 2025-05-07T20:26:19.2134508Z 2025-05-07T20:26:19.2134512Z 2025-05-07T20:26:19.2134516Z 2025-05-07T20:26:19.2134520Z 2025-05-07T20:26:19.2134524Z 2025-05-07T20:26:19.2134527Z 2025-05-07T20:26:19.2134531Z 2025-05-07T20:26:19.2134535Z 2025-05-07T20:26:19.2687325Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:26:19.2687633Z 2025-05-07T20:26:19.2687637Z 2025-05-07T20:26:19.2687641Z 2025-05-07T20:26:19.2687644Z 2025-05-07T20:26:19.2687648Z 2025-05-07T20:26:19.2687665Z 2025-05-07T20:26:19.2687669Z 2025-05-07T20:26:19.2688714Z 2025-05-07T20:26:19.3112681Z cuda-nvrtc-12.8.61 | 63.1 MB | #####6 | 57%  2025-05-07T20:26:19.3113034Z 2025-05-07T20:26:19.3113038Z 2025-05-07T20:26:19.3113042Z 2025-05-07T20:26:19.3113046Z 2025-05-07T20:26:19.3113049Z 2025-05-07T20:26:19.3113067Z 2025-05-07T20:26:19.3113071Z 2025-05-07T20:26:19.3113075Z 2025-05-07T20:26:19.3116469Z 2025-05-07T20:26:19.3135861Z libcurand-10.3.9.55 | 43.6 MB | #######5 | 75%  2025-05-07T20:26:19.3136192Z 2025-05-07T20:26:19.3136196Z 2025-05-07T20:26:19.3136200Z 2025-05-07T20:26:19.3136203Z 2025-05-07T20:26:19.3136207Z 2025-05-07T20:26:19.3136211Z 2025-05-07T20:26:19.3136214Z 2025-05-07T20:26:19.3136218Z 2025-05-07T20:26:19.3136235Z 2025-05-07T20:26:19.3142024Z 2025-05-07T20:26:19.3767505Z gds-tools-1.13.0.11 | 37.9 MB | 6 | 7%  2025-05-07T20:26:19.3767849Z 2025-05-07T20:26:19.3767875Z 2025-05-07T20:26:19.3767879Z 2025-05-07T20:26:19.3767882Z 2025-05-07T20:26:19.3767886Z 2025-05-07T20:26:19.3767890Z 2025-05-07T20:26:19.3767893Z 2025-05-07T20:26:19.3772699Z 2025-05-07T20:26:19.4120080Z cuda-nvrtc-12.8.61 | 63.1 MB | ######1 | 61%  2025-05-07T20:26:19.4120531Z 2025-05-07T20:26:19.4120535Z 2025-05-07T20:26:19.4120538Z 2025-05-07T20:26:19.4120542Z 2025-05-07T20:26:19.4120545Z 2025-05-07T20:26:19.4120549Z 2025-05-07T20:26:19.4120553Z 2025-05-07T20:26:19.4120556Z 2025-05-07T20:26:19.4124499Z 2025-05-07T20:26:19.4170680Z libcurand-10.3.9.55 | 43.6 MB | ########1 | 82%  2025-05-07T20:26:19.4171010Z 2025-05-07T20:26:19.4171016Z 2025-05-07T20:26:19.4171021Z 2025-05-07T20:26:19.4171026Z 2025-05-07T20:26:19.4171055Z 2025-05-07T20:26:19.4171061Z 2025-05-07T20:26:19.4171077Z 2025-05-07T20:26:19.4171083Z 2025-05-07T20:26:19.4171088Z 2025-05-07T20:26:19.4178257Z 2025-05-07T20:26:19.4780185Z gds-tools-1.13.0.11 | 37.9 MB | #3 | 14%  2025-05-07T20:26:19.4780724Z 2025-05-07T20:26:19.4780732Z 2025-05-07T20:26:19.4780739Z 2025-05-07T20:26:19.4780747Z 2025-05-07T20:26:19.4780753Z 2025-05-07T20:26:19.4780760Z 2025-05-07T20:26:19.4780767Z 2025-05-07T20:26:19.4782023Z 2025-05-07T20:26:19.5172414Z cuda-nvrtc-12.8.61 | 63.1 MB | ######5 | 66%  2025-05-07T20:26:19.5172802Z 2025-05-07T20:26:19.5172807Z 2025-05-07T20:26:19.5172810Z 2025-05-07T20:26:19.5172814Z 2025-05-07T20:26:19.5172818Z 2025-05-07T20:26:19.5172821Z 2025-05-07T20:26:19.5172825Z 2025-05-07T20:26:19.5172829Z 2025-05-07T20:26:19.5172833Z 2025-05-07T20:26:19.5176366Z 2025-05-07T20:26:19.5185138Z gds-tools-1.13.0.11 | 37.9 MB | ##1 | 21%  2025-05-07T20:26:19.5185498Z 2025-05-07T20:26:19.5185502Z 2025-05-07T20:26:19.5185517Z 2025-05-07T20:26:19.5185521Z 2025-05-07T20:26:19.5185524Z 2025-05-07T20:26:19.5185528Z 2025-05-07T20:26:19.5185686Z 2025-05-07T20:26:19.5185690Z 2025-05-07T20:26:19.5188800Z 2025-05-07T20:26:19.5809611Z libcurand-10.3.9.55 | 43.6 MB | ########8 | 89%  2025-05-07T20:26:19.5810048Z 2025-05-07T20:26:19.5810053Z 2025-05-07T20:26:19.5810056Z 2025-05-07T20:26:19.5810060Z 2025-05-07T20:26:19.5810064Z 2025-05-07T20:26:19.5810067Z 2025-05-07T20:26:19.5810071Z 2025-05-07T20:26:19.5810074Z 2025-05-07T20:26:19.6182332Z cuda-nvrtc-12.8.61 | 63.1 MB | ######9 | 70%  2025-05-07T20:26:19.6182694Z 2025-05-07T20:26:19.6182700Z 2025-05-07T20:26:19.6182705Z 2025-05-07T20:26:19.6182711Z 2025-05-07T20:26:19.6182716Z 2025-05-07T20:26:19.6182722Z 2025-05-07T20:26:19.6182727Z 2025-05-07T20:26:19.6182733Z 2025-05-07T20:26:19.6182738Z 2025-05-07T20:26:19.6188252Z 2025-05-07T20:26:19.6195008Z gds-tools-1.13.0.11 | 37.9 MB | ##8 | 28%  2025-05-07T20:26:19.6195366Z 2025-05-07T20:26:19.6195370Z 2025-05-07T20:26:19.6195385Z 2025-05-07T20:26:19.6195389Z 2025-05-07T20:26:19.6195392Z 2025-05-07T20:26:19.6195396Z 2025-05-07T20:26:19.6195399Z 2025-05-07T20:26:19.6195403Z 2025-05-07T20:26:19.6195407Z 2025-05-07T20:26:19.6843688Z libcurand-10.3.9.55 | 43.6 MB | #########5 | 95%  2025-05-07T20:26:19.6844022Z 2025-05-07T20:26:19.6844026Z 2025-05-07T20:26:19.6844036Z 2025-05-07T20:26:19.6844040Z 2025-05-07T20:26:19.6844044Z 2025-05-07T20:26:19.6844049Z 2025-05-07T20:26:19.6844052Z 2025-05-07T20:26:19.6845626Z 2025-05-07T20:26:19.7183450Z cuda-nvrtc-12.8.61 | 63.1 MB | #######4 | 74%  2025-05-07T20:26:19.7183839Z 2025-05-07T20:26:19.7183845Z 2025-05-07T20:26:19.7183850Z 2025-05-07T20:26:19.7183856Z 2025-05-07T20:26:19.7183861Z 2025-05-07T20:26:19.7183886Z 2025-05-07T20:26:19.7183892Z 2025-05-07T20:26:19.7183897Z 2025-05-07T20:26:19.7183903Z 2025-05-07T20:26:19.7185795Z 2025-05-07T20:26:19.7848390Z gds-tools-1.13.0.11 | 37.9 MB | ###6 | 36%  2025-05-07T20:26:19.7848890Z 2025-05-07T20:26:19.7848897Z 2025-05-07T20:26:19.7848902Z 2025-05-07T20:26:19.7848907Z 2025-05-07T20:26:19.7848913Z 2025-05-07T20:26:19.7848918Z 2025-05-07T20:26:19.7848923Z 2025-05-07T20:26:19.7853583Z 2025-05-07T20:26:19.8184668Z cuda-nvrtc-12.8.61 | 63.1 MB | #######9 | 79%  2025-05-07T20:26:19.8185030Z 2025-05-07T20:26:19.8185034Z 2025-05-07T20:26:19.8185038Z 2025-05-07T20:26:19.8185041Z 2025-05-07T20:26:19.8185045Z 2025-05-07T20:26:19.8185049Z 2025-05-07T20:26:19.8185052Z 2025-05-07T20:26:19.8185056Z 2025-05-07T20:26:19.8185060Z 2025-05-07T20:26:19.8189863Z 2025-05-07T20:26:19.8849834Z gds-tools-1.13.0.11 | 37.9 MB | ####4 | 44%  2025-05-07T20:26:19.8850206Z 2025-05-07T20:26:19.8850210Z 2025-05-07T20:26:19.8850214Z 2025-05-07T20:26:19.8850218Z 2025-05-07T20:26:19.8850228Z 2025-05-07T20:26:19.8850232Z 2025-05-07T20:26:19.8850236Z 2025-05-07T20:26:19.8852348Z 2025-05-07T20:26:19.9188456Z cuda-nvrtc-12.8.61 | 63.1 MB | ########4 | 84%  2025-05-07T20:26:19.9188983Z 2025-05-07T20:26:19.9188990Z 2025-05-07T20:26:19.9188997Z 2025-05-07T20:26:19.9189003Z 2025-05-07T20:26:19.9189010Z 2025-05-07T20:26:19.9189016Z 2025-05-07T20:26:19.9189022Z 2025-05-07T20:26:19.9189029Z 2025-05-07T20:26:19.9189035Z 2025-05-07T20:26:19.9190587Z 2025-05-07T20:26:19.9872095Z gds-tools-1.13.0.11 | 37.9 MB | #####3 | 53%  2025-05-07T20:26:19.9872631Z 2025-05-07T20:26:19.9872638Z 2025-05-07T20:26:19.9872645Z 2025-05-07T20:26:19.9872651Z 2025-05-07T20:26:19.9872658Z 2025-05-07T20:26:19.9872664Z 2025-05-07T20:26:19.9872671Z 2025-05-07T20:26:19.9875747Z 2025-05-07T20:26:20.0249918Z cuda-nvrtc-12.8.61 | 63.1 MB | ########9 | 89%  2025-05-07T20:26:20.0250237Z 2025-05-07T20:26:20.0250242Z 2025-05-07T20:26:20.0250245Z 2025-05-07T20:26:20.0250249Z 2025-05-07T20:26:20.0250414Z 2025-05-07T20:26:20.0250419Z 2025-05-07T20:26:20.0250424Z 2025-05-07T20:26:20.0250429Z 2025-05-07T20:26:20.0250434Z 2025-05-07T20:26:20.0250440Z 2025-05-07T20:26:20.0950385Z gds-tools-1.13.0.11 | 37.9 MB | ######1 | 61%  2025-05-07T20:26:20.0950805Z 2025-05-07T20:26:20.0950811Z 2025-05-07T20:26:20.0950816Z 2025-05-07T20:26:20.0950821Z 2025-05-07T20:26:20.0950826Z 2025-05-07T20:26:20.0950832Z 2025-05-07T20:26:20.0950847Z 2025-05-07T20:26:20.0950852Z 2025-05-07T20:26:20.1257844Z cuda-nvrtc-12.8.61 | 63.1 MB | #########3 | 94%  2025-05-07T20:26:20.1258268Z 2025-05-07T20:26:20.1258273Z 2025-05-07T20:26:20.1258279Z 2025-05-07T20:26:20.1258293Z 2025-05-07T20:26:20.1258298Z 2025-05-07T20:26:20.1258303Z 2025-05-07T20:26:20.1258328Z 2025-05-07T20:26:20.1258334Z 2025-05-07T20:26:20.1258339Z 2025-05-07T20:26:20.1258344Z 2025-05-07T20:26:20.1951566Z gds-tools-1.13.0.11 | 37.9 MB | ######9 | 69%  2025-05-07T20:26:20.1951964Z 2025-05-07T20:26:20.1951968Z 2025-05-07T20:26:20.1951972Z 2025-05-07T20:26:20.1951976Z 2025-05-07T20:26:20.1951979Z 2025-05-07T20:26:20.1951983Z 2025-05-07T20:26:20.1951986Z 2025-05-07T20:26:20.1951990Z 2025-05-07T20:26:20.2287856Z cuda-nvrtc-12.8.61 | 63.1 MB | #########8 | 99%  2025-05-07T20:26:20.2288231Z 2025-05-07T20:26:20.2288235Z 2025-05-07T20:26:20.2288239Z 2025-05-07T20:26:20.2288242Z 2025-05-07T20:26:20.2288246Z 2025-05-07T20:26:20.2288249Z 2025-05-07T20:26:20.2288253Z 2025-05-07T20:26:20.2288257Z 2025-05-07T20:26:20.2288260Z 2025-05-07T20:26:20.2289531Z 2025-05-07T20:26:20.3291458Z gds-tools-1.13.0.11 | 37.9 MB | #######7 | 77%  2025-05-07T20:26:20.3291779Z 2025-05-07T20:26:20.3291809Z 2025-05-07T20:26:20.3291813Z 2025-05-07T20:26:20.3291817Z 2025-05-07T20:26:20.3291820Z 2025-05-07T20:26:20.3291824Z 2025-05-07T20:26:20.3291828Z 2025-05-07T20:26:20.3291831Z 2025-05-07T20:26:20.3291841Z 2025-05-07T20:26:20.3292658Z 2025-05-07T20:26:20.4291968Z gds-tools-1.13.0.11 | 37.9 MB | ########6 | 87%  2025-05-07T20:26:20.4292345Z 2025-05-07T20:26:20.4292349Z 2025-05-07T20:26:20.4292353Z 2025-05-07T20:26:20.4292356Z 2025-05-07T20:26:20.4292360Z 2025-05-07T20:26:20.4292364Z 2025-05-07T20:26:20.4292367Z 2025-05-07T20:26:20.4292371Z 2025-05-07T20:26:20.4292375Z 2025-05-07T20:26:20.4293266Z 2025-05-07T20:26:21.2470026Z gds-tools-1.13.0.11 | 37.9 MB | #########5 | 95%  2025-05-07T20:26:21.2470464Z 2025-05-07T20:26:21.2470470Z 2025-05-07T20:26:21.2470476Z 2025-05-07T20:26:21.2470481Z 2025-05-07T20:26:21.2470487Z 2025-05-07T20:26:21.2470493Z 2025-05-07T20:26:21.2470502Z 2025-05-07T20:26:21.2470510Z 2025-05-07T20:26:21.2470549Z 2025-05-07T20:26:21.2924211Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:26:21.2924535Z 2025-05-07T20:26:21.2924539Z 2025-05-07T20:26:21.2924564Z 2025-05-07T20:26:21.2924568Z 2025-05-07T20:26:21.2924571Z 2025-05-07T20:26:21.2924575Z 2025-05-07T20:26:21.2924579Z 2025-05-07T20:26:21.2924590Z 2025-05-07T20:26:21.2924594Z 2025-05-07T20:26:21.2924597Z 2025-05-07T20:26:21.2924601Z 2025-05-07T20:26:21.3925248Z python-3.13.0 | 31.5 MB | | 0%  2025-05-07T20:26:21.3925569Z 2025-05-07T20:26:21.3925573Z 2025-05-07T20:26:21.3925577Z 2025-05-07T20:26:21.3925580Z 2025-05-07T20:26:21.3925585Z 2025-05-07T20:26:21.3925589Z 2025-05-07T20:26:21.3925593Z 2025-05-07T20:26:21.3925599Z 2025-05-07T20:26:21.3925602Z 2025-05-07T20:26:21.3925606Z 2025-05-07T20:26:21.3929211Z 2025-05-07T20:26:21.4927619Z python-3.13.0 | 31.5 MB | #1 | 11%  2025-05-07T20:26:21.4928211Z 2025-05-07T20:26:21.4928217Z 2025-05-07T20:26:21.4928222Z 2025-05-07T20:26:21.4928225Z 2025-05-07T20:26:21.4928229Z 2025-05-07T20:26:21.4928233Z 2025-05-07T20:26:21.4928404Z 2025-05-07T20:26:21.4928407Z 2025-05-07T20:26:21.4928411Z 2025-05-07T20:26:21.4928415Z 2025-05-07T20:26:21.4928418Z 2025-05-07T20:26:21.5929665Z python-3.13.0 | 31.5 MB | ##3 | 24%  2025-05-07T20:26:21.5929981Z 2025-05-07T20:26:21.5929986Z 2025-05-07T20:26:21.5929990Z 2025-05-07T20:26:21.5929994Z 2025-05-07T20:26:21.5930006Z 2025-05-07T20:26:21.5930010Z 2025-05-07T20:26:21.5930014Z 2025-05-07T20:26:21.5930017Z 2025-05-07T20:26:21.5930021Z 2025-05-07T20:26:21.5930025Z 2025-05-07T20:26:21.5930029Z 2025-05-07T20:26:21.5996090Z python-3.13.0 | 31.5 MB | ###6 | 36%  2025-05-07T20:26:21.5996384Z 2025-05-07T20:26:21.5999447Z 2025-05-07T20:26:21.6349029Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:26:21.6349321Z 2025-05-07T20:26:21.6349325Z 2025-05-07T20:26:21.6349329Z 2025-05-07T20:26:21.6349333Z 2025-05-07T20:26:21.6349337Z 2025-05-07T20:26:21.6349340Z 2025-05-07T20:26:21.6349354Z 2025-05-07T20:26:21.6349360Z 2025-05-07T20:26:21.6349373Z 2025-05-07T20:26:21.6352420Z 2025-05-07T20:26:21.6451030Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:26:21.6451687Z 2025-05-07T20:26:21.6807856Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:21.6808139Z 2025-05-07T20:26:21.6808144Z 2025-05-07T20:26:21.6808147Z 2025-05-07T20:26:21.6808151Z 2025-05-07T20:26:21.6808154Z 2025-05-07T20:26:21.6808159Z 2025-05-07T20:26:21.6808163Z 2025-05-07T20:26:21.6808167Z 2025-05-07T20:26:21.6808171Z 2025-05-07T20:26:21.6808183Z 2025-05-07T20:26:21.6808187Z 2025-05-07T20:26:21.6808190Z 2025-05-07T20:26:21.6809611Z 2025-05-07T20:26:21.7034932Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:26:21.7035302Z 2025-05-07T20:26:21.7035317Z 2025-05-07T20:26:21.7035323Z 2025-05-07T20:26:21.7035328Z 2025-05-07T20:26:21.7035333Z 2025-05-07T20:26:21.7035353Z 2025-05-07T20:26:21.7035358Z 2025-05-07T20:26:21.7035363Z 2025-05-07T20:26:21.7035368Z 2025-05-07T20:26:21.7035373Z 2025-05-07T20:26:21.7036736Z 2025-05-07T20:26:21.7180087Z python-3.13.0 | 31.5 MB | ####8 | 48%  2025-05-07T20:26:21.7180524Z 2025-05-07T20:26:21.7180531Z 2025-05-07T20:26:21.7180536Z 2025-05-07T20:26:21.7180541Z 2025-05-07T20:26:21.7180547Z 2025-05-07T20:26:21.7180552Z 2025-05-07T20:26:21.7180557Z 2025-05-07T20:26:21.7180563Z 2025-05-07T20:26:21.7180568Z 2025-05-07T20:26:21.7180573Z 2025-05-07T20:26:21.7180579Z 2025-05-07T20:26:21.7186015Z 2025-05-07T20:26:21.7431124Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:26:21.7431453Z 2025-05-07T20:26:21.7431457Z 2025-05-07T20:26:21.7431475Z 2025-05-07T20:26:21.7811423Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:26:21.7811760Z 2025-05-07T20:26:21.7811766Z 2025-05-07T20:26:21.7811786Z 2025-05-07T20:26:21.7811791Z 2025-05-07T20:26:21.7811796Z 2025-05-07T20:26:21.7811801Z 2025-05-07T20:26:21.7811807Z 2025-05-07T20:26:21.7811812Z 2025-05-07T20:26:21.7811817Z 2025-05-07T20:26:21.7811822Z 2025-05-07T20:26:21.7811827Z 2025-05-07T20:26:21.7811843Z 2025-05-07T20:26:21.7815666Z 2025-05-07T20:26:21.8185555Z cuda-nvcc-tools-12.8 | 24.5 MB | #1 | 11%  2025-05-07T20:26:21.8185992Z 2025-05-07T20:26:21.8186007Z 2025-05-07T20:26:21.8186011Z 2025-05-07T20:26:21.8186015Z 2025-05-07T20:26:21.8186018Z 2025-05-07T20:26:21.8186022Z 2025-05-07T20:26:21.8186026Z 2025-05-07T20:26:21.8186029Z 2025-05-07T20:26:21.8186033Z 2025-05-07T20:26:21.8186037Z 2025-05-07T20:26:21.8186041Z 2025-05-07T20:26:21.8190578Z 2025-05-07T20:26:21.8569021Z libnvjitlink-12.8.61 | 28.7 MB | # | 10%  2025-05-07T20:26:21.8569548Z 2025-05-07T20:26:21.8569552Z 2025-05-07T20:26:21.8569556Z 2025-05-07T20:26:21.8569727Z 2025-05-07T20:26:21.8569731Z 2025-05-07T20:26:21.8569734Z 2025-05-07T20:26:21.8569738Z 2025-05-07T20:26:21.8569742Z 2025-05-07T20:26:21.8569745Z 2025-05-07T20:26:21.8569749Z 2025-05-07T20:26:21.8573126Z 2025-05-07T20:26:21.8812079Z python-3.13.0 | 31.5 MB | #####9 | 60%  2025-05-07T20:26:21.8812380Z 2025-05-07T20:26:21.8812384Z 2025-05-07T20:26:21.8812388Z 2025-05-07T20:26:21.8812392Z 2025-05-07T20:26:21.8812395Z 2025-05-07T20:26:21.8812399Z 2025-05-07T20:26:21.8812402Z 2025-05-07T20:26:21.8812406Z 2025-05-07T20:26:21.8812418Z 2025-05-07T20:26:21.8812422Z 2025-05-07T20:26:21.8812425Z 2025-05-07T20:26:21.8812429Z 2025-05-07T20:26:21.8814486Z 2025-05-07T20:26:21.9266699Z cuda-nvcc-tools-12.8 | 24.5 MB | ##2 | 23%  2025-05-07T20:26:21.9267112Z 2025-05-07T20:26:21.9267116Z 2025-05-07T20:26:21.9267120Z 2025-05-07T20:26:21.9267123Z 2025-05-07T20:26:21.9267127Z 2025-05-07T20:26:21.9267140Z 2025-05-07T20:26:21.9267143Z 2025-05-07T20:26:21.9267147Z 2025-05-07T20:26:21.9267150Z 2025-05-07T20:26:21.9267154Z 2025-05-07T20:26:21.9267158Z 2025-05-07T20:26:21.9269350Z 2025-05-07T20:26:21.9796737Z libnvjitlink-12.8.61 | 28.7 MB | ## | 20%  2025-05-07T20:26:21.9797072Z 2025-05-07T20:26:21.9797076Z 2025-05-07T20:26:21.9797080Z 2025-05-07T20:26:21.9797084Z 2025-05-07T20:26:21.9797088Z 2025-05-07T20:26:21.9797092Z 2025-05-07T20:26:21.9797096Z 2025-05-07T20:26:21.9797100Z 2025-05-07T20:26:21.9797104Z 2025-05-07T20:26:21.9797108Z 2025-05-07T20:26:21.9798432Z 2025-05-07T20:26:21.9879819Z python-3.13.0 | 31.5 MB | ####### | 70%  2025-05-07T20:26:21.9880301Z 2025-05-07T20:26:21.9880305Z 2025-05-07T20:26:21.9880309Z 2025-05-07T20:26:21.9880332Z 2025-05-07T20:26:21.9880335Z 2025-05-07T20:26:21.9880339Z 2025-05-07T20:26:21.9880342Z 2025-05-07T20:26:21.9880346Z 2025-05-07T20:26:21.9880350Z 2025-05-07T20:26:21.9880364Z 2025-05-07T20:26:21.9880368Z 2025-05-07T20:26:21.9880372Z 2025-05-07T20:26:21.9883097Z 2025-05-07T20:26:22.0301617Z cuda-nvcc-tools-12.8 | 24.5 MB | ###3 | 34%  2025-05-07T20:26:22.0302071Z 2025-05-07T20:26:22.0302076Z 2025-05-07T20:26:22.0302079Z 2025-05-07T20:26:22.0302083Z 2025-05-07T20:26:22.0302087Z 2025-05-07T20:26:22.0302090Z 2025-05-07T20:26:22.0302094Z 2025-05-07T20:26:22.0302097Z 2025-05-07T20:26:22.0302101Z 2025-05-07T20:26:22.0302104Z 2025-05-07T20:26:22.0302108Z 2025-05-07T20:26:22.0305632Z 2025-05-07T20:26:22.0951038Z libnvjitlink-12.8.61 | 28.7 MB | ##9 | 30%  2025-05-07T20:26:22.0951379Z 2025-05-07T20:26:22.0951383Z 2025-05-07T20:26:22.0951387Z 2025-05-07T20:26:22.0951390Z 2025-05-07T20:26:22.0951429Z 2025-05-07T20:26:22.0951433Z 2025-05-07T20:26:22.0951436Z 2025-05-07T20:26:22.0951440Z 2025-05-07T20:26:22.0951444Z 2025-05-07T20:26:22.0951447Z 2025-05-07T20:26:22.0951460Z 2025-05-07T20:26:22.0951464Z 2025-05-07T20:26:22.0952890Z 2025-05-07T20:26:22.1097037Z cuda-nvcc-tools-12.8 | 24.5 MB | ####4 | 45%  2025-05-07T20:26:22.1097374Z 2025-05-07T20:26:22.1097378Z 2025-05-07T20:26:22.1097381Z 2025-05-07T20:26:22.1097385Z 2025-05-07T20:26:22.1097401Z 2025-05-07T20:26:22.1097404Z 2025-05-07T20:26:22.1097408Z 2025-05-07T20:26:22.1097412Z 2025-05-07T20:26:22.1097415Z 2025-05-07T20:26:22.1097419Z 2025-05-07T20:26:22.1101134Z 2025-05-07T20:26:22.1312548Z python-3.13.0 | 31.5 MB | #######9 | 80%  2025-05-07T20:26:22.1312951Z 2025-05-07T20:26:22.1312957Z 2025-05-07T20:26:22.1312962Z 2025-05-07T20:26:22.1312967Z 2025-05-07T20:26:22.1312972Z 2025-05-07T20:26:22.1312977Z 2025-05-07T20:26:22.1313263Z 2025-05-07T20:26:22.1313271Z 2025-05-07T20:26:22.1313276Z 2025-05-07T20:26:22.1313281Z 2025-05-07T20:26:22.1313285Z 2025-05-07T20:26:22.1314917Z 2025-05-07T20:26:22.1955190Z libnvjitlink-12.8.61 | 28.7 MB | ###8 | 39%  2025-05-07T20:26:22.1955531Z 2025-05-07T20:26:22.1955535Z 2025-05-07T20:26:22.1955539Z 2025-05-07T20:26:22.1955542Z 2025-05-07T20:26:22.1955546Z 2025-05-07T20:26:22.1955550Z 2025-05-07T20:26:22.1955554Z 2025-05-07T20:26:22.1955557Z 2025-05-07T20:26:22.1955561Z 2025-05-07T20:26:22.1955565Z 2025-05-07T20:26:22.1955569Z 2025-05-07T20:26:22.1955572Z 2025-05-07T20:26:22.1955576Z 2025-05-07T20:26:22.2128030Z cuda-nvcc-tools-12.8 | 24.5 MB | #####6 | 56%  2025-05-07T20:26:22.2128531Z 2025-05-07T20:26:22.2128537Z 2025-05-07T20:26:22.2128542Z 2025-05-07T20:26:22.2128547Z 2025-05-07T20:26:22.2128553Z 2025-05-07T20:26:22.2128558Z 2025-05-07T20:26:22.2128563Z 2025-05-07T20:26:22.2128587Z 2025-05-07T20:26:22.2128593Z 2025-05-07T20:26:22.2128598Z 2025-05-07T20:26:22.2128603Z 2025-05-07T20:26:22.2313101Z python-3.13.0 | 31.5 MB | ########8 | 89%  2025-05-07T20:26:22.2313740Z 2025-05-07T20:26:22.2313747Z 2025-05-07T20:26:22.2313752Z 2025-05-07T20:26:22.2313757Z 2025-05-07T20:26:22.2313763Z 2025-05-07T20:26:22.2313777Z 2025-05-07T20:26:22.2313783Z 2025-05-07T20:26:22.2313789Z 2025-05-07T20:26:22.2313794Z 2025-05-07T20:26:22.2313799Z 2025-05-07T20:26:22.2313804Z 2025-05-07T20:26:22.2317204Z 2025-05-07T20:26:22.2956136Z libnvjitlink-12.8.61 | 28.7 MB | ####8 | 49%  2025-05-07T20:26:22.2956486Z 2025-05-07T20:26:22.2956490Z 2025-05-07T20:26:22.2956493Z 2025-05-07T20:26:22.2956497Z 2025-05-07T20:26:22.2956501Z 2025-05-07T20:26:22.2956506Z 2025-05-07T20:26:22.2956510Z 2025-05-07T20:26:22.2956513Z 2025-05-07T20:26:22.2956517Z 2025-05-07T20:26:22.2956521Z 2025-05-07T20:26:22.2956546Z 2025-05-07T20:26:22.2956550Z 2025-05-07T20:26:22.2956554Z 2025-05-07T20:26:22.3131063Z cuda-nvcc-tools-12.8 | 24.5 MB | ######7 | 67%  2025-05-07T20:26:22.3131477Z 2025-05-07T20:26:22.3131483Z 2025-05-07T20:26:22.3131488Z 2025-05-07T20:26:22.3131494Z 2025-05-07T20:26:22.3131499Z 2025-05-07T20:26:22.3131504Z 2025-05-07T20:26:22.3131510Z 2025-05-07T20:26:22.3131515Z 2025-05-07T20:26:22.3131520Z 2025-05-07T20:26:22.3131526Z 2025-05-07T20:26:22.3131530Z 2025-05-07T20:26:22.3314475Z python-3.13.0 | 31.5 MB | #########8 | 98%  2025-05-07T20:26:22.3314766Z 2025-05-07T20:26:22.3314770Z 2025-05-07T20:26:22.3314774Z 2025-05-07T20:26:22.3314777Z 2025-05-07T20:26:22.3314781Z 2025-05-07T20:26:22.3314793Z 2025-05-07T20:26:22.3314796Z 2025-05-07T20:26:22.3314800Z 2025-05-07T20:26:22.3314803Z 2025-05-07T20:26:22.3314807Z 2025-05-07T20:26:22.3314811Z 2025-05-07T20:26:22.3314818Z 2025-05-07T20:26:22.3959589Z libnvjitlink-12.8.61 | 28.7 MB | #####9 | 59%  2025-05-07T20:26:22.3959951Z 2025-05-07T20:26:22.3959955Z 2025-05-07T20:26:22.3959959Z 2025-05-07T20:26:22.3959973Z 2025-05-07T20:26:22.3959977Z 2025-05-07T20:26:22.3959981Z 2025-05-07T20:26:22.3959984Z 2025-05-07T20:26:22.3959988Z 2025-05-07T20:26:22.3959992Z 2025-05-07T20:26:22.3959995Z 2025-05-07T20:26:22.3959999Z 2025-05-07T20:26:22.3960003Z 2025-05-07T20:26:22.3962576Z 2025-05-07T20:26:22.4316416Z cuda-nvcc-tools-12.8 | 24.5 MB | #######9 | 80%  2025-05-07T20:26:22.4316748Z 2025-05-07T20:26:22.4316752Z 2025-05-07T20:26:22.4316756Z 2025-05-07T20:26:22.4316760Z 2025-05-07T20:26:22.4316763Z 2025-05-07T20:26:22.4316767Z 2025-05-07T20:26:22.4316771Z 2025-05-07T20:26:22.4316775Z 2025-05-07T20:26:22.4316778Z 2025-05-07T20:26:22.4316790Z 2025-05-07T20:26:22.4316794Z 2025-05-07T20:26:22.4318357Z 2025-05-07T20:26:22.4828723Z libnvjitlink-12.8.61 | 28.7 MB | #######1 | 71%  2025-05-07T20:26:22.4829101Z 2025-05-07T20:26:22.4829113Z 2025-05-07T20:26:22.4829119Z 2025-05-07T20:26:22.4829124Z 2025-05-07T20:26:22.4829323Z 2025-05-07T20:26:22.4829329Z 2025-05-07T20:26:22.4829334Z 2025-05-07T20:26:22.4829339Z 2025-05-07T20:26:22.4968872Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:26:22.4969200Z 2025-05-07T20:26:22.4969204Z 2025-05-07T20:26:22.4969207Z 2025-05-07T20:26:22.4969211Z 2025-05-07T20:26:22.4969215Z 2025-05-07T20:26:22.4969218Z 2025-05-07T20:26:22.4969222Z 2025-05-07T20:26:22.4969226Z 2025-05-07T20:26:22.4969229Z 2025-05-07T20:26:22.4969242Z 2025-05-07T20:26:22.4969246Z 2025-05-07T20:26:22.4969250Z 2025-05-07T20:26:22.4969253Z 2025-05-07T20:26:22.5170008Z cuda-nvcc-tools-12.8 | 24.5 MB | #########1 | 91%  2025-05-07T20:26:22.5170454Z 2025-05-07T20:26:22.5170459Z 2025-05-07T20:26:22.5170462Z 2025-05-07T20:26:22.5170477Z 2025-05-07T20:26:22.5170480Z 2025-05-07T20:26:22.5170484Z 2025-05-07T20:26:22.5170487Z 2025-05-07T20:26:22.5170491Z 2025-05-07T20:26:22.5170495Z 2025-05-07T20:26:22.5170504Z 2025-05-07T20:26:22.5170508Z 2025-05-07T20:26:22.5170511Z 2025-05-07T20:26:22.5170515Z 2025-05-07T20:26:22.5175053Z 2025-05-07T20:26:22.5317533Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:26:22.5317976Z 2025-05-07T20:26:22.5317980Z 2025-05-07T20:26:22.5317990Z 2025-05-07T20:26:22.5317994Z 2025-05-07T20:26:22.5317998Z 2025-05-07T20:26:22.5318002Z 2025-05-07T20:26:22.5318005Z 2025-05-07T20:26:22.5318009Z 2025-05-07T20:26:22.5318012Z 2025-05-07T20:26:22.5318016Z 2025-05-07T20:26:22.5318020Z 2025-05-07T20:26:22.5318024Z 2025-05-07T20:26:22.6174439Z libnvjitlink-12.8.61 | 28.7 MB | ########2 | 82%  2025-05-07T20:26:22.6174840Z 2025-05-07T20:26:22.6174844Z 2025-05-07T20:26:22.6174848Z 2025-05-07T20:26:22.6174874Z 2025-05-07T20:26:22.6174878Z 2025-05-07T20:26:22.6174893Z 2025-05-07T20:26:22.6174897Z 2025-05-07T20:26:22.6174900Z 2025-05-07T20:26:22.6174904Z 2025-05-07T20:26:22.6174920Z 2025-05-07T20:26:22.6174924Z 2025-05-07T20:26:22.6174927Z 2025-05-07T20:26:22.6174931Z 2025-05-07T20:26:22.6174938Z 2025-05-07T20:26:22.6418971Z cuda-nvvm-tools-12.8 | 23.5 MB | # | 10%  2025-05-07T20:26:22.6419440Z 2025-05-07T20:26:22.6419446Z 2025-05-07T20:26:22.6419450Z 2025-05-07T20:26:22.6419461Z 2025-05-07T20:26:22.6419464Z 2025-05-07T20:26:22.6419468Z 2025-05-07T20:26:22.6419472Z 2025-05-07T20:26:22.6419475Z 2025-05-07T20:26:22.6419479Z 2025-05-07T20:26:22.6419482Z 2025-05-07T20:26:22.6419486Z 2025-05-07T20:26:22.6419490Z 2025-05-07T20:26:22.7175965Z libnvjitlink-12.8.61 | 28.7 MB | #########2 | 93%  2025-05-07T20:26:22.7176405Z 2025-05-07T20:26:22.7176410Z 2025-05-07T20:26:22.7176415Z 2025-05-07T20:26:22.7176438Z 2025-05-07T20:26:22.7176444Z 2025-05-07T20:26:22.7176449Z 2025-05-07T20:26:22.7176454Z 2025-05-07T20:26:22.7176458Z 2025-05-07T20:26:22.7176463Z 2025-05-07T20:26:22.7176476Z 2025-05-07T20:26:22.7176481Z 2025-05-07T20:26:22.7176486Z 2025-05-07T20:26:22.7176491Z 2025-05-07T20:26:22.7179033Z 2025-05-07T20:26:22.8179089Z cuda-nvvm-tools-12.8 | 23.5 MB | ##1 | 22%  2025-05-07T20:26:22.8179572Z 2025-05-07T20:26:22.8179576Z 2025-05-07T20:26:22.8179580Z 2025-05-07T20:26:22.8179584Z 2025-05-07T20:26:22.8179587Z 2025-05-07T20:26:22.8179591Z 2025-05-07T20:26:22.8179595Z 2025-05-07T20:26:22.8179598Z 2025-05-07T20:26:22.8179602Z 2025-05-07T20:26:22.8179606Z 2025-05-07T20:26:22.8179622Z 2025-05-07T20:26:22.8179626Z 2025-05-07T20:26:22.8179629Z 2025-05-07T20:26:22.8181098Z 2025-05-07T20:26:22.9187292Z cuda-nvvm-tools-12.8 | 23.5 MB | ###4 | 35%  2025-05-07T20:26:22.9187635Z 2025-05-07T20:26:22.9187899Z 2025-05-07T20:26:22.9187904Z 2025-05-07T20:26:22.9187908Z 2025-05-07T20:26:22.9187911Z 2025-05-07T20:26:22.9187916Z 2025-05-07T20:26:22.9187921Z 2025-05-07T20:26:22.9188106Z 2025-05-07T20:26:22.9188111Z 2025-05-07T20:26:22.9188116Z 2025-05-07T20:26:22.9188121Z 2025-05-07T20:26:22.9188140Z 2025-05-07T20:26:22.9188146Z 2025-05-07T20:26:22.9188336Z 2025-05-07T20:26:23.0189860Z cuda-nvvm-tools-12.8 | 23.5 MB | ####9 | 50%  2025-05-07T20:26:23.0198407Z 2025-05-07T20:26:23.0198413Z 2025-05-07T20:26:23.0198416Z 2025-05-07T20:26:23.0198420Z 2025-05-07T20:26:23.0198424Z 2025-05-07T20:26:23.0198437Z 2025-05-07T20:26:23.0198441Z 2025-05-07T20:26:23.0198445Z 2025-05-07T20:26:23.0198449Z 2025-05-07T20:26:23.0198453Z 2025-05-07T20:26:23.0198456Z 2025-05-07T20:26:23.0198460Z 2025-05-07T20:26:23.0198464Z 2025-05-07T20:26:23.0198468Z 2025-05-07T20:26:23.1201057Z cuda-nvvm-tools-12.8 | 23.5 MB | ######6 | 66%  2025-05-07T20:26:23.1201411Z 2025-05-07T20:26:23.1201415Z 2025-05-07T20:26:23.1201419Z 2025-05-07T20:26:23.1201423Z 2025-05-07T20:26:23.1201426Z 2025-05-07T20:26:23.1201438Z 2025-05-07T20:26:23.1201442Z 2025-05-07T20:26:23.1201445Z 2025-05-07T20:26:23.1201449Z 2025-05-07T20:26:23.1201452Z 2025-05-07T20:26:23.1201456Z 2025-05-07T20:26:23.1201460Z 2025-05-07T20:26:23.1201463Z 2025-05-07T20:26:23.1204824Z 2025-05-07T20:26:23.2202600Z cuda-nvvm-tools-12.8 | 23.5 MB | ######## | 80%  2025-05-07T20:26:23.2202943Z 2025-05-07T20:26:23.2202947Z 2025-05-07T20:26:23.2202951Z 2025-05-07T20:26:23.2202955Z 2025-05-07T20:26:23.2202958Z 2025-05-07T20:26:23.2202962Z 2025-05-07T20:26:23.2202966Z 2025-05-07T20:26:23.2202969Z 2025-05-07T20:26:23.2202973Z 2025-05-07T20:26:23.2202977Z 2025-05-07T20:26:23.2202980Z 2025-05-07T20:26:23.2202984Z 2025-05-07T20:26:23.2202988Z 2025-05-07T20:26:23.2202991Z 2025-05-07T20:26:23.4507742Z cuda-nvvm-tools-12.8 | 23.5 MB | #########5 | 96%  2025-05-07T20:26:23.4508084Z 2025-05-07T20:26:23.4508088Z 2025-05-07T20:26:23.4508092Z 2025-05-07T20:26:23.4508106Z 2025-05-07T20:26:23.4508110Z 2025-05-07T20:26:23.4508113Z 2025-05-07T20:26:23.4508117Z 2025-05-07T20:26:23.4508121Z 2025-05-07T20:26:23.4508124Z 2025-05-07T20:26:23.4508128Z 2025-05-07T20:26:23.4508132Z 2025-05-07T20:26:23.4508145Z 2025-05-07T20:26:23.4508523Z 2025-05-07T20:26:23.4758503Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:23.4758827Z 2025-05-07T20:26:23.4758839Z 2025-05-07T20:26:23.4758843Z 2025-05-07T20:26:23.4758846Z 2025-05-07T20:26:23.4758850Z 2025-05-07T20:26:23.4758854Z 2025-05-07T20:26:23.4758858Z 2025-05-07T20:26:23.4758861Z 2025-05-07T20:26:23.4758865Z 2025-05-07T20:26:23.4758869Z 2025-05-07T20:26:23.4759014Z 2025-05-07T20:26:23.4992805Z python-3.13.0 | 31.5 MB | ########## | 100%  2025-05-07T20:26:23.4993110Z 2025-05-07T20:26:23.4993114Z 2025-05-07T20:26:23.4993118Z 2025-05-07T20:26:23.4993122Z 2025-05-07T20:26:23.4993125Z 2025-05-07T20:26:23.4993135Z 2025-05-07T20:26:23.4993139Z 2025-05-07T20:26:23.4993142Z 2025-05-07T20:26:23.4993146Z 2025-05-07T20:26:23.4993150Z 2025-05-07T20:26:23.4993153Z 2025-05-07T20:26:23.4993157Z 2025-05-07T20:26:23.4993160Z 2025-05-07T20:26:23.4993164Z 2025-05-07T20:26:23.4994297Z 2025-05-07T20:26:23.5342757Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:26:23.5343198Z 2025-05-07T20:26:23.5343204Z 2025-05-07T20:26:23.5343209Z 2025-05-07T20:26:23.5343214Z 2025-05-07T20:26:23.5343219Z 2025-05-07T20:26:23.5343235Z 2025-05-07T20:26:23.5343240Z 2025-05-07T20:26:23.5343245Z 2025-05-07T20:26:23.5343251Z 2025-05-07T20:26:23.5343257Z 2025-05-07T20:26:23.5343261Z 2025-05-07T20:26:23.5343264Z 2025-05-07T20:26:23.5343268Z 2025-05-07T20:26:23.5343272Z 2025-05-07T20:26:23.5343503Z 2025-05-07T20:26:23.5343508Z 2025-05-07T20:26:23.5995642Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:26:23.5996236Z 2025-05-07T20:26:23.5996240Z 2025-05-07T20:26:23.5996244Z 2025-05-07T20:26:23.5996247Z 2025-05-07T20:26:23.5996251Z 2025-05-07T20:26:23.5996255Z 2025-05-07T20:26:23.5996258Z 2025-05-07T20:26:23.5996262Z 2025-05-07T20:26:23.5996266Z 2025-05-07T20:26:23.5996270Z 2025-05-07T20:26:23.5996273Z 2025-05-07T20:26:23.5996277Z 2025-05-07T20:26:23.5996281Z 2025-05-07T20:26:23.5996284Z 2025-05-07T20:26:23.5997442Z 2025-05-07T20:26:23.6343784Z cuda-nvvm-impl-12.8. | 20.8 MB | #4 | 15%  2025-05-07T20:26:23.6344196Z 2025-05-07T20:26:23.6344202Z 2025-05-07T20:26:23.6344207Z 2025-05-07T20:26:23.6344212Z 2025-05-07T20:26:23.6344217Z 2025-05-07T20:26:23.6344234Z 2025-05-07T20:26:23.6344239Z 2025-05-07T20:26:23.6344244Z 2025-05-07T20:26:23.6344261Z 2025-05-07T20:26:23.6344267Z 2025-05-07T20:26:23.6344272Z 2025-05-07T20:26:23.6344277Z 2025-05-07T20:26:23.6344282Z 2025-05-07T20:26:23.6344287Z 2025-05-07T20:26:23.6344301Z 2025-05-07T20:26:23.6344306Z 2025-05-07T20:26:23.6997145Z cuda-nvcc-dev_linux- | 12.7 MB | ##5 | 26%  2025-05-07T20:26:23.6997628Z 2025-05-07T20:26:23.6997634Z 2025-05-07T20:26:23.6997639Z 2025-05-07T20:26:23.6997644Z 2025-05-07T20:26:23.6997649Z 2025-05-07T20:26:23.6997654Z 2025-05-07T20:26:23.6997660Z 2025-05-07T20:26:23.6997665Z 2025-05-07T20:26:23.6997670Z 2025-05-07T20:26:23.6997676Z 2025-05-07T20:26:23.6997681Z 2025-05-07T20:26:23.6997686Z 2025-05-07T20:26:23.6997691Z 2025-05-07T20:26:23.6997696Z 2025-05-07T20:26:23.7000468Z 2025-05-07T20:26:23.7431179Z cuda-nvvm-impl-12.8. | 20.8 MB | ##9 | 30%  2025-05-07T20:26:23.7431640Z 2025-05-07T20:26:23.7431646Z 2025-05-07T20:26:23.7431652Z 2025-05-07T20:26:23.7431680Z 2025-05-07T20:26:23.7431685Z 2025-05-07T20:26:23.7431702Z 2025-05-07T20:26:23.7431708Z 2025-05-07T20:26:23.7431713Z 2025-05-07T20:26:23.7431719Z 2025-05-07T20:26:23.7431734Z 2025-05-07T20:26:23.7431740Z 2025-05-07T20:26:23.7431745Z 2025-05-07T20:26:23.7431750Z 2025-05-07T20:26:23.7431753Z 2025-05-07T20:26:23.7431757Z 2025-05-07T20:26:23.7435962Z 2025-05-07T20:26:23.7503997Z cuda-nvcc-dev_linux- | 12.7 MB | #####1 | 51%  2025-05-07T20:26:23.7504473Z 2025-05-07T20:26:23.7504479Z 2025-05-07T20:26:23.7504484Z 2025-05-07T20:26:23.7504489Z 2025-05-07T20:26:23.7504494Z 2025-05-07T20:26:23.7504499Z 2025-05-07T20:26:23.7504504Z 2025-05-07T20:26:23.7504509Z 2025-05-07T20:26:23.7504515Z 2025-05-07T20:26:23.7504520Z 2025-05-07T20:26:23.7504525Z 2025-05-07T20:26:23.7510808Z 2025-05-07T20:26:23.7985284Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:26:23.7985761Z 2025-05-07T20:26:23.7985767Z 2025-05-07T20:26:23.7985772Z 2025-05-07T20:26:23.7985777Z 2025-05-07T20:26:23.7985782Z 2025-05-07T20:26:23.7985787Z 2025-05-07T20:26:23.7985794Z 2025-05-07T20:26:23.7985814Z 2025-05-07T20:26:23.7985819Z 2025-05-07T20:26:23.7985824Z 2025-05-07T20:26:23.7985829Z 2025-05-07T20:26:23.7985835Z 2025-05-07T20:26:23.7985840Z 2025-05-07T20:26:23.7985845Z 2025-05-07T20:26:23.7985857Z 2025-05-07T20:26:23.7985861Z 2025-05-07T20:26:23.7986510Z 2025-05-07T20:26:23.8095642Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:26:23.8096127Z 2025-05-07T20:26:23.8096133Z 2025-05-07T20:26:23.8096138Z 2025-05-07T20:26:23.8096143Z 2025-05-07T20:26:23.8096148Z 2025-05-07T20:26:23.8096153Z 2025-05-07T20:26:23.8096158Z 2025-05-07T20:26:23.8096164Z 2025-05-07T20:26:23.8096169Z 2025-05-07T20:26:23.8096174Z 2025-05-07T20:26:23.8096179Z 2025-05-07T20:26:23.8096185Z 2025-05-07T20:26:23.8096190Z 2025-05-07T20:26:23.8096427Z 2025-05-07T20:26:23.8098752Z 2025-05-07T20:26:23.8469788Z cuda-nvvm-impl-12.8. | 20.8 MB | ####4 | 45%  2025-05-07T20:26:23.8470516Z 2025-05-07T20:26:23.8470522Z 2025-05-07T20:26:23.8470527Z 2025-05-07T20:26:23.8470533Z 2025-05-07T20:26:23.8470538Z 2025-05-07T20:26:23.8470543Z 2025-05-07T20:26:23.8470548Z 2025-05-07T20:26:23.8470553Z 2025-05-07T20:26:23.8470558Z 2025-05-07T20:26:23.8470564Z 2025-05-07T20:26:23.8470569Z 2025-05-07T20:26:23.8470574Z 2025-05-07T20:26:23.8470587Z 2025-05-07T20:26:23.8470592Z 2025-05-07T20:26:23.8470598Z 2025-05-07T20:26:23.8472079Z 2025-05-07T20:26:23.8989540Z cuda-nvcc-dev_linux- | 12.7 MB | #######5 | 76%  2025-05-07T20:26:23.8990016Z 2025-05-07T20:26:23.8990022Z 2025-05-07T20:26:23.8990027Z 2025-05-07T20:26:23.8990032Z 2025-05-07T20:26:23.8990047Z 2025-05-07T20:26:23.8990053Z 2025-05-07T20:26:23.8990058Z 2025-05-07T20:26:23.8990089Z 2025-05-07T20:26:23.8990095Z 2025-05-07T20:26:23.8990100Z 2025-05-07T20:26:23.8990104Z 2025-05-07T20:26:23.8990110Z 2025-05-07T20:26:23.8990115Z 2025-05-07T20:26:23.8990120Z 2025-05-07T20:26:23.8990134Z 2025-05-07T20:26:23.8990139Z 2025-05-07T20:26:23.8994824Z 2025-05-07T20:26:23.9171530Z cuda-sanitizer-api-1 | 8.8 MB | ##4 | 25%  2025-05-07T20:26:23.9171976Z 2025-05-07T20:26:23.9171981Z 2025-05-07T20:26:23.9171984Z 2025-05-07T20:26:23.9171988Z 2025-05-07T20:26:23.9171992Z 2025-05-07T20:26:23.9171995Z 2025-05-07T20:26:23.9171999Z 2025-05-07T20:26:23.9172003Z 2025-05-07T20:26:23.9172006Z 2025-05-07T20:26:23.9172010Z 2025-05-07T20:26:23.9172014Z 2025-05-07T20:26:23.9172018Z 2025-05-07T20:26:23.9172022Z 2025-05-07T20:26:23.9172025Z 2025-05-07T20:26:23.9173427Z 2025-05-07T20:26:23.9992855Z cuda-nvvm-impl-12.8. | 20.8 MB | #####9 | 59%  2025-05-07T20:26:23.9993276Z 2025-05-07T20:26:23.9993300Z 2025-05-07T20:26:23.9993304Z 2025-05-07T20:26:23.9993313Z 2025-05-07T20:26:23.9993317Z 2025-05-07T20:26:23.9993320Z 2025-05-07T20:26:23.9993324Z 2025-05-07T20:26:23.9993335Z 2025-05-07T20:26:23.9993339Z 2025-05-07T20:26:23.9993342Z 2025-05-07T20:26:23.9993346Z 2025-05-07T20:26:23.9993356Z 2025-05-07T20:26:23.9993359Z 2025-05-07T20:26:23.9993363Z 2025-05-07T20:26:23.9993366Z 2025-05-07T20:26:23.9993370Z 2025-05-07T20:26:23.9996619Z 2025-05-07T20:26:24.0199614Z cuda-sanitizer-api-1 | 8.8 MB | ###### | 60%  2025-05-07T20:26:24.0200064Z 2025-05-07T20:26:24.0200076Z 2025-05-07T20:26:24.0200079Z 2025-05-07T20:26:24.0200083Z 2025-05-07T20:26:24.0200087Z 2025-05-07T20:26:24.0200090Z 2025-05-07T20:26:24.0200094Z 2025-05-07T20:26:24.0200098Z 2025-05-07T20:26:24.0200101Z 2025-05-07T20:26:24.0200105Z 2025-05-07T20:26:24.0200108Z 2025-05-07T20:26:24.0200201Z 2025-05-07T20:26:24.0200205Z 2025-05-07T20:26:24.0200209Z 2025-05-07T20:26:24.0201442Z 2025-05-07T20:26:24.0993053Z cuda-nvvm-impl-12.8. | 20.8 MB | #######3 | 74%  2025-05-07T20:26:24.0993533Z 2025-05-07T20:26:24.0993555Z 2025-05-07T20:26:24.0993560Z 2025-05-07T20:26:24.0993566Z 2025-05-07T20:26:24.0993571Z 2025-05-07T20:26:24.0993576Z 2025-05-07T20:26:24.0993581Z 2025-05-07T20:26:24.0993587Z 2025-05-07T20:26:24.0993592Z 2025-05-07T20:26:24.0993597Z 2025-05-07T20:26:24.0993602Z 2025-05-07T20:26:24.0993607Z 2025-05-07T20:26:24.0993612Z 2025-05-07T20:26:24.0993618Z 2025-05-07T20:26:24.0993623Z 2025-05-07T20:26:24.0993628Z 2025-05-07T20:26:24.0997717Z 2025-05-07T20:26:24.1082694Z cuda-sanitizer-api-1 | 8.8 MB | #########8 | 98%  2025-05-07T20:26:24.1083173Z 2025-05-07T20:26:24.1083179Z 2025-05-07T20:26:24.1083184Z 2025-05-07T20:26:24.1083189Z 2025-05-07T20:26:24.1083194Z 2025-05-07T20:26:24.1083199Z 2025-05-07T20:26:24.1083205Z 2025-05-07T20:26:24.1083525Z 2025-05-07T20:26:24.1083532Z 2025-05-07T20:26:24.1083547Z 2025-05-07T20:26:24.1083552Z 2025-05-07T20:26:24.1083558Z 2025-05-07T20:26:24.1083563Z 2025-05-07T20:26:24.1085173Z 2025-05-07T20:26:24.1203502Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:24.1203964Z 2025-05-07T20:26:24.1203969Z 2025-05-07T20:26:24.1203974Z 2025-05-07T20:26:24.1203980Z 2025-05-07T20:26:24.1203985Z 2025-05-07T20:26:24.1203990Z 2025-05-07T20:26:24.1203995Z 2025-05-07T20:26:24.1204000Z 2025-05-07T20:26:24.1204005Z 2025-05-07T20:26:24.1204010Z 2025-05-07T20:26:24.1204016Z 2025-05-07T20:26:24.1204021Z 2025-05-07T20:26:24.1204026Z 2025-05-07T20:26:24.1204031Z 2025-05-07T20:26:24.1204036Z 2025-05-07T20:26:24.1532517Z cuda-nvvm-impl-12.8. | 20.8 MB | ########9 | 90%  2025-05-07T20:26:24.1532973Z 2025-05-07T20:26:24.1532979Z 2025-05-07T20:26:24.1532984Z 2025-05-07T20:26:24.1532989Z 2025-05-07T20:26:24.1533013Z 2025-05-07T20:26:24.1533019Z 2025-05-07T20:26:24.1533024Z 2025-05-07T20:26:24.1533029Z 2025-05-07T20:26:24.1533034Z 2025-05-07T20:26:24.1533039Z 2025-05-07T20:26:24.1533065Z 2025-05-07T20:26:24.1533071Z 2025-05-07T20:26:24.1533076Z 2025-05-07T20:26:24.1533081Z 2025-05-07T20:26:24.1533086Z 2025-05-07T20:26:24.1533092Z 2025-05-07T20:26:24.1533097Z 2025-05-07T20:26:24.1533102Z 2025-05-07T20:26:24.2533452Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:26:24.2533929Z 2025-05-07T20:26:24.2533934Z 2025-05-07T20:26:24.2533940Z 2025-05-07T20:26:24.2533945Z 2025-05-07T20:26:24.2533950Z 2025-05-07T20:26:24.2533955Z 2025-05-07T20:26:24.2533960Z 2025-05-07T20:26:24.2533965Z 2025-05-07T20:26:24.2533971Z 2025-05-07T20:26:24.2533976Z 2025-05-07T20:26:24.2533981Z 2025-05-07T20:26:24.2533986Z 2025-05-07T20:26:24.2533991Z 2025-05-07T20:26:24.2533996Z 2025-05-07T20:26:24.2534001Z 2025-05-07T20:26:24.2534026Z 2025-05-07T20:26:24.2534032Z 2025-05-07T20:26:24.2534037Z 2025-05-07T20:26:24.3700986Z cuda-nvdisasm-12.8.5 | 4.9 MB | ######6 | 67%  2025-05-07T20:26:24.3701486Z 2025-05-07T20:26:24.3701491Z 2025-05-07T20:26:24.3701496Z 2025-05-07T20:26:24.3701501Z 2025-05-07T20:26:24.3701516Z 2025-05-07T20:26:24.3701521Z 2025-05-07T20:26:24.3701526Z 2025-05-07T20:26:24.3701531Z 2025-05-07T20:26:24.3701536Z 2025-05-07T20:26:24.3701541Z 2025-05-07T20:26:24.3701546Z 2025-05-07T20:26:24.3701551Z 2025-05-07T20:26:24.3701556Z 2025-05-07T20:26:24.3701561Z 2025-05-07T20:26:24.3701566Z 2025-05-07T20:26:24.3703151Z 2025-05-07T20:26:24.4070069Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:24.4070553Z 2025-05-07T20:26:24.4070559Z 2025-05-07T20:26:24.4070564Z 2025-05-07T20:26:24.4070569Z 2025-05-07T20:26:24.4070574Z 2025-05-07T20:26:24.4070580Z 2025-05-07T20:26:24.4070604Z 2025-05-07T20:26:24.4070610Z 2025-05-07T20:26:24.4070616Z 2025-05-07T20:26:24.4070621Z 2025-05-07T20:26:24.4070626Z 2025-05-07T20:26:24.4070631Z 2025-05-07T20:26:24.4070636Z 2025-05-07T20:26:24.4070660Z 2025-05-07T20:26:24.4070665Z 2025-05-07T20:26:24.4070670Z 2025-05-07T20:26:24.4070675Z 2025-05-07T20:26:24.4070680Z 2025-05-07T20:26:24.4072568Z 2025-05-07T20:26:24.4203880Z ... (more hidden) ... 2025-05-07T20:26:24.4204189Z 2025-05-07T20:26:24.4204193Z 2025-05-07T20:26:24.4204197Z 2025-05-07T20:26:24.4204201Z 2025-05-07T20:26:24.4204211Z 2025-05-07T20:26:24.4204215Z 2025-05-07T20:26:24.4204218Z 2025-05-07T20:26:24.4204222Z 2025-05-07T20:26:24.4204225Z 2025-05-07T20:26:24.4204229Z 2025-05-07T20:26:24.4204232Z 2025-05-07T20:26:24.4204236Z 2025-05-07T20:26:24.4204240Z 2025-05-07T20:26:24.4204244Z 2025-05-07T20:26:24.4204247Z 2025-05-07T20:26:24.4204251Z 2025-05-07T20:26:24.4204255Z 2025-05-07T20:26:24.4677946Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:24.4678373Z 2025-05-07T20:26:24.4678377Z 2025-05-07T20:26:24.4678381Z 2025-05-07T20:26:24.4678540Z 2025-05-07T20:26:24.4678544Z 2025-05-07T20:26:24.4678547Z 2025-05-07T20:26:24.4678551Z 2025-05-07T20:26:24.4678555Z 2025-05-07T20:26:24.4678558Z 2025-05-07T20:26:24.4678562Z 2025-05-07T20:26:24.4678565Z 2025-05-07T20:26:24.4678569Z 2025-05-07T20:26:24.4678573Z 2025-05-07T20:26:24.4678583Z 2025-05-07T20:26:24.4678587Z 2025-05-07T20:26:24.4678590Z 2025-05-07T20:26:24.4678594Z 2025-05-07T20:26:24.4678598Z 2025-05-07T20:26:24.5072266Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:24.5072624Z 2025-05-07T20:26:24.5072628Z 2025-05-07T20:26:24.5072631Z 2025-05-07T20:26:24.5072635Z 2025-05-07T20:26:24.5072638Z 2025-05-07T20:26:24.5072642Z 2025-05-07T20:26:24.5072646Z 2025-05-07T20:26:24.5072650Z 2025-05-07T20:26:24.5072674Z 2025-05-07T20:26:24.5072678Z 2025-05-07T20:26:24.5072681Z 2025-05-07T20:26:24.5072685Z 2025-05-07T20:26:24.5072689Z 2025-05-07T20:26:24.5072692Z 2025-05-07T20:26:24.5072703Z 2025-05-07T20:26:24.5072706Z 2025-05-07T20:26:24.5072710Z 2025-05-07T20:26:24.5072713Z 2025-05-07T20:26:24.5073241Z 2025-05-07T20:26:24.6593429Z ... (more hidden) ... 2025-05-07T20:26:24.6593737Z 2025-05-07T20:26:24.6593741Z 2025-05-07T20:26:24.6593744Z 2025-05-07T20:26:24.6593748Z 2025-05-07T20:26:24.6593752Z 2025-05-07T20:26:24.6593756Z 2025-05-07T20:26:24.6593759Z 2025-05-07T20:26:24.6593771Z 2025-05-07T20:26:24.6593775Z 2025-05-07T20:26:24.6593779Z 2025-05-07T20:26:24.6593782Z 2025-05-07T20:26:24.6593786Z 2025-05-07T20:26:24.6593790Z 2025-05-07T20:26:24.6593793Z 2025-05-07T20:26:24.6593797Z 2025-05-07T20:26:24.6593801Z 2025-05-07T20:26:24.6593804Z 2025-05-07T20:26:24.6593808Z 2025-05-07T20:26:24.6599279Z 2025-05-07T20:26:24.8339313Z ... (more hidden) ... 2025-05-07T20:26:24.8339633Z 2025-05-07T20:26:24.8339637Z 2025-05-07T20:26:24.8339640Z 2025-05-07T20:26:24.8339654Z 2025-05-07T20:26:24.8339657Z 2025-05-07T20:26:24.8339661Z 2025-05-07T20:26:24.8339665Z 2025-05-07T20:26:24.8339668Z 2025-05-07T20:26:24.8339672Z 2025-05-07T20:26:24.8339676Z 2025-05-07T20:26:24.8339679Z 2025-05-07T20:26:24.8339683Z 2025-05-07T20:26:24.8339687Z 2025-05-07T20:26:24.8339690Z 2025-05-07T20:26:24.8344409Z 2025-05-07T20:26:25.1247198Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:25.1247539Z 2025-05-07T20:26:25.1247543Z 2025-05-07T20:26:25.1247547Z 2025-05-07T20:26:25.1247551Z 2025-05-07T20:26:25.1247555Z 2025-05-07T20:26:25.1247562Z 2025-05-07T20:26:25.4414229Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:26:25.4414550Z 2025-05-07T20:26:25.4414562Z 2025-05-07T20:26:25.4414591Z 2025-05-07T20:26:25.4414595Z 2025-05-07T20:26:25.4414608Z 2025-05-07T20:26:25.4414612Z 2025-05-07T20:26:25.4415193Z 2025-05-07T20:26:26.2241510Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:26:26.2241955Z 2025-05-07T20:26:26.2241961Z 2025-05-07T20:26:26.2241966Z 2025-05-07T20:26:26.2241971Z 2025-05-07T20:26:26.2241976Z 2025-05-07T20:26:26.2241981Z 2025-05-07T20:26:26.2241986Z 2025-05-07T20:26:26.2241991Z 2025-05-07T20:26:26.2241996Z 2025-05-07T20:26:26.2242000Z 2025-05-07T20:26:26.6457180Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:26:26.6491463Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:26.6491829Z 2025-05-07T20:26:26.6491833Z 2025-05-07T20:26:26.6491837Z 2025-05-07T20:26:26.6491840Z 2025-05-07T20:26:26.6491844Z 2025-05-07T20:26:26.6491848Z 2025-05-07T20:26:26.6491851Z 2025-05-07T20:26:26.6491855Z 2025-05-07T20:26:26.6491964Z 2025-05-07T20:26:27.0618841Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:26:27.0619182Z 2025-05-07T20:26:27.0619186Z 2025-05-07T20:26:27.0619190Z 2025-05-07T20:26:27.0619406Z 2025-05-07T20:26:27.0619410Z 2025-05-07T20:26:27.4049842Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:27.4050159Z 2025-05-07T20:26:27.4050163Z 2025-05-07T20:26:27.4050167Z 2025-05-07T20:26:27.4050170Z 2025-05-07T20:26:27.4050174Z 2025-05-07T20:26:27.4050177Z 2025-05-07T20:26:27.4050181Z 2025-05-07T20:26:27.4050185Z 2025-05-07T20:26:27.4050188Z 2025-05-07T20:26:27.4050192Z 2025-05-07T20:26:27.4050195Z 2025-05-07T20:26:27.4050199Z 2025-05-07T20:26:27.4050203Z 2025-05-07T20:26:27.6038480Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:27.6038843Z 2025-05-07T20:26:27.6038849Z 2025-05-07T20:26:27.6038853Z 2025-05-07T20:26:27.6038860Z 2025-05-07T20:26:27.6038864Z 2025-05-07T20:26:27.6038869Z 2025-05-07T20:26:27.6038903Z 2025-05-07T20:26:27.6038907Z 2025-05-07T20:26:28.1502633Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:26:28.1502970Z 2025-05-07T20:26:28.1502974Z 2025-05-07T20:26:28.1502978Z 2025-05-07T20:26:28.1502982Z 2025-05-07T20:26:28.1502985Z 2025-05-07T20:26:28.1502989Z 2025-05-07T20:26:28.1502993Z 2025-05-07T20:26:28.1502996Z 2025-05-07T20:26:28.1503000Z 2025-05-07T20:26:28.1503004Z 2025-05-07T20:26:28.1503008Z 2025-05-07T20:26:28.1503015Z 2025-05-07T20:26:28.6216693Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:26:28.6217152Z 2025-05-07T20:26:28.6217156Z 2025-05-07T20:26:28.6217160Z 2025-05-07T20:26:28.6217165Z 2025-05-07T20:26:28.6217169Z 2025-05-07T20:26:28.6217183Z 2025-05-07T20:26:28.6217186Z 2025-05-07T20:26:28.6217190Z 2025-05-07T20:26:28.6217194Z 2025-05-07T20:26:28.6217197Z 2025-05-07T20:26:28.6217201Z 2025-05-07T20:26:28.6217205Z 2025-05-07T20:26:28.6217237Z 2025-05-07T20:26:28.6217241Z 2025-05-07T20:26:28.8427867Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:28.8428363Z 2025-05-07T20:26:28.8428371Z 2025-05-07T20:26:28.8428375Z 2025-05-07T20:26:28.8428378Z 2025-05-07T20:26:28.8428382Z 2025-05-07T20:26:28.8428386Z 2025-05-07T20:26:28.8428390Z 2025-05-07T20:26:28.8428393Z 2025-05-07T20:26:28.8428397Z 2025-05-07T20:26:28.8428401Z 2025-05-07T20:26:28.8428405Z 2025-05-07T20:26:28.9978311Z python-3.13.0 | 31.5 MB | ########## | 100%  2025-05-07T20:26:28.9978783Z 2025-05-07T20:26:28.9978790Z 2025-05-07T20:26:28.9978796Z 2025-05-07T20:26:28.9978801Z 2025-05-07T20:26:28.9978806Z 2025-05-07T20:26:28.9978811Z 2025-05-07T20:26:28.9978817Z 2025-05-07T20:26:28.9978822Z 2025-05-07T20:26:28.9978827Z 2025-05-07T20:26:28.9978834Z 2025-05-07T20:26:28.9978839Z 2025-05-07T20:26:28.9978844Z 2025-05-07T20:26:28.9978858Z 2025-05-07T20:26:28.9978897Z 2025-05-07T20:26:28.9978903Z 2025-05-07T20:26:28.9978908Z 2025-05-07T20:26:29.0308898Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:29.0309295Z 2025-05-07T20:26:29.0309299Z 2025-05-07T20:26:29.0309302Z 2025-05-07T20:26:29.0309306Z 2025-05-07T20:26:29.0309310Z 2025-05-07T20:26:29.0309314Z 2025-05-07T20:26:29.0309317Z 2025-05-07T20:26:29.0309321Z 2025-05-07T20:26:29.0309325Z 2025-05-07T20:26:29.0309328Z 2025-05-07T20:26:29.0309332Z 2025-05-07T20:26:29.0309336Z 2025-05-07T20:26:29.0309339Z 2025-05-07T20:26:29.0309343Z 2025-05-07T20:26:29.0309347Z 2025-05-07T20:26:29.0309350Z 2025-05-07T20:26:29.0309354Z 2025-05-07T20:26:29.0309358Z 2025-05-07T20:26:29.0580431Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:29.0580783Z 2025-05-07T20:26:29.0580787Z 2025-05-07T20:26:29.0580791Z 2025-05-07T20:26:29.0580794Z 2025-05-07T20:26:29.0580798Z 2025-05-07T20:26:29.0581053Z 2025-05-07T20:26:29.0581058Z 2025-05-07T20:26:29.0581062Z 2025-05-07T20:26:29.0581066Z 2025-05-07T20:26:29.0581079Z 2025-05-07T20:26:29.0581082Z 2025-05-07T20:26:29.0581249Z 2025-05-07T20:26:29.0581252Z 2025-05-07T20:26:29.0581256Z 2025-05-07T20:26:29.0581259Z 2025-05-07T20:26:29.0581263Z 2025-05-07T20:26:29.0581269Z 2025-05-07T20:26:29.2738736Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:29.2739114Z 2025-05-07T20:26:29.2739118Z 2025-05-07T20:26:29.2739122Z 2025-05-07T20:26:29.2739126Z 2025-05-07T20:26:29.2739130Z 2025-05-07T20:26:29.2739133Z 2025-05-07T20:26:29.2739137Z 2025-05-07T20:26:29.2739140Z 2025-05-07T20:26:29.2739144Z 2025-05-07T20:26:29.2739148Z 2025-05-07T20:26:29.2739157Z 2025-05-07T20:26:29.2739161Z 2025-05-07T20:26:29.2739165Z 2025-05-07T20:26:29.2739168Z 2025-05-07T20:26:29.2739172Z 2025-05-07T20:26:29.2739176Z 2025-05-07T20:26:29.2739179Z 2025-05-07T20:26:29.2739206Z 2025-05-07T20:26:29.2739210Z 2025-05-07T20:26:29.8372052Z ... (more hidden) ... 2025-05-07T20:26:29.8372372Z 2025-05-07T20:26:29.8372402Z 2025-05-07T20:26:29.8372406Z 2025-05-07T20:26:29.8372409Z 2025-05-07T20:26:29.8372413Z 2025-05-07T20:26:29.8372416Z 2025-05-07T20:26:29.8372420Z 2025-05-07T20:26:29.8372424Z 2025-05-07T20:26:29.8372428Z 2025-05-07T20:26:29.8372432Z 2025-05-07T20:26:29.8372436Z 2025-05-07T20:26:29.8372440Z 2025-05-07T20:26:29.8372444Z 2025-05-07T20:26:29.8372448Z 2025-05-07T20:26:29.8372451Z 2025-05-07T20:26:34.4214100Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:34.4214458Z 2025-05-07T20:26:35.7843751Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:35.7852587Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:35.7852864Z 2025-05-07T20:26:35.7852869Z 2025-05-07T20:26:35.7852873Z 2025-05-07T20:26:35.7852899Z 2025-05-07T20:26:35.7852903Z 2025-05-07T20:26:35.7852913Z 2025-05-07T20:26:35.7852917Z 2025-05-07T20:26:35.7852921Z 2025-05-07T20:26:35.7852925Z 2025-05-07T20:26:35.7852939Z 2025-05-07T20:26:35.7852943Z 2025-05-07T20:26:35.7852947Z 2025-05-07T20:26:35.7852950Z 2025-05-07T20:26:35.7852954Z 2025-05-07T20:26:35.7852958Z 2025-05-07T20:26:35.7852961Z 2025-05-07T20:26:35.7852965Z 2025-05-07T20:26:35.7852968Z 2025-05-07T20:26:35.7852972Z 2025-05-07T20:26:35.7853059Z 2025-05-07T20:26:35.7853467Z  2025-05-07T20:26:35.7853931Z 2025-05-07T20:26:35.7854140Z 2025-05-07T20:26:35.7854392Z  2025-05-07T20:26:35.7854645Z 2025-05-07T20:26:35.7854696Z 2025-05-07T20:26:35.7854923Z  2025-05-07T20:26:35.7855159Z 2025-05-07T20:26:35.7855163Z 2025-05-07T20:26:35.7855174Z 2025-05-07T20:26:35.7855580Z  2025-05-07T20:26:35.7855818Z 2025-05-07T20:26:35.7855822Z 2025-05-07T20:26:35.7855836Z 2025-05-07T20:26:35.7855847Z 2025-05-07T20:26:35.7856232Z  2025-05-07T20:26:35.7856548Z 2025-05-07T20:26:35.7856554Z 2025-05-07T20:26:35.7856559Z 2025-05-07T20:26:35.7856564Z 2025-05-07T20:26:35.7856569Z 2025-05-07T20:26:35.7856969Z  2025-05-07T20:26:35.7857285Z 2025-05-07T20:26:35.7857290Z 2025-05-07T20:26:35.7857296Z 2025-05-07T20:26:35.7857301Z 2025-05-07T20:26:35.7857306Z 2025-05-07T20:26:35.7857311Z 2025-05-07T20:26:35.7857612Z  2025-05-07T20:26:35.7857919Z 2025-05-07T20:26:35.7857924Z 2025-05-07T20:26:35.7857929Z 2025-05-07T20:26:35.7857934Z 2025-05-07T20:26:35.7857940Z 2025-05-07T20:26:35.7858239Z 2025-05-07T20:26:35.7858248Z 2025-05-07T20:26:35.7858531Z  2025-05-07T20:26:35.7858843Z 2025-05-07T20:26:35.7859010Z 2025-05-07T20:26:35.7859015Z 2025-05-07T20:26:35.7859020Z 2025-05-07T20:26:35.7859025Z 2025-05-07T20:26:35.7859030Z 2025-05-07T20:26:35.7859036Z 2025-05-07T20:26:35.7859041Z 2025-05-07T20:26:35.7859316Z  2025-05-07T20:26:35.7859634Z 2025-05-07T20:26:35.7859639Z 2025-05-07T20:26:35.7859644Z 2025-05-07T20:26:35.7859649Z 2025-05-07T20:26:35.7859654Z 2025-05-07T20:26:35.7859660Z 2025-05-07T20:26:35.7859673Z 2025-05-07T20:26:35.7859678Z 2025-05-07T20:26:35.7859684Z 2025-05-07T20:26:35.7859951Z  2025-05-07T20:26:35.7860265Z 2025-05-07T20:26:35.7860270Z 2025-05-07T20:26:35.7860275Z 2025-05-07T20:26:35.7860288Z 2025-05-07T20:26:35.7860301Z 2025-05-07T20:26:35.7860306Z 2025-05-07T20:26:35.7860312Z 2025-05-07T20:26:35.7860317Z 2025-05-07T20:26:35.7860322Z 2025-05-07T20:26:35.7860327Z 2025-05-07T20:26:35.7860592Z  2025-05-07T20:26:35.7860927Z 2025-05-07T20:26:35.7860933Z 2025-05-07T20:26:35.7860938Z 2025-05-07T20:26:35.7860942Z 2025-05-07T20:26:35.7860947Z 2025-05-07T20:26:35.7860952Z 2025-05-07T20:26:35.7860957Z 2025-05-07T20:26:35.7860962Z 2025-05-07T20:26:35.7860967Z 2025-05-07T20:26:35.7860973Z 2025-05-07T20:26:35.7860977Z 2025-05-07T20:26:35.7861249Z  2025-05-07T20:26:35.7861581Z 2025-05-07T20:26:35.7861586Z 2025-05-07T20:26:35.7861592Z 2025-05-07T20:26:35.7861596Z 2025-05-07T20:26:35.7861602Z 2025-05-07T20:26:35.7861607Z 2025-05-07T20:26:35.7861612Z 2025-05-07T20:26:35.7861617Z 2025-05-07T20:26:35.7861623Z 2025-05-07T20:26:35.7861628Z 2025-05-07T20:26:35.7861640Z 2025-05-07T20:26:35.7861646Z 2025-05-07T20:26:35.7862040Z  2025-05-07T20:26:35.7862402Z 2025-05-07T20:26:35.7862408Z 2025-05-07T20:26:35.7862413Z 2025-05-07T20:26:35.7862418Z 2025-05-07T20:26:35.7862423Z 2025-05-07T20:26:35.7862429Z 2025-05-07T20:26:35.7862434Z 2025-05-07T20:26:35.7862440Z 2025-05-07T20:26:35.7862445Z 2025-05-07T20:26:35.7862450Z 2025-05-07T20:26:35.7862455Z 2025-05-07T20:26:35.7862460Z 2025-05-07T20:26:35.7862473Z 2025-05-07T20:26:35.7862763Z  2025-05-07T20:26:35.7863091Z 2025-05-07T20:26:35.7863096Z 2025-05-07T20:26:35.7863101Z 2025-05-07T20:26:35.7863106Z 2025-05-07T20:26:35.7863112Z 2025-05-07T20:26:35.7863117Z 2025-05-07T20:26:35.7863122Z 2025-05-07T20:26:35.7863128Z 2025-05-07T20:26:35.7863132Z 2025-05-07T20:26:35.7863137Z 2025-05-07T20:26:35.7863151Z 2025-05-07T20:26:35.7863164Z 2025-05-07T20:26:35.7863169Z 2025-05-07T20:26:35.7863174Z 2025-05-07T20:26:35.7863466Z  2025-05-07T20:26:35.7863804Z 2025-05-07T20:26:35.7863826Z 2025-05-07T20:26:35.7863831Z 2025-05-07T20:26:35.7863836Z 2025-05-07T20:26:35.7863841Z 2025-05-07T20:26:35.7863847Z 2025-05-07T20:26:35.7863852Z 2025-05-07T20:26:35.7863857Z 2025-05-07T20:26:35.7863863Z 2025-05-07T20:26:35.7863868Z 2025-05-07T20:26:35.7863874Z 2025-05-07T20:26:35.7863879Z 2025-05-07T20:26:35.7863884Z 2025-05-07T20:26:35.7863888Z 2025-05-07T20:26:35.7863893Z 2025-05-07T20:26:35.7864187Z  2025-05-07T20:26:35.7864530Z 2025-05-07T20:26:35.7864535Z 2025-05-07T20:26:35.7864540Z 2025-05-07T20:26:35.7864552Z 2025-05-07T20:26:35.7864557Z 2025-05-07T20:26:35.7864562Z 2025-05-07T20:26:35.7864567Z 2025-05-07T20:26:35.7864572Z 2025-05-07T20:26:35.7864732Z 2025-05-07T20:26:35.7864740Z 2025-05-07T20:26:35.7864745Z 2025-05-07T20:26:35.7864750Z 2025-05-07T20:26:35.7864756Z 2025-05-07T20:26:35.7864769Z 2025-05-07T20:26:35.7864861Z 2025-05-07T20:26:35.7864866Z 2025-05-07T20:26:35.7865182Z  2025-05-07T20:26:35.7865526Z 2025-05-07T20:26:35.7865540Z 2025-05-07T20:26:35.7865546Z 2025-05-07T20:26:35.7865551Z 2025-05-07T20:26:35.7865556Z 2025-05-07T20:26:35.7865561Z 2025-05-07T20:26:35.7865566Z 2025-05-07T20:26:35.7865572Z 2025-05-07T20:26:35.7865577Z 2025-05-07T20:26:35.7865582Z 2025-05-07T20:26:35.7865587Z 2025-05-07T20:26:35.7865592Z 2025-05-07T20:26:35.7865597Z 2025-05-07T20:26:35.7865603Z 2025-05-07T20:26:35.7865608Z 2025-05-07T20:26:35.7865613Z 2025-05-07T20:26:35.7865618Z 2025-05-07T20:26:35.7865930Z  2025-05-07T20:26:35.7866283Z 2025-05-07T20:26:35.7866289Z 2025-05-07T20:26:35.7866294Z 2025-05-07T20:26:35.7866299Z 2025-05-07T20:26:35.7866304Z 2025-05-07T20:26:35.7866309Z 2025-05-07T20:26:35.7866321Z 2025-05-07T20:26:35.7866327Z 2025-05-07T20:26:35.7866332Z 2025-05-07T20:26:35.7866337Z 2025-05-07T20:26:35.7866342Z 2025-05-07T20:26:35.7866355Z 2025-05-07T20:26:35.7866361Z 2025-05-07T20:26:35.7866366Z 2025-05-07T20:26:35.7866369Z 2025-05-07T20:26:35.7866373Z 2025-05-07T20:26:35.7866377Z 2025-05-07T20:26:35.7866380Z 2025-05-07T20:26:35.7867096Z  2025-05-07T20:26:35.7867433Z 2025-05-07T20:26:35.7867440Z 2025-05-07T20:26:35.7867618Z  2025-05-07T20:26:35.7867775Z 2025-05-07T20:26:35.7867784Z 2025-05-07T20:26:35.7868377Z  2025-05-07T20:26:35.7868538Z 2025-05-07T20:26:35.7868547Z 2025-05-07T20:26:35.7868552Z 2025-05-07T20:26:35.7869252Z  2025-05-07T20:26:35.7869422Z 2025-05-07T20:26:35.7869437Z 2025-05-07T20:26:35.7869443Z 2025-05-07T20:26:35.7869452Z 2025-05-07T20:26:35.7869915Z  2025-05-07T20:26:35.7870085Z 2025-05-07T20:26:35.7870091Z 2025-05-07T20:26:35.7870103Z 2025-05-07T20:26:35.7870112Z 2025-05-07T20:26:35.7870118Z 2025-05-07T20:26:35.7870598Z  2025-05-07T20:26:35.7870775Z 2025-05-07T20:26:35.7870780Z 2025-05-07T20:26:35.7870785Z 2025-05-07T20:26:35.7870794Z 2025-05-07T20:26:35.7870799Z 2025-05-07T20:26:35.7870804Z 2025-05-07T20:26:35.7871478Z  2025-05-07T20:26:35.7871719Z 2025-05-07T20:26:35.7871736Z 2025-05-07T20:26:35.7871742Z 2025-05-07T20:26:35.7871748Z 2025-05-07T20:26:35.7871754Z 2025-05-07T20:26:35.7871765Z 2025-05-07T20:26:35.7871770Z 2025-05-07T20:26:35.7872181Z  2025-05-07T20:26:35.7872438Z 2025-05-07T20:26:35.7872444Z 2025-05-07T20:26:35.7872450Z 2025-05-07T20:26:35.7872456Z 2025-05-07T20:26:35.7872461Z 2025-05-07T20:26:35.7872467Z 2025-05-07T20:26:35.7872473Z 2025-05-07T20:26:35.7872496Z 2025-05-07T20:26:35.7873019Z  2025-05-07T20:26:35.7873252Z 2025-05-07T20:26:35.7873257Z 2025-05-07T20:26:35.7873263Z 2025-05-07T20:26:35.7873284Z 2025-05-07T20:26:35.7873289Z 2025-05-07T20:26:35.7873295Z 2025-05-07T20:26:35.7873300Z 2025-05-07T20:26:35.7873305Z 2025-05-07T20:26:35.7873310Z 2025-05-07T20:26:35.7873843Z  2025-05-07T20:26:35.7874024Z 2025-05-07T20:26:35.7874028Z 2025-05-07T20:26:35.7874032Z 2025-05-07T20:26:35.7874042Z 2025-05-07T20:26:35.7874046Z 2025-05-07T20:26:35.7874057Z 2025-05-07T20:26:35.7874060Z 2025-05-07T20:26:35.7874064Z 2025-05-07T20:26:35.7874068Z 2025-05-07T20:26:35.7874071Z 2025-05-07T20:26:35.7874525Z  2025-05-07T20:26:35.7874772Z 2025-05-07T20:26:35.7874777Z 2025-05-07T20:26:35.7874783Z 2025-05-07T20:26:35.7874788Z 2025-05-07T20:26:35.7874794Z 2025-05-07T20:26:35.7874799Z 2025-05-07T20:26:35.7874809Z 2025-05-07T20:26:35.7874814Z 2025-05-07T20:26:35.7874981Z 2025-05-07T20:26:35.7874988Z 2025-05-07T20:26:35.7874994Z 2025-05-07T20:26:35.7875220Z  2025-05-07T20:26:35.7875489Z 2025-05-07T20:26:35.7875494Z 2025-05-07T20:26:35.7875625Z 2025-05-07T20:26:35.7875631Z 2025-05-07T20:26:35.7875636Z 2025-05-07T20:26:35.7875641Z 2025-05-07T20:26:35.7875657Z 2025-05-07T20:26:35.7875662Z 2025-05-07T20:26:35.7875667Z 2025-05-07T20:26:35.7875673Z 2025-05-07T20:26:35.7875678Z 2025-05-07T20:26:35.7875683Z 2025-05-07T20:26:35.7876232Z  2025-05-07T20:26:35.7876509Z 2025-05-07T20:26:35.7876516Z 2025-05-07T20:26:35.7876521Z 2025-05-07T20:26:35.7876527Z 2025-05-07T20:26:35.7876541Z 2025-05-07T20:26:35.7876554Z 2025-05-07T20:26:35.7876558Z 2025-05-07T20:26:35.7876561Z 2025-05-07T20:26:35.7876565Z 2025-05-07T20:26:35.7876569Z 2025-05-07T20:26:35.7876572Z 2025-05-07T20:26:35.7876576Z 2025-05-07T20:26:35.7876580Z 2025-05-07T20:26:35.7876736Z  2025-05-07T20:26:35.7876951Z 2025-05-07T20:26:35.7876955Z 2025-05-07T20:26:35.7876959Z 2025-05-07T20:26:35.7876963Z 2025-05-07T20:26:35.7876966Z 2025-05-07T20:26:35.7876970Z 2025-05-07T20:26:35.7876979Z 2025-05-07T20:26:35.7876983Z 2025-05-07T20:26:35.7876986Z 2025-05-07T20:26:35.7876992Z 2025-05-07T20:26:35.7876997Z 2025-05-07T20:26:35.7877003Z 2025-05-07T20:26:35.7877009Z 2025-05-07T20:26:35.7877014Z 2025-05-07T20:26:35.7877458Z  2025-05-07T20:26:35.7877805Z 2025-05-07T20:26:35.7877812Z 2025-05-07T20:26:35.7877817Z 2025-05-07T20:26:35.7877823Z 2025-05-07T20:26:35.7877828Z 2025-05-07T20:26:35.7877834Z 2025-05-07T20:26:35.7877840Z 2025-05-07T20:26:35.7877846Z 2025-05-07T20:26:35.7877852Z 2025-05-07T20:26:35.7877857Z 2025-05-07T20:26:35.7877862Z 2025-05-07T20:26:35.7877868Z 2025-05-07T20:26:35.7877889Z 2025-05-07T20:26:35.7877895Z 2025-05-07T20:26:35.7877901Z 2025-05-07T20:26:35.7878187Z  2025-05-07T20:26:35.7878538Z 2025-05-07T20:26:35.7878544Z 2025-05-07T20:26:35.7878550Z 2025-05-07T20:26:35.7878555Z 2025-05-07T20:26:35.7878560Z 2025-05-07T20:26:35.7878565Z 2025-05-07T20:26:35.7878579Z 2025-05-07T20:26:35.7878585Z 2025-05-07T20:26:35.7878590Z 2025-05-07T20:26:35.7878596Z 2025-05-07T20:26:35.7878616Z 2025-05-07T20:26:35.7878622Z 2025-05-07T20:26:35.7878627Z 2025-05-07T20:26:35.7878633Z 2025-05-07T20:26:35.7878639Z 2025-05-07T20:26:35.7878653Z 2025-05-07T20:26:35.7878926Z  2025-05-07T20:26:35.7879290Z 2025-05-07T20:26:35.7879296Z 2025-05-07T20:26:35.7879301Z 2025-05-07T20:26:35.7879306Z 2025-05-07T20:26:35.7879312Z 2025-05-07T20:26:35.7879317Z 2025-05-07T20:26:35.7879322Z 2025-05-07T20:26:35.7879328Z 2025-05-07T20:26:35.7879341Z 2025-05-07T20:26:35.7879346Z 2025-05-07T20:26:35.7879352Z 2025-05-07T20:26:35.7879357Z 2025-05-07T20:26:35.7879363Z 2025-05-07T20:26:35.7879369Z 2025-05-07T20:26:35.7879374Z 2025-05-07T20:26:35.7879380Z 2025-05-07T20:26:35.7879393Z 2025-05-07T20:26:35.7879724Z  2025-05-07T20:26:35.7880078Z 2025-05-07T20:26:35.7880084Z 2025-05-07T20:26:35.7880097Z 2025-05-07T20:26:35.7880102Z 2025-05-07T20:26:35.7880107Z 2025-05-07T20:26:35.7880112Z 2025-05-07T20:26:35.7880250Z 2025-05-07T20:26:35.7880256Z 2025-05-07T20:26:35.7880261Z 2025-05-07T20:26:35.7880278Z 2025-05-07T20:26:35.7880284Z 2025-05-07T20:26:35.7880290Z 2025-05-07T20:26:35.7880296Z 2025-05-07T20:26:35.7880301Z 2025-05-07T20:26:35.7880307Z 2025-05-07T20:26:35.7880313Z 2025-05-07T20:26:35.7880318Z 2025-05-07T20:26:35.7880324Z 2025-05-07T20:26:35.7881518Z  2025-05-07T20:26:35.7881869Z 2025-05-07T20:26:35.7881881Z 2025-05-07T20:26:35.7882056Z  2025-05-07T20:26:35.7882231Z 2025-05-07T20:26:35.7882241Z 2025-05-07T20:26:35.7882929Z  2025-05-07T20:26:35.7883112Z 2025-05-07T20:26:35.7883118Z 2025-05-07T20:26:35.7883128Z 2025-05-07T20:26:35.7884334Z  2025-05-07T20:26:35.7884515Z 2025-05-07T20:26:35.7884539Z 2025-05-07T20:26:35.7884544Z 2025-05-07T20:26:35.7884549Z 2025-05-07T20:26:35.7884710Z  2025-05-07T20:26:35.7885016Z 2025-05-07T20:26:35.7885022Z 2025-05-07T20:26:35.7885026Z 2025-05-07T20:26:35.7885039Z 2025-05-07T20:26:35.7885044Z 2025-05-07T20:26:35.7885209Z  2025-05-07T20:26:35.7885384Z 2025-05-07T20:26:35.7885390Z 2025-05-07T20:26:35.7885395Z 2025-05-07T20:26:35.7885400Z 2025-05-07T20:26:35.7885405Z 2025-05-07T20:26:35.7885420Z 2025-05-07T20:26:35.7885588Z  2025-05-07T20:26:35.7885771Z 2025-05-07T20:26:35.7885776Z 2025-05-07T20:26:35.7885781Z 2025-05-07T20:26:35.7885787Z 2025-05-07T20:26:35.7885792Z 2025-05-07T20:26:35.7885804Z 2025-05-07T20:26:35.7885810Z 2025-05-07T20:26:35.7886226Z  2025-05-07T20:26:35.7886385Z 2025-05-07T20:26:35.7886389Z 2025-05-07T20:26:35.7886393Z 2025-05-07T20:26:35.7886396Z 2025-05-07T20:26:35.7886413Z 2025-05-07T20:26:35.7886429Z 2025-05-07T20:26:35.7886432Z 2025-05-07T20:26:35.7886439Z 2025-05-07T20:26:35.7886764Z  2025-05-07T20:26:35.7886975Z 2025-05-07T20:26:35.7886986Z 2025-05-07T20:26:35.7886997Z 2025-05-07T20:26:35.7887005Z 2025-05-07T20:26:35.7887009Z 2025-05-07T20:26:35.7887013Z 2025-05-07T20:26:35.7887016Z 2025-05-07T20:26:35.7887020Z 2025-05-07T20:26:35.7887023Z 2025-05-07T20:26:35.7887552Z  2025-05-07T20:26:35.7887827Z 2025-05-07T20:26:35.7887832Z 2025-05-07T20:26:35.7887838Z 2025-05-07T20:26:35.7887843Z 2025-05-07T20:26:35.7887848Z 2025-05-07T20:26:35.7887853Z 2025-05-07T20:26:35.7887858Z 2025-05-07T20:26:35.7887863Z 2025-05-07T20:26:35.7887869Z 2025-05-07T20:26:35.7887888Z 2025-05-07T20:26:35.7888340Z  2025-05-07T20:26:35.7888593Z 2025-05-07T20:26:35.7888599Z 2025-05-07T20:26:35.7888605Z 2025-05-07T20:26:35.7888610Z 2025-05-07T20:26:35.7888615Z 2025-05-07T20:26:35.7888628Z 2025-05-07T20:26:35.7888644Z 2025-05-07T20:26:35.7888649Z 2025-05-07T20:26:35.7888654Z 2025-05-07T20:26:35.7888660Z 2025-05-07T20:26:35.7888665Z 2025-05-07T20:26:35.7888926Z  2025-05-07T20:26:35.7889196Z 2025-05-07T20:26:35.7889201Z 2025-05-07T20:26:35.7889207Z 2025-05-07T20:26:35.7889212Z 2025-05-07T20:26:35.7889234Z 2025-05-07T20:26:35.7889239Z 2025-05-07T20:26:35.7889245Z 2025-05-07T20:26:35.7889250Z 2025-05-07T20:26:35.7889255Z 2025-05-07T20:26:35.7889260Z 2025-05-07T20:26:35.7889265Z 2025-05-07T20:26:35.7889271Z 2025-05-07T20:26:35.7889511Z  2025-05-07T20:26:35.7889786Z 2025-05-07T20:26:35.7889799Z 2025-05-07T20:26:35.7889805Z 2025-05-07T20:26:35.7889810Z 2025-05-07T20:26:35.7889815Z 2025-05-07T20:26:35.7889826Z 2025-05-07T20:26:35.7889831Z 2025-05-07T20:26:35.7889837Z 2025-05-07T20:26:35.7889842Z 2025-05-07T20:26:35.7889847Z 2025-05-07T20:26:35.7889852Z 2025-05-07T20:26:35.7889857Z 2025-05-07T20:26:35.7889862Z 2025-05-07T20:26:35.7890270Z  2025-05-07T20:26:35.7890586Z 2025-05-07T20:26:35.7890593Z 2025-05-07T20:26:35.7890599Z 2025-05-07T20:26:35.7890604Z 2025-05-07T20:26:35.7890629Z 2025-05-07T20:26:35.7890645Z 2025-05-07T20:26:35.7890651Z 2025-05-07T20:26:35.7890656Z 2025-05-07T20:26:35.7890662Z 2025-05-07T20:26:35.7890667Z 2025-05-07T20:26:35.7890673Z 2025-05-07T20:26:35.7890678Z 2025-05-07T20:26:35.7890684Z 2025-05-07T20:26:35.7890690Z 2025-05-07T20:26:35.7890943Z  2025-05-07T20:26:35.7891270Z 2025-05-07T20:26:35.7891276Z 2025-05-07T20:26:35.7891282Z 2025-05-07T20:26:35.7891288Z 2025-05-07T20:26:35.7891294Z 2025-05-07T20:26:35.7891299Z 2025-05-07T20:26:35.7891313Z 2025-05-07T20:26:35.7891319Z 2025-05-07T20:26:35.7891325Z 2025-05-07T20:26:35.7891331Z 2025-05-07T20:26:35.7891337Z 2025-05-07T20:26:35.7891342Z 2025-05-07T20:26:35.7891348Z 2025-05-07T20:26:35.7891353Z 2025-05-07T20:26:35.7891359Z 2025-05-07T20:26:35.7891781Z  2025-05-07T20:26:35.7892108Z 2025-05-07T20:26:35.7892114Z 2025-05-07T20:26:35.7892120Z 2025-05-07T20:26:35.7892127Z 2025-05-07T20:26:35.7892238Z 2025-05-07T20:26:35.7892244Z 2025-05-07T20:26:35.7892250Z 2025-05-07T20:26:35.7892256Z 2025-05-07T20:26:35.7892262Z 2025-05-07T20:26:35.7892278Z 2025-05-07T20:26:35.7892284Z 2025-05-07T20:26:35.7892290Z 2025-05-07T20:26:35.7892296Z 2025-05-07T20:26:35.7892302Z 2025-05-07T20:26:35.7892308Z 2025-05-07T20:26:35.7892326Z 2025-05-07T20:26:35.7892589Z  2025-05-07T20:26:35.7892936Z 2025-05-07T20:26:35.7892942Z 2025-05-07T20:26:35.7892948Z 2025-05-07T20:26:35.7892954Z 2025-05-07T20:26:35.7892960Z 2025-05-07T20:26:35.7892966Z 2025-05-07T20:26:35.7892972Z 2025-05-07T20:26:35.7892978Z 2025-05-07T20:26:35.7892984Z 2025-05-07T20:26:35.7892990Z 2025-05-07T20:26:35.7892995Z 2025-05-07T20:26:35.7893001Z 2025-05-07T20:26:35.7893007Z 2025-05-07T20:26:35.7893022Z 2025-05-07T20:26:35.7893028Z 2025-05-07T20:26:35.7893034Z 2025-05-07T20:26:35.7893040Z 2025-05-07T20:26:35.7893321Z  2025-05-07T20:26:35.7893689Z 2025-05-07T20:26:35.7893695Z 2025-05-07T20:26:35.7893701Z 2025-05-07T20:26:35.7893707Z 2025-05-07T20:26:35.7893713Z 2025-05-07T20:26:35.7893718Z 2025-05-07T20:26:35.7893724Z 2025-05-07T20:26:35.7893729Z 2025-05-07T20:26:35.7893735Z 2025-05-07T20:26:35.7893749Z 2025-05-07T20:26:35.7893756Z 2025-05-07T20:26:35.7893761Z 2025-05-07T20:26:35.7893767Z 2025-05-07T20:26:35.7893773Z 2025-05-07T20:26:35.7893778Z 2025-05-07T20:26:35.7893784Z 2025-05-07T20:26:35.7893790Z 2025-05-07T20:26:35.7893796Z 2025-05-07T20:26:35.7894499Z  2025-05-07T20:26:35.7894855Z 2025-05-07T20:26:35.7894866Z 2025-05-07T20:26:35.7895036Z  2025-05-07T20:26:35.7895205Z 2025-05-07T20:26:35.7895211Z 2025-05-07T20:26:35.7895624Z  2025-05-07T20:26:35.7895802Z 2025-05-07T20:26:35.7895816Z 2025-05-07T20:26:35.7895825Z 2025-05-07T20:26:35.7896205Z  2025-05-07T20:26:35.7896380Z 2025-05-07T20:26:35.7896386Z 2025-05-07T20:26:35.7896391Z 2025-05-07T20:26:35.7896412Z 2025-05-07T20:26:35.7896734Z  2025-05-07T20:26:35.7896908Z 2025-05-07T20:26:35.7896921Z 2025-05-07T20:26:35.7896927Z 2025-05-07T20:26:35.7896932Z 2025-05-07T20:26:35.7896941Z 2025-05-07T20:26:35.7897396Z  2025-05-07T20:26:35.7897611Z 2025-05-07T20:26:35.7897617Z 2025-05-07T20:26:35.7897623Z 2025-05-07T20:26:35.7897629Z 2025-05-07T20:26:35.7897635Z 2025-05-07T20:26:35.7897640Z 2025-05-07T20:26:35.7898014Z  2025-05-07T20:26:35.7898231Z 2025-05-07T20:26:35.7898237Z 2025-05-07T20:26:35.7898243Z 2025-05-07T20:26:35.7898249Z 2025-05-07T20:26:35.7898255Z 2025-05-07T20:26:35.7898261Z 2025-05-07T20:26:35.7898271Z 2025-05-07T20:26:35.7898667Z  2025-05-07T20:26:35.7898906Z 2025-05-07T20:26:35.7898912Z 2025-05-07T20:26:35.7898927Z 2025-05-07T20:26:35.7898933Z 2025-05-07T20:26:35.7898939Z 2025-05-07T20:26:35.7898944Z 2025-05-07T20:26:35.7898964Z 2025-05-07T20:26:35.7898969Z 2025-05-07T20:26:35.7899355Z  2025-05-07T20:26:35.7899597Z 2025-05-07T20:26:35.7899602Z 2025-05-07T20:26:35.7899607Z 2025-05-07T20:26:35.7899612Z 2025-05-07T20:26:35.7899618Z 2025-05-07T20:26:35.7899623Z 2025-05-07T20:26:35.7899628Z 2025-05-07T20:26:35.7899633Z 2025-05-07T20:26:35.7899641Z 2025-05-07T20:26:35.7899908Z  2025-05-07T20:26:35.7900138Z 2025-05-07T20:26:35.7900143Z 2025-05-07T20:26:35.7900148Z 2025-05-07T20:26:35.7900153Z 2025-05-07T20:26:35.7900158Z 2025-05-07T20:26:35.7900169Z 2025-05-07T20:26:35.7900174Z 2025-05-07T20:26:35.7900179Z 2025-05-07T20:26:35.7900184Z 2025-05-07T20:26:35.7900189Z 2025-05-07T20:26:35.7900532Z  2025-05-07T20:26:35.7900778Z 2025-05-07T20:26:35.7900792Z 2025-05-07T20:26:35.7900797Z 2025-05-07T20:26:35.7900803Z 2025-05-07T20:26:35.7900939Z 2025-05-07T20:26:35.7900946Z 2025-05-07T20:26:35.7900952Z 2025-05-07T20:26:35.7900957Z 2025-05-07T20:26:35.7900962Z 2025-05-07T20:26:35.7900967Z 2025-05-07T20:26:35.7901066Z 2025-05-07T20:26:35.7901270Z  2025-05-07T20:26:35.7901535Z 2025-05-07T20:26:35.7901541Z 2025-05-07T20:26:35.7901546Z 2025-05-07T20:26:35.7901551Z 2025-05-07T20:26:35.7901557Z 2025-05-07T20:26:35.7901562Z 2025-05-07T20:26:35.7901567Z 2025-05-07T20:26:35.7901572Z 2025-05-07T20:26:35.7901576Z 2025-05-07T20:26:35.7901581Z 2025-05-07T20:26:35.7901586Z 2025-05-07T20:26:35.7901591Z 2025-05-07T20:26:35.7901794Z  2025-05-07T20:26:35.7902063Z 2025-05-07T20:26:35.7902069Z 2025-05-07T20:26:35.7902074Z 2025-05-07T20:26:35.7902079Z 2025-05-07T20:26:35.7902084Z 2025-05-07T20:26:35.7902089Z 2025-05-07T20:26:35.7902094Z 2025-05-07T20:26:35.7902100Z 2025-05-07T20:26:35.7902105Z 2025-05-07T20:26:35.7902111Z 2025-05-07T20:26:35.7902116Z 2025-05-07T20:26:35.7902129Z 2025-05-07T20:26:35.7902134Z 2025-05-07T20:26:35.7902343Z  2025-05-07T20:26:35.7902616Z 2025-05-07T20:26:35.7902621Z 2025-05-07T20:26:35.7902634Z 2025-05-07T20:26:35.7902639Z 2025-05-07T20:26:35.7902644Z 2025-05-07T20:26:35.7902650Z 2025-05-07T20:26:35.7902655Z 2025-05-07T20:26:35.7902669Z 2025-05-07T20:26:35.7902674Z 2025-05-07T20:26:35.7902678Z 2025-05-07T20:26:35.7902683Z 2025-05-07T20:26:35.7902688Z 2025-05-07T20:26:35.7902694Z 2025-05-07T20:26:35.7902699Z 2025-05-07T20:26:35.7902908Z  2025-05-07T20:26:35.7903204Z 2025-05-07T20:26:35.7903209Z 2025-05-07T20:26:35.7903214Z 2025-05-07T20:26:35.7903219Z 2025-05-07T20:26:35.7903224Z 2025-05-07T20:26:35.7903228Z 2025-05-07T20:26:35.7903233Z 2025-05-07T20:26:35.7903238Z 2025-05-07T20:26:35.7903243Z 2025-05-07T20:26:35.7903248Z 2025-05-07T20:26:35.7903254Z 2025-05-07T20:26:35.7903259Z 2025-05-07T20:26:35.7903264Z 2025-05-07T20:26:35.7903275Z 2025-05-07T20:26:35.7903280Z 2025-05-07T20:26:35.7903504Z  2025-05-07T20:26:35.7903794Z 2025-05-07T20:26:35.7903799Z 2025-05-07T20:26:35.7903811Z 2025-05-07T20:26:35.7903816Z 2025-05-07T20:26:35.7903821Z 2025-05-07T20:26:35.7903826Z 2025-05-07T20:26:35.7903831Z 2025-05-07T20:26:35.7903836Z 2025-05-07T20:26:35.7903842Z 2025-05-07T20:26:35.7903847Z 2025-05-07T20:26:35.7903852Z 2025-05-07T20:26:35.7903857Z 2025-05-07T20:26:35.7903870Z 2025-05-07T20:26:35.7903875Z 2025-05-07T20:26:35.7903880Z 2025-05-07T20:26:35.7903885Z 2025-05-07T20:26:35.7904107Z  2025-05-07T20:26:35.7904409Z 2025-05-07T20:26:35.7904423Z 2025-05-07T20:26:35.7904428Z 2025-05-07T20:26:35.7904433Z 2025-05-07T20:26:35.7904438Z 2025-05-07T20:26:35.7904443Z 2025-05-07T20:26:35.7904448Z 2025-05-07T20:26:35.7904454Z 2025-05-07T20:26:35.7904459Z 2025-05-07T20:26:35.7904464Z 2025-05-07T20:26:35.7904469Z 2025-05-07T20:26:35.7904480Z 2025-05-07T20:26:35.7904485Z 2025-05-07T20:26:35.7904490Z 2025-05-07T20:26:35.7904495Z 2025-05-07T20:26:35.7904500Z 2025-05-07T20:26:35.7904505Z 2025-05-07T20:26:35.7904748Z  2025-05-07T20:26:35.7905049Z 2025-05-07T20:26:35.7905054Z 2025-05-07T20:26:35.7905059Z 2025-05-07T20:26:35.7905065Z 2025-05-07T20:26:35.7905069Z 2025-05-07T20:26:35.7905074Z 2025-05-07T20:26:35.7905080Z 2025-05-07T20:26:35.7905085Z 2025-05-07T20:26:35.7905090Z 2025-05-07T20:26:35.7905095Z 2025-05-07T20:26:35.7905100Z 2025-05-07T20:26:35.7905105Z 2025-05-07T20:26:35.7905118Z 2025-05-07T20:26:35.7905123Z 2025-05-07T20:26:35.7905128Z 2025-05-07T20:26:35.7905133Z 2025-05-07T20:26:35.7905138Z 2025-05-07T20:26:35.7905143Z 2025-05-07T20:26:35.7905506Z  2025-05-07T20:26:35.7905823Z 2025-05-07T20:26:35.7905847Z 2025-05-07T20:26:35.7906000Z  2025-05-07T20:26:35.7906140Z 2025-05-07T20:26:35.7906144Z 2025-05-07T20:26:35.7906663Z  2025-05-07T20:26:35.7906788Z 2025-05-07T20:26:35.7906792Z 2025-05-07T20:26:35.7906802Z 2025-05-07T20:26:35.7906956Z  2025-05-07T20:26:35.7907185Z 2025-05-07T20:26:35.7907189Z 2025-05-07T20:26:35.7907192Z 2025-05-07T20:26:35.7907199Z 2025-05-07T20:26:35.7907488Z  2025-05-07T20:26:35.7907668Z 2025-05-07T20:26:35.7907677Z 2025-05-07T20:26:35.7907683Z 2025-05-07T20:26:35.7907688Z 2025-05-07T20:26:35.7907693Z 2025-05-07T20:26:35.7908054Z  2025-05-07T20:26:35.7908231Z 2025-05-07T20:26:35.7908237Z 2025-05-07T20:26:35.7908246Z 2025-05-07T20:26:35.7908251Z 2025-05-07T20:26:35.7908256Z 2025-05-07T20:26:35.7908262Z 2025-05-07T20:26:35.7908663Z  2025-05-07T20:26:35.7908859Z 2025-05-07T20:26:35.7908864Z 2025-05-07T20:26:35.7908870Z 2025-05-07T20:26:35.7908875Z 2025-05-07T20:26:35.7908880Z 2025-05-07T20:26:35.7908886Z 2025-05-07T20:26:35.7908902Z 2025-05-07T20:26:35.7909075Z  2025-05-07T20:26:35.7909290Z 2025-05-07T20:26:35.7909296Z 2025-05-07T20:26:35.7909301Z 2025-05-07T20:26:35.7909306Z 2025-05-07T20:26:35.7909315Z 2025-05-07T20:26:35.7909320Z 2025-05-07T20:26:35.7909334Z 2025-05-07T20:26:35.7909339Z 2025-05-07T20:26:35.7909655Z  2025-05-07T20:26:35.7909874Z 2025-05-07T20:26:35.7909879Z 2025-05-07T20:26:35.7909885Z 2025-05-07T20:26:35.7909890Z 2025-05-07T20:26:35.7909895Z 2025-05-07T20:26:35.7909901Z 2025-05-07T20:26:35.7909906Z 2025-05-07T20:26:35.7909911Z 2025-05-07T20:26:35.7909932Z 2025-05-07T20:26:35.7910104Z  2025-05-07T20:26:35.7910327Z 2025-05-07T20:26:35.7910333Z 2025-05-07T20:26:35.7910338Z 2025-05-07T20:26:35.7910343Z 2025-05-07T20:26:35.7910348Z 2025-05-07T20:26:35.7910353Z 2025-05-07T20:26:35.7910359Z 2025-05-07T20:26:35.7910371Z 2025-05-07T20:26:35.7910384Z 2025-05-07T20:26:35.7910390Z 2025-05-07T20:26:35.7910574Z  2025-05-07T20:26:35.7910809Z 2025-05-07T20:26:35.7910822Z 2025-05-07T20:26:35.7910828Z 2025-05-07T20:26:35.7910840Z 2025-05-07T20:26:35.7910845Z 2025-05-07T20:26:35.7910851Z 2025-05-07T20:26:35.7910856Z 2025-05-07T20:26:35.7910861Z 2025-05-07T20:26:35.7910873Z 2025-05-07T20:26:35.7910877Z 2025-05-07T20:26:35.7910888Z 2025-05-07T20:26:35.7911072Z  2025-05-07T20:26:35.7911330Z 2025-05-07T20:26:35.7911335Z 2025-05-07T20:26:35.7911340Z 2025-05-07T20:26:35.7911345Z 2025-05-07T20:26:35.7911350Z 2025-05-07T20:26:35.7911355Z 2025-05-07T20:26:35.7911360Z 2025-05-07T20:26:35.7911364Z 2025-05-07T20:26:35.7911369Z 2025-05-07T20:26:35.7911374Z 2025-05-07T20:26:35.7911386Z 2025-05-07T20:26:35.7911391Z 2025-05-07T20:26:35.7911575Z  2025-05-07T20:26:35.7911841Z 2025-05-07T20:26:35.7911846Z 2025-05-07T20:26:35.7911851Z 2025-05-07T20:26:35.7911855Z 2025-05-07T20:26:35.7911860Z 2025-05-07T20:26:35.7911865Z 2025-05-07T20:26:35.7911870Z 2025-05-07T20:26:35.7911876Z 2025-05-07T20:26:35.7911887Z 2025-05-07T20:26:35.7911892Z 2025-05-07T20:26:35.7911897Z 2025-05-07T20:26:35.7911903Z 2025-05-07T20:26:35.7911908Z 2025-05-07T20:26:35.7912110Z  2025-05-07T20:26:35.7912389Z 2025-05-07T20:26:35.7912395Z 2025-05-07T20:26:35.7912400Z 2025-05-07T20:26:35.7912405Z 2025-05-07T20:26:35.7912410Z 2025-05-07T20:26:35.7912416Z 2025-05-07T20:26:35.7912421Z 2025-05-07T20:26:35.7912426Z 2025-05-07T20:26:35.7912440Z 2025-05-07T20:26:35.7912446Z 2025-05-07T20:26:35.7912450Z 2025-05-07T20:26:35.7912455Z 2025-05-07T20:26:35.7912460Z 2025-05-07T20:26:35.7912473Z 2025-05-07T20:26:35.7912676Z  2025-05-07T20:26:35.7912964Z 2025-05-07T20:26:35.7912970Z 2025-05-07T20:26:35.7912974Z 2025-05-07T20:26:35.7912979Z 2025-05-07T20:26:35.7912984Z 2025-05-07T20:26:35.7912989Z 2025-05-07T20:26:35.7912994Z 2025-05-07T20:26:35.7912999Z 2025-05-07T20:26:35.7913004Z 2025-05-07T20:26:35.7913009Z 2025-05-07T20:26:35.7913014Z 2025-05-07T20:26:35.7913137Z 2025-05-07T20:26:35.7913143Z 2025-05-07T20:26:35.7913148Z 2025-05-07T20:26:35.7913153Z 2025-05-07T20:26:35.7913686Z  2025-05-07T20:26:35.7914168Z 2025-05-07T20:26:35.7914174Z 2025-05-07T20:26:35.7914179Z 2025-05-07T20:26:35.7914184Z 2025-05-07T20:26:35.7914190Z 2025-05-07T20:26:35.7914195Z 2025-05-07T20:26:35.7914200Z 2025-05-07T20:26:35.7914205Z 2025-05-07T20:26:35.7914210Z 2025-05-07T20:26:35.7914216Z 2025-05-07T20:26:35.7914221Z 2025-05-07T20:26:35.7914226Z 2025-05-07T20:26:35.7914232Z 2025-05-07T20:26:35.7914249Z 2025-05-07T20:26:35.7914254Z 2025-05-07T20:26:35.7914260Z 2025-05-07T20:26:35.7914506Z  2025-05-07T20:26:35.7914813Z 2025-05-07T20:26:35.7914818Z 2025-05-07T20:26:35.7914823Z 2025-05-07T20:26:35.7914829Z 2025-05-07T20:26:35.7914834Z 2025-05-07T20:26:35.7914839Z 2025-05-07T20:26:35.7914844Z 2025-05-07T20:26:35.7914849Z 2025-05-07T20:26:35.7914855Z 2025-05-07T20:26:35.7914869Z 2025-05-07T20:26:35.7914874Z 2025-05-07T20:26:35.7914878Z 2025-05-07T20:26:35.7914884Z 2025-05-07T20:26:35.7914889Z 2025-05-07T20:26:35.7914894Z 2025-05-07T20:26:35.7914906Z 2025-05-07T20:26:35.7914911Z 2025-05-07T20:26:35.7915136Z  2025-05-07T20:26:35.7915433Z 2025-05-07T20:26:35.7915438Z 2025-05-07T20:26:35.7915443Z 2025-05-07T20:26:35.7915449Z 2025-05-07T20:26:35.7915454Z 2025-05-07T20:26:35.7915459Z 2025-05-07T20:26:35.7915464Z 2025-05-07T20:26:35.7915469Z 2025-05-07T20:26:35.7915474Z 2025-05-07T20:26:35.7915480Z 2025-05-07T20:26:35.7915485Z 2025-05-07T20:26:35.7915490Z 2025-05-07T20:26:35.7915495Z 2025-05-07T20:26:35.7915500Z 2025-05-07T20:26:35.7915512Z 2025-05-07T20:26:35.7915517Z 2025-05-07T20:26:35.7915523Z 2025-05-07T20:26:35.7915528Z 2025-05-07T20:26:35.7915761Z  2025-05-07T20:26:35.7916063Z 2025-05-07T20:26:35.7916068Z 2025-05-07T20:26:35.7916232Z  2025-05-07T20:26:35.7916385Z 2025-05-07T20:26:35.7916391Z 2025-05-07T20:26:35.7916549Z  2025-05-07T20:26:35.7916704Z 2025-05-07T20:26:35.7916709Z 2025-05-07T20:26:35.7916722Z 2025-05-07T20:26:35.7916871Z  2025-05-07T20:26:35.7917031Z 2025-05-07T20:26:35.7917037Z 2025-05-07T20:26:35.7917042Z 2025-05-07T20:26:35.7917047Z 2025-05-07T20:26:35.7917204Z  2025-05-07T20:26:35.7917380Z 2025-05-07T20:26:35.7917385Z 2025-05-07T20:26:35.7917391Z 2025-05-07T20:26:35.7917396Z 2025-05-07T20:26:35.7917401Z 2025-05-07T20:26:35.7917560Z  2025-05-07T20:26:35.7917743Z 2025-05-07T20:26:35.7917748Z 2025-05-07T20:26:35.7917753Z 2025-05-07T20:26:35.7917758Z 2025-05-07T20:26:35.7917763Z 2025-05-07T20:26:35.7917769Z 2025-05-07T20:26:35.7917930Z  2025-05-07T20:26:35.7918118Z 2025-05-07T20:26:35.7918123Z 2025-05-07T20:26:35.7918128Z 2025-05-07T20:26:35.7918133Z 2025-05-07T20:26:35.7918138Z 2025-05-07T20:26:35.7918143Z 2025-05-07T20:26:35.7918154Z 2025-05-07T20:26:35.7918321Z  2025-05-07T20:26:35.7918528Z 2025-05-07T20:26:35.7918533Z 2025-05-07T20:26:35.7918538Z 2025-05-07T20:26:35.7918544Z 2025-05-07T20:26:35.7918555Z 2025-05-07T20:26:35.7918560Z 2025-05-07T20:26:35.7918565Z 2025-05-07T20:26:35.7918570Z 2025-05-07T20:26:35.7918747Z  2025-05-07T20:26:35.7918969Z 2025-05-07T20:26:35.7918975Z 2025-05-07T20:26:35.7918980Z 2025-05-07T20:26:35.7918985Z 2025-05-07T20:26:35.7918990Z 2025-05-07T20:26:35.7918996Z 2025-05-07T20:26:35.7919001Z 2025-05-07T20:26:35.7919014Z 2025-05-07T20:26:35.7919019Z 2025-05-07T20:26:35.7919193Z  2025-05-07T20:26:35.7919425Z 2025-05-07T20:26:35.7919430Z 2025-05-07T20:26:35.7919436Z 2025-05-07T20:26:35.7919441Z 2025-05-07T20:26:35.7919446Z 2025-05-07T20:26:35.7919451Z 2025-05-07T20:26:35.7919456Z 2025-05-07T20:26:35.7919461Z 2025-05-07T20:26:35.7919466Z 2025-05-07T20:26:35.7919470Z 2025-05-07T20:26:35.7919825Z  2025-05-07T20:26:35.7920069Z 2025-05-07T20:26:35.7920074Z 2025-05-07T20:26:35.7920079Z 2025-05-07T20:26:35.7920084Z 2025-05-07T20:26:35.7920089Z 2025-05-07T20:26:35.7920331Z 2025-05-07T20:26:35.7920336Z 2025-05-07T20:26:35.7920341Z 2025-05-07T20:26:35.7920346Z 2025-05-07T20:26:35.7920352Z 2025-05-07T20:26:35.7920357Z 2025-05-07T20:26:35.7920588Z  2025-05-07T20:26:35.7920840Z 2025-05-07T20:26:35.7920845Z 2025-05-07T20:26:35.7920850Z 2025-05-07T20:26:35.7920855Z 2025-05-07T20:26:35.7920860Z 2025-05-07T20:26:35.7920865Z 2025-05-07T20:26:35.7920870Z 2025-05-07T20:26:35.7920883Z 2025-05-07T20:26:35.7920889Z 2025-05-07T20:26:35.7920894Z 2025-05-07T20:26:35.7920899Z 2025-05-07T20:26:35.7920904Z 2025-05-07T20:26:35.7921093Z  2025-05-07T20:26:35.7921376Z 2025-05-07T20:26:35.7921383Z 2025-05-07T20:26:35.7921389Z 2025-05-07T20:26:35.7921404Z 2025-05-07T20:26:35.7921409Z 2025-05-07T20:26:35.7921414Z 2025-05-07T20:26:35.7921428Z 2025-05-07T20:26:35.7921433Z 2025-05-07T20:26:35.7921438Z 2025-05-07T20:26:35.7921443Z 2025-05-07T20:26:35.7921448Z 2025-05-07T20:26:35.7921454Z 2025-05-07T20:26:35.7921467Z 2025-05-07T20:26:35.7921697Z  2025-05-07T20:26:35.7921974Z 2025-05-07T20:26:35.7921979Z 2025-05-07T20:26:35.7921984Z 2025-05-07T20:26:35.7921989Z 2025-05-07T20:26:35.7921993Z 2025-05-07T20:26:35.7921998Z 2025-05-07T20:26:35.7922003Z 2025-05-07T20:26:35.7922008Z 2025-05-07T20:26:35.7922013Z 2025-05-07T20:26:35.7922019Z 2025-05-07T20:26:35.7922023Z 2025-05-07T20:26:35.7922029Z 2025-05-07T20:26:35.7922034Z 2025-05-07T20:26:35.7922039Z 2025-05-07T20:26:35.7922245Z  2025-05-07T20:26:35.7922522Z 2025-05-07T20:26:35.7922527Z 2025-05-07T20:26:35.7922532Z 2025-05-07T20:26:35.7922537Z 2025-05-07T20:26:35.7922543Z 2025-05-07T20:26:35.7922548Z 2025-05-07T20:26:35.7922553Z 2025-05-07T20:26:35.7922558Z 2025-05-07T20:26:35.7922568Z 2025-05-07T20:26:35.7922573Z 2025-05-07T20:26:35.7922578Z 2025-05-07T20:26:35.7922583Z 2025-05-07T20:26:35.7922597Z 2025-05-07T20:26:35.7922603Z 2025-05-07T20:26:35.7922608Z 2025-05-07T20:26:35.7922831Z  2025-05-07T20:26:35.7923118Z 2025-05-07T20:26:35.7923124Z 2025-05-07T20:26:35.7923129Z 2025-05-07T20:26:35.7923134Z 2025-05-07T20:26:35.7923139Z 2025-05-07T20:26:35.7923144Z 2025-05-07T20:26:35.7923149Z 2025-05-07T20:26:35.7923154Z 2025-05-07T20:26:35.7923159Z 2025-05-07T20:26:35.7923164Z 2025-05-07T20:26:35.7923169Z 2025-05-07T20:26:35.7923174Z 2025-05-07T20:26:35.7923179Z 2025-05-07T20:26:35.7923184Z 2025-05-07T20:26:35.7923189Z 2025-05-07T20:26:35.7923195Z 2025-05-07T20:26:35.7923414Z  2025-05-07T20:26:35.7923703Z 2025-05-07T20:26:35.7923708Z 2025-05-07T20:26:35.7923713Z 2025-05-07T20:26:35.7923718Z 2025-05-07T20:26:35.7923723Z 2025-05-07T20:26:35.7923729Z 2025-05-07T20:26:35.7923739Z 2025-05-07T20:26:35.7923744Z 2025-05-07T20:26:35.7923749Z 2025-05-07T20:26:35.7923754Z 2025-05-07T20:26:35.7923759Z 2025-05-07T20:26:35.7923764Z 2025-05-07T20:26:35.7923769Z 2025-05-07T20:26:35.7923779Z 2025-05-07T20:26:35.7923784Z 2025-05-07T20:26:35.7923796Z 2025-05-07T20:26:35.7923801Z 2025-05-07T20:26:35.7924016Z  2025-05-07T20:26:35.7924315Z 2025-05-07T20:26:35.7924321Z 2025-05-07T20:26:35.7924326Z 2025-05-07T20:26:35.7924331Z 2025-05-07T20:26:35.7924336Z 2025-05-07T20:26:35.7924349Z 2025-05-07T20:26:35.7924354Z 2025-05-07T20:26:35.7924359Z 2025-05-07T20:26:35.7924365Z 2025-05-07T20:26:35.7924370Z 2025-05-07T20:26:35.7924375Z 2025-05-07T20:26:35.7924380Z 2025-05-07T20:26:35.7924385Z 2025-05-07T20:26:35.7924390Z 2025-05-07T20:26:35.7924395Z 2025-05-07T20:26:35.7924400Z 2025-05-07T20:26:35.7924406Z 2025-05-07T20:26:35.7924411Z 2025-05-07T20:26:35.7924648Z  2025-05-07T20:26:35.7925058Z 2025-05-07T20:26:35.7925065Z 2025-05-07T20:26:35.7925211Z  2025-05-07T20:26:35.7925362Z 2025-05-07T20:26:35.7925367Z 2025-05-07T20:26:35.7925507Z  2025-05-07T20:26:35.7925742Z 2025-05-07T20:26:35.7925747Z 2025-05-07T20:26:35.7925753Z 2025-05-07T20:26:35.7925904Z  2025-05-07T20:26:35.7926060Z 2025-05-07T20:26:35.7926065Z 2025-05-07T20:26:35.7926069Z 2025-05-07T20:26:35.7926074Z 2025-05-07T20:26:35.7926239Z  2025-05-07T20:26:35.7926401Z 2025-05-07T20:26:35.7926407Z 2025-05-07T20:26:35.7926412Z 2025-05-07T20:26:35.7926417Z 2025-05-07T20:26:35.7926423Z 2025-05-07T20:26:35.7926585Z  2025-05-07T20:26:35.7926762Z 2025-05-07T20:26:35.7926767Z 2025-05-07T20:26:35.7926772Z 2025-05-07T20:26:35.7926777Z 2025-05-07T20:26:35.7926782Z 2025-05-07T20:26:35.7926787Z 2025-05-07T20:26:35.7926952Z  2025-05-07T20:26:35.7927130Z 2025-05-07T20:26:35.7927135Z 2025-05-07T20:26:35.7927140Z 2025-05-07T20:26:35.7927145Z 2025-05-07T20:26:35.7927157Z 2025-05-07T20:26:35.7927163Z 2025-05-07T20:26:35.7927168Z 2025-05-07T20:26:35.7927336Z  2025-05-07T20:26:35.7927530Z 2025-05-07T20:26:35.7927535Z 2025-05-07T20:26:35.7927547Z 2025-05-07T20:26:35.7927552Z 2025-05-07T20:26:35.7927557Z 2025-05-07T20:26:35.7927563Z 2025-05-07T20:26:35.7927568Z 2025-05-07T20:26:35.7927573Z 2025-05-07T20:26:35.7927746Z  2025-05-07T20:26:35.7927955Z 2025-05-07T20:26:35.7927960Z 2025-05-07T20:26:35.7927965Z 2025-05-07T20:26:35.7927971Z 2025-05-07T20:26:35.7927975Z 2025-05-07T20:26:35.7927981Z 2025-05-07T20:26:35.7927986Z 2025-05-07T20:26:35.7927991Z 2025-05-07T20:26:35.7927996Z 2025-05-07T20:26:35.7928190Z  2025-05-07T20:26:35.7928415Z 2025-05-07T20:26:35.7928420Z 2025-05-07T20:26:35.7928425Z 2025-05-07T20:26:35.7928430Z 2025-05-07T20:26:35.7928436Z 2025-05-07T20:26:35.7928441Z 2025-05-07T20:26:35.7928446Z 2025-05-07T20:26:35.7928451Z 2025-05-07T20:26:35.7928457Z 2025-05-07T20:26:35.7928477Z 2025-05-07T20:26:35.7928659Z  2025-05-07T20:26:35.7928892Z 2025-05-07T20:26:35.7928898Z 2025-05-07T20:26:35.7928903Z 2025-05-07T20:26:35.7928913Z 2025-05-07T20:26:35.7928918Z 2025-05-07T20:26:35.7928924Z 2025-05-07T20:26:35.7928936Z 2025-05-07T20:26:35.7928942Z 2025-05-07T20:26:35.7928947Z 2025-05-07T20:26:35.7928952Z 2025-05-07T20:26:35.7928957Z 2025-05-07T20:26:35.7929139Z  2025-05-07T20:26:35.7929385Z 2025-05-07T20:26:35.7929390Z 2025-05-07T20:26:35.7929396Z 2025-05-07T20:26:35.7929409Z 2025-05-07T20:26:35.7929414Z 2025-05-07T20:26:35.7929419Z 2025-05-07T20:26:35.7929424Z 2025-05-07T20:26:35.7929429Z 2025-05-07T20:26:35.7929434Z 2025-05-07T20:26:35.7929440Z 2025-05-07T20:26:35.7929445Z 2025-05-07T20:26:35.7929450Z 2025-05-07T20:26:35.7929633Z  2025-05-07T20:26:35.7929900Z 2025-05-07T20:26:35.7929905Z 2025-05-07T20:26:35.7929911Z 2025-05-07T20:26:35.7929915Z 2025-05-07T20:26:35.7929926Z 2025-05-07T20:26:35.7929930Z 2025-05-07T20:26:35.7929936Z 2025-05-07T20:26:35.7929941Z 2025-05-07T20:26:35.7929946Z 2025-05-07T20:26:35.7929951Z 2025-05-07T20:26:35.7929963Z 2025-05-07T20:26:35.7929968Z 2025-05-07T20:26:35.7929973Z 2025-05-07T20:26:35.7930162Z  2025-05-07T20:26:35.7930438Z 2025-05-07T20:26:35.7930444Z 2025-05-07T20:26:35.7930449Z 2025-05-07T20:26:35.7930454Z 2025-05-07T20:26:35.7930459Z 2025-05-07T20:26:35.7930464Z 2025-05-07T20:26:35.7930469Z 2025-05-07T20:26:35.7930474Z 2025-05-07T20:26:35.7930480Z 2025-05-07T20:26:35.7930485Z 2025-05-07T20:26:35.7930490Z 2025-05-07T20:26:35.7930496Z 2025-05-07T20:26:35.7930529Z 2025-05-07T20:26:35.7930534Z 2025-05-07T20:26:35.7930732Z  2025-05-07T20:26:35.7931013Z 2025-05-07T20:26:35.7931018Z 2025-05-07T20:26:35.7931023Z 2025-05-07T20:26:35.7931035Z 2025-05-07T20:26:35.7931041Z 2025-05-07T20:26:35.7931045Z 2025-05-07T20:26:35.7931156Z 2025-05-07T20:26:35.7931164Z 2025-05-07T20:26:35.7931169Z 2025-05-07T20:26:35.7931174Z 2025-05-07T20:26:35.7931180Z 2025-05-07T20:26:35.7931185Z 2025-05-07T20:26:35.7931273Z 2025-05-07T20:26:35.7931278Z 2025-05-07T20:26:35.7931283Z 2025-05-07T20:26:35.7931506Z  2025-05-07T20:26:35.7931797Z 2025-05-07T20:26:35.7931802Z 2025-05-07T20:26:35.7931808Z 2025-05-07T20:26:35.7931813Z 2025-05-07T20:26:35.7931818Z 2025-05-07T20:26:35.7931823Z 2025-05-07T20:26:35.7931828Z 2025-05-07T20:26:35.7931833Z 2025-05-07T20:26:35.7931838Z 2025-05-07T20:26:35.7931843Z 2025-05-07T20:26:35.7931849Z 2025-05-07T20:26:35.7931854Z 2025-05-07T20:26:35.7931859Z 2025-05-07T20:26:35.7931864Z 2025-05-07T20:26:35.7931869Z 2025-05-07T20:26:35.7931874Z 2025-05-07T20:26:35.7932089Z  2025-05-07T20:26:35.7932381Z 2025-05-07T20:26:35.7932386Z 2025-05-07T20:26:35.7932391Z 2025-05-07T20:26:35.7932396Z 2025-05-07T20:26:35.7932409Z 2025-05-07T20:26:35.7932414Z 2025-05-07T20:26:35.7932419Z 2025-05-07T20:26:35.7932424Z 2025-05-07T20:26:35.7932429Z 2025-05-07T20:26:35.7932435Z 2025-05-07T20:26:35.7932454Z 2025-05-07T20:26:35.7932459Z 2025-05-07T20:26:35.7932465Z 2025-05-07T20:26:35.7932470Z 2025-05-07T20:26:35.7932475Z 2025-05-07T20:26:35.7932480Z 2025-05-07T20:26:35.7932485Z 2025-05-07T20:26:35.7932702Z  2025-05-07T20:26:35.7933008Z 2025-05-07T20:26:35.7933013Z 2025-05-07T20:26:35.7933018Z 2025-05-07T20:26:35.7933023Z 2025-05-07T20:26:35.7933028Z 2025-05-07T20:26:35.7933033Z 2025-05-07T20:26:35.7933039Z 2025-05-07T20:26:35.7933044Z 2025-05-07T20:26:35.7933049Z 2025-05-07T20:26:35.7933054Z 2025-05-07T20:26:35.7933059Z 2025-05-07T20:26:35.7933064Z 2025-05-07T20:26:35.7933069Z 2025-05-07T20:26:35.7933074Z 2025-05-07T20:26:35.7933079Z 2025-05-07T20:26:35.7933084Z 2025-05-07T20:26:35.7933090Z 2025-05-07T20:26:35.7933094Z 2025-05-07T20:26:35.7933333Z  2025-05-07T20:26:35.7933631Z 2025-05-07T20:26:35.7933637Z 2025-05-07T20:26:35.7933774Z  2025-05-07T20:26:35.7933928Z 2025-05-07T20:26:35.7933940Z 2025-05-07T20:26:35.7934068Z  2025-05-07T20:26:35.7934178Z 2025-05-07T20:26:35.7934188Z 2025-05-07T20:26:35.7934192Z 2025-05-07T20:26:35.7934296Z  2025-05-07T20:26:35.7934405Z 2025-05-07T20:26:35.7934409Z 2025-05-07T20:26:35.7934412Z 2025-05-07T20:26:35.7934416Z 2025-05-07T20:26:35.7934546Z  2025-05-07T20:26:35.7934664Z 2025-05-07T20:26:35.7934668Z 2025-05-07T20:26:35.7934672Z 2025-05-07T20:26:35.7934680Z 2025-05-07T20:26:35.7934684Z 2025-05-07T20:26:35.7934794Z  2025-05-07T20:26:35.7934917Z 2025-05-07T20:26:35.7934921Z 2025-05-07T20:26:35.7934924Z 2025-05-07T20:26:35.7934928Z 2025-05-07T20:26:35.7934932Z 2025-05-07T20:26:35.7934941Z 2025-05-07T20:26:35.7935052Z  2025-05-07T20:26:35.7935191Z 2025-05-07T20:26:35.7935195Z 2025-05-07T20:26:35.7935203Z 2025-05-07T20:26:35.7935206Z 2025-05-07T20:26:35.7935210Z 2025-05-07T20:26:35.7935214Z 2025-05-07T20:26:35.7935217Z 2025-05-07T20:26:35.7935338Z  2025-05-07T20:26:35.7935483Z 2025-05-07T20:26:35.7935487Z 2025-05-07T20:26:35.7935490Z 2025-05-07T20:26:35.7935494Z 2025-05-07T20:26:35.7935498Z 2025-05-07T20:26:35.7935501Z 2025-05-07T20:26:35.7935505Z 2025-05-07T20:26:35.7935509Z 2025-05-07T20:26:35.7935634Z  2025-05-07T20:26:35.7935785Z 2025-05-07T20:26:35.7935788Z 2025-05-07T20:26:35.7935792Z 2025-05-07T20:26:35.7935795Z 2025-05-07T20:26:35.7935799Z 2025-05-07T20:26:35.7935803Z 2025-05-07T20:26:35.7935806Z 2025-05-07T20:26:35.7935816Z 2025-05-07T20:26:35.7935819Z 2025-05-07T20:26:35.7935945Z  2025-05-07T20:26:35.7936101Z 2025-05-07T20:26:35.7936105Z 2025-05-07T20:26:35.7936108Z 2025-05-07T20:26:35.7936112Z 2025-05-07T20:26:35.7936116Z 2025-05-07T20:26:35.7936119Z 2025-05-07T20:26:35.7936128Z 2025-05-07T20:26:35.7936230Z 2025-05-07T20:26:35.7936235Z 2025-05-07T20:26:35.7936239Z 2025-05-07T20:26:35.7936369Z  2025-05-07T20:26:35.7936533Z 2025-05-07T20:26:35.7936613Z 2025-05-07T20:26:35.7936616Z 2025-05-07T20:26:35.7936626Z 2025-05-07T20:26:35.7936630Z 2025-05-07T20:26:35.7936633Z 2025-05-07T20:26:35.7936637Z 2025-05-07T20:26:35.7936641Z 2025-05-07T20:26:35.7936644Z 2025-05-07T20:26:35.7936648Z 2025-05-07T20:26:35.7936651Z 2025-05-07T20:26:35.7936803Z  2025-05-07T20:26:35.7937075Z 2025-05-07T20:26:35.7937080Z 2025-05-07T20:26:35.7937085Z 2025-05-07T20:26:35.7937091Z 2025-05-07T20:26:35.7937096Z 2025-05-07T20:26:35.7937101Z 2025-05-07T20:26:35.7937107Z 2025-05-07T20:26:35.7937112Z 2025-05-07T20:26:35.7937117Z 2025-05-07T20:26:35.7937122Z 2025-05-07T20:26:35.7937128Z 2025-05-07T20:26:35.7937133Z 2025-05-07T20:26:35.7937336Z  2025-05-07T20:26:35.7937606Z 2025-05-07T20:26:35.7937619Z 2025-05-07T20:26:35.7937624Z 2025-05-07T20:26:35.7937629Z 2025-05-07T20:26:35.7937635Z 2025-05-07T20:26:35.7937639Z 2025-05-07T20:26:35.7937645Z 2025-05-07T20:26:35.7937650Z 2025-05-07T20:26:35.7937664Z 2025-05-07T20:26:35.7937669Z 2025-05-07T20:26:35.7937674Z 2025-05-07T20:26:35.7937679Z 2025-05-07T20:26:35.7937684Z 2025-05-07T20:26:35.7937904Z  done 2025-05-07T20:26:36.1011745Z Preparing transaction: \ | / done 2025-05-07T20:26:42.6769472Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:43.5950038Z Executing transaction: / - \ | / - \ | / done 2025-05-07T20:26:46.1935412Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ... 2025-05-07T20:26:46.1936018Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:46.1936751Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:46.1937344Z 2025-05-07T20:26:46.1949341Z 2025-05-07T20:26:46.1950550Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:46.1951533Z 2025-05-07T20:26:46.1963712Z 2025-05-07T20:26:46.1964160Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:46.1969172Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:46.1975969Z 2025-05-07T20:26:46.3710050Z 2025-05-07T20:26:46.3716293Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:46.3720553Z 2025-05-07T20:26:46.3738427Z 2025-05-07T20:26:46.3738900Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:46.4113972Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:48.2872777Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:48.3499136Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:48.3499877Z 2025-05-07T20:26:48.7723066Z 2025-05-07T20:26:48.7731228Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:48.8085056Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:48.8085727Z 2025-05-07T20:26:49.2377326Z 2025-05-07T20:26:49.2377692Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:49.2378643Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:49.2379375Z 2025-05-07T20:26:49.6608493Z 2025-05-07T20:26:51.6828623Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:53.7172346Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:55.7367311Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:55.7368387Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:57.7594543Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:59.6557892Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:59.6558189Z 2025-05-07T20:26:59.7173281Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:27:03.5709875Z /tmp/tmpmahzl8qv: line 3: clang: command not found 2025-05-07T20:27:03.5710169Z 2025-05-07T20:27:03.5710928Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:27:03.6338520Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:27:03.6338858Z 2025-05-07T20:27:03.6358470Z total 36 2025-05-07T20:27:03.6359117Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:27:03.6359721Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:25 .. 2025-05-07T20:27:03.6360294Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:27:03.6361002Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:27:03.6361717Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:27:03.6362392Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:27:03.6362913Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:27:03.6363392Z -rw-r--r--. 2 ec2-user ec2-user 2932 Jan 24 22:22 ~cuda-nvcc_activate.sh 2025-05-07T20:27:03.6363697Z 2025-05-07T20:27:03.6364257Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:27:03.6364952Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:27:03.6365583Z 2025-05-07T20:27:03.6383758Z 2025-05-07T20:27:03.6384043Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:27:03.6384316Z 2025-05-07T20:27:05.5992065Z 2025-05-07T20:27:05.5992984Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:27:05.5993556Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:27:05.5994018Z 2025-05-07T20:27:06.0410821Z 2025-05-07T20:27:06.0411177Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:27:06.0411448Z 2025-05-07T20:27:07.9413099Z -allow-unsupported-compiler 2025-05-07T20:27:07.9413639Z 2025-05-07T20:27:08.0055047Z 2025-05-07T20:27:08.0055713Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:27:08.0056275Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:27:08.0056614Z 2025-05-07T20:27:09.9635111Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:27:09.9635776Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:27:09.9636135Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:27:09.9636469Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:27:09.9636818Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:27:09.9637100Z #define _STL_PAIR_H 1 2025-05-07T20:27:09.9637360Z #define __cpp_attributes 200809L 2025-05-07T20:27:09.9637734Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:27:09.9638240Z #define __DELETE_THROW throw() 2025-05-07T20:27:09.9638609Z #define _PTRDIFF_T_ 2025-05-07T20:27:09.9638957Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:27:09.9639339Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:27:09.9639624Z #define _IO_LEFT 02 2025-05-07T20:27:09.9639876Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:27:09.9640253Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:27:09.9640543Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:27:09.9640996Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:27:09.9641487Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:27:09.9641906Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:27:09.9642273Z #define _IOS_OUTPUT 2 2025-05-07T20:27:09.9642622Z #define __SM_100_RT_HPP__ 2025-05-07T20:27:09.9643074Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:27:09.9643602Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:27:09.9644054Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:27:09.9644433Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:27:09.9644834Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:27:09.9645958Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:27:09.9647095Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:27:09.9647578Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:27:09.9648031Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:27:09.9648471Z #define _T_WCHAR_ 2025-05-07T20:27:09.9648781Z #define stdout stdout 2025-05-07T20:27:09.9649247Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:27:09.9649796Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:27:09.9661112Z #define __flexarr [] 2025-05-07T20:27:09.9661517Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:27:09.9662005Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:27:09.9662509Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:27:09.9662867Z #define _MATH_H 1 2025-05-07T20:27:09.9663262Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:27:09.9663755Z #define __S64_TYPE long int 2025-05-07T20:27:09.9664111Z #define __stub_fchflags 2025-05-07T20:27:09.9664806Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:27:09.9665131Z #define __SQUAD_TYPE long int 2025-05-07T20:27:09.9665408Z #define __INTMAX_C(c) c ## L 2025-05-07T20:27:09.9665866Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:27:09.9666224Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:27:09.9666499Z #define NL_NMAX INT_MAX 2025-05-07T20:27:09.9666735Z #define _BITS_TIME_H 1 2025-05-07T20:27:09.9667022Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:27:09.9667369Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:27:09.9667680Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:27:09.9668044Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:27:09.9668457Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:27:09.9668839Z #define __CHAR_BIT__ 8 2025-05-07T20:27:09.9669104Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.9669433Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:27:09.9669747Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:27:09.9670018Z #define FP_NAN 0 2025-05-07T20:27:09.9670288Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:27:09.9670729Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:27:09.9671125Z #define __cudaCDP2GetErrorString 2025-05-07T20:27:09.9671451Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:27:09.9671746Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:27:09.9672002Z #define __SM_80_RT_H__ 2025-05-07T20:27:09.9672236Z #define _NEW 2025-05-07T20:27:09.9672472Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:27:09.9672753Z #define __UINT8_MAX__ 0xff 2025-05-07T20:27:09.9673138Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:27:09.9673558Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:27:09.9673808Z #define __USE_ANSI 1 2025-05-07T20:27:09.9674102Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:27:09.9674521Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:27:09.9674897Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:27:09.9675206Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:27:09.9675501Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:27:09.9675803Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:27:09.9676089Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:27:09.9676386Z #define PIPE_BUF 4096 2025-05-07T20:27:09.9676725Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:27:09.9677194Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:27:09.9677582Z #define ADJ_TICK 0x4000 2025-05-07T20:27:09.9677873Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:27:09.9678210Z #define MQ_PRIO_MAX 32768 2025-05-07T20:27:09.9678480Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:27:09.9678813Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:27:09.9679296Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:09.9679841Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:27:09.9680322Z #define _XOPEN_SOURCE 700 2025-05-07T20:27:09.9680587Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:27:09.9680873Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:27:09.9681168Z #define __cpp_static_assert 201411L 2025-05-07T20:27:09.9681472Z #define __GLIBCXX__ 20230528 2025-05-07T20:27:09.9681779Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:27:09.9682073Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:27:09.9682361Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:27:09.9682674Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:27:09.9682957Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:27:09.9683266Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.9683637Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:27:09.9683986Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:27:09.9684276Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:27:09.9684698Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.9685067Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:27:09.9685432Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:27:09.9685814Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:27:09.9686114Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:27:09.9686451Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:27:09.9686786Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:27:09.9687204Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:27:09.9687624Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:27:09.9687936Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:27:09.9688212Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:27:09.9688497Z #define __GCC_IEC_559 2 2025-05-07T20:27:09.9688801Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:27:09.9689149Z #define _IO_flockfile(_fp) 2025-05-07T20:27:09.9689420Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:27:09.9689700Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:27:09.9689974Z #define _IOFBF 0 2025-05-07T20:27:09.9690186Z #define __USE_BSD 1 2025-05-07T20:27:09.9690427Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:27:09.9690704Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:27:09.9690986Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:27:09.9691244Z #define _IO_NO_WRITES 8 2025-05-07T20:27:09.9691507Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:27:09.9691876Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:27:09.9692234Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:27:09.9692550Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:27:09.9692881Z #define __cpp_binary_literals 201304L 2025-05-07T20:27:09.9693176Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:27:09.9693454Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:27:09.9693732Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:27:09.9694046Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:27:09.9694451Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:27:09.9694830Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:27:09.9695151Z #define M_PI 3.14159265358979323846 2025-05-07T20:27:09.9695464Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:27:09.9695804Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:27:09.9696128Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:27:09.9696433Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:27:09.9696719Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:27:09.9696998Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:27:09.9697592Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:27:09.9698196Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:27:09.9699275Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:27:09.9699985Z 2025-05-07T20:27:09.9700124Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:27:09.9700462Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:27:09.9700772Z #define __cudaCDP2GetErrorName 2025-05-07T20:27:09.9701063Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:27:09.9701358Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:27:09.9701700Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:27:09.9702038Z #define __cpp_variadic_templates 200704L 2025-05-07T20:27:09.9702346Z #define RAND_MAX 2147483647 2025-05-07T20:27:09.9702613Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:27:09.9702951Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.9703278Z #define __SM_90_RT_H__ 2025-05-07T20:27:09.9703524Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:27:09.9703791Z #define __COMPAR_FN_T 2025-05-07T20:27:09.9704036Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:27:09.9704388Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:27:09.9704881Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:27:09.9705480Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:27:09.9705828Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:27:09.9706198Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:27:09.9706505Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:27:09.9706854Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:27:09.9707174Z #define __cpp_variable_templates 201304L 2025-05-07T20:27:09.9707696Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:09.9708254Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:27:09.9708588Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:27:09.9708880Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:27:09.9709193Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:27:09.9709505Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:27:09.9709786Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:27:09.9710064Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:27:09.9710329Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:27:09.9710592Z #define __u_char_defined 2025-05-07T20:27:09.9710918Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:27:09.9711289Z #define STA_PPSERROR 0x0800 2025-05-07T20:27:09.9711548Z #define _GLIBCXX_STD_A std 2025-05-07T20:27:09.9711814Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:27:09.9712104Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:27:09.9712553Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:27:09.9712990Z #define FP_INFINITE 1 2025-05-07T20:27:09.9713702Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:09.9714221Z #define _IO_pid_t __pid_t 2025-05-07T20:27:09.9714487Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:27:09.9714765Z #define __LEAF , __leaf__ 2025-05-07T20:27:09.9715006Z #define PATH_MAX 4096 2025-05-07T20:27:09.9715266Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:27:09.9715614Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:27:09.9715954Z #define _LIMITS_H___ 2025-05-07T20:27:09.9716181Z #define __size_t 2025-05-07T20:27:09.9716419Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:27:09.9716979Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:27:09.9717574Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:27:09.9717885Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:27:09.9718231Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:27:09.9718502Z #define _WCHAR_T_DEFINED 2025-05-07T20:27:09.9718757Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:27:09.9719062Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:27:09.9719400Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:27:09.9719694Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:27:09.9719983Z #define __INT8_C(c) c 2025-05-07T20:27:09.9720322Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:27:09.9720624Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:27:09.9720904Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:27:09.9721174Z #define __SM_70_RT_HPP__ 2025-05-07T20:27:09.9721433Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:27:09.9721705Z #define __cpp_variadic_using 201611L 2025-05-07T20:27:09.9722040Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.9722380Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:27:09.9722653Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:27:09.9722935Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:27:09.9723204Z #define __cpp_capture_star_this 201603L 2025-05-07T20:27:09.9723519Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:27:09.9723832Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:27:09.9724204Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:27:09.9724737Z #define NFDBITS __NFDBITS 2025-05-07T20:27:09.9725008Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:27:09.9725305Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:27:09.9725635Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:27:09.9726097Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:27:09.9726362Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:27:09.9726659Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:27:09.9726969Z #define STA_UNSYNC 0x0040 2025-05-07T20:27:09.9727290Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:09.9727727Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:27:09.9728091Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:27:09.9728386Z #define __cpp_if_constexpr 201606L 2025-05-07T20:27:09.9728710Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:27:09.9729040Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:27:09.9729369Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:27:09.9729722Z #define __daddr_t_defined 2025-05-07T20:27:09.9729980Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:27:09.9730259Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:27:09.9730590Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:27:09.9731124Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:27:09.9731666Z #define _ACRTIMP 2025-05-07T20:27:09.9731898Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:27:09.9732172Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:27:09.9732463Z #define _IOS_BIN 128 2025-05-07T20:27:09.9732825Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:27:09.9733253Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.9733528Z #define UNDERFLOW 4 2025-05-07T20:27:09.9733763Z #define NAME_MAX 255 2025-05-07T20:27:09.9734007Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:27:09.9734286Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:27:09.9734576Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:27:09.9734879Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:27:09.9735269Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:27:09.9735669Z #define __ptr_t void * 2025-05-07T20:27:09.9735922Z #define M_E 2.7182818284590452354 2025-05-07T20:27:09.9736208Z #define cudaSurfaceType1D 0x01 2025-05-07T20:27:09.9736478Z #define __USE_ISOCXX11 1 2025-05-07T20:27:09.9736752Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:27:09.9737255Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:27:09.9737554Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:27:09.9737838Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:27:09.9738135Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:27:09.9738463Z #define cudaSurfaceType2D 0x02 2025-05-07T20:27:09.9738732Z #define __linux 1 2025-05-07T20:27:09.9738967Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:27:09.9739247Z #define cudaDeviceMask 0xff 2025-05-07T20:27:09.9739522Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:27:09.9739823Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:27:09.9740110Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:27:09.9740405Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:27:09.9740724Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:27:09.9741035Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:27:09.9741329Z #define _BITS_TYPES_H 1 2025-05-07T20:27:09.9741623Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:27:09.9741968Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:27:09.9742270Z #define cudaSurfaceType3D 0x03 2025-05-07T20:27:09.9742558Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:27:09.9742857Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:27:09.9743150Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:27:09.9744105Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:27:09.9744942Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:27:09.9745234Z #define __unix 1 2025-05-07T20:27:09.9745450Z #define MATH_ERRNO 1 2025-05-07T20:27:09.9745834Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:27:09.9746246Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:27:09.9746525Z #define __SM_100_RT_H__ 2025-05-07T20:27:09.9746788Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:27:09.9747087Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:27:09.9747377Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:27:09.9747662Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:27:09.9747974Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:27:09.9748450Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:27:09.9748924Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:27:09.9749232Z #define CUDARTAPI_CDECL 2025-05-07T20:27:09.9749496Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:27:09.9749782Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:27:09.9750077Z #define __cpp_lib_void_t 201411 2025-05-07T20:27:09.9750347Z #define _POSIX_AIO_MAX 1 2025-05-07T20:27:09.9750597Z #define __SIZE_T 2025-05-07T20:27:09.9750856Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:27:09.9751187Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:27:09.9751486Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:27:09.9751756Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:27:09.9752035Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:27:09.9752299Z #define _ATFILE_SOURCE 1 2025-05-07T20:27:09.9752699Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:27:09.9753144Z #define __WAIT_STATUS void * 2025-05-07T20:27:09.9753415Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:27:09.9753687Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:27:09.9753962Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:27:09.9754254Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:27:09.9754536Z #define __WINT_MIN__ 0U 2025-05-07T20:27:09.9755124Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:27:09.9755791Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:27:09.9756093Z #define WUNTRACED 2 2025-05-07T20:27:09.9756330Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:27:09.9756615Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:27:09.9756908Z #define NZERO 20 2025-05-07T20:27:09.9757136Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:27:09.9757422Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:27:09.9757722Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:27:09.9758014Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:27:09.9758276Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:27:09.9758569Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:27:09.9758846Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:27:09.9759136Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:27:09.9759416Z #define EXIT_FAILURE 1 2025-05-07T20:27:09.9759657Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:27:09.9759924Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:27:09.9760330Z #define _SIZE_T_DEFINED_ 2025-05-07T20:27:09.9760588Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:27:09.9760877Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:27:09.9761231Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:27:09.9761600Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:27:09.9761900Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:27:09.9762161Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:27:09.9762440Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:27:09.9762739Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:27:09.9763056Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:27:09.9763356Z #define SEEK_DATA 3 2025-05-07T20:27:09.9763587Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:27:09.9763895Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:27:09.9764437Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:27:09.9764836Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:27:09.9765097Z #define __INT64_C(c) c ## L 2025-05-07T20:27:09.9765451Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:27:09.9765789Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:27:09.9766124Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:27:09.9766412Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:27:09.9766723Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:27:09.9767027Z #define STA_PPSWANDER 0x0400 2025-05-07T20:27:09.9767291Z #define __INT_WCHAR_T_H 2025-05-07T20:27:09.9767538Z #define WSTOPPED 2 2025-05-07T20:27:09.9767773Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:27:09.9768065Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:27:09.9768323Z #define FP_NORMAL 4 2025-05-07T20:27:09.9768562Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:27:09.9768856Z #define _BITS_TIMEX_H 1 2025-05-07T20:27:09.9769108Z #define _POSIX_LINK_MAX 8 2025-05-07T20:27:09.9769364Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:27:09.9769654Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:27:09.9769940Z #define cudaTextureType1D 0x01 2025-05-07T20:27:09.9770211Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:27:09.9770480Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:27:09.9770757Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:27:09.9771059Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:27:09.9771492Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:27:09.9771952Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:27:09.9772222Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:27:09.9772486Z #define _POSIX_SOURCE 1 2025-05-07T20:27:09.9772740Z #define cudaTextureType2D 0x02 2025-05-07T20:27:09.9773010Z #define _PTR_TRAITS_H 1 2025-05-07T20:27:09.9773281Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:27:09.9773602Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:27:09.9773886Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:27:09.9774210Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:27:09.9774556Z #define cudaTextureType3D 0x03 2025-05-07T20:27:09.9774839Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:27:09.9775101Z #define CLOCK_REALTIME 0 2025-05-07T20:27:09.9775355Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:27:09.9775638Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:27:09.9775952Z #define __cpp_aligned_new 201606L 2025-05-07T20:27:09.9776233Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:27:09.9776520Z #define cudaEventBlockingSync 0x01 2025-05-07T20:27:09.9776815Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:27:09.9777091Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:27:09.9777405Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:27:09.9777721Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:27:09.9778007Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:27:09.9778270Z #define __GLIBC__ 2 2025-05-07T20:27:09.9778499Z #define __END_DECLS } 2025-05-07T20:27:09.9778736Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:27:09.9779116Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:27:09.9779508Z #define __CONCAT(x,y) x ## y 2025-05-07T20:27:09.9779761Z #define WCONTINUED 8 2025-05-07T20:27:09.9780004Z #define __STDC_HOSTED__ 1 2025-05-07T20:27:09.9780271Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:27:09.9780553Z #define _ALLOCA_H 1 2025-05-07T20:27:09.9780780Z #define __host__ __location__(host) 2025-05-07T20:27:09.9781218Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:27:09.9781669Z #define __SLONG32_TYPE int 2025-05-07T20:27:09.9781941Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:27:09.9782235Z #define _SYS_SELECT_H 1 2025-05-07T20:27:09.9782479Z #define _IO_LINE_BUF 0x200 2025-05-07T20:27:09.9782731Z #define _IOS_NOCREATE 32 2025-05-07T20:27:09.9782990Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:27:09.9783364Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:27:09.9783663Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:27:09.9783958Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:27:09.9784329Z #define __global__ __location__(global) 2025-05-07T20:27:09.9830416Z #define __GNU_LIBRARY__ 6 2025-05-07T20:27:09.9830715Z #define __cpp_decltype_auto 201304L 2025-05-07T20:27:09.9831010Z #define __DBL_DIG__ 15 2025-05-07T20:27:09.9831246Z #define TIME_UTC 1 2025-05-07T20:27:09.9831468Z #define __FLT32_DIG__ 6 2025-05-07T20:27:09.9831799Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:27:09.9832208Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:27:09.9832533Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:27:09.9832854Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:27:09.9833156Z #define _G_BUFSIZ 8192 2025-05-07T20:27:09.9833470Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:27:09.9833862Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:27:09.9834170Z #define __cudaCDP2GetDevice 2025-05-07T20:27:09.9834464Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:27:09.9834763Z #define STA_CLOCKERR 0x1000 2025-05-07T20:27:09.9835023Z #define __GXX_WEAK__ 1 2025-05-07T20:27:09.9835281Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:09.9835589Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:27:09.9835853Z #define __SHRT_WIDTH__ 16 2025-05-07T20:27:09.9836155Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:27:09.9836498Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:27:09.9836780Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:27:09.9837076Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:27:09.9837381Z #define _G_config_h 1 2025-05-07T20:27:09.9837669Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:27:09.9838020Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:27:09.9838303Z #define _GCC_WCHAR_T 2025-05-07T20:27:09.9838545Z #define TMP_MAX 238328 2025-05-07T20:27:09.9838800Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:27:09.9839071Z #define __DEVICE_TYPES_H__ 2025-05-07T20:27:09.9839340Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:09.9839629Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:27:09.9839915Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:27:09.9840299Z #define _IO_SKIPWS 01 2025-05-07T20:27:09.9840722Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:27:09.9841190Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:27:09.9841464Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:27:09.9841806Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:27:09.9842184Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:27:09.9842563Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:27:09.9842930Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:27:09.9843191Z #define le32toh(x) (x) 2025-05-07T20:27:09.9843431Z #define _SIZE_T_DEFINED 2025-05-07T20:27:09.9843691Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:27:09.9844043Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:27:09.9844409Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:27:09.9844824Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:27:09.9845244Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:27:09.9845521Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:27:09.9845792Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:27:09.9846062Z #define _POSIX_NAME_MAX 14 2025-05-07T20:27:09.9846348Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:27:09.9846895Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:27:09.9847406Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:27:09.9847731Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:27:09.9848096Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:27:09.9848425Z #define _WCHAR_T_ 2025-05-07T20:27:09.9849009Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:27:09.9849393Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:27:09.9849788Z #define RTSIG_MAX 32 2025-05-07T20:27:09.9850159Z #define _STDDEF_H 2025-05-07T20:27:09.9850400Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:27:09.9850685Z #define _VA_LIST_DEFINED 2025-05-07T20:27:09.9850940Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:27:09.9851287Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:27:09.9851691Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:27:09.9852022Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:27:09.9852324Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:27:09.9852799Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:27:09.9853341Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:27:09.9853717Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:27:09.9854057Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:27:09.9854386Z #define __unix__ 1 2025-05-07T20:27:09.9854624Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:09.9854921Z #define __INT_WIDTH__ 32 2025-05-07T20:27:09.9855175Z #define __SIZEOF_LONG__ 8 2025-05-07T20:27:09.9855417Z #define _IONBF 2 2025-05-07T20:27:09.9855877Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:27:09.9856664Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:27:09.9857212Z #define __STDC_IEC_559__ 1 2025-05-07T20:27:09.9857474Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:27:09.9857754Z #define __UINT16_C(c) c 2025-05-07T20:27:09.9858003Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:27:09.9858281Z #define STA_DEL 0x0020 2025-05-07T20:27:09.9858529Z #define __CUDACC_VER_MINOR__ 8 2025-05-07T20:27:09.9858799Z #define __id_t_defined 2025-05-07T20:27:09.9859075Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:27:09.9859542Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:27:09.9859989Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:27:09.9860261Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:27:09.9860531Z #define __DECIMAL_DIG__ 21 2025-05-07T20:27:09.9860795Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:27:09.9861063Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:27:09.9861345Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:27:09.9861661Z #define SING 2 2025-05-07T20:27:09.9861886Z #define STA_FREQHOLD 0x0080 2025-05-07T20:27:09.9862160Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:09.9862472Z #define cudaStreamDefault 0x00 2025-05-07T20:27:09.9862841Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:27:09.9863220Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:27:09.9863500Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:27:09.9863782Z #define __gnu_linux__ 1 2025-05-07T20:27:09.9864024Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:27:09.9864291Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:27:09.9864547Z #define MAX_INPUT 255 2025-05-07T20:27:09.9864836Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:27:09.9865181Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:27:09.9865564Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:27:09.9865892Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:27:09.9866165Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:27:09.9866575Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:27:09.9867015Z #define _IO_SHOWPOS 02000 2025-05-07T20:27:09.9867355Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:27:09.9867730Z #define _Mfloat_ float 2025-05-07T20:27:09.9868005Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:27:09.9868322Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:27:09.9868712Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:27:09.9869051Z #define cudaMemPoolCreateUsageHwDecompress 0x2 2025-05-07T20:27:09.9869605Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:27:09.9870192Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.9870480Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:27:09.9870823Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:27:09.9871191Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:27:09.9871514Z #define __USE_ISOC11 1 2025-05-07T20:27:09.9871784Z #define _BSD_SIZE_T_ 2025-05-07T20:27:09.9872018Z #define ADJ_MICRO 0x1000 2025-05-07T20:27:09.9872276Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:27:09.9872548Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:27:09.9872854Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:27:09.9873188Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:27:09.9873516Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:27:09.9873854Z #define __THROW throw () 2025-05-07T20:27:09.9874115Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:27:09.9874419Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:09.9874790Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:27:09.9875149Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:27:09.9875432Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:27:09.9875703Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:27:09.9875972Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:27:09.9876241Z #define L_tmpnam 20 2025-05-07T20:27:09.9876471Z #define ___int_wchar_t_h 2025-05-07T20:27:09.9876819Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:27:09.9877211Z #define isascii(c) __isascii (c) 2025-05-07T20:27:09.9877478Z #define _T_PTRDIFF 2025-05-07T20:27:09.9877778Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:27:09.9878141Z #define toascii(c) __toascii (c) 2025-05-07T20:27:09.9878409Z #define __GNUC__ 11 2025-05-07T20:27:09.9878665Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:27:09.9878974Z #define __GXX_RTTI 1 2025-05-07T20:27:09.9879198Z #define __pie__ 2 2025-05-07T20:27:09.9879417Z #define __MMX__ 1 2025-05-07T20:27:09.9879638Z #define __cudaCDP2Malloc 2025-05-07T20:27:09.9879902Z #define __timespec_defined 1 2025-05-07T20:27:09.9880222Z #define L_ctermid 9 2025-05-07T20:27:09.9880454Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:09.9880765Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:27:09.9881169Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:27:09.9881548Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:27:09.9881819Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:27:09.9882116Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:27:09.9882425Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:27:09.9882748Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:27:09.9883019Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:27:09.9883470Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:27:09.9884239Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:09.9884865Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:27:09.9885177Z #define __USE_SVID 1 2025-05-07T20:27:09.9885431Z #define __constant__ __location__(constant) 2025-05-07T20:27:09.9885757Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:27:09.9886063Z #define __device__ __location__(device) 2025-05-07T20:27:09.9886393Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:27:09.9886731Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:27:09.9887006Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:27:09.9887290Z #define CUDART_DEVICE __device__ 2025-05-07T20:27:09.9887652Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:27:09.9888122Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:27:09.9888416Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:27:09.9888791Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:27:09.9889265Z #define __STDC_UTF_16__ 1 2025-05-07T20:27:09.9889516Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:27:09.9889888Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:27:09.9890322Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:27:09.9890645Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:27:09.9890922Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:27:09.9891196Z #define NGROUPS_MAX 65536 2025-05-07T20:27:09.9891458Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:27:09.9891724Z #define __USE_ISOC95 1 2025-05-07T20:27:09.9891955Z #define _TIME_H 1 2025-05-07T20:27:09.9892239Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:27:09.9892567Z #define __USE_ISOC99 1 2025-05-07T20:27:09.9892910Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:27:09.9893281Z #define HOST_NAME_MAX 64 2025-05-07T20:27:09.9893536Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:27:09.9893799Z #define _IOS_ATEND 4 2025-05-07T20:27:09.9894039Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:27:09.9894377Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:27:09.9894791Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:09.9895137Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:27:09.9895430Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:27:09.9895762Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:27:09.9896082Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:27:09.9896346Z #define _STDIO_H 1 2025-05-07T20:27:09.9896750Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:27:09.9897226Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:27:09.9897594Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:09.9897986Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:27:09.9898284Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:27:09.9898556Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:27:09.9898841Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:27:09.9899141Z #define __cpp_raw_strings 200710L 2025-05-07T20:27:09.9899446Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.9899771Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:27:09.9900050Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:27:09.9900335Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:27:09.9900647Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:27:09.9900927Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:27:09.9901227Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:27:09.9901587Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:27:09.9901964Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:27:09.9902222Z #define __USE_XOPEN 1 2025-05-07T20:27:09.9902474Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:27:09.9902929Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:09.9903377Z #define __USE_XOPEN2K 1 2025-05-07T20:27:09.9903629Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:27:09.9903905Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:27:09.9904209Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:27:09.9904486Z #define __cpp_fold_expressions 201603L 2025-05-07T20:27:09.9905022Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:27:09.9905559Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:27:09.9905852Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:27:09.9906220Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:27:09.9906618Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:27:09.9907007Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:27:09.9907402Z #define __END_NAMESPACE_C99 2025-05-07T20:27:09.9907765Z #define __glibcxx_integral_traps true 2025-05-07T20:27:09.9908063Z #define _POSIX_PATH_MAX 256 2025-05-07T20:27:09.9908319Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:27:09.9908583Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:27:09.9908934Z #define _IOS_TRUNC 16 2025-05-07T20:27:09.9909171Z #define _ISOC11_SOURCE 1 2025-05-07T20:27:09.9909435Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:27:09.9909736Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:27:09.9910040Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:27:09.9910419Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:27:09.9910809Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:27:09.9911093Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:27:09.9911361Z #define _IO_UNITBUF 020000 2025-05-07T20:27:09.9911621Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:27:09.9911889Z #define __FD_SETSIZE 1024 2025-05-07T20:27:09.9912143Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:27:09.9912419Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:27:09.9912778Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:27:09.9913141Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:27:09.9913924Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:27:09.9914294Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:27:09.9914617Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:27:09.9914898Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:27:09.9915211Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:27:09.9915558Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:27:09.9915848Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:27:09.9916182Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:27:09.9916475Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:27:09.9916748Z #define __USE_POSIX199506 1 2025-05-07T20:27:09.9917006Z #define _FEATURES_H 1 2025-05-07T20:27:09.9917247Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:27:09.9917646Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:27:09.9918133Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:27:09.9918469Z #define __stub_getmsg 2025-05-07T20:27:09.9918704Z #define _IO_FIXED 010000 2025-05-07T20:27:09.9918991Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:27:09.9919309Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:27:09.9919585Z #define __stub_setlogin 2025-05-07T20:27:09.9919823Z #define __stub_fattach 2025-05-07T20:27:09.9920065Z #define __cplusplus 201703L 2025-05-07T20:27:09.9920424Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:27:09.9920704Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:27:09.9920963Z #define INFINITY (__builtin_inff()) 2025-05-07T20:27:09.9921245Z #define _IO_UNBUFFERED 2 2025-05-07T20:27:09.9921734Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:27:09.9922266Z #define _IO_INTERNAL 010 2025-05-07T20:27:09.9922518Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:27:09.9922862Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:09.9923221Z #define __dev_t_defined 2025-05-07T20:27:09.9923469Z #define __DEPRECATED 1 2025-05-07T20:27:09.9923704Z #define __S32_TYPE int 2025-05-07T20:27:09.9923964Z #define __cpp_rvalue_references 200610L 2025-05-07T20:27:09.9924276Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:27:09.9924542Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:27:09.9924801Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:27:09.9925419Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:27:09.9926064Z #define _G_HAVE_MREMAP 1 2025-05-07T20:27:09.9926376Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:09.9926722Z #define OVERFLOW 3 2025-05-07T20:27:09.9926979Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:27:09.9927290Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:27:09.9927582Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:09.9928115Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:27:09.9928454Z #define __SSE2_MATH__ 1 2025-05-07T20:27:09.9928710Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:27:09.9929035Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:09.9929463Z #define _IO_STDIO_H 2025-05-07T20:27:09.9929707Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:27:09.9930011Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:27:09.9930343Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:27:09.9930646Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.9930961Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:27:09.9931235Z #define __amd64 1 2025-05-07T20:27:09.9931458Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:27:09.9931730Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:27:09.9932011Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:27:09.9932298Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:27:09.9932611Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:27:09.9932892Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:27:09.9933192Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:27:09.9933460Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:27:09.9933714Z #define __bounded 2025-05-07T20:27:09.9933948Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:27:09.9934217Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:27:09.9934514Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:27:09.9934801Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:27:09.9935066Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:27:09.9935346Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.9935673Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:27:09.9936094Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:09.9936504Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:27:09.9936785Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:27:09.9937129Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:27:09.9937481Z #define STA_PLL 0x0001 2025-05-07T20:27:09.9937735Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:27:09.9938007Z #define __GNUG__ 11 2025-05-07T20:27:09.9938242Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:27:09.9938513Z #define _T_WCHAR 2025-05-07T20:27:09.9938759Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:27:09.9939046Z #define __specialization_static 2025-05-07T20:27:09.9939350Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:27:09.9939670Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:27:09.9939931Z #define cudaArraySparse 0x40 2025-05-07T20:27:09.9940202Z #define STA_PPSFREQ 0x0002 2025-05-07T20:27:09.9940490Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:27:09.9940790Z #define _WCHAR_T 2025-05-07T20:27:09.9941019Z #define __cudaCDP2Free 2025-05-07T20:27:09.9941674Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:27:09.9942369Z #define __cpp_nsdmi 200809L 2025-05-07T20:27:09.9942788Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:27:09.9943241Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:27:09.9943528Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:27:09.9943791Z #define cudaArrayCubemap 0x04 2025-05-07T20:27:09.9944129Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:09.9944490Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:27:09.9944734Z #define __NO_CTYPE 1 2025-05-07T20:27:09.9944964Z #define __stub_bdflush 2025-05-07T20:27:09.9945330Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:27:09.9945760Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:27:09.9946070Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:27:09.9946345Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:27:09.9946627Z #define __cpp_initializer_lists 200806L 2025-05-07T20:27:09.9946933Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:27:09.9947327Z #define __U16_TYPE unsigned short int 2025-05-07T20:27:09.9947676Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:27:09.9948025Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:27:09.9948391Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:27:09.9948679Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:27:09.9949024Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:27:09.9949378Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:27:09.9949668Z #define _IO_STDIO 040000 2025-05-07T20:27:09.9949996Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:27:09.9950390Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:27:09.9950714Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:27:09.9951009Z #define _PTRDIFF_T 2025-05-07T20:27:09.9951226Z #define _MOVE_H 1 2025-05-07T20:27:09.9951460Z #define __cpp_hex_float 201603L 2025-05-07T20:27:09.9951727Z #define ADJ_TAI 0x0080 2025-05-07T20:27:09.9951954Z #define __ptrvalue 2025-05-07T20:27:09.9952188Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:27:09.9952446Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:27:09.9952732Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:27:09.9953047Z #define MATH_ERREXCEPT 2 2025-05-07T20:27:09.9953428Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:27:09.9960739Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:27:09.9961170Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:27:09.9961618Z #define __USE_GNU 1 2025-05-07T20:27:09.9961855Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:27:09.9962140Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:27:09.9962419Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:27:09.9962812Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:27:09.9963210Z #define WEXITED 4 2025-05-07T20:27:09.9963433Z #define _IO_NO_READS 4 2025-05-07T20:27:09.9963738Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:27:09.9964102Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:27:09.9964392Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:27:09.9964701Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:27:09.9965025Z #define __uid_t_defined 2025-05-07T20:27:09.9965290Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:27:09.9965587Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:27:09.9965863Z #define WNOHANG 1 2025-05-07T20:27:09.9966116Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:27:09.9966437Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:27:09.9966712Z #define cudaEventDefault 0x00 2025-05-07T20:27:09.9967013Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:27:09.9967341Z #define NL_SETMAX INT_MAX 2025-05-07T20:27:09.9967582Z #define __x86_64 1 2025-05-07T20:27:09.9967815Z #define __cudaCDP2LaunchDevice 2025-05-07T20:27:09.9968224Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:09.9968715Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:27:09.9969225Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:09.9969672Z #define __PTRDIFF_T 2025-05-07T20:27:09.9970016Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:27:09.9970399Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:27:09.9970683Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:09.9970980Z #define _Mlong_double_ long double 2025-05-07T20:27:09.9971268Z #define __cpp_lambdas 200907L 2025-05-07T20:27:09.9971529Z #define _IO_DEC 020 2025-05-07T20:27:09.9971757Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:27:09.9972033Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:27:09.9972332Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:27:09.9972619Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:27:09.9972886Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:27:09.9973193Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:27:09.9973642Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:27:09.9973926Z #define _ANSI_STDDEF_H 2025-05-07T20:27:09.9974205Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:27:09.9974526Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:27:09.9974985Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:27:09.9975383Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:27:09.9975674Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:27:09.9975968Z #define __cpp_template_auto 201606L 2025-05-07T20:27:09.9976337Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:27:09.9976713Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:27:09.9976986Z #define __key_t_defined 2025-05-07T20:27:09.9977242Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:27:09.9977625Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:27:09.9978104Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:27:09.9978488Z #define __GNUC_VA_LIST 2025-05-07T20:27:09.9978833Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:09.9979224Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:27:09.9979496Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:27:09.9979785Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:27:09.9980086Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:27:09.9980339Z #define __WCOREFLAG 0x80 2025-05-07T20:27:09.9980599Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:27:09.9980911Z #define cudaEventDisableTiming 0x02 2025-05-07T20:27:09.9981190Z #define __LP64__ 1 2025-05-07T20:27:09.9981440Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:27:09.9981767Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:27:09.9982052Z #define _IO_off64_t __off64_t 2025-05-07T20:27:09.9982317Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.9982591Z #define __time_t_defined 1 2025-05-07T20:27:09.9982848Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:27:09.9983213Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:27:09.9983590Z #define __USE_UNIX98 1 2025-05-07T20:27:09.9983838Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:27:09.9984118Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:27:09.9984395Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:27:09.9984706Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:27:09.9985025Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:27:09.9985292Z #define SEEK_CUR 1 2025-05-07T20:27:09.9985529Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:09.9985803Z #define _ASSERT_H 1 2025-05-07T20:27:09.9986389Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:27:09.9987042Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:27:09.9987326Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:27:09.9987585Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:27:09.9987860Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:27:09.9988147Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:27:09.9988532Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:09.9988954Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:27:09.9989641Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:27:09.9990303Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:27:09.9990610Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:27:09.9990972Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:27:09.9991365Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:27:09.9991638Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:27:09.9991931Z #define cudaArrayDefault 0x00 2025-05-07T20:27:09.9992218Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:27:09.9992516Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:27:09.9992920Z #define TLOSS 5 2025-05-07T20:27:09.9993146Z #define __ssize_t_defined 2025-05-07T20:27:09.9993400Z #define __CUDACC_VER_BUILD__ 61 2025-05-07T20:27:09.9993685Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:27:09.9994066Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:27:09.9994349Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:27:09.9994641Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:27:09.9994935Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:27:09.9995248Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:27:09.9995548Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:27:09.9995848Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:27:09.9996139Z #define __REGISTER_PREFIX__ 2025-05-07T20:27:09.9996396Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:27:09.9996736Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:27:09.9997105Z #define _IOS_NOREPLACE 64 2025-05-07T20:27:09.9997347Z #define __cdecl 2025-05-07T20:27:09.9997589Z #define cudaEventInterprocess 0x04 2025-05-07T20:27:09.9997931Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:27:09.9998264Z #define LOGIN_NAME_MAX 256 2025-05-07T20:27:09.9998522Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:27:09.9998806Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:27:09.9999102Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:27:09.9999373Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:27:09.9999688Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:27:10.0000026Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:27:10.0000498Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:10.0000940Z #define ADJ_NANO 0x2000 2025-05-07T20:27:10.0001249Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:27:10.0001655Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:27:10.0001948Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:27:10.0002217Z #define __FLT_DIG__ 6 2025-05-07T20:27:10.0002576Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:27:10.0002984Z #define __NO_INLINE__ 1 2025-05-07T20:27:10.0003299Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:10.0003660Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:27:10.0003920Z #define ADJ_STATUS 0x0010 2025-05-07T20:27:10.0004186Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:27:10.0004483Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:27:10.0004755Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:10.0005062Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:27:10.0005356Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:27:10.0005741Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:27:10.0006166Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:27:10.0006519Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:27:10.0006867Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:27:10.0007117Z #define MAX_CANON 255 2025-05-07T20:27:10.0007358Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:27:10.0007612Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:27:10.0007887Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:27:10.0008186Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:27:10.0008503Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:27:10.0008803Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:27:10.0009087Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:27:10.0009415Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:27:10.0009731Z #define __VERSION__ "11.4.0" 2025-05-07T20:27:10.0009997Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:27:10.0010294Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:27:10.0010585Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:27:10.0010872Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:27:10.0011196Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:27:10.0011499Z #define __UINT64_C(c) c ## UL 2025-05-07T20:27:10.0011761Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:27:10.0012106Z #define _SYS_TYPES_H 1 2025-05-07T20:27:10.0012346Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:27:10.0012612Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:27:10.0012868Z #define _SYS_CDEFS_H 1 2025-05-07T20:27:10.0013182Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:27:10.0013817Z #define __cpp_unicode_characters 201411L 2025-05-07T20:27:10.0014123Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:27:10.0014385Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:27:10.0014684Z #define __cudaCDP2StreamDestroy 2025-05-07T20:27:10.0014957Z #define FP_SUBNORMAL 3 2025-05-07T20:27:10.0015208Z #define cudaOccupancyDefault 0x00 2025-05-07T20:27:10.0015489Z #define _INITIALIZER_LIST 2025-05-07T20:27:10.0015743Z #define _STDC_PREDEF_H 1 2025-05-07T20:27:10.0016006Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:27:10.0016295Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:27:10.0016559Z #define _IO_file_flags _flags 2025-05-07T20:27:10.0016820Z #define __USE_XOPEN2K8 1 2025-05-07T20:27:10.0017074Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:27:10.0017354Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:27:10.0017633Z #define HUGE 3.40282347e+38F 2025-05-07T20:27:10.0017898Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:27:10.0018296Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:27:10.0018697Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:27:10.0019006Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:27:10.0019280Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:27:10.0019538Z #define _BSD_SOURCE 1 2025-05-07T20:27:10.0019780Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:27:10.0020635Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:27:10.0021504Z #define __catch(X) catch(X) 2025-05-07T20:27:10.0021768Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:27:10.0022072Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:27:10.0022346Z #define __TIMER_T_TYPE void * 2025-05-07T20:27:10.0022602Z #define __STRING(x) #x 2025-05-07T20:27:10.0022845Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:27:10.0023129Z #define _T_PTRDIFF_ 2025-05-07T20:27:10.0023372Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:27:10.0023683Z #define cudaEventWaitExternal 0x01 2025-05-07T20:27:10.0023961Z #define __unbounded 2025-05-07T20:27:10.0024203Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:10.0024500Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:27:10.0024784Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:10.0025083Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:27:10.0025363Z #define __cpp_lib_is_final 201402L 2025-05-07T20:27:10.0025669Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:27:10.0026000Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:27:10.0026315Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:27:10.0026607Z #define __managed__ __location__(managed) 2025-05-07T20:27:10.0026907Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:27:10.0027317Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:10.0027748Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:27:10.0028006Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:27:10.0028383Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:27:10.0028793Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:27:10.0029049Z #define _SYS_SIZE_T_H 2025-05-07T20:27:10.0029342Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:27:10.0029685Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:27:10.0029969Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:27:10.0030264Z #define _CRTIMP 2025-05-07T20:27:10.0030490Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:27:10.0030802Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:10.0031133Z #define STA_PPSJITTER 0x0200 2025-05-07T20:27:10.0031696Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:27:10.0032120Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:10.0032560Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:27:10.0032840Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:27:10.0033251Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:27:10.0033640Z #define __SIZE_T__ 2025-05-07T20:27:10.0033851Z #define __stub_gtty 2025-05-07T20:27:10.0034083Z #define __pid_t_defined 2025-05-07T20:27:10.0034341Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:27:10.0034646Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:10.0034971Z #define __glibcxx_function_requires(...) 2025-05-07T20:27:10.0035270Z #define __SM_80_RT_HPP__ 2025-05-07T20:27:10.0035509Z #define __need_clockid_t 2025-05-07T20:27:10.0035756Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:27:10.0036016Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:27:10.0036345Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:27:10.0036669Z #define _IO_HEX 0100 2025-05-07T20:27:10.0036931Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:27:10.0037279Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:27:10.0037381Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:27:10.0037483Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:27:10.0037711Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:10.0037831Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:27:10.0037936Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:27:10.0038044Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:27:10.0038151Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:27:10.0038254Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:27:10.0038342Z #define __stub_sstk 2025-05-07T20:27:10.0038433Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:27:10.0038590Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:27:10.0038676Z #define __wur 2025-05-07T20:27:10.0038795Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:27:10.0038888Z #define _G_HAVE_MMAP 1 2025-05-07T20:27:10.0038970Z #define _IO_OCT 040 2025-05-07T20:27:10.0039074Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:27:10.0039168Z #define NL_MSGMAX INT_MAX 2025-05-07T20:27:10.0039259Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:27:10.0039387Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:27:10.0039485Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:27:10.0039589Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:27:10.0039783Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:27:10.0039880Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:27:10.0039971Z #define _STL_ALGOBASE_H 1 2025-05-07T20:27:10.0040081Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:27:10.0040247Z #define __off64_t_defined 2025-05-07T20:27:10.0040348Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:27:10.0040440Z #define __FLT128_DIG__ 33 2025-05-07T20:27:10.0040549Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:27:10.0040647Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:27:10.0040735Z #define __INT32_C(c) c 2025-05-07T20:27:10.0040838Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:27:10.0040936Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:27:10.0041036Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:27:10.0041127Z #define __PDP_ENDIAN 3412 2025-05-07T20:27:10.0041218Z #define _ISOC95_SOURCE 1 2025-05-07T20:27:10.0041317Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:27:10.0041447Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:27:10.0041557Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:27:10.0041664Z #define __SM_90_RT_HPP__ 2025-05-07T20:27:10.0041774Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:27:10.0041885Z #define __have_pthread_attr_t 1 2025-05-07T20:27:10.0041985Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:27:10.0042301Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:27:10.0042424Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:27:10.0042529Z #define __cudaCDP2EventRecord 2025-05-07T20:27:10.0042625Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:27:10.0042871Z #define htole32(x) (x) 2025-05-07T20:27:10.0043129Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:27:10.0043256Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:27:10.0043357Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:27:10.0043514Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:27:10.0043660Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:27:10.0043786Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:27:10.0043926Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:27:10.0044022Z #define ADJ_OFFSET 0x0001 2025-05-07T20:27:10.0044123Z #define cudaArrayLayered 0x01 2025-05-07T20:27:10.0044292Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:27:10.0044410Z #define cudaEventRecordDefault 0x00 2025-05-07T20:27:10.0044507Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:27:10.0044608Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:27:10.0044693Z #define unix 1 2025-05-07T20:27:10.0044786Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:27:10.0044883Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:27:10.0044977Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:27:10.0045096Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:27:10.0045185Z #define __USE_POSIX 1 2025-05-07T20:27:10.0045279Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:27:10.0045410Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:27:10.0045508Z #define __THROWNL throw () 2025-05-07T20:27:10.0045597Z #define __cpp_rtti 199711L 2025-05-07T20:27:10.0045700Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:27:10.0045793Z #define __PMT(args) args 2025-05-07T20:27:10.0045907Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:10.0046064Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:27:10.0046179Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:27:10.0046270Z #define _SIZE_T_DECLARED 2025-05-07T20:27:10.0046371Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:27:10.0046469Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:27:10.0046872Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:27:10.0046976Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:27:10.0047071Z #define XATTR_LIST_MAX 65536 2025-05-07T20:27:10.0047166Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:27:10.0047313Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:27:10.0047395Z #define _WCHAR_T_H 2025-05-07T20:27:10.0047489Z #define __FLT64X_DIG__ 18 2025-05-07T20:27:10.0047580Z #define _IO_SHOWBASE 0200 2025-05-07T20:27:10.0047668Z #define _POSIX_QLIMIT 1 2025-05-07T20:27:10.0047772Z #define __INT8_TYPE__ signed char 2025-05-07T20:27:10.0047870Z #define __SURFACE_TYPES_H__ 2025-05-07T20:27:10.0047957Z #define __CUDA_ARCH__ 520 2025-05-07T20:27:10.0048070Z #define __cpp_digit_separators 201309L 2025-05-07T20:27:10.0048151Z #define __ELF__ 1 2025-05-07T20:27:10.0048257Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:27:10.0048360Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:27:10.0048447Z #define STA_INS 0x0010 2025-05-07T20:27:10.0048545Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:27:10.0048726Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:27:10.0048820Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:27:10.0048919Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:27:10.0049028Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:10.0049137Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:27:10.0049238Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:27:10.0049341Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:27:10.0049440Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:27:10.0049718Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:27:10.0049879Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:27:10.0049980Z #define _IO_funlockfile(_fp) 2025-05-07T20:27:10.0050388Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:10.0050518Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:27:10.0050618Z #define __DRIVER_TYPES_H__ 2025-05-07T20:27:10.0050705Z #define __FLT_RADIX__ 2 2025-05-07T20:27:10.0050807Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:27:10.0050976Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:27:10.0051073Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:27:10.0051167Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:27:10.0051275Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:27:10.0051372Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:27:10.0051470Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:27:10.0051584Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:27:10.0051668Z #define WORD_BIT 32 2025-05-07T20:27:10.0051757Z #define _IO_USER_BUF 1 2025-05-07T20:27:10.0051849Z #define __VECTOR_TYPES_H__ 2025-05-07T20:27:10.0051957Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:10.0052069Z #define cudaHostAllocPortable 0x01 2025-05-07T20:27:10.0052168Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:27:10.0052269Z #define __long_double_t long double 2025-05-07T20:27:10.0052369Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:27:10.0052461Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:27:10.0052868Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:27:10.0052953Z #define __k8 1 2025-05-07T20:27:10.0053151Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:27:10.0053322Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:27:10.0053441Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:27:10.0053546Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:27:10.0053649Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:27:10.0053749Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:27:10.0053848Z #define __blksize_t_defined 2025-05-07T20:27:10.0053944Z #define _IO_SHOWPOINT 0400 2025-05-07T20:27:10.0054043Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:27:10.0054156Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:27:10.0054255Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:27:10.0054361Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:27:10.0054457Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:27:10.0054557Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:27:10.0054816Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:27:10.0055167Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:27:10.0055269Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:27:10.0055371Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:27:10.0055458Z #define SEEK_SET 0 2025-05-07T20:27:10.0055556Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:27:10.0055652Z #define __CUDA_API_VER_MINOR__ 8 2025-05-07T20:27:10.0055858Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:27:10.0055961Z #define __cudaCDP2GetLastError 2025-05-07T20:27:10.0056056Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:27:10.0056151Z #define _MATH_H_MATHDEF 1 2025-05-07T20:27:10.0056476Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:27:10.0056583Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:27:10.0056682Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:27:10.0056771Z #define __stub_sigreturn 2025-05-07T20:27:10.0057018Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:27:10.0057115Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:27:10.0057294Z #define __HOST_CONFIG_H__ 2025-05-07T20:27:10.0057401Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:27:10.0057487Z #define CLOCK_TAI 11 2025-05-07T20:27:10.0057594Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:27:10.0057885Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:27:10.0057974Z #define __restrict_arr 2025-05-07T20:27:10.0058090Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:27:10.0058233Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:27:10.0058769Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:10.0058960Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:27:10.0059051Z #define __USE_MISC 1 2025-05-07T20:27:10.0059154Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:27:10.0059264Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:27:10.0059351Z #define _GCC_LIMITS_H_ 2025-05-07T20:27:10.0059441Z #define __LDBL_DIG__ 18 2025-05-07T20:27:10.0059536Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:27:10.0059644Z #define __malloc_and_calloc_defined 2025-05-07T20:27:10.0059741Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:27:10.0059845Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:27:10.0059928Z #define __x86_64__ 1 2025-05-07T20:27:10.0060016Z #define _SIZE_T_ 2025-05-07T20:27:10.0060920Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:27:10.0061026Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:27:10.0061125Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:27:10.0061242Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:27:10.0061361Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:27:10.0061456Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:27:10.0061570Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:27:10.0061694Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:27:10.0061836Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:27:10.0061937Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:27:10.0062412Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:10.0062538Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:27:10.0062690Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:27:10.0062790Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:27:10.0062884Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:27:10.0062976Z #define STA_FLL 0x0008 2025-05-07T20:27:10.0063126Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:27:10.0063223Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:27:10.0063348Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:10.0063464Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:27:10.0063550Z #define __stub_revoke 2025-05-07T20:27:10.0063645Z #define __timer_t_defined 1 2025-05-07T20:27:10.0063779Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:27:10.0063874Z #define INT_MAX __INT_MAX__ 2025-05-07T20:27:10.0063981Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:27:10.0064086Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:27:10.0064185Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:27:10.0064287Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:27:10.0064397Z #define cudaArrayTextureGather 0x08 2025-05-07T20:27:10.0064498Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:27:10.0064644Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:27:10.0064844Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:27:10.0064942Z #define _IO_off_t __off_t 2025-05-07T20:27:10.0065033Z #define __FLT64_DIG__ 15 2025-05-07T20:27:10.0065261Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:27:10.0065434Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:27:10.0065563Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:10.0065689Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:27:10.0065784Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:27:10.0065886Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:27:10.0065976Z #define NULL __null 2025-05-07T20:27:10.0066108Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:27:10.0066212Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:27:10.0066317Z #define __U64_TYPE unsigned long int 2025-05-07T20:27:10.0066412Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:27:10.0066504Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:27:10.0066593Z #define FP_ZERO 2 2025-05-07T20:27:10.0066698Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:27:10.0066856Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:27:10.0066965Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:10.0067053Z #define __WCHAR_T__ 2025-05-07T20:27:10.0067150Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:27:10.0067348Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:27:10.0067502Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:27:10.0067605Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:27:10.0067727Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:27:10.0067843Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:10.0067976Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:27:10.0068103Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:27:10.0068200Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:27:10.0068291Z #define _SIGSET_H_types 1 2025-05-07T20:27:10.0068413Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:27:10.0068521Z #define __cpp_unicode_literals 200710L 2025-05-07T20:27:10.0068671Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:27:10.0068781Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:27:10.0068904Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:27:10.0069035Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:27:10.0069143Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:27:10.0069274Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:27:10.0069389Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1 2025-05-07T20:27:10.0069568Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:27:10.0069662Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:27:10.0069765Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:27:10.0069869Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:27:10.0069961Z #define STA_MODE 0x4000 2025-05-07T20:27:10.0070075Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:27:10.0070182Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:27:10.0070297Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:27:10.0070398Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:27:10.0070504Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:27:10.0070609Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:27:10.0070703Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:27:10.0070819Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:27:10.0070908Z #define __SIZE_WIDTH__ 64 2025-05-07T20:27:10.0071033Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:10.0071113Z #define __SEG_FS 1 2025-05-07T20:27:10.0071203Z #define _IO_size_t size_t 2025-05-07T20:27:10.0071306Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:27:10.0071402Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:27:10.0071486Z #define __stub_lchmod 2025-05-07T20:27:10.0071640Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:27:10.0071809Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:10.0072024Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:27:10.0072164Z #define __SEG_GS 1 2025-05-07T20:27:10.0080096Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:27:10.0080418Z #define _IOS_APPEND 8 2025-05-07T20:27:10.0080521Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:27:10.0080618Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:27:10.0080726Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:27:10.0080829Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:27:10.0080943Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:27:10.0081031Z #define htole16(x) (x) 2025-05-07T20:27:10.0081146Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:10.0081250Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:27:10.0081360Z #define __INT16_TYPE__ short int 2025-05-07T20:27:10.0081476Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:27:10.0081612Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:27:10.0081725Z #define __cpp_structured_bindings 201606L 2025-05-07T20:27:10.0081854Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:27:10.0081947Z #define __SIZEOF_INT__ 4 2025-05-07T20:27:10.0082036Z #define __WCLONE 0x80000000 2025-05-07T20:27:10.0082136Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:27:10.0082222Z #define SEEK_HOLE 4 2025-05-07T20:27:10.0082309Z #define TIMER_ABSTIME 1 2025-05-07T20:27:10.0082406Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:27:10.0082496Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:27:10.0082674Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:10.0082796Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:10.0082894Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:27:10.0083005Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:27:10.0083106Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:27:10.0083229Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:27:10.0083322Z #define _LINUX_LIMITS_H 2025-05-07T20:27:10.0083404Z #define linux 1 2025-05-07T20:27:10.0083504Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:27:10.0083618Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:27:10.0083718Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:27:10.0083826Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:27:10.0083932Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:27:10.0084083Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:27:10.0084186Z #define __cpp_lib_hypot 201603 2025-05-07T20:27:10.0084282Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:27:10.0084380Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:27:10.0084475Z #define MOD_NANO ADJ_NANO 2025-05-07T20:27:10.0084561Z #define htole64(x) (x) 2025-05-07T20:27:10.0084661Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:27:10.0084789Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:27:10.0084884Z #define _IO_UPPERCASE 01000 2025-05-07T20:27:10.0085389Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:27:10.0085481Z #define __USE_POSIX2 1 2025-05-07T20:27:10.0085581Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:27:10.0085671Z #define __WALL 0x40000000 2025-05-07T20:27:10.0085775Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:27:10.0085859Z #define _XLOCALE_H 1 2025-05-07T20:27:10.0085959Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:27:10.0086058Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:27:10.0086154Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:27:10.0086263Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:27:10.0086349Z #define __EXCEPTIONS 1 2025-05-07T20:27:10.0086455Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:27:10.0086653Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:27:10.0086741Z #define __WORDSIZE 64 2025-05-07T20:27:10.0086835Z #define CLOCK_MONOTONIC 1 2025-05-07T20:27:10.0086923Z #define _STL_RELOPS_H 1 2025-05-07T20:27:10.0087021Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:27:10.0087286Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:27:10.0087387Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:27:10.0087479Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:27:10.0087579Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:27:10.0087961Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:27:10.0088198Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:10.0088323Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:27:10.0088426Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:27:10.0088530Z #define __cpp_range_based_for 201603L 2025-05-07T20:27:10.0088642Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:27:10.0088747Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:27:10.0088855Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:27:10.0089038Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:27:10.0089142Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:27:10.0089239Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:27:10.0089344Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:27:10.0089524Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:10.0089645Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:27:10.0089729Z #define _STRING_H 1 2025-05-07T20:27:10.0089830Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:27:10.0089918Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:27:10.0090023Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:27:10.0090158Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:27:10.0090254Z #define __code_model_small__ 1 2025-05-07T20:27:10.0090346Z #define _PSTL_CONFIG_H 2025-05-07T20:27:10.0090447Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:27:10.0090563Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:27:10.0090662Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:27:10.0090765Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:27:10.0091118Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:10.0091211Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:27:10.0091300Z #define le64toh(x) (x) 2025-05-07T20:27:10.0091398Z #define FILENAME_MAX 4096 2025-05-07T20:27:10.0091552Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:27:10.0091665Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:27:10.0091754Z #define L_cuserid 9 2025-05-07T20:27:10.0091843Z #define __ino_t_defined 2025-05-07T20:27:10.0091924Z #define __k8__ 1 2025-05-07T20:27:10.0092027Z #define __INTPTR_TYPE__ long int 2025-05-07T20:27:10.0092136Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:27:10.0092225Z #define __int8_t_defined 2025-05-07T20:27:10.0092318Z #define __WCHAR_TYPE__ int 2025-05-07T20:27:10.0092419Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:27:10.0092537Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:27:10.0092637Z #define __SLONGWORD_TYPE long int 2025-05-07T20:27:10.0092760Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:27:10.0092918Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:27:10.0093003Z #define __HAVE_COLUMN 2025-05-07T20:27:10.0093096Z #define __stub_fdetach 2025-05-07T20:27:10.0093522Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:27:10.0093604Z #define __pic__ 2 2025-05-07T20:27:10.0093723Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:10.0093824Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:27:10.0093917Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:27:10.0094024Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:27:10.0094112Z #define __stub_chflags 2025-05-07T20:27:10.0094201Z #define CLOCK_BOOTTIME 7 2025-05-07T20:27:10.0094288Z #define __need_IOV_MAX 2025-05-07T20:27:10.0094397Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:27:10.0094500Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:27:10.0094680Z #define __cpp_decltype 200707L 2025-05-07T20:27:10.0094781Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:27:10.0094873Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:27:10.0095058Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:27:10.0095145Z #define TTY_NAME_MAX 32 2025-05-07T20:27:10.0095315Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:27:10.0095436Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:10.0095606Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:27:10.0095718Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:27:10.0095810Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:27:10.0095901Z #define STA_PPSTIME 0x0004 2025-05-07T20:27:10.0095986Z #define __import__ 2025-05-07T20:27:10.0096074Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:27:10.0096207Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:27:10.0096295Z #define __export__ 2025-05-07T20:27:10.0096417Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:27:10.0096516Z #define cudaMemAttachHost 0x02 2025-05-07T20:27:10.0096682Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:10.0096781Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:27:10.0096875Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:27:10.0096969Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:27:10.0097060Z #define _WCHAR_T_DECLARED 2025-05-07T20:27:10.0097183Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:27:10.0097298Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:27:10.0097403Z #define __cpp_inline_variables 201606L 2025-05-07T20:27:10.0097497Z #define WNOWAIT 0x01000000 2025-05-07T20:27:10.0097578Z #define PLOSS 6 2025-05-07T20:27:10.0097668Z #define M_LN10 2.30258509299404568402 2025-05-07T20:27:10.0097939Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:27:10.0098025Z #define EXIT_SUCCESS 0 2025-05-07T20:27:10.0098131Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:27:10.0098226Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:27:10.0098325Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:27:10.0098419Z #define __thread__ __thread 2025-05-07T20:27:10.0098520Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:27:10.0098611Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:27:10.0098717Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:27:10.0098946Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:10.0099058Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:27:10.0099155Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:27:10.0099235Z #define __linux__ 1 2025-05-07T20:27:10.0099330Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:27:10.0099459Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:27:10.0099550Z #define __S16_TYPE short int 2025-05-07T20:27:10.0099908Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:27:10.0100021Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:27:10.0100214Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:27:10.0100320Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:27:10.0100417Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:27:10.0100496Z #define _T_SIZE_ 2025-05-07T20:27:10.0100598Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:10.0100717Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:27:10.0100809Z #define _PSTL_VERSION 12000 2025-05-07T20:27:10.0100932Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:27:10.0101024Z #define __WNOTHREAD 0x20000000 2025-05-07T20:27:10.0101123Z #define _G_va_list __gnuc_va_list 2025-05-07T20:27:10.0101252Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:27:10.0101337Z #define _IOS_INPUT 1 2025-05-07T20:27:10.0101432Z #define __USE_LARGEFILE64 1 2025-05-07T20:27:10.0101536Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:27:10.0101716Z #define __INT64_TYPE__ long int 2025-05-07T20:27:10.0101817Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:27:10.0101915Z #define __shared__ __location__(shared) 2025-05-07T20:27:10.0102007Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:27:10.0102240Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:27:10.0102327Z #define __gid_t_defined 2025-05-07T20:27:10.0102442Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:27:10.0102538Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:27:10.0102739Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:27:10.0102839Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:27:10.0102928Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:27:10.0103013Z #define ___int_size_t_h 2025-05-07T20:27:10.0103123Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:10.0103245Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:27:10.0103402Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:27:10.0103515Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:27:10.0103610Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:27:10.0103707Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:27:10.0103805Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:27:10.0103934Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:10.0104052Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:27:10.0104172Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:27:10.0104263Z #define __clock_t_defined 1 2025-05-07T20:27:10.0104365Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:27:10.0104475Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:27:10.0104563Z #define __GLIBC_MINOR__ 17 2025-05-07T20:27:10.0104657Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:27:10.0104753Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:27:10.0104861Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:27:10.0104958Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:27:10.0105134Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:10.0105220Z #define __SSE__ 1 2025-05-07T20:27:10.0105314Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:27:10.0105410Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:27:10.0105500Z #define _CTYPE_H 1 2025-05-07T20:27:10.0105589Z #define __sigset_t_defined 2025-05-07T20:27:10.0105694Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:27:10.0105837Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:27:10.0105958Z #define MOD_TAI ADJ_TAI 2025-05-07T20:27:10.0106079Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:27:10.0106176Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:27:10.0106259Z #define __SM_70_RT_H__ 2025-05-07T20:27:10.0106354Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:27:10.0106461Z #define cudaEventWaitDefault 0x00 2025-05-07T20:27:10.0106556Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:27:10.0106721Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:10.0106815Z #define _POSIX_MAX_CANON 255 2025-05-07T20:27:10.0106928Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:27:10.0107027Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:27:10.0107118Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:27:10.0107198Z #define __amd64__ 1 2025-05-07T20:27:10.0107294Z #define __WINT_WIDTH__ 32 2025-05-07T20:27:10.0107396Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:27:10.0107667Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:10.0107775Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:27:10.0107856Z #define EOF (-1) 2025-05-07T20:27:10.0107953Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:27:10.0108048Z #define __USE_POSIX199309 1 2025-05-07T20:27:10.0108142Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:27:10.0108238Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:27:10.0108331Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:27:10.0108425Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:27:10.0108540Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:27:10.0108725Z #define ____mbstate_t_defined 1 2025-05-07T20:27:10.0108821Z #define STA_NANO 0x2000 2025-05-07T20:27:10.0108919Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:27:10.0109013Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:27:10.0109204Z #define _IO_LINKED 0x80 2025-05-07T20:27:10.0109304Z #define __cpp_lib_launder 201606 2025-05-07T20:27:10.0109393Z #define __SIZEOF_INT128__ 16 2025-05-07T20:27:10.0109492Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:27:10.0109589Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:27:10.0109681Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:27:10.0109828Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:27:10.0109936Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:10.0110038Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:10.0110142Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:27:10.0110235Z #define __W_CONTINUED 0xffff 2025-05-07T20:27:10.0110324Z #define __ATOMIC_RELAXED 0 2025-05-07T20:27:10.0110461Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:27:10.0110589Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:10.0110793Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:27:10.0110987Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:27:10.0111070Z #define __stub_stty 2025-05-07T20:27:10.0111241Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:27:10.0111326Z #define le16toh(x) (x) 2025-05-07T20:27:10.0111431Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:27:10.0111609Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:27:10.0111689Z #define _SIZET_ 2025-05-07T20:27:10.0111780Z #define XATTR_NAME_MAX 255 2025-05-07T20:27:10.0111867Z #define _SVID_SOURCE 1 2025-05-07T20:27:10.0111946Z #define _LP64 1 2025-05-07T20:27:10.0112033Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:27:10.0112273Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:27:10.0112389Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:27:10.0112478Z #define __UINT8_C(c) c 2025-05-07T20:27:10.0112572Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:27:10.0112671Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:27:10.0112783Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:27:10.0112876Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:27:10.0112967Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:27:10.0113067Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:27:10.0113157Z #define CUDARTAPI 2025-05-07T20:27:10.0113241Z #define IOV_MAX 1024 2025-05-07T20:27:10.0113688Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:27:10.0113787Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:27:10.0113886Z #define P_tmpdir "/tmp" 2025-05-07T20:27:10.0113992Z #define cudaMemAttachSingle 0x04 2025-05-07T20:27:10.0114074Z #define __wchar_t__ 2025-05-07T20:27:10.0114180Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:27:10.0114262Z #define SEEK_END 2 2025-05-07T20:27:10.0114360Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:27:10.0114539Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:27:10.0114639Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:27:10.0114789Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:27:10.0114882Z #define ____FILE_defined 1 2025-05-07T20:27:10.0114997Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:27:10.0115095Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:27:10.0115184Z #define _ISOC99_SOURCE 1 2025-05-07T20:27:10.0115280Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:27:10.0115530Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:10.0115664Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:27:10.0115747Z #define _IO_RIGHT 04 2025-05-07T20:27:10.0115843Z #define __END_NAMESPACE_STD 2025-05-07T20:27:10.0116030Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:10.0116273Z #define _GLIBCXX_STD_C std 2025-05-07T20:27:10.0116405Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:27:10.0116503Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:27:10.0116609Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:27:10.0116814Z #define _STDDEF_H_ 2025-05-07T20:27:10.0117009Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:10.0117112Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:27:10.0117238Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:27:10.0117462Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:27:10.0117584Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:10.0117738Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:27:10.0117868Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:27:10.0117978Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:27:10.0118091Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:27:10.0118196Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:27:10.0118320Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:27:10.0118422Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:27:10.0118520Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:27:10.0118627Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:27:10.0118820Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:27:10.0118918Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:27:10.0119118Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:27:10.0119221Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:27:10.0119323Z #define __STDCPP_THREADS__ 1 2025-05-07T20:27:10.0119479Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:27:10.0119577Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:27:10.0119679Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:27:10.0119784Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:27:10.0119911Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:27:10.0120019Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:27:10.0120194Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:27:10.0120363Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:27:10.0120539Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:27:10.0120636Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:27:10.0120760Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:27:10.0120872Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:27:10.0120972Z #define __location__(a) __annotate__(a) 2025-05-07T20:27:10.0121207Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:27:10.0121304Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:27:10.0121417Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:27:10.0121514Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:27:10.0121601Z #define __STDC_UTF_32__ 1 2025-05-07T20:27:10.0121698Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:27:10.0121793Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:27:10.0121891Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:27:10.0121975Z #define __FXSR__ 1 2025-05-07T20:27:10.0122055Z #define _SIZE_T 2025-05-07T20:27:10.0122158Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:27:10.0122276Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:27:10.0122444Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:10.0122592Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:27:10.0122687Z #define _IO_ssize_t __ssize_t 2025-05-07T20:27:10.0122786Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:27:10.0122971Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:10.0123175Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:27:10.0123264Z #define _GXX_NULLPTR_T 2025-05-07T20:27:10.0123390Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:27:10.0123473Z #define FOPEN_MAX 16 2025-05-07T20:27:10.0123562Z #define __BIG_ENDIAN 4321 2025-05-07T20:27:10.0123769Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:10.0123868Z #define __suseconds_t_defined 2025-05-07T20:27:10.0123954Z #define __off_t_defined 2025-05-07T20:27:10.0124117Z #define stderr stderr 2025-05-07T20:27:10.0124210Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:27:10.0124323Z #define __glibcxx_requires_string(_String) 2025-05-07T20:27:10.0124425Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:27:10.0124515Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:27:10.0124935Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:27:10.0125031Z #define __mode_t_defined 2025-05-07T20:27:10.0125116Z #define _GCC_SIZE_T 2025-05-07T20:27:10.0125212Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:10.0125313Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:27:10.0125426Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:27:10.0125524Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:27:10.0125616Z #define __UINT32_C(c) c ## U 2025-05-07T20:27:10.0125724Z #define __cpp_alias_templates 200704L 2025-05-07T20:27:10.0125828Z #define cudaHostAllocMapped 0x02 2025-05-07T20:27:10.0125938Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:27:10.0126028Z #define _STL_ITERATOR_H 1 2025-05-07T20:27:10.0126107Z #define __size_t__ 2025-05-07T20:27:10.0126237Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:27:10.0126341Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:27:10.0126450Z #define cudaEventRecordExternal 0x01 2025-05-07T20:27:10.0126604Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:27:10.0126697Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:27:10.0126865Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:27:10.0126951Z #define _ENDIAN_H 1 2025-05-07T20:27:10.0127055Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:27:10.0127150Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:27:10.0127258Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:27:10.0127337Z #define __try try 2025-05-07T20:27:10.0127431Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:27:10.0127528Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:27:10.0127621Z #define __INT8_MAX__ 0x7f 2025-05-07T20:27:10.0127887Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:27:10.0127977Z #define __LONG_WIDTH__ 64 2025-05-07T20:27:10.0128056Z #define __PIC__ 2 2025-05-07T20:27:10.0128174Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:27:10.0128298Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:27:10.0128429Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:27:10.0128530Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:27:10.0128622Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:27:10.0128806Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:10.0128912Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:27:10.0129015Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:27:10.0129113Z #define _IO_uid_t __uid_t 2025-05-07T20:27:10.0129211Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:27:10.0129337Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:27:10.0129439Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:27:10.0129583Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:10.0129685Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:27:10.0129808Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:27:10.0129890Z #define LONG_BIT 64 2025-05-07T20:27:10.0129998Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:27:10.0130101Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:27:10.0130228Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:27:10.0130321Z #define __fsfilcnt_t_defined 2025-05-07T20:27:10.0130414Z #define __blkcnt_t_defined 2025-05-07T20:27:10.0130687Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:10.0130863Z #define __USE_LARGEFILE 1 2025-05-07T20:27:10.0130963Z #define __cpp_constexpr 201603L 2025-05-07T20:27:10.0131057Z #define CUDART_VERSION 12080 2025-05-07T20:27:10.0131149Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:27:10.0131319Z #define cudaDeviceMapHost 0x08 2025-05-07T20:27:10.0131407Z #define _GLIBCXX_CMATH 1 2025-05-07T20:27:10.0131612Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:27:10.0131703Z #define __lldiv_t_defined 1 2025-05-07T20:27:10.0131783Z #define __SSE2__ 1 2025-05-07T20:27:10.0131867Z #define _IOLBF 1 2025-05-07T20:27:10.0131972Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:27:10.0132067Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:27:10.0132175Z #define __cpp_deduction_guides 201703L 2025-05-07T20:27:10.0132269Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:27:10.0132383Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:27:10.0132473Z #define __INT32_TYPE__ int 2025-05-07T20:27:10.0132561Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:27:10.0132675Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:27:10.0132774Z #define __cpp_exceptions 199711L 2025-05-07T20:27:10.0132869Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:27:10.0132986Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:27:10.0133076Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:27:10.0133191Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:27:10.0133354Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:27:10.0133451Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:27:10.0133548Z #define __SWORD_TYPE long int 2025-05-07T20:27:10.0133639Z #define __INTMAX_TYPE__ long int 2025-05-07T20:27:10.0133734Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:27:10.0133831Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:27:10.0133923Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:27:10.0134206Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:10.0134303Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:27:10.0134454Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:27:10.0134533Z #define _T_SIZE 2025-05-07T20:27:10.0134646Z #define cudaHostAllocDefault 0x00 2025-05-07T20:27:10.0134793Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:27:10.0134971Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:27:10.0135082Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:27:10.0135173Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:27:10.0135299Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:27:10.0135397Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:10.0135487Z #define __ATOMIC_CONSUME 1 2025-05-07T20:27:10.0135667Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:27:10.0135755Z #define __GNUC_MINOR__ 4 2025-05-07T20:27:10.0135860Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:27:10.0135954Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:27:10.0136071Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:10.0136158Z #define __PIE__ 2 2025-05-07T20:27:10.0136265Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:27:10.0136363Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:27:10.0136557Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:27:10.0136786Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:10.0136878Z #define __nlink_t_defined 2025-05-07T20:27:10.0137007Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:27:10.0137120Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:27:10.0137205Z #define _XOPEN_LIM_H 1 2025-05-07T20:27:10.0137472Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:10.0137588Z #define __cpp_template_template_args 201611L 2025-05-07T20:27:10.0137694Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:27:10.0137795Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:27:10.0137887Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:27:10.0138093Z #define __FILE_defined 1 2025-05-07T20:27:10.0138274Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:27:10.0138371Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:27:10.0138540Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:27:10.0138646Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:27:10.0138762Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:27:10.0138874Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:27:10.0138976Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:27:10.0139058Z #define __INT16_C(c) c 2025-05-07T20:27:10.0139157Z #define __U32_TYPE unsigned int 2025-05-07T20:27:10.0139255Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:27:10.0139379Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:27:10.0139460Z #define __STDC__ 1 2025-05-07T20:27:10.0139556Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:27:10.0139656Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:27:10.0139757Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:27:10.0139911Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:27:10.0140004Z #define __FLT32X_DIG__ 15 2025-05-07T20:27:10.0140103Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:27:10.0140206Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:27:10.0140321Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:27:10.0140431Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:27:10.0140528Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:27:10.0140632Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:27:10.0140714Z #define stdin stdin 2025-05-07T20:27:10.0140806Z #define __ino64_t_defined 2025-05-07T20:27:10.0140891Z #define STA_CLK 0x8000 2025-05-07T20:27:10.0140983Z #define __clockid_t_defined 1 2025-05-07T20:27:10.0141136Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:27:10.0141301Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:27:10.0141404Z #define __cudaCDP2MemsetAsync 2025-05-07T20:27:10.0141514Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:27:10.0141619Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:27:10.0141722Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:27:10.0141933Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:27:10.0142023Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:27:10.0142562Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:27:10.0142645Z #define DOMAIN 1 2025-05-07T20:27:10.0142736Z #define M_LN2 0.69314718055994530942 2025-05-07T20:27:10.0142821Z #define __NVCC__ 1 2025-05-07T20:27:10.0142922Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:27:10.0143031Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:10.0143137Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:27:10.0143241Z #define __throw_exception_again throw 2025-05-07T20:27:10.0143332Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:27:10.0143424Z #define __EXCEPTION_H 1 2025-05-07T20:27:10.0143518Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:27:10.0143628Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:27:10.0143935Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:10.0144047Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:27:10.0144147Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:27:10.0144240Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:27:10.0144340Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:27:10.0144440Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:27:10.0144580Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:27:10.0144685Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:10.0144798Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:27:10.0144890Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:27:10.0145088Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:27:10.0145185Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:27:10.0145287Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:27:10.0145500Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:27:10.0145595Z #define __useconds_t_defined 2025-05-07T20:27:10.0145693Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:27:10.0145879Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:27:10.0146030Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:27:10.0146114Z #define __SSE_MATH__ 1 2025-05-07T20:27:10.0146208Z #define _IO_wint_t wint_t 2025-05-07T20:27:10.0146302Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:27:10.0146399Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:27:10.0146493Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:27:10.0146607Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:27:10.0146707Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:27:10.0146803Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:27:10.0146886Z #define __USE_ATFILE 1 2025-05-07T20:27:10.0146981Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:27:10.0147076Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:27:10.0147167Z #define _GCC_PTRDIFF_T 2025-05-07T20:27:10.0147401Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:10.0147498Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:27:10.0147595Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:27:10.0147700Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:27:10.0147807Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:27:10.0147893Z #define _STDLIB_H 1 2025-05-07T20:27:10.0148032Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:27:10.0148127Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:27:10.0148225Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:27:10.0148352Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:10.0148461Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:10.0148564Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:27:10.0148749Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:27:10.0148903Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:27:10.0149017Z #define __glibcxx_requires_nonempty() 2025-05-07T20:27:10.0149132Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:27:10.0149228Z #define __ldiv_t_defined 1 2025-05-07T20:27:10.0149411Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:27:10.0149503Z #define ___int_ptrdiff_t_h 2025-05-07T20:27:10.0149675Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:10.0149778Z #define __cudaCDP2EventDestroy 2025-05-07T20:27:10.0149869Z #define __HOST_DEFINES_H__ 2025-05-07T20:27:10.0149973Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:27:10.0150074Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:10.0150173Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:27:10.0150263Z #define CUDART_CB 2025-05-07T20:27:10.0150364Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:27:10.0150488Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:27:10.0150582Z #define MB_LEN_MAX 16 2025-05-07T20:27:10.0150807Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:10.0150912Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:27:10.0151036Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:27:10.0151148Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:27:10.0151246Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:27:10.0151420Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:27:10.0151542Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:27:10.0151636Z #define _GNU_SOURCE 1 2025-05-07T20:27:10.0151721Z #define __stub_putmsg 2025-05-07T20:27:10.0151804Z #define __CUDACC__ 1 2025-05-07T20:27:10.0151895Z #define __N(msgid) (msgid) 2025-05-07T20:27:10.0151978Z #define __P(args) args 2025-05-07T20:27:10.0152320Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:27:10.0152422Z #define __cpp_init_captures 201304L 2025-05-07T20:27:10.0152527Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:27:10.0152692Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:27:10.0152788Z #define __cpp_lib_as_const 201510 2025-05-07T20:27:10.0152870Z #define __WCHAR_T 2025-05-07T20:27:10.0152964Z #define __ATOMIC_RELEASE 3 2025-05-07T20:27:10.0153056Z #define __fsblkcnt_t_defined 2025-05-07T20:27:10.0153171Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:27:10.0153277Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:27:10.0153283Z 2025-05-07T20:27:10.0262138Z 2025-05-07T20:27:10.0262970Z + conda run -n build_binary nvcc --version 2025-05-07T20:27:10.0262990Z 2025-05-07T20:27:11.9201286Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:27:11.9201673Z Copyright (c) 2005-2025 NVIDIA Corporation 2025-05-07T20:27:11.9202012Z Built on Wed_Jan_15_19:20:09_PST_2025 2025-05-07T20:27:11.9202411Z Cuda compilation tools, release 12.8, V12.8.61 2025-05-07T20:27:11.9202762Z Build cuda_12.8.r12.8/compiler.35404655_0 2025-05-07T20:27:11.9202976Z 2025-05-07T20:27:11.9826176Z 2025-05-07T20:27:11.9836686Z /usr/bin/nvidia-smi 2025-05-07T20:27:11.9842503Z + nvidia-smi 2025-05-07T20:27:11.9842684Z 2025-05-07T20:27:12.0018030Z Wed May 7 20:27:11 2025 2025-05-07T20:27:12.0018610Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:12.0019227Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:27:12.0019740Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:12.0020255Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:27:12.0020808Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:27:12.0021283Z | | | MIG M. | 2025-05-07T20:27:12.0021636Z |=========================================+========================+======================| 2025-05-07T20:27:12.0189562Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:27:12.0190047Z | 0% 30C P8 26W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:27:12.0190446Z | | | N/A | 2025-05-07T20:27:12.0190867Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:12.0194415Z 2025-05-07T20:27:12.0194871Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:12.0195327Z | Processes: | 2025-05-07T20:27:12.0195784Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:27:12.0196205Z | ID ID Usage | 2025-05-07T20:27:12.0196579Z |=========================================================================================| 2025-05-07T20:27:12.0199342Z | No running processes found | 2025-05-07T20:27:12.0199833Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:12.2602346Z 2025-05-07T20:27:12.2608687Z [INSTALL] Successfully installed CUDA 12.8.0 2025-05-07T20:27:12.2666582Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:27:12.2667151Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:27:12.2680052Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:27:12.2680504Z env: 2025-05-07T20:27:12.2680742Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:27:12.2681058Z BUILD_ENV: build_binary 2025-05-07T20:27:12.2681509Z BUILD_TARGET: genai 2025-05-07T20:27:12.2681746Z BUILD_VARIANT: cuda 2025-05-07T20:27:12.2682018Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:27:12.2682299Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:27:12.2682613Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:27:12.2682963Z ##[endgroup] 2025-05-07T20:27:12.6051728Z ################################################################################ 2025-05-07T20:27:12.6052237Z # Install PyTorch (PIP) 2025-05-07T20:27:12.6052564Z # 2025-05-07T20:27:12.6066840Z # [2025-05-07T20:27:12.606Z] + install_pytorch_pip build_binary nightly cuda/12.8.0 2025-05-07T20:27:12.6067487Z ################################################################################ 2025-05-07T20:27:12.6067772Z 2025-05-07T20:27:12.6095500Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:27:13.6115230Z Channels: 2025-05-07T20:27:13.6115485Z - conda-forge 2025-05-07T20:27:13.6115730Z Platform: linux-64 2025-05-07T20:27:16.9660952Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:27:17.6959149Z Solving environment: \ | / done 2025-05-07T20:27:17.9195914Z 2025-05-07T20:27:17.9196339Z ## Package Plan ## 2025-05-07T20:27:17.9196775Z 2025-05-07T20:27:17.9197404Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:27:17.9198279Z 2025-05-07T20:27:17.9198478Z added / updated specs: 2025-05-07T20:27:17.9199153Z - numpy 2025-05-07T20:27:17.9199483Z 2025-05-07T20:27:17.9199529Z 2025-05-07T20:27:17.9199877Z The following packages will be downloaded: 2025-05-07T20:27:17.9200643Z 2025-05-07T20:27:17.9200965Z package | build 2025-05-07T20:27:17.9201713Z ---------------------------|----------------- 2025-05-07T20:27:17.9202512Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:27:17.9203447Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:27:17.9203975Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:27:17.9204443Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:27:17.9204918Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:27:17.9205577Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:27:17.9206205Z numpy-2.2.5 | py313h17eae1a_0 8.1 MB conda-forge 2025-05-07T20:27:17.9206611Z ------------------------------------------------------------ 2025-05-07T20:27:17.9206970Z Total: 15.4 MB 2025-05-07T20:27:17.9207188Z 2025-05-07T20:27:17.9207320Z The following NEW packages will be INSTALLED: 2025-05-07T20:27:17.9207554Z 2025-05-07T20:27:17.9207779Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:27:17.9208347Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:27:17.9209084Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:27:17.9209675Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:27:17.9210211Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:27:17.9210775Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:27:17.9211613Z numpy conda-forge/linux-64::numpy-2.2.5-py313h17eae1a_0 2025-05-07T20:27:17.9211908Z 2025-05-07T20:27:17.9211912Z 2025-05-07T20:27:17.9211916Z 2025-05-07T20:27:17.9212064Z Downloading and Extracting Packages: ...working... 2025-05-07T20:27:17.9212454Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:17.9212683Z 2025-05-07T20:27:17.9212967Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:17.9213218Z 2025-05-07T20:27:17.9213222Z 2025-05-07T20:27:17.9223470Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:27:17.9223902Z 2025-05-07T20:27:17.9223907Z 2025-05-07T20:27:17.9223912Z 2025-05-07T20:27:17.9254757Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:27:17.9255153Z 2025-05-07T20:27:17.9255159Z 2025-05-07T20:27:17.9255165Z 2025-05-07T20:27:17.9255170Z 2025-05-07T20:27:17.9267780Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:17.9268156Z 2025-05-07T20:27:17.9268175Z 2025-05-07T20:27:17.9268181Z 2025-05-07T20:27:17.9268186Z 2025-05-07T20:27:17.9273678Z 2025-05-07T20:27:17.9277071Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:17.9277460Z 2025-05-07T20:27:17.9277465Z 2025-05-07T20:27:17.9277471Z 2025-05-07T20:27:17.9277476Z 2025-05-07T20:27:17.9277481Z 2025-05-07T20:27:17.9284083Z 2025-05-07T20:27:18.0326976Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:18.0327395Z 2025-05-07T20:27:18.0327400Z 2025-05-07T20:27:18.0327406Z 2025-05-07T20:27:18.0624727Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:27:18.0625117Z 2025-05-07T20:27:18.0625122Z 2025-05-07T20:27:18.0878576Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:27:18.0878969Z 2025-05-07T20:27:18.0878975Z 2025-05-07T20:27:18.0878980Z 2025-05-07T20:27:18.0880937Z 2025-05-07T20:27:18.0922624Z libblas-3.9.0 | 16 KB | #########7 | 97%  2025-05-07T20:27:18.0923017Z 2025-05-07T20:27:18.0923023Z 2025-05-07T20:27:18.0924861Z 2025-05-07T20:27:18.0950333Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:18.0950730Z 2025-05-07T20:27:18.0950736Z 2025-05-07T20:27:18.0950741Z 2025-05-07T20:27:18.0957538Z 2025-05-07T20:27:18.1276231Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1276623Z 2025-05-07T20:27:18.1276628Z 2025-05-07T20:27:18.1276634Z 2025-05-07T20:27:18.1276656Z 2025-05-07T20:27:18.1286072Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1286455Z 2025-05-07T20:27:18.1286461Z 2025-05-07T20:27:18.1286466Z 2025-05-07T20:27:18.1286471Z 2025-05-07T20:27:18.1286476Z 2025-05-07T20:27:18.1294841Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:18.1295133Z 2025-05-07T20:27:18.1295138Z 2025-05-07T20:27:18.1295142Z 2025-05-07T20:27:18.1295145Z 2025-05-07T20:27:18.1295589Z 2025-05-07T20:27:18.1320307Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1320702Z 2025-05-07T20:27:18.1320708Z 2025-05-07T20:27:18.1320713Z 2025-05-07T20:27:18.1350335Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:18.1467359Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:18.1467798Z 2025-05-07T20:27:18.1467805Z 2025-05-07T20:27:18.1467810Z 2025-05-07T20:27:18.1467815Z 2025-05-07T20:27:18.1467832Z 2025-05-07T20:27:18.1467836Z 2025-05-07T20:27:18.1478804Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:18.1479194Z 2025-05-07T20:27:18.1479200Z 2025-05-07T20:27:18.1479205Z 2025-05-07T20:27:18.1479211Z 2025-05-07T20:27:18.1479216Z 2025-05-07T20:27:18.1479565Z 2025-05-07T20:27:18.1534493Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1534891Z 2025-05-07T20:27:18.1586073Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:18.1586720Z 2025-05-07T20:27:18.1586728Z 2025-05-07T20:27:18.1586733Z 2025-05-07T20:27:18.1586738Z 2025-05-07T20:27:18.1586743Z 2025-05-07T20:27:18.1676540Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1676933Z 2025-05-07T20:27:18.1676939Z 2025-05-07T20:27:18.1678833Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:18.1679214Z 2025-05-07T20:27:18.1679751Z 2025-05-07T20:27:18.1997630Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:18.1998283Z 2025-05-07T20:27:18.1998289Z 2025-05-07T20:27:18.1998294Z 2025-05-07T20:27:18.1998299Z 2025-05-07T20:27:18.1998304Z 2025-05-07T20:27:18.1998309Z 2025-05-07T20:27:18.2352732Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.2534953Z numpy-2.2.5 | 8.1 MB | #####9 | 59% 2025-05-07T20:27:18.2535814Z 2025-05-07T20:27:18.2689544Z libopenblas-0.3.29 | 5.6 MB | #########7 | 98%  2025-05-07T20:27:18.2689956Z 2025-05-07T20:27:18.2859285Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:18.2931006Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:18.2931352Z 2025-05-07T20:27:18.2931358Z 2025-05-07T20:27:18.3989301Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:18.3989599Z 2025-05-07T20:27:18.6913047Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:18.6913783Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:18.6922107Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:18.6922461Z 2025-05-07T20:27:18.6922676Z 2025-05-07T20:27:18.6922880Z  2025-05-07T20:27:18.6923093Z 2025-05-07T20:27:18.6923097Z 2025-05-07T20:27:18.6923290Z  2025-05-07T20:27:18.6923505Z 2025-05-07T20:27:18.6923525Z 2025-05-07T20:27:18.6923529Z 2025-05-07T20:27:18.6923702Z  2025-05-07T20:27:18.6923925Z 2025-05-07T20:27:18.6923929Z 2025-05-07T20:27:18.6923932Z 2025-05-07T20:27:18.6923936Z 2025-05-07T20:27:18.6924112Z  2025-05-07T20:27:18.6924334Z 2025-05-07T20:27:18.6924337Z 2025-05-07T20:27:18.6924341Z 2025-05-07T20:27:18.6924345Z 2025-05-07T20:27:18.6924356Z 2025-05-07T20:27:18.6924535Z  2025-05-07T20:27:18.6924756Z 2025-05-07T20:27:18.6924760Z 2025-05-07T20:27:18.6924763Z 2025-05-07T20:27:18.6924767Z 2025-05-07T20:27:18.6924770Z 2025-05-07T20:27:18.6924774Z 2025-05-07T20:27:18.6924966Z  done 2025-05-07T20:27:18.7927220Z Preparing transaction: \ done 2025-05-07T20:27:18.9933289Z Verifying transaction: / - done 2025-05-07T20:27:19.0942250Z Executing transaction: | done 2025-05-07T20:27:19.2758515Z ################################################################################ 2025-05-07T20:27:19.2758892Z # Install Package From PyTorch PIP: torch 2025-05-07T20:27:19.2759203Z # 2025-05-07T20:27:19.2776173Z # [2025-05-07T20:27:19.277Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0 2025-05-07T20:27:19.2776838Z ################################################################################ 2025-05-07T20:27:19.2777077Z 2025-05-07T20:27:19.2792213Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:27:19.3762604Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:27:19.3763113Z ################################################################################ 2025-05-07T20:27:19.3763472Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:27:19.3763757Z # 2025-05-07T20:27:19.3780075Z # [2025-05-07T20:27:19.377Z] + __prepare_pip_arguments torch nightly cuda/12.8.0 2025-05-07T20:27:19.3780880Z ################################################################################ 2025-05-07T20:27:19.3781114Z 2025-05-07T20:27:19.3801759Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:27:19.3828220Z [INSTALL] Extracted package variant: cu128 2025-05-07T20:27:19.3844976Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:27:19.3845540Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:27:19.3852893Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:27:19.3860282Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ... 2025-05-07T20:27:19.3881390Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:32.0685012Z DEPRECATION: Building 'MarkupSafe' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'MarkupSafe'. Discussion can be found at https://github.com/pypa/pip/issues/6334 2025-05-07T20:28:32.0687474Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:32.0688062Z Collecting torch 2025-05-07T20:28:32.0688853Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:32.0689586Z Collecting filelock (from torch) 2025-05-07T20:28:32.0689832Z 2025-05-07T20:28:32.0690163Z Using cached https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:32.0691133Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (4.13.2) 2025-05-07T20:28:32.0692244Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (78.1.1) 2025-05-07T20:28:32.0692930Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:32.0693448Z Using cached https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:32.0693977Z Collecting networkx (from torch) 2025-05-07T20:28:32.0694500Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:28:32.0695522Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 19.2 MB/s eta 0:00:00 2025-05-07T20:28:32.0695906Z Collecting jinja2 (from torch) 2025-05-07T20:28:32.0696407Z Using cached https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:32.0696927Z Collecting fsspec (from torch) 2025-05-07T20:28:32.0697448Z Using cached https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:32.0698051Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch) 2025-05-07T20:28:32.0698903Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:32.0699764Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch) 2025-05-07T20:28:32.0700631Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:32.0701494Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch) 2025-05-07T20:28:32.0702334Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:32.0703155Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch) 2025-05-07T20:28:32.0704243Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB) 2025-05-07T20:28:32.0704990Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch) 2025-05-07T20:28:32.0705736Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:32.0706477Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch) 2025-05-07T20:28:32.0707470Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:32.0708296Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch) 2025-05-07T20:28:32.0709030Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:32.0709784Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch) 2025-05-07T20:28:32.0710557Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:32.0711317Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch) 2025-05-07T20:28:32.0712156Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:32.0713001Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:32.0714031Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) 2025-05-07T20:28:32.0714764Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:32.0715619Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:32.0716429Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch) 2025-05-07T20:28:32.0717226Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:32.0718031Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch) 2025-05-07T20:28:32.0718873Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:32.0719711Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch) 2025-05-07T20:28:32.0720714Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:32.0721559Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:32.0722430Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:32.0723284Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:32.0723862Z Using cached https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:32.0724414Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:32.0724937Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5.tar.gz (19 kB) 2025-05-07T20:28:32.0725450Z Preparing metadata (setup.py): started 2025-05-07T20:28:32.0725849Z Preparing metadata (setup.py): finished with status 'done' 2025-05-07T20:28:32.0726632Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1047.0 MB) 2025-05-07T20:28:32.0727502Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 23.8 MB/s eta 0:00:00 2025-05-07T20:28:32.0728385Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB) 2025-05-07T20:28:32.0729499Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB) 2025-05-07T20:28:32.0730691Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB) 2025-05-07T20:28:32.0731890Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB) 2025-05-07T20:28:32.0733117Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB) 2025-05-07T20:28:32.0734208Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB) 2025-05-07T20:28:32.0735382Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB) 2025-05-07T20:28:32.0736468Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB) 2025-05-07T20:28:32.0737493Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB) 2025-05-07T20:28:32.0738608Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB) 2025-05-07T20:28:32.0739728Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:32.0740821Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:32.0742000Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB) 2025-05-07T20:28:32.0743164Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-05-07T20:28:32.0744342Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:28:32.0745313Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 196.0 MB/s eta 0:00:00 2025-05-07T20:28:32.0745719Z Building wheels for collected packages: MarkupSafe 2025-05-07T20:28:32.0746119Z Building wheel for MarkupSafe (setup.py): started 2025-05-07T20:28:32.0746578Z Building wheel for MarkupSafe (setup.py): finished with status 'done' 2025-05-07T20:28:32.0747475Z Created wheel for MarkupSafe: filename=markupsafe-2.1.5-cp313-cp313-linux_x86_64.whl size=14954 sha256=8642341f746950f07f790b09c3e552393bd8cdf535cdc73dd539cf084cd476d7 2025-05-07T20:28:32.0748533Z Stored in directory: /home/ec2-user/.cache/pip/wheels/3a/21/87/28c44597225fd0c28d6ffa365f1c2c9dd0ab763711aa4957c6 2025-05-07T20:28:32.0749138Z Successfully built MarkupSafe 2025-05-07T20:28:32.0750874Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:32.0752516Z 2025-05-07T20:28:32.0754639Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128 2025-05-07T20:28:32.0756758Z 2025-05-07T20:28:34.2952942Z torch 2.8.0.dev20250507+cu128 2025-05-07T20:28:34.2957288Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128) 2025-05-07T20:28:37.7141467Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:41.1478655Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128 2025-05-07T20:28:41.1479110Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:44.4966534Z True 2025-05-07T20:28:44.4966787Z True 2025-05-07T20:28:44.4966897Z 2025-05-07T20:28:44.5593393Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:44.5633219Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:44.5633858Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:44.5647563Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:44.5647932Z env: 2025-05-07T20:28:44.5648164Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:44.5648475Z BUILD_ENV: build_binary 2025-05-07T20:28:44.5648734Z BUILD_TARGET: genai 2025-05-07T20:28:44.5649033Z BUILD_VARIANT: cuda 2025-05-07T20:28:44.5649289Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:44.5649553Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:44.5649868Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:44.5650216Z ##[endgroup] 2025-05-07T20:28:44.8981962Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:44.8983690Z ################################################################################ 2025-05-07T20:28:44.8984194Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:44.8984567Z # 2025-05-07T20:28:44.9000398Z # [2025-05-07T20:28:44.899Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:44.9000820Z ################################################################################ 2025-05-07T20:28:44.9001044Z 2025-05-07T20:28:44.9018390Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:44.9944436Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:44.9953964Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:44.9954634Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:44.9955039Z 2025-05-07T20:28:45.0821301Z 2025-05-07T20:28:45.0821733Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:45.0845255Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:50.9642005Z Collecting environment information... 2025-05-07T20:28:50.9642398Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T20:28:50.9642696Z Is debug build: False 2025-05-07T20:28:50.9642957Z CUDA used to build PyTorch: 12.8 2025-05-07T20:28:50.9643249Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:50.9643432Z 2025-05-07T20:28:50.9643538Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:50.9643873Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:50.9644207Z Clang version: Could not collect 2025-05-07T20:28:50.9644506Z CMake version: Could not collect 2025-05-07T20:28:50.9644780Z Libc version: glibc-2.34 2025-05-07T20:28:50.9644947Z 2025-05-07T20:28:50.9645260Z Python version: 3.13.0 | packaged by conda-forge | (main, Nov 27 2024, 19:18:50) [GCC 13.3.0] (64-bit runtime) 2025-05-07T20:28:50.9645939Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:50.9646362Z Is CUDA available: True 2025-05-07T20:28:50.9646623Z CUDA runtime version: 12.8.61 2025-05-07T20:28:50.9646905Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:50.9647225Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:50.9647562Z Nvidia driver version: 570.133.07 2025-05-07T20:28:50.9647853Z cuDNN version: Could not collect 2025-05-07T20:28:50.9648135Z HIP runtime version: N/A 2025-05-07T20:28:50.9648391Z MIOpen runtime version: N/A 2025-05-07T20:28:50.9648661Z Is XNNPACK available: True 2025-05-07T20:28:50.9648834Z 2025-05-07T20:28:50.9648914Z CPU: 2025-05-07T20:28:50.9649139Z Architecture: x86_64 2025-05-07T20:28:50.9649801Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:50.9650208Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:50.9650609Z Byte Order: Little Endian 2025-05-07T20:28:50.9650930Z CPU(s): 16 2025-05-07T20:28:50.9651241Z On-line CPU(s) list: 0-15 2025-05-07T20:28:50.9651776Z Vendor ID: AuthenticAMD 2025-05-07T20:28:50.9652127Z Model name: AMD EPYC 7R32 2025-05-07T20:28:50.9652459Z CPU family: 23 2025-05-07T20:28:50.9652758Z Model: 49 2025-05-07T20:28:50.9653057Z Thread(s) per core: 2 2025-05-07T20:28:50.9653355Z Core(s) per socket: 8 2025-05-07T20:28:50.9653649Z Socket(s): 1 2025-05-07T20:28:50.9653938Z Stepping: 0 2025-05-07T20:28:50.9654257Z BogoMIPS: 5599.99 2025-05-07T20:28:50.9656389Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:50.9658499Z Hypervisor vendor: KVM 2025-05-07T20:28:50.9658828Z Virtualization type: full 2025-05-07T20:28:50.9659179Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:50.9659553Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:50.9659935Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:50.9660307Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:50.9660633Z NUMA node(s): 1 2025-05-07T20:28:50.9660936Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:50.9661284Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:50.9661666Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:50.9662037Z Vulnerability L1tf: Not affected 2025-05-07T20:28:50.9662399Z Vulnerability Mds: Not affected 2025-05-07T20:28:50.9662763Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:50.9663129Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:50.9663509Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:50.9664074Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:50.9664672Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:50.9665236Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:50.9665945Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:50.9666962Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:50.9667656Z Vulnerability Srbds: Not affected 2025-05-07T20:28:50.9668028Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:50.9668266Z 2025-05-07T20:28:50.9668377Z Versions of relevant libraries: 2025-05-07T20:28:50.9668650Z [pip3] numpy==2.2.5 2025-05-07T20:28:50.9668903Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T20:28:50.9669225Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T20:28:50.9669542Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T20:28:50.9669979Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T20:28:50.9670306Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T20:28:50.9670610Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T20:28:50.9670909Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T20:28:50.9671223Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T20:28:50.9671546Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T20:28:50.9671970Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:50.9672287Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:50.9672587Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T20:28:50.9672895Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T20:28:50.9673200Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:50.9673522Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T20:28:50.9673909Z [conda] cuda-cudart 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:50.9674418Z [conda] cuda-cudart-dev 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:50.9674961Z [conda] cuda-cudart-dev_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:50.9675504Z [conda] cuda-cudart-static 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:50.9676055Z [conda] cuda-cudart-static_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:50.9676613Z [conda] cuda-cudart_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:50.9677120Z [conda] cuda-cupti 12.8.57 hbd13f7d_0 conda-forge 2025-05-07T20:28:50.9677616Z [conda] cuda-cupti-dev 12.8.57 h5888daf_0 conda-forge 2025-05-07T20:28:50.9678119Z [conda] cuda-libraries 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:50.9678664Z [conda] cuda-libraries-dev 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:50.9679162Z [conda] cuda-nvrtc 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:50.9679658Z [conda] cuda-nvrtc-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:50.9680256Z [conda] cuda-nvtx 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:50.9680744Z [conda] cuda-opencl 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:50.9681241Z [conda] cuda-opencl-dev 12.8.55 h5888daf_0 conda-forge 2025-05-07T20:28:50.9681747Z [conda] cuda-runtime 12.8.0 ha804496_0 conda-forge 2025-05-07T20:28:50.9682232Z [conda] libcublas 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:50.9682716Z [conda] libcublas-dev 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:50.9683209Z [conda] libcufft 11.3.3.41 hbd13f7d_0 conda-forge 2025-05-07T20:28:50.9683696Z [conda] libcufft-dev 11.3.3.41 h5888daf_0 conda-forge 2025-05-07T20:28:50.9684185Z [conda] libcurand 10.3.9.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:50.9684676Z [conda] libcurand-dev 10.3.9.55 h5888daf_0 conda-forge 2025-05-07T20:28:50.9685178Z [conda] libcusolver 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:50.9685688Z [conda] libcusolver-dev 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:50.9686198Z [conda] libcusparse 12.5.7.53 hbd13f7d_0 conda-forge 2025-05-07T20:28:50.9686711Z [conda] libcusparse-dev 12.5.7.53 h5888daf_0 conda-forge 2025-05-07T20:28:50.9687223Z [conda] libnvjitlink 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:50.9687737Z [conda] libnvjitlink-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:50.9688221Z [conda] numpy 2.2.5 py313h17eae1a_0 conda-forge 2025-05-07T20:28:50.9688708Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T20:28:50.9689323Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:50.9689845Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:50.9690361Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:50.9690972Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T20:28:50.9691569Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T20:28:50.9692063Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T20:28:50.9692571Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T20:28:50.9693082Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T20:28:50.9693603Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:50.9694099Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:50.9694707Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:50.9695208Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T20:28:50.9695704Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:50.9696186Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T20:28:50.9696472Z 2025-05-07T20:28:51.0374969Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:51.0375667Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:51.0387413Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:51.0387774Z env: 2025-05-07T20:28:51.0388011Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:51.0388323Z BUILD_ENV: build_binary 2025-05-07T20:28:51.0388583Z BUILD_TARGET: genai 2025-05-07T20:28:51.0388825Z BUILD_VARIANT: cuda 2025-05-07T20:28:51.0389092Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:51.0389358Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:51.0389672Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:51.0390019Z ##[endgroup] 2025-05-07T20:28:51.3750889Z ################################################################################ 2025-05-07T20:28:51.3751289Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:51.3766941Z # 2025-05-07T20:28:51.3767316Z # [2025-05-07T20:28:51.376Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:51.3767727Z ################################################################################ 2025-05-07T20:28:51.3767961Z 2025-05-07T20:28:51.3782844Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:51.4724544Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:51.4746589Z [BUILD] Running git submodules update ... 2025-05-07T20:28:51.4767928Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:51.5128986Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:51.5129494Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:51.5129940Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:51.5130346Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:51.5130765Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:51.5131207Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:51.5131630Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:51.5164858Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:51.5724511Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:51.5746539Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:53.9678083Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:53.9688840Z Using cached backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:54.0096479Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:54.0105805Z Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:54.1419886Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:54.1430362Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:54.1868835Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:54.1877061Z Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:54.4161410Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:54.4171872Z Using cached hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:54.4257089Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:54.4260592Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:54.4688221Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:54.4696979Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:54.4710527Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:54.5035081Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:54.5044392Z Using cached pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:54.5908326Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:54.6080967Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:54.6939332Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:54.6948647Z Using cached scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:54.6998318Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:54.7457051Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:54.7466149Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:54.7821941Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:54.7831335Z Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:54.8173677Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:54.8183422Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:54.8892275Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:54.8922632Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:54.9633339Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:54.9642361Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:54.9975152Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:54.9984023Z Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:55.0450315Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:55.0468245Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:55.0484560Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:55.1010902Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:55.1037234Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:55.1513286Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:55.1836281Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:55.1845264Z Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:55.1865133Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:55.2501281Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:55.2540860Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:55.2976531Z Using cached backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:55.2985455Z Using cached build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:55.2994654Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:55.3187265Z Using cached click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:55.3196394Z Using cached hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:55.3209027Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:55.3218232Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:55.3229471Z Using cached pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:55.3270333Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB) 2025-05-07T20:28:55.4022013Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.5/759.5 kB 7.2 MB/s eta 0:00:00 2025-05-07T20:28:55.4030664Z Using cached scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:55.4040265Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:55.4049341Z Using cached tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:55.4058820Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:55.4070257Z Using cached attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:55.4098585Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:55.4575936Z Using cached distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:55.4584816Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:55.4612317Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:55.5168113Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:55.6855347Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:58.1095109Z 2025-05-07T20:28:58.1149309Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:28:58.2830439Z ################################################################################ 2025-05-07T20:28:58.2830840Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:58.2831109Z # 2025-05-07T20:28:58.2848825Z # [2025-05-07T20:28:58.284Z] + install_triton_pip build_binary 2025-05-07T20:28:58.2849230Z ################################################################################ 2025-05-07T20:28:58.2849454Z 2025-05-07T20:28:58.2850053Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:58.2850511Z ################################################################################ 2025-05-07T20:28:58.2850895Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:58.2851229Z # 2025-05-07T20:28:58.2868912Z # [2025-05-07T20:28:58.286Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:58.2869458Z ################################################################################ 2025-05-07T20:28:58.2869695Z 2025-05-07T20:28:58.2886258Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:58.3840074Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:58.3840909Z ################################################################################ 2025-05-07T20:28:58.3841592Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:58.3842172Z # 2025-05-07T20:28:58.3858101Z # [2025-05-07T20:28:58.385Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:58.3858605Z ################################################################################ 2025-05-07T20:28:58.3858837Z 2025-05-07T20:28:58.3904170Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:58.3920861Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:58.3921784Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:58.3929687Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:58.3939103Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:58.3960254Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:05.7333667Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:29:05.7334973Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:29:05.7335628Z 2025-05-07T20:29:05.7335868Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:05.7336299Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:05.7337134Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:29:05.7338396Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:29:05.7339499Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 63.3 MB/s eta 0:00:00 2025-05-07T20:29:05.7339910Z Installing collected packages: pytorch-triton 2025-05-07T20:29:05.7340268Z Attempting uninstall: pytorch-triton 2025-05-07T20:29:05.7340678Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:29:05.7341110Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:29:05.7341550Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:29:05.7342021Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:29:05.7342290Z 2025-05-07T20:29:07.9482060Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:29:07.9486177Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:10.0996021Z ################################################################################ 2025-05-07T20:29:10.0996504Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:10.0996897Z ################################################################################ 2025-05-07T20:29:10.0997464Z 2025-05-07T20:29:12.1614151Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:14.3351615Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:14.3355306Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:14.3401039Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:14.3401542Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:14.3413631Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:14.3414026Z env: 2025-05-07T20:29:14.3414268Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:14.3414583Z BUILD_ENV: build_binary 2025-05-07T20:29:14.3414846Z BUILD_TARGET: genai 2025-05-07T20:29:14.3415088Z BUILD_VARIANT: cuda 2025-05-07T20:29:14.3415332Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:14.3415606Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:14.3415927Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:14.3416278Z ##[endgroup] 2025-05-07T20:29:14.6792281Z ################################################################################ 2025-05-07T20:29:14.6792823Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:14.6793131Z # 2025-05-07T20:29:14.6809413Z # [2025-05-07T20:29:14.680Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.6810346Z ################################################################################ 2025-05-07T20:29:14.6810602Z 2025-05-07T20:29:14.6810984Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.6811717Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.6812073Z 2025-05-07T20:29:14.6961526Z 891428e398d8fa44bdcd60728272fd376b27a8ba fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.6963838Z 2025-05-07T20:29:14.6964467Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.6964999Z 2025-05-07T20:29:14.7129759Z 86a533cac2dc47ba6525697cbaf3fe89eda98f1fc3bd69dfc08261cb1f2d2035 fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.7131851Z 2025-05-07T20:29:14.7132490Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.7132983Z 2025-05-07T20:29:14.7459796Z 4c3714dae593cf99d3df6aac70dd67cf fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:14.7461869Z 2025-05-07T20:29:14.7473751Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:14.7495072Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:17.5397790Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:17.5399007Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:17.5399880Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:17.5400432Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:17.5400708Z 2025-05-07T20:29:24.4458494Z ################################################################################ 2025-05-07T20:29:24.4458878Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:24.4459277Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128 2025-05-07T20:29:24.4459714Z [CHECK] CUDA version reported by PyTorch is: 12.8 2025-05-07T20:29:24.4460069Z [CHECK] 2025-05-07T20:29:24.4460433Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:24.4461020Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:24.4461416Z ################################################################################ 2025-05-07T20:29:24.4462062Z 2025-05-07T20:29:24.4462186Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:29:28.4268865Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:32.3777605Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:36.3468049Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:36.3473562Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:48.2166473Z ################################################################################ 2025-05-07T20:29:48.2166937Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:48.2167287Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:48.2167649Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:48.2167998Z ################################################################################ 2025-05-07T20:29:48.2168220Z 2025-05-07T20:29:56.1583244Z ################################################################################ 2025-05-07T20:29:56.1584155Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:56.1586185Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:56.1587813Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:56.1588358Z ################################################################################ 2025-05-07T20:29:56.1588586Z 2025-05-07T20:29:56.1588746Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:30:00.1317890Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:30:04.1126985Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:30:08.2003793Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:12.1791898Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:12.1795645Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:16.0736516Z fbgemm.nccl_init 2025-05-07T20:30:16.0736720Z 2025-05-07T20:30:16.1350976Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:20.0208421Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:20.0208651Z 2025-05-07T20:30:20.0823430Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:23.9656904Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:23.9657125Z 2025-05-07T20:30:24.0266998Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:24.0268164Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:24.0304461Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:24.0304948Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:24.0318069Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:24.0318440Z env: 2025-05-07T20:30:24.0318683Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:24.0319000Z BUILD_ENV: build_binary 2025-05-07T20:30:24.0319263Z BUILD_TARGET: genai 2025-05-07T20:30:24.0319506Z BUILD_VARIANT: cuda 2025-05-07T20:30:24.0319752Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:30:24.0320024Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:24.0320439Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:24.0320787Z ##[endgroup] 2025-05-07T20:30:24.3674540Z ################################################################################ 2025-05-07T20:30:24.3674962Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:24.3675591Z # 2025-05-07T20:30:24.3689431Z # [2025-05-07T20:30:24.368Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:24.3690012Z ################################################################################ 2025-05-07T20:30:24.3690310Z 2025-05-07T20:30:32.2805202Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:32.2805804Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:32.2806230Z [TEST] Determined the test directories: 2025-05-07T20:30:32.2806557Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:32.2806896Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:32.2815632Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:32.2815846Z 2025-05-07T20:30:32.2816664Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:32.2822154Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:32.2822623Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:32.2822949Z 2025-05-07T20:30:32.7039667Z 2025-05-07T20:30:32.7040122Z [TEST] Installing PyTest ... 2025-05-07T20:30:32.7064548Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:33.9603361Z Channels: 2025-05-07T20:30:33.9603718Z - conda-forge 2025-05-07T20:30:33.9603966Z Platform: linux-64 2025-05-07T20:30:37.2477978Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:38.3847292Z Solving environment: \ | / done 2025-05-07T20:30:38.6153453Z 2025-05-07T20:30:38.6153894Z ## Package Plan ## 2025-05-07T20:30:38.6154121Z 2025-05-07T20:30:38.6154421Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:38.6154830Z 2025-05-07T20:30:38.6154954Z added / updated specs: 2025-05-07T20:30:38.6155288Z - expecttest 2025-05-07T20:30:38.6155578Z - pytest 2025-05-07T20:30:38.6155727Z 2025-05-07T20:30:38.6155755Z 2025-05-07T20:30:38.6155885Z The following packages will be downloaded: 2025-05-07T20:30:38.6156118Z 2025-05-07T20:30:38.6156240Z package | build 2025-05-07T20:30:38.6156570Z ---------------------------|----------------- 2025-05-07T20:30:38.6156963Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:38.6157434Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:38.6157917Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:38.6158374Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:38.6158826Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:38.6159259Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:38.6159689Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:38.6160630Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:38.6161044Z ------------------------------------------------------------ 2025-05-07T20:30:38.6161397Z Total: 428 KB 2025-05-07T20:30:38.6161619Z 2025-05-07T20:30:38.6161751Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:38.6161976Z 2025-05-07T20:30:38.6162204Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:38.6162733Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:38.6163269Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:38.6163766Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:38.6164249Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:38.6164705Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:38.6165332Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:38.6165768Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:38.6166033Z 2025-05-07T20:30:38.6166037Z 2025-05-07T20:30:38.6166042Z 2025-05-07T20:30:38.6166198Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:38.6166574Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:38.6166812Z 2025-05-07T20:30:38.6167816Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:38.6168068Z 2025-05-07T20:30:38.6168072Z 2025-05-07T20:30:38.6187437Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:38.6187801Z 2025-05-07T20:30:38.6187807Z 2025-05-07T20:30:38.6187812Z 2025-05-07T20:30:38.6192608Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:38.6192961Z 2025-05-07T20:30:38.6192965Z 2025-05-07T20:30:38.6192978Z 2025-05-07T20:30:38.6192982Z 2025-05-07T20:30:38.6213257Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:38.6213877Z 2025-05-07T20:30:38.6213881Z 2025-05-07T20:30:38.6213885Z 2025-05-07T20:30:38.6213896Z 2025-05-07T20:30:38.6214071Z 2025-05-07T20:30:38.6218075Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:38.6218354Z 2025-05-07T20:30:38.6218359Z 2025-05-07T20:30:38.6218363Z 2025-05-07T20:30:38.6218366Z 2025-05-07T20:30:38.6218370Z 2025-05-07T20:30:38.6218374Z 2025-05-07T20:30:38.6219332Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:38.6219616Z 2025-05-07T20:30:38.6219620Z 2025-05-07T20:30:38.6219623Z 2025-05-07T20:30:38.6219627Z 2025-05-07T20:30:38.6219631Z 2025-05-07T20:30:38.6219642Z 2025-05-07T20:30:38.6219645Z 2025-05-07T20:30:38.6726029Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:38.6726418Z 2025-05-07T20:30:38.6726422Z 2025-05-07T20:30:38.6786728Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:38.6787099Z 2025-05-07T20:30:38.6787112Z 2025-05-07T20:30:38.6794283Z 2025-05-07T20:30:38.6833689Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:38.6834066Z 2025-05-07T20:30:38.7036058Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:38.7057599Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:38.7057928Z 2025-05-07T20:30:38.7057934Z 2025-05-07T20:30:38.7057940Z 2025-05-07T20:30:38.7059590Z 2025-05-07T20:30:38.7251953Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:38.7252265Z 2025-05-07T20:30:38.7252270Z 2025-05-07T20:30:38.7254279Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:38.7254539Z 2025-05-07T20:30:38.7254543Z 2025-05-07T20:30:38.7267163Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:38.7268658Z 2025-05-07T20:30:38.7270675Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:38.7270944Z 2025-05-07T20:30:38.7280590Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:38.7280860Z 2025-05-07T20:30:38.7280864Z 2025-05-07T20:30:38.7281352Z 2025-05-07T20:30:38.7287077Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:38.7287348Z 2025-05-07T20:30:38.7287352Z 2025-05-07T20:30:38.7287355Z 2025-05-07T20:30:38.7299860Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:38.7300245Z 2025-05-07T20:30:38.7300251Z 2025-05-07T20:30:38.7300256Z 2025-05-07T20:30:38.7300261Z 2025-05-07T20:30:38.7300266Z 2025-05-07T20:30:38.7300272Z 2025-05-07T20:30:38.7300528Z 2025-05-07T20:30:38.7321598Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:38.7321889Z 2025-05-07T20:30:38.7321893Z 2025-05-07T20:30:38.7321905Z 2025-05-07T20:30:38.7321909Z 2025-05-07T20:30:38.7321913Z 2025-05-07T20:30:38.7321916Z 2025-05-07T20:30:38.7322110Z 2025-05-07T20:30:38.7343490Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:38.7343872Z 2025-05-07T20:30:38.7343878Z 2025-05-07T20:30:38.7343884Z 2025-05-07T20:30:38.7343889Z 2025-05-07T20:30:38.7343895Z 2025-05-07T20:30:38.7344038Z 2025-05-07T20:30:38.7360869Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:38.7361294Z 2025-05-07T20:30:38.7361300Z 2025-05-07T20:30:38.7361305Z 2025-05-07T20:30:38.7361308Z 2025-05-07T20:30:38.7361312Z 2025-05-07T20:30:38.7363765Z 2025-05-07T20:30:38.7580433Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:38.7580730Z 2025-05-07T20:30:38.7580733Z 2025-05-07T20:30:38.7580737Z 2025-05-07T20:30:38.7583895Z 2025-05-07T20:30:38.7587092Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:38.7587822Z 2025-05-07T20:30:38.7587833Z 2025-05-07T20:30:38.7587842Z 2025-05-07T20:30:38.7587872Z 2025-05-07T20:30:38.7601586Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:38.7602011Z 2025-05-07T20:30:38.7602015Z 2025-05-07T20:30:38.7602019Z 2025-05-07T20:30:38.7602023Z 2025-05-07T20:30:38.7602026Z 2025-05-07T20:30:38.7602030Z 2025-05-07T20:30:38.7602033Z 2025-05-07T20:30:38.7704123Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:38.7704806Z 2025-05-07T20:30:38.7704814Z 2025-05-07T20:30:38.7704821Z 2025-05-07T20:30:38.7704828Z 2025-05-07T20:30:38.7704835Z 2025-05-07T20:30:38.7704843Z 2025-05-07T20:30:38.7821403Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:38.7821949Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:38.8623141Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:38.8623563Z 2025-05-07T20:30:38.8623569Z 2025-05-07T20:30:38.8623574Z 2025-05-07T20:30:38.8623579Z 2025-05-07T20:30:38.8623584Z 2025-05-07T20:30:38.8630214Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:38.8630554Z 2025-05-07T20:30:38.8630558Z 2025-05-07T20:30:38.8630562Z 2025-05-07T20:30:38.8630566Z 2025-05-07T20:30:38.8630569Z 2025-05-07T20:30:38.8704963Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:38.8705235Z 2025-05-07T20:30:38.8705240Z 2025-05-07T20:30:38.8705243Z 2025-05-07T20:30:38.8705247Z 2025-05-07T20:30:38.8705251Z 2025-05-07T20:30:38.8711955Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:38.8712328Z 2025-05-07T20:30:38.8712546Z 2025-05-07T20:30:38.8712714Z  2025-05-07T20:30:38.8712920Z 2025-05-07T20:30:38.8712924Z 2025-05-07T20:30:38.8713099Z  2025-05-07T20:30:38.8713501Z 2025-05-07T20:30:38.8713507Z 2025-05-07T20:30:38.8713512Z 2025-05-07T20:30:38.8713753Z  2025-05-07T20:30:38.8714177Z 2025-05-07T20:30:38.8714182Z 2025-05-07T20:30:38.8714186Z 2025-05-07T20:30:38.8714190Z 2025-05-07T20:30:38.8714378Z  2025-05-07T20:30:38.8714592Z 2025-05-07T20:30:38.8714596Z 2025-05-07T20:30:38.8714599Z 2025-05-07T20:30:38.8714603Z 2025-05-07T20:30:38.8714606Z 2025-05-07T20:30:38.8714789Z  2025-05-07T20:30:38.8715003Z 2025-05-07T20:30:38.8715007Z 2025-05-07T20:30:38.8715010Z 2025-05-07T20:30:38.8715014Z 2025-05-07T20:30:38.8715017Z 2025-05-07T20:30:38.8715021Z 2025-05-07T20:30:38.8715212Z  2025-05-07T20:30:38.8715430Z 2025-05-07T20:30:38.8715433Z 2025-05-07T20:30:38.8715437Z 2025-05-07T20:30:38.8715440Z 2025-05-07T20:30:38.8715444Z 2025-05-07T20:30:38.8715447Z 2025-05-07T20:30:38.8715584Z 2025-05-07T20:30:38.8715787Z  done 2025-05-07T20:30:38.9718339Z Preparing transaction: \ done 2025-05-07T20:30:39.0721939Z Verifying transaction: / done 2025-05-07T20:30:40.9749151Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:41.1011128Z [TEST] Checking imports ... 2025-05-07T20:30:45.0304518Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:45.0318503Z [TEST] Setting feature flags ... 2025-05-07T20:30:45.0319023Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:45.0319377Z 2025-05-07T20:30:45.4547020Z 2025-05-07T20:30:45.4547556Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:45.4548714Z ################################################################################ 2025-05-07T20:30:45.4549039Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:45.4549310Z # 2025-05-07T20:30:45.4568600Z # [2025-05-07T20:30:45.456Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:45.4569030Z ################################################################################ 2025-05-07T20:30:45.4569263Z 2025-05-07T20:30:45.4576793Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:45.4605892Z ./attention/gqa_test.py 2025-05-07T20:30:45.4606185Z ./coalesce/coalesce_test.py 2025-05-07T20:30:45.4606458Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:45.4606750Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:45.4607060Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:45.4607323Z ./moe/activation_test.py 2025-05-07T20:30:45.4607584Z ./moe/gather_scatter_test.py 2025-05-07T20:30:45.4607847Z ./moe/layers_test.py 2025-05-07T20:30:45.4608083Z ./moe/shuffling_test.py 2025-05-07T20:30:45.4608340Z ./quantize/quantize_test.py 2025-05-07T20:30:45.4608516Z 2025-05-07T20:30:45.4608637Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:45.4608865Z 2025-05-07T20:30:45.4626538Z ################################################################################ 2025-05-07T20:30:45.4641612Z # [2025-05-07T20:30:45.463Z] Run Python Test Suite: 2025-05-07T20:30:45.4641956Z # ./attention/gqa_test.py 2025-05-07T20:30:45.4642245Z ################################################################################ 2025-05-07T20:30:45.4665584Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:45.4666198Z 2025-05-07T20:30:48.0132787Z ============================= test session starts ============================== 2025-05-07T20:30:48.0133565Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:48.0134172Z cachedir: .pytest_cache 2025-05-07T20:30:48.0135371Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:48.0137388Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:48.0138242Z plugins: hypothesis-6.131.14 2025-05-07T20:30:49.6283238Z collecting ... collected 2 items 2025-05-07T20:30:49.6283686Z 2025-05-07T20:31:27.8591218Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:27.8598097Z self=, 2025-05-07T20:31:27.8598532Z int4_kv=False, 2025-05-07T20:31:27.8598811Z num_groups=1, 2025-05-07T20:31:27.8599069Z B=1, 2025-05-07T20:31:27.8599304Z MAX_T=4, 2025-05-07T20:31:27.8599550Z N_H_L=1, 2025-05-07T20:31:27.8599791Z ) 2025-05-07T20:31:27.8600038Z Trying example: test_gqa( 2025-05-07T20:31:27.8600552Z self=, 2025-05-07T20:31:27.8600944Z int4_kv=True, 2025-05-07T20:31:27.8601209Z num_groups=1, 2025-05-07T20:31:27.8601468Z B=1, 2025-05-07T20:31:27.8601697Z MAX_T=4, 2025-05-07T20:31:27.8602365Z N_H_L=1, 2025-05-07T20:31:27.8602610Z ) 2025-05-07T20:31:27.8602867Z Trying example: test_gqa( 2025-05-07T20:31:27.8603236Z self=, 2025-05-07T20:31:27.8603636Z int4_kv=True, 2025-05-07T20:31:27.8603894Z num_groups=4, 2025-05-07T20:31:27.8604151Z B=23, 2025-05-07T20:31:27.8604388Z MAX_T=33, 2025-05-07T20:31:27.8604630Z N_H_L=68, 2025-05-07T20:31:27.8604871Z ) 2025-05-07T20:31:27.8605116Z Trying example: test_gqa( 2025-05-07T20:31:27.8605485Z self=, 2025-05-07T20:31:27.8605886Z int4_kv=True, 2025-05-07T20:31:27.8606148Z num_groups=4, 2025-05-07T20:31:27.8606408Z B=77, 2025-05-07T20:31:27.8606639Z MAX_T=4, 2025-05-07T20:31:27.8606884Z N_H_L=1, 2025-05-07T20:31:27.8607122Z ) 2025-05-07T20:31:27.8607361Z Trying example: test_gqa( 2025-05-07T20:31:27.8607726Z self=, 2025-05-07T20:31:27.8608120Z int4_kv=True, 2025-05-07T20:31:27.8608386Z num_groups=4, 2025-05-07T20:31:27.8608640Z B=77, 2025-05-07T20:31:27.8608880Z MAX_T=52, 2025-05-07T20:31:27.8609118Z N_H_L=67, 2025-05-07T20:31:27.8609362Z ) 2025-05-07T20:31:27.8609606Z Trying example: test_gqa( 2025-05-07T20:31:27.8609969Z self=, 2025-05-07T20:31:27.8610368Z int4_kv=False, 2025-05-07T20:31:27.8610635Z num_groups=4, 2025-05-07T20:31:27.8610886Z B=57, 2025-05-07T20:31:27.8611123Z MAX_T=45, 2025-05-07T20:31:27.8611375Z N_H_L=120, 2025-05-07T20:31:27.8611615Z ) 2025-05-07T20:31:27.8611862Z Trying example: test_gqa( 2025-05-07T20:31:27.8612231Z self=, 2025-05-07T20:31:27.8612623Z int4_kv=True, 2025-05-07T20:31:27.8612885Z num_groups=4, 2025-05-07T20:31:27.8613146Z B=52, 2025-05-07T20:31:27.8613669Z MAX_T=42, 2025-05-07T20:31:27.8613929Z N_H_L=53, 2025-05-07T20:31:27.8614177Z ) 2025-05-07T20:31:27.8614413Z Trying example: test_gqa( 2025-05-07T20:31:27.8614787Z self=, 2025-05-07T20:31:27.8615192Z int4_kv=True, 2025-05-07T20:31:27.8615446Z num_groups=1, 2025-05-07T20:31:27.8615702Z B=77, 2025-05-07T20:31:27.8615965Z MAX_T=95, 2025-05-07T20:31:27.8616237Z N_H_L=53, 2025-05-07T20:31:27.8616473Z ) 2025-05-07T20:31:27.8616718Z Trying example: test_gqa( 2025-05-07T20:31:27.8617086Z self=, 2025-05-07T20:31:27.8617475Z int4_kv=True, 2025-05-07T20:31:27.8617741Z num_groups=4, 2025-05-07T20:31:27.8617999Z B=113, 2025-05-07T20:31:27.8618230Z MAX_T=48, 2025-05-07T20:31:27.8618481Z N_H_L=96, 2025-05-07T20:31:27.8618730Z ) 2025-05-07T20:31:27.8618974Z Trying example: test_gqa( 2025-05-07T20:31:27.8619341Z self=, 2025-05-07T20:31:27.8619738Z int4_kv=False, 2025-05-07T20:31:27.8619998Z num_groups=1, 2025-05-07T20:31:27.8620256Z B=51, 2025-05-07T20:31:27.8620499Z MAX_T=61, 2025-05-07T20:31:27.8620737Z N_H_L=69, 2025-05-07T20:31:27.8621247Z ) 2025-05-07T20:31:27.8621502Z Trying example: test_gqa( 2025-05-07T20:31:27.8621862Z self=, 2025-05-07T20:31:27.8622259Z int4_kv=False, 2025-05-07T20:31:27.8622529Z num_groups=4, 2025-05-07T20:31:27.8622780Z B=17, 2025-05-07T20:31:27.8623022Z MAX_T=113, 2025-05-07T20:31:27.8623274Z N_H_L=65, 2025-05-07T20:31:27.8623509Z ) 2025-05-07T20:31:27.8623759Z Trying example: test_gqa( 2025-05-07T20:31:27.8624130Z self=, 2025-05-07T20:31:27.8624522Z int4_kv=False, 2025-05-07T20:31:27.8624786Z num_groups=4, 2025-05-07T20:31:27.8625044Z B=17, 2025-05-07T20:31:27.8625279Z MAX_T=65, 2025-05-07T20:31:27.8625520Z N_H_L=65, 2025-05-07T20:31:27.8625761Z ) 2025-05-07T20:31:27.8626003Z Trying example: test_gqa( 2025-05-07T20:31:27.8626363Z self=, 2025-05-07T20:31:27.8626900Z int4_kv=False, 2025-05-07T20:31:27.8627167Z num_groups=4, 2025-05-07T20:31:27.8627429Z B=65, 2025-05-07T20:31:27.8627668Z MAX_T=65, 2025-05-07T20:31:27.8627911Z N_H_L=65, 2025-05-07T20:31:27.8628144Z ) 2025-05-07T20:31:27.8628390Z Trying example: test_gqa( 2025-05-07T20:31:27.8628753Z self=, 2025-05-07T20:31:27.8629143Z int4_kv=False, 2025-05-07T20:31:27.8629409Z num_groups=1, 2025-05-07T20:31:27.8629666Z B=6, 2025-05-07T20:31:27.8629898Z MAX_T=108, 2025-05-07T20:31:27.8630148Z N_H_L=14, 2025-05-07T20:31:27.8630387Z ) 2025-05-07T20:31:27.8630622Z Trying example: test_gqa( 2025-05-07T20:31:27.8630988Z self=, 2025-05-07T20:31:27.8631381Z int4_kv=False, 2025-05-07T20:31:27.8631638Z num_groups=1, 2025-05-07T20:31:27.8631894Z B=6, 2025-05-07T20:31:27.8632127Z MAX_T=14, 2025-05-07T20:31:27.8632365Z N_H_L=14, 2025-05-07T20:31:27.8632611Z ) 2025-05-07T20:31:27.8632863Z Trying example: test_gqa( 2025-05-07T20:31:27.8633228Z self=, 2025-05-07T20:31:27.8633627Z int4_kv=False, 2025-05-07T20:31:27.8633890Z num_groups=1, 2025-05-07T20:31:27.8634146Z B=6, 2025-05-07T20:31:27.8634371Z MAX_T=6, 2025-05-07T20:31:27.8634614Z N_H_L=14, 2025-05-07T20:31:27.8634854Z ) 2025-05-07T20:31:27.8635092Z Trying example: test_gqa( 2025-05-07T20:31:27.8635457Z self=, 2025-05-07T20:31:27.8635857Z int4_kv=False, 2025-05-07T20:31:27.8636141Z num_groups=1, 2025-05-07T20:31:27.8636421Z B=6, 2025-05-07T20:31:27.8636654Z MAX_T=6, 2025-05-07T20:31:27.8636894Z N_H_L=6, 2025-05-07T20:31:27.8637136Z ) 2025-05-07T20:31:27.8637381Z Trying example: test_gqa( 2025-05-07T20:31:27.8637741Z self=, 2025-05-07T20:31:27.8638142Z int4_kv=False, 2025-05-07T20:31:27.8638409Z num_groups=1, 2025-05-07T20:31:27.8638666Z B=70, 2025-05-07T20:31:27.8638906Z MAX_T=94, 2025-05-07T20:31:27.8639153Z N_H_L=78, 2025-05-07T20:31:27.8639395Z ) 2025-05-07T20:31:27.8639640Z Trying example: test_gqa( 2025-05-07T20:31:27.8640005Z self=, 2025-05-07T20:31:27.8640563Z int4_kv=False, 2025-05-07T20:31:27.8640848Z num_groups=1, 2025-05-07T20:31:27.8641117Z B=78, 2025-05-07T20:31:27.8641358Z MAX_T=94, 2025-05-07T20:31:27.8641616Z N_H_L=78, 2025-05-07T20:31:27.8641869Z ) 2025-05-07T20:31:27.8642060Z Trying example: test_gqa( 2025-05-07T20:31:27.8642356Z self=, 2025-05-07T20:31:27.8642674Z int4_kv=False, 2025-05-07T20:31:27.8642884Z num_groups=1, 2025-05-07T20:31:27.8643091Z B=94, 2025-05-07T20:31:27.8643282Z MAX_T=94, 2025-05-07T20:31:27.8643479Z N_H_L=78, 2025-05-07T20:31:27.8643667Z ) 2025-05-07T20:31:27.8643862Z Trying example: test_gqa( 2025-05-07T20:31:27.8644160Z self=, 2025-05-07T20:31:27.8644482Z int4_kv=False, 2025-05-07T20:31:27.8644808Z num_groups=1, 2025-05-07T20:31:27.8645017Z B=94, 2025-05-07T20:31:27.8645201Z MAX_T=94, 2025-05-07T20:31:27.8645398Z N_H_L=94, 2025-05-07T20:31:27.8645593Z ) 2025-05-07T20:31:27.8645790Z Trying example: test_gqa( 2025-05-07T20:31:27.8646092Z self=, 2025-05-07T20:31:27.8646459Z int4_kv=False, 2025-05-07T20:31:27.8646666Z num_groups=4, 2025-05-07T20:31:27.8646874Z B=41, 2025-05-07T20:31:27.8647064Z MAX_T=105, 2025-05-07T20:31:27.8647261Z N_H_L=126, 2025-05-07T20:31:27.8647642Z ) 2025-05-07T20:31:27.8647841Z Trying example: test_gqa( 2025-05-07T20:31:27.8648132Z self=, 2025-05-07T20:31:27.8648452Z int4_kv=False, 2025-05-07T20:31:27.8648668Z num_groups=4, 2025-05-07T20:31:27.8648870Z B=105, 2025-05-07T20:31:27.8649062Z MAX_T=105, 2025-05-07T20:31:27.8649264Z N_H_L=126, 2025-05-07T20:31:27.8649549Z ) 2025-05-07T20:31:27.8649753Z Trying example: test_gqa( 2025-05-07T20:31:27.8650051Z self=, 2025-05-07T20:31:27.8650377Z int4_kv=False, 2025-05-07T20:31:27.8650593Z num_groups=4, 2025-05-07T20:31:27.8650795Z B=105, 2025-05-07T20:31:27.8650990Z MAX_T=105, 2025-05-07T20:31:27.8651196Z N_H_L=105, 2025-05-07T20:31:27.8651391Z ) 2025-05-07T20:31:27.8651592Z Trying example: test_gqa( 2025-05-07T20:31:27.8651889Z self=, 2025-05-07T20:31:27.8652202Z int4_kv=True, 2025-05-07T20:31:27.8652414Z num_groups=1, 2025-05-07T20:31:27.8652622Z B=95, 2025-05-07T20:31:27.8652812Z MAX_T=114, 2025-05-07T20:31:27.8653017Z N_H_L=43, 2025-05-07T20:31:27.8653221Z ) 2025-05-07T20:31:27.8653411Z Trying example: test_gqa( 2025-05-07T20:31:27.8653709Z self=, 2025-05-07T20:31:27.8654025Z int4_kv=True, 2025-05-07T20:31:27.8654237Z num_groups=1, 2025-05-07T20:31:27.8654444Z B=43, 2025-05-07T20:31:27.8654635Z MAX_T=114, 2025-05-07T20:31:27.8654841Z N_H_L=43, 2025-05-07T20:31:27.8655033Z ) 2025-05-07T20:31:27.8655230Z Trying example: test_gqa( 2025-05-07T20:31:27.8655527Z self=, 2025-05-07T20:31:27.8655836Z int4_kv=True, 2025-05-07T20:31:27.8656050Z num_groups=1, 2025-05-07T20:31:27.8656257Z B=43, 2025-05-07T20:31:27.8656443Z MAX_T=43, 2025-05-07T20:31:27.8656640Z N_H_L=43, 2025-05-07T20:31:27.8656836Z ) 2025-05-07T20:31:27.8657027Z Trying example: test_gqa( 2025-05-07T20:31:27.8657340Z self=, 2025-05-07T20:31:27.8657660Z int4_kv=False, 2025-05-07T20:31:27.8658018Z num_groups=1, 2025-05-07T20:31:27.8658430Z B=21, 2025-05-07T20:31:27.8658709Z MAX_T=38, 2025-05-07T20:31:27.8658982Z N_H_L=42, 2025-05-07T20:31:27.8659318Z ) 2025-05-07T20:31:27.8659606Z Trying example: test_gqa( 2025-05-07T20:31:27.8667886Z self=, 2025-05-07T20:31:27.8668266Z int4_kv=False, 2025-05-07T20:31:27.8668497Z num_groups=1, 2025-05-07T20:31:27.8668710Z B=38, 2025-05-07T20:31:27.8668908Z MAX_T=38, 2025-05-07T20:31:27.8669112Z N_H_L=42, 2025-05-07T20:31:27.8669307Z ) 2025-05-07T20:31:27.8669513Z Trying example: test_gqa( 2025-05-07T20:31:27.8669827Z self=, 2025-05-07T20:31:27.8670147Z int4_kv=False, 2025-05-07T20:31:27.8670370Z num_groups=1, 2025-05-07T20:31:27.8670586Z B=38, 2025-05-07T20:31:27.8670775Z MAX_T=42, 2025-05-07T20:31:27.8670979Z N_H_L=42, 2025-05-07T20:31:27.8671177Z ) 2025-05-07T20:31:27.8671372Z Trying example: test_gqa( 2025-05-07T20:31:27.8671679Z self=, 2025-05-07T20:31:27.8672007Z int4_kv=False, 2025-05-07T20:31:27.8672220Z num_groups=1, 2025-05-07T20:31:27.8672436Z B=42, 2025-05-07T20:31:27.8672632Z MAX_T=42, 2025-05-07T20:31:27.8672833Z N_H_L=42, 2025-05-07T20:31:27.8673033Z ) 2025-05-07T20:31:27.8673358Z Trying example: test_gqa( 2025-05-07T20:31:27.8673655Z self=, 2025-05-07T20:31:27.8673976Z int4_kv=True, 2025-05-07T20:31:27.8674193Z num_groups=1, 2025-05-07T20:31:27.8674402Z B=74, 2025-05-07T20:31:27.8674587Z MAX_T=20, 2025-05-07T20:31:27.8674787Z N_H_L=15, 2025-05-07T20:31:27.8674982Z ) 2025-05-07T20:31:27.8675175Z Trying example: test_gqa( 2025-05-07T20:31:27.8675475Z self=, 2025-05-07T20:31:27.8675796Z int4_kv=True, 2025-05-07T20:31:27.8676004Z num_groups=1, 2025-05-07T20:31:27.8676213Z B=20, 2025-05-07T20:31:27.8676409Z MAX_T=20, 2025-05-07T20:31:27.8676602Z N_H_L=15, 2025-05-07T20:31:27.8676798Z ) 2025-05-07T20:31:27.8676995Z Trying example: test_gqa( 2025-05-07T20:31:27.8677286Z self=, 2025-05-07T20:31:27.8677606Z int4_kv=True, 2025-05-07T20:31:27.8677911Z num_groups=1, 2025-05-07T20:31:27.8678113Z B=20, 2025-05-07T20:31:27.8678310Z MAX_T=15, 2025-05-07T20:31:27.8678508Z N_H_L=15, 2025-05-07T20:31:27.8678697Z ) 2025-05-07T20:31:27.8678894Z Trying example: test_gqa( 2025-05-07T20:31:27.8679196Z self=, 2025-05-07T20:31:27.8679510Z int4_kv=True, 2025-05-07T20:31:27.8679725Z num_groups=1, 2025-05-07T20:31:27.8679934Z B=15, 2025-05-07T20:31:27.8680176Z MAX_T=20, 2025-05-07T20:31:27.8680381Z N_H_L=15, 2025-05-07T20:31:27.8680578Z ) 2025-05-07T20:31:27.8680769Z Trying example: test_gqa( 2025-05-07T20:31:27.8681066Z self=, 2025-05-07T20:31:27.8681393Z int4_kv=True, 2025-05-07T20:31:27.8681609Z num_groups=1, 2025-05-07T20:31:27.8681811Z B=15, 2025-05-07T20:31:27.8682003Z MAX_T=15, 2025-05-07T20:31:27.8682202Z N_H_L=15, 2025-05-07T20:31:27.8682392Z ) 2025-05-07T20:31:27.8682590Z Trying example: test_gqa( 2025-05-07T20:31:27.8682897Z self=, 2025-05-07T20:31:27.8683218Z int4_kv=False, 2025-05-07T20:31:27.8683437Z num_groups=4, 2025-05-07T20:31:27.8683648Z B=117, 2025-05-07T20:31:27.8683839Z MAX_T=104, 2025-05-07T20:31:27.8684043Z N_H_L=69, 2025-05-07T20:31:27.8684242Z ) 2025-05-07T20:31:27.8684435Z Trying example: test_gqa( 2025-05-07T20:31:27.8684736Z self=, 2025-05-07T20:31:27.8685059Z int4_kv=False, 2025-05-07T20:31:27.8685269Z num_groups=4, 2025-05-07T20:31:27.8685482Z B=117, 2025-05-07T20:31:27.8685676Z MAX_T=117, 2025-05-07T20:31:27.8685872Z N_H_L=69, 2025-05-07T20:31:27.8686100Z ) 2025-05-07T20:31:27.8686323Z Trying example: test_gqa( 2025-05-07T20:31:27.8686617Z self=, 2025-05-07T20:31:27.8686942Z int4_kv=False, 2025-05-07T20:31:27.8687162Z num_groups=4, 2025-05-07T20:31:27.8687363Z B=69, 2025-05-07T20:31:27.8687563Z MAX_T=117, 2025-05-07T20:31:27.8687766Z N_H_L=69, 2025-05-07T20:31:27.8687956Z ) 2025-05-07T20:31:27.8688156Z Trying example: test_gqa( 2025-05-07T20:31:27.8688455Z self=, 2025-05-07T20:31:27.8688774Z int4_kv=False, 2025-05-07T20:31:27.8688982Z num_groups=4, 2025-05-07T20:31:27.8689194Z B=117, 2025-05-07T20:31:27.8689388Z MAX_T=69, 2025-05-07T20:31:27.8689580Z N_H_L=69, 2025-05-07T20:31:27.8689777Z ) 2025-05-07T20:31:27.8689967Z PASSED 2025-05-07T20:31:27.8774795Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:27.8775138Z 2025-05-07T20:31:27.8775297Z =========================== short test summary info ============================ 2025-05-07T20:31:27.8776425Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:27.8777887Z ======================== 1 passed, 1 skipped in 40.38s ========================= 2025-05-07T20:31:28.5397795Z 2025-05-07T20:31:28.5399834Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:28.5419751Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds 2025-05-07T20:31:28.5420050Z 2025-05-07T20:31:28.5420411Z 2025-05-07T20:31:28.5420415Z 2025-05-07T20:31:28.5420548Z 2025-05-07T20:31:28.5443519Z ################################################################################ 2025-05-07T20:31:28.5459320Z # [2025-05-07T20:31:28.545Z] Run Python Test Suite: 2025-05-07T20:31:28.5459675Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:28.5459980Z ################################################################################ 2025-05-07T20:31:28.5483889Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:28.5484538Z 2025-05-07T20:31:30.7038570Z ============================= test session starts ============================== 2025-05-07T20:31:30.7041213Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:30.7041762Z cachedir: .pytest_cache 2025-05-07T20:31:30.7042377Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:30.7043147Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:30.7043583Z plugins: hypothesis-6.131.14 2025-05-07T20:31:32.2442593Z collecting ... collected 1 item 2025-05-07T20:31:32.2442872Z 2025-05-07T20:31:32.9926914Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:32.9927261Z 2025-05-07T20:31:32.9927438Z ============================== 1 passed in 2.42s =============================== 2025-05-07T20:31:33.6353613Z 2025-05-07T20:31:33.6354138Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:33.6374400Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:31:33.6374841Z 2025-05-07T20:31:33.6374848Z 2025-05-07T20:31:33.6374854Z 2025-05-07T20:31:33.6374859Z 2025-05-07T20:31:33.6395209Z ################################################################################ 2025-05-07T20:31:33.6410863Z # [2025-05-07T20:31:33.640Z] Run Python Test Suite: 2025-05-07T20:31:33.6411331Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:33.6411638Z ################################################################################ 2025-05-07T20:31:33.6437251Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:33.6437887Z 2025-05-07T20:31:35.7987867Z ============================= test session starts ============================== 2025-05-07T20:31:35.7989405Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:35.7990519Z cachedir: .pytest_cache 2025-05-07T20:31:35.7991739Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:35.7993239Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:35.7994084Z plugins: hypothesis-6.131.14 2025-05-07T20:31:37.4198413Z collecting ... collected 5 items 2025-05-07T20:31:37.4198722Z 2025-05-07T20:31:37.4209035Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:37.4216921Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:37.4223607Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:37.4234542Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:37.4249528Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:37.4250033Z 2025-05-07T20:31:37.4250648Z =========================== short test summary info ============================ 2025-05-07T20:31:37.4251386Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:37.4252363Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:37.4253334Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:37.4254303Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:37.4255267Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:37.4256108Z ============================== 5 skipped in 1.76s ============================== 2025-05-07T20:31:38.0114182Z 2025-05-07T20:31:38.0114853Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:38.0134405Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:31:38.0134809Z 2025-05-07T20:31:38.0134814Z 2025-05-07T20:31:38.0134818Z 2025-05-07T20:31:38.0134821Z 2025-05-07T20:31:38.0155236Z ################################################################################ 2025-05-07T20:31:38.0172190Z # [2025-05-07T20:31:38.016Z] Run Python Test Suite: 2025-05-07T20:31:38.0172548Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:38.0172871Z ################################################################################ 2025-05-07T20:31:38.0198636Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:38.0199341Z 2025-05-07T20:31:40.1757287Z ============================= test session starts ============================== 2025-05-07T20:31:40.1757955Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:40.1758508Z cachedir: .pytest_cache 2025-05-07T20:31:40.1759124Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:40.1759890Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:40.1760433Z plugins: hypothesis-6.131.14 2025-05-07T20:31:41.8168659Z collecting ... collected 2 items 2025-05-07T20:31:41.8169109Z 2025-05-07T20:31:41.8178294Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:41.8192731Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:41.8193385Z 2025-05-07T20:31:41.8193557Z =========================== short test summary info ============================ 2025-05-07T20:31:41.8194210Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:41.8195082Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:41.8195709Z ============================== 2 skipped in 1.78s ============================== 2025-05-07T20:31:42.4196060Z 2025-05-07T20:31:42.4196797Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:42.4217719Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds 2025-05-07T20:31:42.4218062Z 2025-05-07T20:31:42.4218067Z 2025-05-07T20:31:42.4218071Z 2025-05-07T20:31:42.4218075Z 2025-05-07T20:31:42.4238476Z ################################################################################ 2025-05-07T20:31:42.4254131Z # [2025-05-07T20:31:42.425Z] Run Python Test Suite: 2025-05-07T20:31:42.4254825Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:42.4255133Z ################################################################################ 2025-05-07T20:31:42.4280388Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:42.4281027Z 2025-05-07T20:31:44.5816832Z ============================= test session starts ============================== 2025-05-07T20:31:44.5817491Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:44.5818032Z cachedir: .pytest_cache 2025-05-07T20:31:44.5818639Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:44.5819402Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:44.5820196Z plugins: hypothesis-6.131.14 2025-05-07T20:31:46.1636909Z collecting ... collected 4 items 2025-05-07T20:31:46.1637236Z 2025-05-07T20:31:48.4147114Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:48.4228966Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:48.4318558Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:48.4405677Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:48.4406048Z 2025-05-07T20:31:48.4406211Z =========================== short test summary info ============================ 2025-05-07T20:31:48.4406935Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:48.4407892Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when xformers is not available 2025-05-07T20:31:48.4408566Z ============================== 4 skipped in 3.99s ============================== 2025-05-07T20:31:50.7160008Z 2025-05-07T20:31:50.7160994Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:50.7181226Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:50.7181628Z 2025-05-07T20:31:50.7181635Z 2025-05-07T20:31:50.7181640Z 2025-05-07T20:31:50.7181645Z 2025-05-07T20:31:50.7203646Z ################################################################################ 2025-05-07T20:31:50.7223317Z # [2025-05-07T20:31:50.722Z] Run Python Test Suite: 2025-05-07T20:31:50.7223735Z # ./moe/activation_test.py 2025-05-07T20:31:50.7224058Z ################################################################################ 2025-05-07T20:31:50.7248171Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:50.7248827Z 2025-05-07T20:31:52.8772534Z ============================= test session starts ============================== 2025-05-07T20:31:52.8773330Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:52.8773870Z cachedir: .pytest_cache 2025-05-07T20:31:52.8774486Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:52.8775254Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:52.8775690Z plugins: hypothesis-6.131.14 2025-05-07T20:31:54.4646121Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:54.5615099Z collecting ... collected 2 items 2025-05-07T20:31:54.5615417Z 2025-05-07T20:31:59.4786600Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:59.4787376Z self=, 2025-05-07T20:31:59.4788233Z T=1, 2025-05-07T20:31:59.4788528Z D=5120, 2025-05-07T20:31:59.4788828Z contiguous=True, 2025-05-07T20:31:59.4789088Z compiled=True, 2025-05-07T20:31:59.4789374Z ) 2025-05-07T20:31:59.4789676Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4790263Z self=, 2025-05-07T20:31:59.4790809Z T=4096, 2025-05-07T20:31:59.4791086Z D=5120, 2025-05-07T20:31:59.4791296Z contiguous=True, 2025-05-07T20:31:59.4791524Z compiled=True, 2025-05-07T20:31:59.4791738Z ) 2025-05-07T20:31:59.4791943Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4792328Z self=, 2025-05-07T20:31:59.4792723Z T=4096, 2025-05-07T20:31:59.4792918Z D=7168, 2025-05-07T20:31:59.4793128Z contiguous=False, 2025-05-07T20:31:59.4793359Z compiled=False, 2025-05-07T20:31:59.4793797Z ) 2025-05-07T20:31:59.4794100Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4794680Z self=, 2025-05-07T20:31:59.4795237Z T=4096, 2025-05-07T20:31:59.4795504Z D=5120, 2025-05-07T20:31:59.4795704Z contiguous=False, 2025-05-07T20:31:59.4795941Z compiled=True, 2025-05-07T20:31:59.4796154Z ) 2025-05-07T20:31:59.4796355Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4796748Z self=, 2025-05-07T20:31:59.4797142Z T=1, 2025-05-07T20:31:59.4797327Z D=7168, 2025-05-07T20:31:59.4797531Z contiguous=True, 2025-05-07T20:31:59.4797764Z compiled=True, 2025-05-07T20:31:59.4797971Z ) 2025-05-07T20:31:59.4798179Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4798571Z self=, 2025-05-07T20:31:59.4798957Z T=1, 2025-05-07T20:31:59.4799150Z D=7168, 2025-05-07T20:31:59.4799362Z contiguous=False, 2025-05-07T20:31:59.4799599Z compiled=True, 2025-05-07T20:31:59.4799814Z ) 2025-05-07T20:31:59.4800026Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4800510Z self=, 2025-05-07T20:31:59.4800899Z T=4096, 2025-05-07T20:31:59.4801132Z D=5120, 2025-05-07T20:31:59.4801338Z contiguous=False, 2025-05-07T20:31:59.4801580Z compiled=False, 2025-05-07T20:31:59.4801790Z ) 2025-05-07T20:31:59.4802000Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4802393Z self=, 2025-05-07T20:31:59.4802786Z T=1, 2025-05-07T20:31:59.4802973Z D=7168, 2025-05-07T20:31:59.4803178Z contiguous=True, 2025-05-07T20:31:59.4803420Z compiled=False, 2025-05-07T20:31:59.4803633Z ) 2025-05-07T20:31:59.4803839Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4804233Z self=, 2025-05-07T20:31:59.4804627Z T=2048, 2025-05-07T20:31:59.4804823Z D=5120, 2025-05-07T20:31:59.4805044Z contiguous=True, 2025-05-07T20:31:59.4805304Z compiled=True, 2025-05-07T20:31:59.4805517Z ) 2025-05-07T20:31:59.4805722Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4806106Z self=, 2025-05-07T20:31:59.4806501Z T=2048, 2025-05-07T20:31:59.4806697Z D=7168, 2025-05-07T20:31:59.4806894Z contiguous=True, 2025-05-07T20:31:59.4807127Z compiled=True, 2025-05-07T20:31:59.4807341Z ) 2025-05-07T20:31:59.4807541Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4807932Z self=, 2025-05-07T20:31:59.4808326Z T=2048, 2025-05-07T20:31:59.4808521Z D=7168, 2025-05-07T20:31:59.4808718Z contiguous=True, 2025-05-07T20:31:59.4808952Z compiled=False, 2025-05-07T20:31:59.4809164Z ) 2025-05-07T20:31:59.4809364Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4809762Z self=, 2025-05-07T20:31:59.4810281Z T=128, 2025-05-07T20:31:59.4810473Z D=5120, 2025-05-07T20:31:59.4810675Z contiguous=False, 2025-05-07T20:31:59.4810912Z compiled=True, 2025-05-07T20:31:59.4811119Z ) 2025-05-07T20:31:59.4811327Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4811722Z self=, 2025-05-07T20:31:59.4812109Z T=128, 2025-05-07T20:31:59.4812305Z D=5120, 2025-05-07T20:31:59.4812511Z contiguous=True, 2025-05-07T20:31:59.4812739Z compiled=True, 2025-05-07T20:31:59.4812952Z ) 2025-05-07T20:31:59.4813161Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4813949Z self=, 2025-05-07T20:31:59.4814393Z T=16384, 2025-05-07T20:31:59.4814598Z D=5120, 2025-05-07T20:31:59.4814798Z contiguous=False, 2025-05-07T20:31:59.4815036Z compiled=True, 2025-05-07T20:31:59.4815417Z ) 2025-05-07T20:31:59.4815618Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4816019Z self=, 2025-05-07T20:31:59.4816420Z T=16384, 2025-05-07T20:31:59.4816622Z D=5120, 2025-05-07T20:31:59.4816823Z contiguous=False, 2025-05-07T20:31:59.4817060Z compiled=False, 2025-05-07T20:31:59.4817278Z ) 2025-05-07T20:31:59.4817479Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4817871Z self=, 2025-05-07T20:31:59.4818266Z T=128, 2025-05-07T20:31:59.4818455Z D=7168, 2025-05-07T20:31:59.4818662Z contiguous=True, 2025-05-07T20:31:59.4818899Z compiled=False, 2025-05-07T20:31:59.4819114Z ) 2025-05-07T20:31:59.4819314Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4819706Z self=, 2025-05-07T20:31:59.4820100Z T=128, 2025-05-07T20:31:59.4820289Z D=7168, 2025-05-07T20:31:59.4820500Z contiguous=False, 2025-05-07T20:31:59.4820736Z compiled=False, 2025-05-07T20:31:59.4820948Z ) 2025-05-07T20:31:59.4821153Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4821540Z self=, 2025-05-07T20:31:59.4821926Z T=1, 2025-05-07T20:31:59.4822116Z D=5120, 2025-05-07T20:31:59.4822319Z contiguous=False, 2025-05-07T20:31:59.4822547Z compiled=False, 2025-05-07T20:31:59.4822759Z ) 2025-05-07T20:31:59.4822962Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4823345Z self=, 2025-05-07T20:31:59.4823737Z T=1, 2025-05-07T20:31:59.4823930Z D=7168, 2025-05-07T20:31:59.4824127Z contiguous=False, 2025-05-07T20:31:59.4824362Z compiled=False, 2025-05-07T20:31:59.4824576Z ) 2025-05-07T20:31:59.4824791Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4825210Z self=, 2025-05-07T20:31:59.4825632Z T=4096, 2025-05-07T20:31:59.4825828Z D=5120, 2025-05-07T20:31:59.4826038Z contiguous=True, 2025-05-07T20:31:59.4826270Z compiled=False, 2025-05-07T20:31:59.4826487Z ) 2025-05-07T20:31:59.4826744Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4827210Z self=, 2025-05-07T20:31:59.4827864Z T=128, 2025-05-07T20:31:59.4836020Z D=7168, 2025-05-07T20:31:59.4836295Z contiguous=True, 2025-05-07T20:31:59.4836546Z compiled=True, 2025-05-07T20:31:59.4836769Z ) 2025-05-07T20:31:59.4836973Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4837375Z self=, 2025-05-07T20:31:59.4837781Z T=1, 2025-05-07T20:31:59.4837974Z D=5120, 2025-05-07T20:31:59.4838188Z contiguous=False, 2025-05-07T20:31:59.4838427Z compiled=True, 2025-05-07T20:31:59.4838640Z ) 2025-05-07T20:31:59.4838850Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4839259Z self=, 2025-05-07T20:31:59.4839858Z T=4096, 2025-05-07T20:31:59.4840061Z D=7168, 2025-05-07T20:31:59.4840371Z contiguous=True, 2025-05-07T20:31:59.4840605Z compiled=False, 2025-05-07T20:31:59.4840827Z ) 2025-05-07T20:31:59.4841035Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4841420Z self=, 2025-05-07T20:31:59.4841816Z T=4096, 2025-05-07T20:31:59.4842012Z D=7168, 2025-05-07T20:31:59.4842213Z contiguous=False, 2025-05-07T20:31:59.4842452Z compiled=True, 2025-05-07T20:31:59.4842668Z ) 2025-05-07T20:31:59.4842870Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4843260Z self=, 2025-05-07T20:31:59.4843656Z T=128, 2025-05-07T20:31:59.4843851Z D=5120, 2025-05-07T20:31:59.4844049Z contiguous=True, 2025-05-07T20:31:59.4844283Z compiled=False, 2025-05-07T20:31:59.4844594Z ) 2025-05-07T20:31:59.4844793Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4845191Z self=, 2025-05-07T20:31:59.4845585Z T=128, 2025-05-07T20:31:59.4845773Z D=5120, 2025-05-07T20:31:59.4845980Z contiguous=False, 2025-05-07T20:31:59.4846216Z compiled=False, 2025-05-07T20:31:59.4846423Z ) 2025-05-07T20:31:59.4846631Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4847022Z self=, 2025-05-07T20:31:59.4847407Z T=1, 2025-05-07T20:31:59.4847607Z D=5120, 2025-05-07T20:31:59.4847811Z contiguous=True, 2025-05-07T20:31:59.4848036Z compiled=False, 2025-05-07T20:31:59.4848252Z ) 2025-05-07T20:31:59.4848456Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4848840Z self=, 2025-05-07T20:31:59.4849238Z T=2048, 2025-05-07T20:31:59.4849432Z D=7168, 2025-05-07T20:31:59.4849647Z contiguous=False, 2025-05-07T20:31:59.4849879Z compiled=True, 2025-05-07T20:31:59.4850098Z ) 2025-05-07T20:31:59.4850308Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4850690Z self=, 2025-05-07T20:31:59.4851083Z T=2048, 2025-05-07T20:31:59.4851283Z D=7168, 2025-05-07T20:31:59.4851482Z contiguous=False, 2025-05-07T20:31:59.4851721Z compiled=False, 2025-05-07T20:31:59.4851941Z ) 2025-05-07T20:31:59.4852139Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4852531Z self=, 2025-05-07T20:31:59.4852930Z T=16384, 2025-05-07T20:31:59.4853126Z D=7168, 2025-05-07T20:31:59.4853332Z contiguous=False, 2025-05-07T20:31:59.4853575Z compiled=True, 2025-05-07T20:31:59.4853784Z ) 2025-05-07T20:31:59.4853996Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4854388Z self=, 2025-05-07T20:31:59.4854784Z T=16384, 2025-05-07T20:31:59.4854987Z D=7168, 2025-05-07T20:31:59.4855198Z contiguous=True, 2025-05-07T20:31:59.4855422Z compiled=True, 2025-05-07T20:31:59.4855636Z ) 2025-05-07T20:31:59.4855843Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4856232Z self=, 2025-05-07T20:31:59.4856618Z T=4096, 2025-05-07T20:31:59.4856813Z D=7168, 2025-05-07T20:31:59.4857016Z contiguous=True, 2025-05-07T20:31:59.4857242Z compiled=True, 2025-05-07T20:31:59.4857457Z ) 2025-05-07T20:31:59.4857666Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4858051Z self=, 2025-05-07T20:31:59.4858450Z T=2048, 2025-05-07T20:31:59.4858643Z D=5120, 2025-05-07T20:31:59.4858841Z contiguous=False, 2025-05-07T20:31:59.4859078Z compiled=False, 2025-05-07T20:31:59.4859293Z ) 2025-05-07T20:31:59.4859491Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4859982Z self=, 2025-05-07T20:31:59.4860383Z T=2048, 2025-05-07T20:31:59.4860575Z D=5120, 2025-05-07T20:31:59.4860776Z contiguous=True, 2025-05-07T20:31:59.4861011Z compiled=False, 2025-05-07T20:31:59.4861218Z ) 2025-05-07T20:31:59.4861426Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4861813Z self=, 2025-05-07T20:31:59.4862206Z T=128, 2025-05-07T20:31:59.4862396Z D=7168, 2025-05-07T20:31:59.4862602Z contiguous=False, 2025-05-07T20:31:59.4862842Z compiled=True, 2025-05-07T20:31:59.4863048Z ) 2025-05-07T20:31:59.4863254Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4863642Z self=, 2025-05-07T20:31:59.4864033Z T=16384, 2025-05-07T20:31:59.4864238Z D=5120, 2025-05-07T20:31:59.4864443Z contiguous=True, 2025-05-07T20:31:59.4864671Z compiled=True, 2025-05-07T20:31:59.4864972Z ) 2025-05-07T20:31:59.4865209Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4865620Z self=, 2025-05-07T20:31:59.4866015Z T=2048, 2025-05-07T20:31:59.4866216Z D=5120, 2025-05-07T20:31:59.4866418Z contiguous=False, 2025-05-07T20:31:59.4866657Z compiled=True, 2025-05-07T20:31:59.4866870Z ) 2025-05-07T20:31:59.4867075Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4867468Z self=, 2025-05-07T20:31:59.4867864Z T=16384, 2025-05-07T20:31:59.4868060Z D=5120, 2025-05-07T20:31:59.4868267Z contiguous=True, 2025-05-07T20:31:59.4868501Z compiled=False, 2025-05-07T20:31:59.4868707Z ) 2025-05-07T20:31:59.4868917Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4869308Z self=, 2025-05-07T20:31:59.4869700Z T=16384, 2025-05-07T20:31:59.4869900Z D=7168, 2025-05-07T20:31:59.4870106Z contiguous=False, 2025-05-07T20:31:59.4870347Z compiled=False, 2025-05-07T20:31:59.4870555Z ) 2025-05-07T20:31:59.4870762Z Trying example: test_silu_mul( 2025-05-07T20:31:59.4871151Z self=, 2025-05-07T20:31:59.4871535Z T=16384, 2025-05-07T20:31:59.4871737Z D=7168, 2025-05-07T20:31:59.4871940Z contiguous=True, 2025-05-07T20:31:59.4872166Z compiled=False, 2025-05-07T20:31:59.4872378Z ) 2025-05-07T20:31:59.4872569Z PASSED 2025-05-07T20:31:59.5492607Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.5493887Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:59.5495312Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.5496846Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.5497881Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.5499247Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.5500693Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.5502411Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.5503858Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.5504960Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:31:59.5506284Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.5507735Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:59.5508619Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:59.5509881Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.5511146Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:59.5512233Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:59.5513297Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:59.5514959Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.5516306Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.5517243Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:59.5518379Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:59.5519467Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:59.5520400Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:59.5521633Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.5523042Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.5524152Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.5525104Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.5526041Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:59.5527114Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.5647538Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.5648656Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:59.5650049Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.5651852Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.5652877Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.5654227Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.5655665Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.5657032Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.5658470Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.5659565Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:31:59.5660877Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.5662170Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:59.5663050Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:59.5664314Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.5665570Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:59.5666645Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:59.5667714Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:59.5669166Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.5670507Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.5671448Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:59.5672572Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:59.5673653Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:59.5674456Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:59.5675769Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.5677173Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.5678278Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.5679229Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.5680005Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:59.5681359Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.6034819Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.6036160Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:59.6037861Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.6039684Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.6040921Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6042286Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.6043717Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.6045074Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.6046812Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.6047922Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:31:59.6049247Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.6050541Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:59.6051431Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:59.6052695Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.6054091Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:59.6055206Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:59.6056286Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:59.6057556Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.6058892Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.6059845Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:59.6060972Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:59.6062056Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:59.6062863Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:59.6064083Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.6065504Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.6066610Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.6067563Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.6068342Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:59.6069408Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.6082856Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.6083951Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:59.6085383Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.6086851Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.6087869Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.6089312Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.6090738Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.6092089Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.6093518Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.6094615Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:31:59.6095930Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.6097214Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:59.6098093Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:59.6099347Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.6100604Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:59.6101689Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:59.6102744Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:59.6104010Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.6105338Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.6106279Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:59.6107528Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:59.6108609Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:59.6109415Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:59.6110631Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.6112039Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.6113219Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.6114448Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.6115230Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:59.6116294Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.0091661Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.0092633Z self=, 2025-05-07T20:32:00.0093207Z T=1, 2025-05-07T20:32:00.0093407Z D=5120, 2025-05-07T20:32:00.0093625Z scale_ub=None, 2025-05-07T20:32:00.0093846Z contiguous=True, 2025-05-07T20:32:00.0094081Z compiled=True, 2025-05-07T20:32:00.0094298Z ) 2025-05-07T20:32:00.0094630Z self = 2025-05-07T20:32:00.0095139Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:00.0095408Z 2025-05-07T20:32:00.0095497Z @given( 2025-05-07T20:32:00.0095734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.0096065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.0096389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.0096731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0097079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0097380Z ) 2025-05-07T20:32:00.0097751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0098219Z def test_silu_mul_quant( 2025-05-07T20:32:00.0098478Z self, 2025-05-07T20:32:00.0098687Z T: int, 2025-05-07T20:32:00.0098892Z D: int, 2025-05-07T20:32:00.0099125Z scale_ub: Optional[float], 2025-05-07T20:32:00.0099412Z contiguous: bool, 2025-05-07T20:32:00.0099662Z compiled: bool, 2025-05-07T20:32:00.0099900Z ) -> None: 2025-05-07T20:32:00.0100131Z torch.manual_seed(2025) 2025-05-07T20:32:00.0100382Z 2025-05-07T20:32:00.0100672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0101032Z 2025-05-07T20:32:00.0101234Z x_sign = torch.sign(x) 2025-05-07T20:32:00.0101543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.0101874Z x = x_sign * x_clamp 2025-05-07T20:32:00.0102130Z x0 = x[:, :D] 2025-05-07T20:32:00.0102355Z x1 = x[:, D:] 2025-05-07T20:32:00.0102576Z 2025-05-07T20:32:00.0102776Z if contiguous: 2025-05-07T20:32:00.0103415Z x0 = x0.contiguous() 2025-05-07T20:32:00.0103693Z x1 = x1.contiguous() 2025-05-07T20:32:00.0103946Z 2025-05-07T20:32:00.0104143Z if scale_ub is not None: 2025-05-07T20:32:00.0104432Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.0104791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.0105116Z ) 2025-05-07T20:32:00.0105344Z else: 2025-05-07T20:32:00.0105593Z scale_ub_tensor = None 2025-05-07T20:32:00.0105854Z 2025-05-07T20:32:00.0106100Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.0106462Z op = silu_mul_quant 2025-05-07T20:32:00.0106729Z if compiled: 2025-05-07T20:32:00.0106994Z op = torch.compile(op) 2025-05-07T20:32:00.0107303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0107598Z 2025-05-07T20:32:00.0107989Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.0108292Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.0108601Z 2025-05-07T20:32:00.0108851Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.0109198Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.0109510Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.0109843Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.0110222Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.0110544Z 2025-05-07T20:32:00.0110759Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.0110964Z 2025-05-07T20:32:00.0111076Z moe/activation_test.py:126: 2025-05-07T20:32:00.0111388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0111747Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.0112096Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.0112941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.0114069Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.0114653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.0115378Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.0116099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.0116865Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.0117639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.0118325Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.0118964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.0119511Z fn() 2025-05-07T20:32:00.0120054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.0120821Z self.fn.run( 2025-05-07T20:32:00.0121362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.0121925Z kernel = self.compile( 2025-05-07T20:32:00.0122497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.0123183Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.0123606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0123849Z 2025-05-07T20:32:00.0124074Z self = 2025-05-07T20:32:00.0125365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.0126833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f17492f20>} 2025-05-07T20:32:00.0128243Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.0129316Z context = 2025-05-07T20:32:00.0129618Z 2025-05-07T20:32:00.0129801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.0130353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.0130966Z module_map=module_map) 2025-05-07T20:32:00.0131358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.0131736Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.0132016Z E ^ 2025-05-07T20:32:00.0132506Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.0133000Z 2025-05-07T20:32:00.0133440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.0133986Z 2025-05-07T20:32:00.0134097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.0134541Z self=, 2025-05-07T20:32:00.0134962Z T=2048, 2025-05-07T20:32:00.0135169Z D=5120, 2025-05-07T20:32:00.0135377Z scale_ub=1200.0, 2025-05-07T20:32:00.0135612Z contiguous=True, 2025-05-07T20:32:00.0135859Z compiled=False, 2025-05-07T20:32:00.0136077Z ) 2025-05-07T20:32:00.0144797Z self = 2025-05-07T20:32:00.0145444Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.0145770Z 2025-05-07T20:32:00.0145854Z @given( 2025-05-07T20:32:00.0146114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.0146471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.0146819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.0147189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0147564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0147890Z ) 2025-05-07T20:32:00.0148291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0148819Z def test_silu_mul_quant( 2025-05-07T20:32:00.0149092Z self, 2025-05-07T20:32:00.0149305Z T: int, 2025-05-07T20:32:00.0149520Z D: int, 2025-05-07T20:32:00.0149762Z scale_ub: Optional[float], 2025-05-07T20:32:00.0150060Z contiguous: bool, 2025-05-07T20:32:00.0150329Z compiled: bool, 2025-05-07T20:32:00.0150577Z ) -> None: 2025-05-07T20:32:00.0150809Z torch.manual_seed(2025) 2025-05-07T20:32:00.0151082Z 2025-05-07T20:32:00.0151389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0151788Z 2025-05-07T20:32:00.0151985Z x_sign = torch.sign(x) 2025-05-07T20:32:00.0152292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.0152615Z x = x_sign * x_clamp 2025-05-07T20:32:00.0152860Z x0 = x[:, :D] 2025-05-07T20:32:00.0153089Z x1 = x[:, D:] 2025-05-07T20:32:00.0153307Z 2025-05-07T20:32:00.0153495Z if contiguous: 2025-05-07T20:32:00.0153744Z x0 = x0.contiguous() 2025-05-07T20:32:00.0154018Z x1 = x1.contiguous() 2025-05-07T20:32:00.0154268Z 2025-05-07T20:32:00.0154589Z if scale_ub is not None: 2025-05-07T20:32:00.0154882Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.0155232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.0155608Z ) 2025-05-07T20:32:00.0155809Z else: 2025-05-07T20:32:00.0156024Z scale_ub_tensor = None 2025-05-07T20:32:00.0156287Z 2025-05-07T20:32:00.0156531Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.0156854Z op = silu_mul_quant 2025-05-07T20:32:00.0157115Z if compiled: 2025-05-07T20:32:00.0157373Z op = torch.compile(op) 2025-05-07T20:32:00.0157682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0157962Z 2025-05-07T20:32:00.0158163Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.0158332Z 2025-05-07T20:32:00.0158441Z moe/activation_test.py:117: 2025-05-07T20:32:00.0158743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0159187Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.0159485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0160288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.0161013Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.0161577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.0162293Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.0162984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.0163542Z kernel = self.compile( 2025-05-07T20:32:00.0164114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.0164813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.0165227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0165473Z 2025-05-07T20:32:00.0165688Z self = 2025-05-07T20:32:00.0166810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.0168239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f289bfec0>} 2025-05-07T20:32:00.0169626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.0170698Z context = 2025-05-07T20:32:00.0171006Z 2025-05-07T20:32:00.0171181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.0171732Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.0172220Z module_map=module_map) 2025-05-07T20:32:00.0172610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.0172981Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.0173246Z E ^ 2025-05-07T20:32:00.0173732Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.0174203Z 2025-05-07T20:32:00.0174634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.2738915Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.2740066Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:00.2741625Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.2743236Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.2744256Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.2745814Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.2747253Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.2748613Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.2750047Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.2751149Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:00.2752474Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.2753770Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:00.2754651Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:00.2755913Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.2757180Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:00.2758275Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:00.2759347Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:00.2760737Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.2762079Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.2763107Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:00.2764255Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:00.2765344Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:00.2766156Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:00.2767379Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.2768786Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.2769981Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.2770944Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.2771725Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:00.2772797Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.3435666Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.3436935Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:00.3438331Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.3439829Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.3440978Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.3442351Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.3443797Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.3445165Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.3446601Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.3447699Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:00.3449358Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.3450671Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:00.3451562Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:00.3452823Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.3454096Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:00.3455185Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:00.3456396Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:00.3457677Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.3459015Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.3459958Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:00.3461105Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:00.3462209Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:00.3463025Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:00.3464256Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.3465675Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.3466788Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.3467760Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.3468549Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:00.3469621Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.5497984Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.5499193Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:00.5500962Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.5502808Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.5504055Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.5505722Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.5507490Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.5509302Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.5511058Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.5512384Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:00.5514191Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.5515512Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:00.5516396Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:00.5517656Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.5518931Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:00.5520017Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:00.5521203Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:00.5522477Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.5523812Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.5524753Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:00.5525893Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:00.5527119Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:00.5527943Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:00.5529170Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.5530584Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.5531689Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.5532644Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.5533566Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:00.5534642Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.5596856Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.5598169Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:00.5599557Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.5601143Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.5602161Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.5603517Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.5604958Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.5606331Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.5607770Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.5608862Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:00.5610174Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.5611469Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:00.5612476Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:00.5613992Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.5615264Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:00.5616339Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:00.5617404Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:00.5618687Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.5620202Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.5621149Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:00.5622281Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:00.5623365Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:00.5624175Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:00.5625405Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.5626816Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.5627919Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.5628882Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.5629664Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:00.5630745Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.8650036Z 2025-05-07T20:32:00.8650459Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.8651300Z self=, 2025-05-07T20:32:00.8651896Z T=2048, 2025-05-07T20:32:00.8652148Z D=5120, 2025-05-07T20:32:00.8652349Z scale_ub=1200.0, 2025-05-07T20:32:00.8652587Z contiguous=True, 2025-05-07T20:32:00.8652824Z compiled=True, 2025-05-07T20:32:00.8653043Z ) 2025-05-07T20:32:00.8653389Z self = 2025-05-07T20:32:00.8653915Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.8654198Z 2025-05-07T20:32:00.8654284Z @given( 2025-05-07T20:32:00.8654533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.8654892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.8655609Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.8655959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.8656311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.8656613Z ) 2025-05-07T20:32:00.8656976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.8657441Z def test_silu_mul_quant( 2025-05-07T20:32:00.8657695Z self, 2025-05-07T20:32:00.8657896Z T: int, 2025-05-07T20:32:00.8658105Z D: int, 2025-05-07T20:32:00.8658334Z scale_ub: Optional[float], 2025-05-07T20:32:00.8658614Z contiguous: bool, 2025-05-07T20:32:00.8658869Z compiled: bool, 2025-05-07T20:32:00.8659108Z ) -> None: 2025-05-07T20:32:00.8659329Z torch.manual_seed(2025) 2025-05-07T20:32:00.8659584Z 2025-05-07T20:32:00.8659875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.8660387Z 2025-05-07T20:32:00.8660593Z x_sign = torch.sign(x) 2025-05-07T20:32:00.8660901Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.8661229Z x = x_sign * x_clamp 2025-05-07T20:32:00.8661477Z x0 = x[:, :D] 2025-05-07T20:32:00.8661706Z x1 = x[:, D:] 2025-05-07T20:32:00.8661931Z 2025-05-07T20:32:00.8662122Z if contiguous: 2025-05-07T20:32:00.8662369Z x0 = x0.contiguous() 2025-05-07T20:32:00.8662642Z x1 = x1.contiguous() 2025-05-07T20:32:00.8662889Z 2025-05-07T20:32:00.8663097Z if scale_ub is not None: 2025-05-07T20:32:00.8663388Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.8663737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.8664067Z ) 2025-05-07T20:32:00.8664274Z else: 2025-05-07T20:32:00.8664493Z scale_ub_tensor = None 2025-05-07T20:32:00.8664771Z 2025-05-07T20:32:00.8665023Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.8665360Z op = silu_mul_quant 2025-05-07T20:32:00.8665620Z if compiled: 2025-05-07T20:32:00.8665915Z op = torch.compile(op) 2025-05-07T20:32:00.8666230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.8666525Z 2025-05-07T20:32:00.8666728Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.8667029Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.8667340Z 2025-05-07T20:32:00.8667594Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.8667945Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.8668257Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.8668591Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.8668969Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.8669306Z 2025-05-07T20:32:00.8669528Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.8669736Z 2025-05-07T20:32:00.8669846Z moe/activation_test.py:126: 2025-05-07T20:32:00.8670165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.8670520Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.8670871Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.8671696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.8672485Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.8673063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.8673781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.8674508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.8675363Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.8676187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.8676859Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.8677498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.8678050Z fn() 2025-05-07T20:32:00.8678589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.8679199Z self.fn.run( 2025-05-07T20:32:00.8679699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.8680384Z kernel = self.compile( 2025-05-07T20:32:00.8680957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.8681741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.8682164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.8682408Z 2025-05-07T20:32:00.8682633Z self = 2025-05-07T20:32:00.8683762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.8685214Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f288f7240>} 2025-05-07T20:32:00.8686673Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.8687752Z context = 2025-05-07T20:32:00.8688058Z 2025-05-07T20:32:00.8688244Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.8688797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.8689294Z module_map=module_map) 2025-05-07T20:32:00.8689687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.8690061Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.8690348Z E ^ 2025-05-07T20:32:00.8690840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.8691310Z 2025-05-07T20:32:00.8691755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.8692298Z 2025-05-07T20:32:00.8692413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.8692860Z self=, 2025-05-07T20:32:00.8693289Z T=16384, 2025-05-07T20:32:00.8693492Z D=7168, 2025-05-07T20:32:00.8693704Z scale_ub=1200.0, 2025-05-07T20:32:00.8693946Z contiguous=False, 2025-05-07T20:32:00.8694184Z compiled=False, 2025-05-07T20:32:00.8694402Z ) 2025-05-07T20:32:00.8694742Z self = 2025-05-07T20:32:00.8695281Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:00.8695582Z 2025-05-07T20:32:00.8695667Z @given( 2025-05-07T20:32:00.8695917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.8696256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.8696578Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.8696935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.8697369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.8697676Z ) 2025-05-07T20:32:00.8698048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.8698517Z def test_silu_mul_quant( 2025-05-07T20:32:00.8698770Z self, 2025-05-07T20:32:00.8698979Z T: int, 2025-05-07T20:32:00.8699188Z D: int, 2025-05-07T20:32:00.8699419Z scale_ub: Optional[float], 2025-05-07T20:32:00.8699701Z contiguous: bool, 2025-05-07T20:32:00.8699959Z compiled: bool, 2025-05-07T20:32:00.8700197Z ) -> None: 2025-05-07T20:32:00.8700423Z torch.manual_seed(2025) 2025-05-07T20:32:00.8700685Z 2025-05-07T20:32:00.8700973Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.8701348Z 2025-05-07T20:32:00.8701552Z x_sign = torch.sign(x) 2025-05-07T20:32:00.8701948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.8702283Z x = x_sign * x_clamp 2025-05-07T20:32:00.8702532Z x0 = x[:, :D] 2025-05-07T20:32:00.8702767Z x1 = x[:, D:] 2025-05-07T20:32:00.8702990Z 2025-05-07T20:32:00.8703186Z if contiguous: 2025-05-07T20:32:00.8703433Z x0 = x0.contiguous() 2025-05-07T20:32:00.8703708Z x1 = x1.contiguous() 2025-05-07T20:32:00.8703958Z 2025-05-07T20:32:00.8704162Z if scale_ub is not None: 2025-05-07T20:32:00.8704453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.8704813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.8705136Z ) 2025-05-07T20:32:00.8705345Z else: 2025-05-07T20:32:00.8705605Z scale_ub_tensor = None 2025-05-07T20:32:00.8705875Z 2025-05-07T20:32:00.8706122Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.8706454Z op = silu_mul_quant 2025-05-07T20:32:00.8706718Z if compiled: 2025-05-07T20:32:00.8706986Z op = torch.compile(op) 2025-05-07T20:32:00.8707303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.8707588Z 2025-05-07T20:32:00.8707794Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.8707966Z 2025-05-07T20:32:00.8708077Z moe/activation_test.py:117: 2025-05-07T20:32:00.8708383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.8708736Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.8709037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.8709760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.8710497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.8711062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.8711794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.8712498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.8713067Z kernel = self.compile( 2025-05-07T20:32:00.8713913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.8714611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.8715038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.8715280Z 2025-05-07T20:32:00.8715505Z self = 2025-05-07T20:32:00.8716635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.8718251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f164b6e80>} 2025-05-07T20:32:00.8719805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.8729309Z context = 2025-05-07T20:32:00.8729663Z 2025-05-07T20:32:00.8729850Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.8730418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.8730922Z module_map=module_map) 2025-05-07T20:32:00.8731306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.8731889Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.8732169Z E ^ 2025-05-07T20:32:00.8732668Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.8733141Z 2025-05-07T20:32:00.8733583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.0487112Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.0488352Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:01.0489760Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.0491306Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.0492331Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.0493705Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.0495155Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.0496537Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.0497988Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.0499088Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:01.0500408Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.0501714Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:01.0502952Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.0504223Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.0505478Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:01.0506561Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:01.0507630Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:01.0508912Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.0510384Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.0511322Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.0512462Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:01.0513933Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:01.0514757Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:01.0515989Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.0517392Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.0518498Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.0519452Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.0520342Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:01.0521411Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.0992008Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.0993159Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:01.0994586Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.0996139Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.0998036Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.0999457Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.1001053Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.1002477Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.1004112Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.1005258Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:01.1006627Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.1007979Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:01.1008901Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.1010217Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.1011534Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:01.1012660Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:01.1014015Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:01.1015343Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.1016745Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.1017729Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.1018904Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:01.1020037Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:01.1020882Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:01.1022276Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.1023753Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.1024911Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.1025906Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.1026705Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:01.1027800Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2709669Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.2710951Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:01.2712365Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.2714114Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.2715159Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.2716559Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.2718031Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2719419Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.2720965Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2722102Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:01.2723457Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.2724784Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:01.2725679Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.2726969Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.2728634Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:01.2729746Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:01.2730837Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:01.2732129Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.2733500Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.2734779Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.2735948Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:01.2737064Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:01.2737882Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:01.2739138Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.2740586Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.2741728Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2742698Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2743504Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:01.2744604Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2800815Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:01.2802110Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:01.2803530Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:01.2805033Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:01.2806080Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.2807573Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:01.2809052Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2810439Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.2811895Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2813008Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:01.2814780Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:01.2816108Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:01.2817011Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.2818289Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:01.2819577Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:01.2820696Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:01.2821789Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:01.2823090Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:01.2824448Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:01.2825415Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.2826631Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:01.2827747Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:01.2828567Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:01.2829820Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:01.2831264Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:01.2832550Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2833537Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2834334Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:01.2835426Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9487019Z 2025-05-07T20:32:01.9487394Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9487946Z self=, 2025-05-07T20:32:01.9488556Z T=1, 2025-05-07T20:32:01.9488829Z D=7168, 2025-05-07T20:32:01.9489106Z scale_ub=None, 2025-05-07T20:32:01.9489421Z contiguous=True, 2025-05-07T20:32:01.9490230Z compiled=True, 2025-05-07T20:32:01.9490506Z ) 2025-05-07T20:32:01.9490937Z self = 2025-05-07T20:32:01.9491453Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.9491727Z 2025-05-07T20:32:01.9491814Z @given( 2025-05-07T20:32:01.9492066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9492395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9492710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9493054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9493399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9493692Z ) 2025-05-07T20:32:01.9494060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9494522Z def test_silu_mul_quant( 2025-05-07T20:32:01.9494775Z self, 2025-05-07T20:32:01.9494985Z T: int, 2025-05-07T20:32:01.9495192Z D: int, 2025-05-07T20:32:01.9495425Z scale_ub: Optional[float], 2025-05-07T20:32:01.9495704Z contiguous: bool, 2025-05-07T20:32:01.9495957Z compiled: bool, 2025-05-07T20:32:01.9496196Z ) -> None: 2025-05-07T20:32:01.9496419Z torch.manual_seed(2025) 2025-05-07T20:32:01.9496676Z 2025-05-07T20:32:01.9496962Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9497317Z 2025-05-07T20:32:01.9497522Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9497827Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9498147Z x = x_sign * x_clamp 2025-05-07T20:32:01.9498398Z x0 = x[:, :D] 2025-05-07T20:32:01.9498625Z x1 = x[:, D:] 2025-05-07T20:32:01.9498837Z 2025-05-07T20:32:01.9499035Z if contiguous: 2025-05-07T20:32:01.9499277Z x0 = x0.contiguous() 2025-05-07T20:32:01.9499543Z x1 = x1.contiguous() 2025-05-07T20:32:01.9499806Z 2025-05-07T20:32:01.9500008Z if scale_ub is not None: 2025-05-07T20:32:01.9500298Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9500645Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9500973Z ) 2025-05-07T20:32:01.9501177Z else: 2025-05-07T20:32:01.9501395Z scale_ub_tensor = None 2025-05-07T20:32:01.9501660Z 2025-05-07T20:32:01.9501908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9502233Z op = silu_mul_quant 2025-05-07T20:32:01.9502500Z if compiled: 2025-05-07T20:32:01.9502794Z op = torch.compile(op) 2025-05-07T20:32:01.9503109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9503398Z 2025-05-07T20:32:01.9503598Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.9503899Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.9504207Z 2025-05-07T20:32:01.9504456Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9504976Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.9505292Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.9505619Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.9505997Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9506325Z 2025-05-07T20:32:01.9506539Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.9506743Z 2025-05-07T20:32:01.9506849Z moe/activation_test.py:126: 2025-05-07T20:32:01.9507162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9507515Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.9507852Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9508680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.9509546Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.9510124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9510834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9511556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.9512316Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9513077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.9514195Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.9514830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.9515373Z fn() 2025-05-07T20:32:01.9515911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.9516526Z self.fn.run( 2025-05-07T20:32:01.9517017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9517566Z kernel = self.compile( 2025-05-07T20:32:01.9518133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9518816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9519230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9519471Z 2025-05-07T20:32:01.9519689Z self = 2025-05-07T20:32:01.9520913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9522363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f164b6700>} 2025-05-07T20:32:01.9523757Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9524819Z context = 2025-05-07T20:32:01.9525121Z 2025-05-07T20:32:01.9525297Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9525851Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9526346Z module_map=module_map) 2025-05-07T20:32:01.9526729Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9527253Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.9527545Z E ^ 2025-05-07T20:32:01.9528036Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9528506Z 2025-05-07T20:32:01.9528941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9529482Z 2025-05-07T20:32:01.9529593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9530031Z self=, 2025-05-07T20:32:01.9530456Z T=4096, 2025-05-07T20:32:01.9530652Z D=5120, 2025-05-07T20:32:01.9530858Z scale_ub=None, 2025-05-07T20:32:01.9531089Z contiguous=False, 2025-05-07T20:32:01.9531324Z compiled=False, 2025-05-07T20:32:01.9531544Z ) 2025-05-07T20:32:01.9531882Z self = 2025-05-07T20:32:01.9532518Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.9532812Z 2025-05-07T20:32:01.9532895Z @given( 2025-05-07T20:32:01.9533142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.9533463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.9533791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9534141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9534490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9534788Z ) 2025-05-07T20:32:01.9535156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9535619Z def test_silu_mul_quant( 2025-05-07T20:32:01.9535896Z self, 2025-05-07T20:32:01.9536128Z T: int, 2025-05-07T20:32:01.9536336Z D: int, 2025-05-07T20:32:01.9536567Z scale_ub: Optional[float], 2025-05-07T20:32:01.9536865Z contiguous: bool, 2025-05-07T20:32:01.9537119Z compiled: bool, 2025-05-07T20:32:01.9537353Z ) -> None: 2025-05-07T20:32:01.9537583Z torch.manual_seed(2025) 2025-05-07T20:32:01.9537837Z 2025-05-07T20:32:01.9538121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9538487Z 2025-05-07T20:32:01.9538693Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9538997Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9539325Z x = x_sign * x_clamp 2025-05-07T20:32:01.9539575Z x0 = x[:, :D] 2025-05-07T20:32:01.9539802Z x1 = x[:, D:] 2025-05-07T20:32:01.9540016Z 2025-05-07T20:32:01.9540211Z if contiguous: 2025-05-07T20:32:01.9540453Z x0 = x0.contiguous() 2025-05-07T20:32:01.9540718Z x1 = x1.contiguous() 2025-05-07T20:32:01.9540968Z 2025-05-07T20:32:01.9541172Z if scale_ub is not None: 2025-05-07T20:32:01.9541458Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9541825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9542151Z ) 2025-05-07T20:32:01.9542351Z else: 2025-05-07T20:32:01.9542574Z scale_ub_tensor = None 2025-05-07T20:32:01.9542840Z 2025-05-07T20:32:01.9543080Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9543413Z op = silu_mul_quant 2025-05-07T20:32:01.9543679Z if compiled: 2025-05-07T20:32:01.9543934Z op = torch.compile(op) 2025-05-07T20:32:01.9544246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9544539Z 2025-05-07T20:32:01.9544736Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.9544914Z 2025-05-07T20:32:01.9545018Z moe/activation_test.py:117: 2025-05-07T20:32:01.9545328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9545682Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.9545973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9546789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.9547517Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.9548077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9548983Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9549680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9550236Z kernel = self.compile( 2025-05-07T20:32:01.9550797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9551492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9551909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9552235Z 2025-05-07T20:32:01.9552461Z self = 2025-05-07T20:32:01.9553578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9555004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f28971d00>} 2025-05-07T20:32:01.9556397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9557460Z context = 2025-05-07T20:32:01.9557762Z 2025-05-07T20:32:01.9557943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9558499Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9558992Z module_map=module_map) 2025-05-07T20:32:01.9559377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9559745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.9560018Z E ^ 2025-05-07T20:32:01.9560580Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9561048Z 2025-05-07T20:32:01.9561486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.2249637Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.2250989Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:02.2252409Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.2253923Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.2254959Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:02.2256667Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.2258144Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.2259517Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.2260974Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.2262085Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:02.2263584Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.2264904Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:02.2265797Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:02.2267071Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.2268348Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:02.2269454Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:02.2270545Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:02.2271834Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.2273190Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.2274150Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:02.2275309Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:02.2276415Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:02.2277232Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:02.2278472Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.2279904Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.2281127Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.2282173Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.2282966Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:02.2284050Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.3934364Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.3935820Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:02.3937591Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.3939105Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.3940147Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:02.3941537Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.3942999Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.3944393Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.3945851Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.3946956Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:02.3948297Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.3949624Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:02.3950518Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:02.3951790Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.3953061Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:02.3954162Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:02.3955385Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:02.3956729Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.3958078Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.3967141Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:02.3968342Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:02.3970209Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:02.3971188Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:02.3972698Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.3974462Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.3975818Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.3976976Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.3977916Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:02.3979227Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.6523505Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.6524638Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:02.6526032Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.6527556Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.6528571Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:02.6529934Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.6531370Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.6533073Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.6534521Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.6535612Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:02.6536989Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.6538285Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:02.6539311Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:02.6540572Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.6541820Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:02.6542898Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:02.6543968Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:02.6545242Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.6546631Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.6547567Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:02.6548701Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:02.6549792Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:02.6550601Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:02.6551824Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.6553235Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.6554344Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.6555289Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.6556073Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:02.6557221Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.6621898Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:02.6623290Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:02.6624684Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:02.6626166Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:02.6627385Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:02.6628747Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:02.6630193Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.6631565Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:02.6633012Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.6634112Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:02.6635429Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:02.6636784Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:02.6637673Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:02.6638944Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:02.6640319Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:02.6641398Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:02.6642465Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:02.6643743Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:02.6645208Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:02.6646155Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:02.6647346Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:02.6648435Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:02.6649252Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:02.6650487Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:02.6651973Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:02.6653084Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.6654042Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.6654830Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:02.6655892Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.8376398Z 2025-05-07T20:32:03.8377102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.8377876Z self=, 2025-05-07T20:32:03.8378564Z T=4096, 2025-05-07T20:32:03.8378926Z D=7168, 2025-05-07T20:32:03.8379239Z scale_ub=None, 2025-05-07T20:32:03.8379591Z contiguous=False, 2025-05-07T20:32:03.8379971Z compiled=False, 2025-05-07T20:32:03.8380307Z ) 2025-05-07T20:32:03.8380853Z self = 2025-05-07T20:32:03.8381713Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:03.8382187Z 2025-05-07T20:32:03.8382310Z @given( 2025-05-07T20:32:03.8382688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.8383180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.8383668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.8384256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.8384826Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.8385321Z ) 2025-05-07T20:32:03.8385901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.8386710Z def test_silu_mul_quant( 2025-05-07T20:32:03.8387106Z self, 2025-05-07T20:32:03.8387414Z T: int, 2025-05-07T20:32:03.8387732Z D: int, 2025-05-07T20:32:03.8388093Z scale_ub: Optional[float], 2025-05-07T20:32:03.8388545Z contiguous: bool, 2025-05-07T20:32:03.8388950Z compiled: bool, 2025-05-07T20:32:03.8389320Z ) -> None: 2025-05-07T20:32:03.8389669Z torch.manual_seed(2025) 2025-05-07T20:32:03.8390079Z 2025-05-07T20:32:03.8390542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.8391118Z 2025-05-07T20:32:03.8391436Z x_sign = torch.sign(x) 2025-05-07T20:32:03.8391921Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.8392865Z x = x_sign * x_clamp 2025-05-07T20:32:03.8393273Z x0 = x[:, :D] 2025-05-07T20:32:03.8393630Z x1 = x[:, D:] 2025-05-07T20:32:03.8393975Z 2025-05-07T20:32:03.8394270Z if contiguous: 2025-05-07T20:32:03.8394648Z x0 = x0.contiguous() 2025-05-07T20:32:03.8395080Z x1 = x1.contiguous() 2025-05-07T20:32:03.8395472Z 2025-05-07T20:32:03.8395787Z if scale_ub is not None: 2025-05-07T20:32:03.8396256Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.8396811Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.8397329Z ) 2025-05-07T20:32:03.8397641Z else: 2025-05-07T20:32:03.8397975Z scale_ub_tensor = None 2025-05-07T20:32:03.8398385Z 2025-05-07T20:32:03.8398783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.8399310Z op = silu_mul_quant 2025-05-07T20:32:03.8399953Z if compiled: 2025-05-07T20:32:03.8400522Z op = torch.compile(op) 2025-05-07T20:32:03.8401036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.8401485Z 2025-05-07T20:32:03.8401805Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.8402054Z 2025-05-07T20:32:03.8402209Z moe/activation_test.py:117: 2025-05-07T20:32:03.8402642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.8403137Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.8403559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.8404569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.8405592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.8406389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.8407401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.8408391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.8409171Z kernel = self.compile( 2025-05-07T20:32:03.8409942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.8410894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.8411460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.8411797Z 2025-05-07T20:32:03.8412084Z self = 2025-05-07T20:32:03.8414133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.8416325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f1663e0c0>} 2025-05-07T20:32:03.8418326Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.8419829Z context = 2025-05-07T20:32:03.8420260Z 2025-05-07T20:32:03.8420499Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.8421273Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.8421958Z module_map=module_map) 2025-05-07T20:32:03.8422476Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.8422975Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.8423355Z E ^ 2025-05-07T20:32:03.8424235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.8424908Z 2025-05-07T20:32:03.8425521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.8426303Z 2025-05-07T20:32:03.8426477Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.8427200Z self=, 2025-05-07T20:32:03.8427907Z T=128, 2025-05-07T20:32:03.8428191Z D=7168, 2025-05-07T20:32:03.8428475Z scale_ub=None, 2025-05-07T20:32:03.8428792Z contiguous=False, 2025-05-07T20:32:03.8429116Z compiled=True, 2025-05-07T20:32:03.8429429Z ) 2025-05-07T20:32:03.8429937Z self = 2025-05-07T20:32:03.8430715Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.8431394Z 2025-05-07T20:32:03.8431513Z @given( 2025-05-07T20:32:03.8431897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.8432396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.8432885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.8433465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.8433997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.8434431Z ) 2025-05-07T20:32:03.8435003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.8435813Z def test_silu_mul_quant( 2025-05-07T20:32:03.8436215Z self, 2025-05-07T20:32:03.8436547Z T: int, 2025-05-07T20:32:03.8436886Z D: int, 2025-05-07T20:32:03.8437240Z scale_ub: Optional[float], 2025-05-07T20:32:03.8437714Z contiguous: bool, 2025-05-07T20:32:03.8438120Z compiled: bool, 2025-05-07T20:32:03.8438511Z ) -> None: 2025-05-07T20:32:03.8438874Z torch.manual_seed(2025) 2025-05-07T20:32:03.8439289Z 2025-05-07T20:32:03.8439753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.8440423Z 2025-05-07T20:32:03.8440734Z x_sign = torch.sign(x) 2025-05-07T20:32:03.8441180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.8441662Z x = x_sign * x_clamp 2025-05-07T20:32:03.8442027Z x0 = x[:, :D] 2025-05-07T20:32:03.8442385Z x1 = x[:, D:] 2025-05-07T20:32:03.8442724Z 2025-05-07T20:32:03.8443007Z if contiguous: 2025-05-07T20:32:03.8443379Z x0 = x0.contiguous() 2025-05-07T20:32:03.8443785Z x1 = x1.contiguous() 2025-05-07T20:32:03.8444183Z 2025-05-07T20:32:03.8444495Z if scale_ub is not None: 2025-05-07T20:32:03.8444937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.8445506Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.8446033Z ) 2025-05-07T20:32:03.8446349Z else: 2025-05-07T20:32:03.8446712Z scale_ub_tensor = None 2025-05-07T20:32:03.8447140Z 2025-05-07T20:32:03.8447525Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.8448057Z op = silu_mul_quant 2025-05-07T20:32:03.8448478Z if compiled: 2025-05-07T20:32:03.8448881Z op = torch.compile(op) 2025-05-07T20:32:03.8449351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.8449800Z 2025-05-07T20:32:03.8450130Z y_fp8, y_scale = fn() 2025-05-07T20:32:03.8450623Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:03.8451094Z 2025-05-07T20:32:03.8451469Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.8451963Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:03.8452397Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:03.8452934Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:03.8453720Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:03.8454285Z 2025-05-07T20:32:03.8454630Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:03.8454984Z 2025-05-07T20:32:03.8455160Z moe/activation_test.py:126: 2025-05-07T20:32:03.8455672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.8456272Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:03.8456847Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:03.8458326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:03.8459767Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:03.8460777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.8462063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.8463350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:03.8464614Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:03.8465985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:03.8467193Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:03.8468319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:03.8469298Z fn() 2025-05-07T20:32:03.8470253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:03.8471329Z self.fn.run( 2025-05-07T20:32:03.8472136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.8473113Z kernel = self.compile( 2025-05-07T20:32:03.8474115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.8475331Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.8476045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.8476461Z 2025-05-07T20:32:03.8476836Z self = 2025-05-07T20:32:03.8478875Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.8481634Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f1663d940>} 2025-05-07T20:32:03.8484241Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.8486209Z context = 2025-05-07T20:32:03.8486794Z 2025-05-07T20:32:03.8487094Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.8488056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.8488923Z module_map=module_map) 2025-05-07T20:32:03.8489565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.8490197Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:03.8490650Z E ^ 2025-05-07T20:32:03.8491507Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.8492380Z 2025-05-07T20:32:03.8493349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.0869359Z 2025-05-07T20:32:04.0869897Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.0870651Z self=, 2025-05-07T20:32:04.0871426Z T=128, 2025-05-07T20:32:04.0871729Z D=7168, 2025-05-07T20:32:04.0872034Z scale_ub=None, 2025-05-07T20:32:04.0872378Z contiguous=False, 2025-05-07T20:32:04.0872745Z compiled=False, 2025-05-07T20:32:04.0873078Z ) 2025-05-07T20:32:04.0873613Z self = 2025-05-07T20:32:04.0874454Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:04.0874913Z 2025-05-07T20:32:04.0875042Z @given( 2025-05-07T20:32:04.0875401Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.0876349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.0876865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.0877428Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.0877972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.0878452Z ) 2025-05-07T20:32:04.0879039Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.0879803Z def test_silu_mul_quant( 2025-05-07T20:32:04.0880325Z self, 2025-05-07T20:32:04.0880633Z T: int, 2025-05-07T20:32:04.0880956Z D: int, 2025-05-07T20:32:04.0881322Z scale_ub: Optional[float], 2025-05-07T20:32:04.0881782Z contiguous: bool, 2025-05-07T20:32:04.0882167Z compiled: bool, 2025-05-07T20:32:04.0882536Z ) -> None: 2025-05-07T20:32:04.0882885Z torch.manual_seed(2025) 2025-05-07T20:32:04.0883283Z 2025-05-07T20:32:04.0883747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.0884345Z 2025-05-07T20:32:04.0884654Z x_sign = torch.sign(x) 2025-05-07T20:32:04.0885144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.0885664Z x = x_sign * x_clamp 2025-05-07T20:32:04.0886051Z x0 = x[:, :D] 2025-05-07T20:32:04.0886401Z x1 = x[:, D:] 2025-05-07T20:32:04.0886739Z 2025-05-07T20:32:04.0887030Z if contiguous: 2025-05-07T20:32:04.0887407Z x0 = x0.contiguous() 2025-05-07T20:32:04.0887832Z x1 = x1.contiguous() 2025-05-07T20:32:04.0888221Z 2025-05-07T20:32:04.0888527Z if scale_ub is not None: 2025-05-07T20:32:04.0888980Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.0889529Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.0890046Z ) 2025-05-07T20:32:04.0890350Z else: 2025-05-07T20:32:04.0890681Z scale_ub_tensor = None 2025-05-07T20:32:04.0891096Z 2025-05-07T20:32:04.0891478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.0892036Z op = silu_mul_quant 2025-05-07T20:32:04.0892442Z if compiled: 2025-05-07T20:32:04.0892851Z op = torch.compile(op) 2025-05-07T20:32:04.0893358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0893815Z 2025-05-07T20:32:04.0894134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.0894409Z 2025-05-07T20:32:04.0894564Z moe/activation_test.py:117: 2025-05-07T20:32:04.0894988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0895508Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.0895943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0897055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.0898185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.0899315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.0900549Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.0901679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.0902552Z kernel = self.compile( 2025-05-07T20:32:04.0903460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.0904583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.0905282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0905705Z 2025-05-07T20:32:04.0906061Z self = 2025-05-07T20:32:04.0908003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.0910557Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f143fa700>} 2025-05-07T20:32:04.0912975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.0915177Z context = 2025-05-07T20:32:04.0915702Z 2025-05-07T20:32:04.0915996Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.0916968Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.0917812Z module_map=module_map) 2025-05-07T20:32:04.0918475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.0919093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.0919538Z E ^ 2025-05-07T20:32:04.0920431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.0921252Z 2025-05-07T20:32:04.0922018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.0922968Z 2025-05-07T20:32:04.0923159Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.0923877Z self=, 2025-05-07T20:32:04.0924581Z T=4096, 2025-05-07T20:32:04.0924877Z D=5120, 2025-05-07T20:32:04.0925189Z scale_ub=1200.0, 2025-05-07T20:32:04.0925566Z contiguous=True, 2025-05-07T20:32:04.0925944Z compiled=False, 2025-05-07T20:32:04.0926292Z ) 2025-05-07T20:32:04.0926833Z self = 2025-05-07T20:32:04.0927710Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.0928203Z 2025-05-07T20:32:04.0928335Z @given( 2025-05-07T20:32:04.0928707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.0929239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.0929785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.0930372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.0930951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.0931433Z ) 2025-05-07T20:32:04.0932024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.0932808Z def test_silu_mul_quant( 2025-05-07T20:32:04.0933220Z self, 2025-05-07T20:32:04.0933546Z T: int, 2025-05-07T20:32:04.0933880Z D: int, 2025-05-07T20:32:04.0934259Z scale_ub: Optional[float], 2025-05-07T20:32:04.0934897Z contiguous: bool, 2025-05-07T20:32:04.0935297Z compiled: bool, 2025-05-07T20:32:04.0935690Z ) -> None: 2025-05-07T20:32:04.0936062Z torch.manual_seed(2025) 2025-05-07T20:32:04.0936478Z 2025-05-07T20:32:04.0936954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.0937573Z 2025-05-07T20:32:04.0937891Z x_sign = torch.sign(x) 2025-05-07T20:32:04.0938353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.0938878Z x = x_sign * x_clamp 2025-05-07T20:32:04.0950701Z x0 = x[:, :D] 2025-05-07T20:32:04.0951112Z x1 = x[:, D:] 2025-05-07T20:32:04.0951458Z 2025-05-07T20:32:04.0951770Z if contiguous: 2025-05-07T20:32:04.0952153Z x0 = x0.contiguous() 2025-05-07T20:32:04.0952582Z x1 = x1.contiguous() 2025-05-07T20:32:04.0952994Z 2025-05-07T20:32:04.0953318Z if scale_ub is not None: 2025-05-07T20:32:04.0954019Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.0954584Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.0955108Z ) 2025-05-07T20:32:04.0955440Z else: 2025-05-07T20:32:04.0955805Z scale_ub_tensor = None 2025-05-07T20:32:04.0956221Z 2025-05-07T20:32:04.0956580Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.0957060Z op = silu_mul_quant 2025-05-07T20:32:04.0957423Z if compiled: 2025-05-07T20:32:04.0957822Z op = torch.compile(op) 2025-05-07T20:32:04.0958360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0958853Z 2025-05-07T20:32:04.0959182Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.0959479Z 2025-05-07T20:32:04.0959651Z moe/activation_test.py:117: 2025-05-07T20:32:04.0960256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0960858Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.0961355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0962655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.0963969Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.0964983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.0966269Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.0967519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.0968438Z kernel = self.compile( 2025-05-07T20:32:04.0969354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.0970503Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.0971251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0971679Z 2025-05-07T20:32:04.0972056Z self = 2025-05-07T20:32:04.0973766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.0976223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f143f8220>} 2025-05-07T20:32:04.0978905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.0980860Z context = 2025-05-07T20:32:04.0981396Z 2025-05-07T20:32:04.0981829Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.0982795Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.0983660Z module_map=module_map) 2025-05-07T20:32:04.0984312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.0984933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.0985387Z E ^ 2025-05-07T20:32:04.0986240Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.0987094Z 2025-05-07T20:32:04.0987891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.0988879Z 2025-05-07T20:32:04.0989058Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.0989944Z self=, 2025-05-07T20:32:04.0990687Z T=1, 2025-05-07T20:32:04.0990994Z D=5120, 2025-05-07T20:32:04.0991326Z scale_ub=None, 2025-05-07T20:32:04.0991699Z contiguous=True, 2025-05-07T20:32:04.0992077Z compiled=True, 2025-05-07T20:32:04.0992431Z ) 2025-05-07T20:32:04.0992998Z self = 2025-05-07T20:32:04.0993871Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:04.0994360Z 2025-05-07T20:32:04.0994488Z @given( 2025-05-07T20:32:04.0994879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.0995437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.0995971Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.0996567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.0997155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.0997669Z ) 2025-05-07T20:32:04.0998311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.0999134Z def test_silu_mul_quant( 2025-05-07T20:32:04.0999555Z self, 2025-05-07T20:32:04.0999887Z T: int, 2025-05-07T20:32:04.1000346Z D: int, 2025-05-07T20:32:04.1000727Z scale_ub: Optional[float], 2025-05-07T20:32:04.1001193Z contiguous: bool, 2025-05-07T20:32:04.1001621Z compiled: bool, 2025-05-07T20:32:04.1002008Z ) -> None: 2025-05-07T20:32:04.1002368Z torch.manual_seed(2025) 2025-05-07T20:32:04.1002786Z 2025-05-07T20:32:04.1003189Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.1003650Z 2025-05-07T20:32:04.1003919Z x_sign = torch.sign(x) 2025-05-07T20:32:04.1004326Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.1004779Z x = x_sign * x_clamp 2025-05-07T20:32:04.1005147Z x0 = x[:, :D] 2025-05-07T20:32:04.1005492Z x1 = x[:, D:] 2025-05-07T20:32:04.1005790Z 2025-05-07T20:32:04.1006066Z if contiguous: 2025-05-07T20:32:04.1006403Z x0 = x0.contiguous() 2025-05-07T20:32:04.1006861Z x1 = x1.contiguous() 2025-05-07T20:32:04.1007233Z 2025-05-07T20:32:04.1007522Z if scale_ub is not None: 2025-05-07T20:32:04.1007955Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.1008515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.1009037Z ) 2025-05-07T20:32:04.1009337Z else: 2025-05-07T20:32:04.1009652Z scale_ub_tensor = None 2025-05-07T20:32:04.1010058Z 2025-05-07T20:32:04.1010434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.1010929Z op = silu_mul_quant 2025-05-07T20:32:04.1011340Z if compiled: 2025-05-07T20:32:04.1011749Z op = torch.compile(op) 2025-05-07T20:32:04.1012229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.1012691Z 2025-05-07T20:32:04.1013154Z y_fp8, y_scale = fn() 2025-05-07T20:32:04.1013927Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:04.1014436Z 2025-05-07T20:32:04.1014855Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.1015454Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:04.1015982Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:04.1016549Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:04.1017196Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:04.1017740Z 2025-05-07T20:32:04.1018086Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:04.1018445Z 2025-05-07T20:32:04.1018625Z moe/activation_test.py:126: 2025-05-07T20:32:04.1019147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.1019756Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:04.1020565Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:04.1022086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:04.1023545Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:04.1024573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.1025868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.1027216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:04.1028597Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:04.1030000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:04.1031236Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:04.1032377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:04.1033361Z fn() 2025-05-07T20:32:04.1034319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:04.1035423Z self.fn.run( 2025-05-07T20:32:04.1036297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.1037297Z kernel = self.compile( 2025-05-07T20:32:04.1038313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.1039550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.1040376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.1040807Z 2025-05-07T20:32:04.1041191Z self = 2025-05-07T20:32:04.1043281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.1045972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f143fae80>} 2025-05-07T20:32:04.1048597Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.1050572Z context = 2025-05-07T20:32:04.1051112Z 2025-05-07T20:32:04.1051415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.1053267Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.1054153Z module_map=module_map) 2025-05-07T20:32:04.1054801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.1055440Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:04.1055906Z E ^ 2025-05-07T20:32:04.1056766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.1057637Z 2025-05-07T20:32:04.1058449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.3432038Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.3434148Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:04.3436993Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.3439720Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.3441670Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3444196Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.3446930Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.3449250Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.3451780Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.3453709Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:04.3455917Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.3458048Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:04.3459538Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:04.3461653Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.3463948Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:04.3466079Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:04.3468031Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:04.3470276Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.3472628Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.3474263Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:04.3476316Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:04.3478490Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:04.3479927Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:04.3482159Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.3484642Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.3486604Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.3488340Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.3489742Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:04.3491681Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.4070756Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.4072043Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:04.4073485Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.4075096Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.4076115Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.4077480Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.4078921Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.4080710Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.4082152Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.4083238Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:04.4084556Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.4085854Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:04.4086895Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:04.4088156Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.4089411Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:04.4090496Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:04.4091559Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:04.4092841Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.4094176Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.4095119Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:04.4096253Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:04.4097394Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:04.4098225Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:04.4099441Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.4100855Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.4101960Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.4102913Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.4103782Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:04.4104850Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.5907536Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.5908672Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:04.5910079Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.5911953Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.5912980Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.5914647Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.5916092Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.5917464Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.5918916Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.5920009Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:04.5921402Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.5922698Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:04.5923580Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:04.5924846Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.5926106Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:04.5927182Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:04.5928243Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:04.5929677Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.5931015Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.5931954Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:04.5933089Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:04.5934172Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:04.5934982Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:04.5936328Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.5937732Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.5938841Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.5939794Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.5940574Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:04.5941643Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.5999571Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.6001898Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:04.6004664Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.6007063Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.6008099Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.6009453Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.6010891Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6012251Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.6014236Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6015361Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:04.6016727Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.6018031Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:04.6018915Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:04.6020180Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.6021596Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:04.6022675Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:04.6023744Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:04.6025023Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.6026447Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.6027632Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:04.6029014Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:04.6030098Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:04.6030904Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:04.6032129Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.6033543Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.6034648Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6035603Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.6036385Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:04.6037505Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.2168809Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.2169969Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:05.2171395Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.2172915Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.2173945Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.2175479Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.2176941Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.2178324Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.2179781Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.2180901Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:05.2182238Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.2183557Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:05.2184450Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.2185722Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.2187001Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:05.2188102Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:05.2189183Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:05.2190482Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.2191838Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.2192788Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.2194026Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:05.2195130Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:05.2195956Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:05.2197207Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.2198632Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.2199842Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.2200910Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.2201701Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:05.2202781Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.2783234Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.2784801Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:05.2786317Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.2787818Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.2788851Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.2790220Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.2791674Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.2793041Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.2794483Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.2795577Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:05.2797283Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.2798599Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:05.2799490Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.2800845Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.2802111Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:05.2803198Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:05.2804404Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:05.2805684Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.2807031Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.2807982Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.2809128Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:05.2810227Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:05.2811041Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:05.2812272Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.2813928Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.2815042Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.2816011Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.2816798Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:05.2817867Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4650196Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.4651443Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:05.4653250Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.4654770Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.4655799Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4657209Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.4658651Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4660390Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.4661832Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4662928Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:05.4664245Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.4665561Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:05.4666450Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.4667763Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.4669026Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:05.4670107Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:05.4671176Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:05.4672463Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.4673802Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.4674748Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.4675903Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:05.4676994Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:05.4678448Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:05.4679688Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.4681190Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.4682301Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4683251Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4684121Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:05.4685195Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4744855Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.4754604Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:05.4756061Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.4757562Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.4758591Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4759960Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.4761478Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4762848Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.4764284Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4765387Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:05.4766715Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.4768019Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:05.4769165Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.4770433Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.4771701Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:05.4772787Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:05.4773858Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:05.4775140Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.4776605Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.4777555Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.4778697Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:05.4779789Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:05.4780593Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:05.4781836Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.4783250Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.4784362Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4785318Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4786097Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:05.4787227Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.6628332Z 2025-05-07T20:32:05.6628630Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.6629118Z self=, 2025-05-07T20:32:05.6629647Z T=2048, 2025-05-07T20:32:05.6629931Z D=5120, 2025-05-07T20:32:05.6630135Z scale_ub=None, 2025-05-07T20:32:05.6630371Z contiguous=True, 2025-05-07T20:32:05.6630608Z compiled=True, 2025-05-07T20:32:05.6630820Z ) 2025-05-07T20:32:05.6631159Z self = 2025-05-07T20:32:05.6631678Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.6631958Z 2025-05-07T20:32:05.6632050Z @given( 2025-05-07T20:32:05.6632289Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.6632642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.6633302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.6633649Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.6633998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.6634299Z ) 2025-05-07T20:32:05.6634663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.6635127Z def test_silu_mul_quant( 2025-05-07T20:32:05.6635385Z self, 2025-05-07T20:32:05.6635589Z T: int, 2025-05-07T20:32:05.6635796Z D: int, 2025-05-07T20:32:05.6636030Z scale_ub: Optional[float], 2025-05-07T20:32:05.6636311Z contiguous: bool, 2025-05-07T20:32:05.6636568Z compiled: bool, 2025-05-07T20:32:05.6636814Z ) -> None: 2025-05-07T20:32:05.6637039Z torch.manual_seed(2025) 2025-05-07T20:32:05.6637299Z 2025-05-07T20:32:05.6637589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.6638098Z 2025-05-07T20:32:05.6638304Z x_sign = torch.sign(x) 2025-05-07T20:32:05.6638615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.6638946Z x = x_sign * x_clamp 2025-05-07T20:32:05.6639200Z x0 = x[:, :D] 2025-05-07T20:32:05.6639431Z x1 = x[:, D:] 2025-05-07T20:32:05.6639656Z 2025-05-07T20:32:05.6639848Z if contiguous: 2025-05-07T20:32:05.6640095Z x0 = x0.contiguous() 2025-05-07T20:32:05.6640453Z x1 = x1.contiguous() 2025-05-07T20:32:05.6640700Z 2025-05-07T20:32:05.6640909Z if scale_ub is not None: 2025-05-07T20:32:05.6641206Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.6641559Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.6641889Z ) 2025-05-07T20:32:05.6642095Z else: 2025-05-07T20:32:05.6642319Z scale_ub_tensor = None 2025-05-07T20:32:05.6642590Z 2025-05-07T20:32:05.6642846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.6643188Z op = silu_mul_quant 2025-05-07T20:32:05.6643449Z if compiled: 2025-05-07T20:32:05.6643714Z op = torch.compile(op) 2025-05-07T20:32:05.6644027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.6644317Z 2025-05-07T20:32:05.6644523Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.6644825Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.6645127Z 2025-05-07T20:32:05.6645380Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.6645735Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.6646039Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.6646372Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.6646755Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.6647134Z 2025-05-07T20:32:05.6647347Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.6647559Z 2025-05-07T20:32:05.6647671Z moe/activation_test.py:126: 2025-05-07T20:32:05.6647988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.6648343Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.6648693Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.6649523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.6650304Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.6650889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.6651609Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.6652334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.6653189Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.6653960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.6654634Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.6655270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.6655808Z fn() 2025-05-07T20:32:05.6656343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.6656959Z self.fn.run( 2025-05-07T20:32:05.6657443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.6658002Z kernel = self.compile( 2025-05-07T20:32:05.6658573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.6659350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.6659764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.6660013Z 2025-05-07T20:32:05.6660233Z self = 2025-05-07T20:32:05.6661364Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.6662816Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f141f0b80>} 2025-05-07T20:32:05.6664208Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.6665283Z context = 2025-05-07T20:32:05.6665593Z 2025-05-07T20:32:05.6665769Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.6666321Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.6666810Z module_map=module_map) 2025-05-07T20:32:05.6667199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.6667579Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.6667863Z E ^ 2025-05-07T20:32:05.6668344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.6668819Z 2025-05-07T20:32:05.6669254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.6669792Z 2025-05-07T20:32:05.6669916Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.6670347Z self=, 2025-05-07T20:32:05.6670772Z T=128, 2025-05-07T20:32:05.6670971Z D=5120, 2025-05-07T20:32:05.6671176Z scale_ub=None, 2025-05-07T20:32:05.6671395Z contiguous=True, 2025-05-07T20:32:05.6671631Z compiled=True, 2025-05-07T20:32:05.6671846Z ) 2025-05-07T20:32:05.6672177Z self = 2025-05-07T20:32:05.6672692Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.6672968Z 2025-05-07T20:32:05.6673055Z @given( 2025-05-07T20:32:05.6673292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.6673626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.6673962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.6674311Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.6674745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.6675057Z ) 2025-05-07T20:32:05.6675426Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.6675887Z def test_silu_mul_quant( 2025-05-07T20:32:05.6676146Z self, 2025-05-07T20:32:05.6676354Z T: int, 2025-05-07T20:32:05.6676558Z D: int, 2025-05-07T20:32:05.6676791Z scale_ub: Optional[float], 2025-05-07T20:32:05.6677080Z contiguous: bool, 2025-05-07T20:32:05.6677329Z compiled: bool, 2025-05-07T20:32:05.6677567Z ) -> None: 2025-05-07T20:32:05.6677796Z torch.manual_seed(2025) 2025-05-07T20:32:05.6678048Z 2025-05-07T20:32:05.6678343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.6678703Z 2025-05-07T20:32:05.6678906Z x_sign = torch.sign(x) 2025-05-07T20:32:05.6679211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.6679647Z x = x_sign * x_clamp 2025-05-07T20:32:05.6679908Z x0 = x[:, :D] 2025-05-07T20:32:05.6680259Z x1 = x[:, D:] 2025-05-07T20:32:05.6680484Z 2025-05-07T20:32:05.6680676Z if contiguous: 2025-05-07T20:32:05.6680921Z x0 = x0.contiguous() 2025-05-07T20:32:05.6681192Z x1 = x1.contiguous() 2025-05-07T20:32:05.6681439Z 2025-05-07T20:32:05.6681644Z if scale_ub is not None: 2025-05-07T20:32:05.6681930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.6682278Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.6682608Z ) 2025-05-07T20:32:05.6682818Z else: 2025-05-07T20:32:05.6683040Z scale_ub_tensor = None 2025-05-07T20:32:05.6683304Z 2025-05-07T20:32:05.6683548Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.6683874Z op = silu_mul_quant 2025-05-07T20:32:05.6684147Z if compiled: 2025-05-07T20:32:05.6684411Z op = torch.compile(op) 2025-05-07T20:32:05.6684722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.6685017Z 2025-05-07T20:32:05.6685219Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.6685512Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.6685822Z 2025-05-07T20:32:05.6686074Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.6686428Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.6686735Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.6687112Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.6687491Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.6687812Z 2025-05-07T20:32:05.6688024Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.6688228Z 2025-05-07T20:32:05.6688338Z moe/activation_test.py:126: 2025-05-07T20:32:05.6688646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.6689007Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.6689352Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.6690176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.6690954Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.6691529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.6692245Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.6692974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.6693723Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.6694581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.6695259Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.6695886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.6696432Z fn() 2025-05-07T20:32:05.6696967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.6697574Z self.fn.run( 2025-05-07T20:32:05.6698062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.6698621Z kernel = self.compile( 2025-05-07T20:32:05.6699192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.6699875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.6700384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.6700633Z 2025-05-07T20:32:05.6700850Z self = 2025-05-07T20:32:05.6701980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.6703412Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f141f1da0>} 2025-05-07T20:32:05.6704806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.6705877Z context = 2025-05-07T20:32:05.6706186Z 2025-05-07T20:32:05.6706375Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.6706960Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.6707470Z module_map=module_map) 2025-05-07T20:32:05.6707856Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.6708235Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.6708512Z E ^ 2025-05-07T20:32:05.6709000Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.6709469Z 2025-05-07T20:32:05.6709910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.8984291Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.8985466Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:05.8986875Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.8988385Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.8989419Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.8991104Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.8992567Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.8993942Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.8995389Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.8996493Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:05.8997957Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.8999258Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:05.9000224Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.9001492Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.9002754Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:05.9003845Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:05.9004923Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:05.9006207Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.9007556Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.9008514Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.9009662Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:05.9010755Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:05.9011578Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:05.9012812Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.9014573Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.9015816Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.9016781Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.9017567Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:05.9018643Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.9597343Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.9598471Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:05.9600299Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.9601809Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.9602841Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.9604210Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.9605660Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.9607033Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.9608472Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.9609572Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:05.9610897Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.9612196Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:05.9613088Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.9614589Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.9615858Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:05.9616944Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:05.9618187Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:05.9619476Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.9620815Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.9621770Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:05.9622906Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:05.9624121Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:05.9624942Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:05.9626176Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.9627593Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.9628696Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.9629656Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.9630445Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:05.9631516Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.1473029Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.1474156Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:06.1475590Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.1477113Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.1478153Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.1479536Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.1481060Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.1482765Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.1484224Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.1485327Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:06.1486657Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.1487962Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:06.1488989Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.1490254Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.1491527Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:06.1492617Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:06.1493684Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:06.1494973Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.1496312Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.1497263Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.1498405Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:06.1499490Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:06.1500315Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:06.1501545Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.1502962Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.1504076Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.1505032Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.1505818Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:06.1506979Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.1571507Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.1572784Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:06.1574176Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.1575670Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.1576875Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.1578247Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.1579692Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.1581061Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.1582522Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.1583621Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:06.1584939Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.1586242Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:06.1587180Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.1588449Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.1589713Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:06.1590791Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:06.1591861Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:06.1593142Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.1594604Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.1595545Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.1596680Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:06.1597949Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:06.1598762Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:06.1599995Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.1601614Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.1602726Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.1603681Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.1604464Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:06.1605529Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.6118877Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.6120028Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:06.6121543Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.6123080Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.6124125Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.6125530Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.6127007Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.6128407Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.6129874Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.6131376Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:06.6132731Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.6142417Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:06.6143326Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.6144609Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.6146093Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:06.6147181Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:06.6148251Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:06.6149531Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.6150891Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.6151850Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.6152995Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:06.6154093Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:06.6154910Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:06.6156145Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.6157566Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.6158679Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.6159637Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.6160536Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:06.6161607Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.6730604Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.6731731Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:06.6733146Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.6734652Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.6735681Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.6737066Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.6738662Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.6740040Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.6741500Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.6742605Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:06.6743957Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.6745281Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:06.6746178Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.6747448Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.6748718Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:06.6749820Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:06.6750903Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:06.6752193Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.6753539Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.6754486Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.6755745Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:06.6756850Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:06.6757721Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:06.6758951Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.6760490Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.6761703Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.6762669Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.6763459Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:06.6764538Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.8600690Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.8601808Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:06.8603243Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.8604758Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.8605797Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.8607163Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.8608627Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.8610005Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.8611449Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.8612556Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:06.8614469Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.8615791Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:06.8616687Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.8617956Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.8619238Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:06.8620325Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:06.8621542Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:06.8622825Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.8624178Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.8625125Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.8626271Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:06.8627389Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:06.8628207Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:06.8629450Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.8630875Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.8631996Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.8632981Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.8633771Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:06.8634844Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.8693336Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.8694445Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:06.8695970Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.8697523Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.8698560Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.8699923Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.8701377Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.8702835Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.8704286Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.8705392Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:06.8706713Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.8708076Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:06.8708974Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.8710242Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.8711510Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:06.8712598Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:06.8713898Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:06.8715192Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.8716537Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.8717540Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:06.8718676Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:06.8719780Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:06.8720840Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:06.8722083Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.8723497Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.8724616Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.8725586Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.8726375Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:06.8727578Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.0904322Z 2025-05-07T20:32:07.0904652Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.0905272Z self=, 2025-05-07T20:32:07.0905862Z T=4096, 2025-05-07T20:32:07.0906127Z D=5120, 2025-05-07T20:32:07.0906393Z scale_ub=None, 2025-05-07T20:32:07.0906680Z contiguous=True, 2025-05-07T20:32:07.0906950Z compiled=True, 2025-05-07T20:32:07.0907187Z ) 2025-05-07T20:32:07.0907580Z self = 2025-05-07T20:32:07.0908100Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:07.0908408Z 2025-05-07T20:32:07.0908493Z @given( 2025-05-07T20:32:07.0908758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.0909094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.0909427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.0909789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.0910137Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.0910448Z ) 2025-05-07T20:32:07.0910828Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.0911298Z def test_silu_mul_quant( 2025-05-07T20:32:07.0911557Z self, 2025-05-07T20:32:07.0911775Z T: int, 2025-05-07T20:32:07.0911993Z D: int, 2025-05-07T20:32:07.0912223Z scale_ub: Optional[float], 2025-05-07T20:32:07.0912524Z contiguous: bool, 2025-05-07T20:32:07.0912788Z compiled: bool, 2025-05-07T20:32:07.0913028Z ) -> None: 2025-05-07T20:32:07.0913274Z torch.manual_seed(2025) 2025-05-07T20:32:07.0914825Z 2025-05-07T20:32:07.0915190Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.0915571Z 2025-05-07T20:32:07.0915785Z x_sign = torch.sign(x) 2025-05-07T20:32:07.0916093Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.0916431Z x = x_sign * x_clamp 2025-05-07T20:32:07.0916698Z x0 = x[:, :D] 2025-05-07T20:32:07.0916919Z x1 = x[:, D:] 2025-05-07T20:32:07.0917137Z 2025-05-07T20:32:07.0917332Z if contiguous: 2025-05-07T20:32:07.0917567Z x0 = x0.contiguous() 2025-05-07T20:32:07.0917841Z x1 = x1.contiguous() 2025-05-07T20:32:07.0918092Z 2025-05-07T20:32:07.0918295Z if scale_ub is not None: 2025-05-07T20:32:07.0918588Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.0918942Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.0919265Z ) 2025-05-07T20:32:07.0919482Z else: 2025-05-07T20:32:07.0920051Z scale_ub_tensor = None 2025-05-07T20:32:07.0920440Z 2025-05-07T20:32:07.0920692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.0921016Z op = silu_mul_quant 2025-05-07T20:32:07.0921282Z if compiled: 2025-05-07T20:32:07.0921543Z op = torch.compile(op) 2025-05-07T20:32:07.0921855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.0922136Z 2025-05-07T20:32:07.0922342Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.0922640Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.0922936Z 2025-05-07T20:32:07.0923184Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.0923532Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.0923831Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.0924155Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.0924754Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.0925079Z 2025-05-07T20:32:07.0925285Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.0925494Z 2025-05-07T20:32:07.0925602Z moe/activation_test.py:126: 2025-05-07T20:32:07.0925908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.0926257Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.0926598Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.0927420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.0928198Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.0928768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.0929483Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.0930215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.0930963Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.0931724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.0932394Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.0933022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.0933554Z fn() 2025-05-07T20:32:07.0934081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.0934682Z self.fn.run( 2025-05-07T20:32:07.0935162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.0935721Z kernel = self.compile( 2025-05-07T20:32:07.0936286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.0936965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.0937396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.0937661Z 2025-05-07T20:32:07.0937874Z self = 2025-05-07T20:32:07.0938995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.0940437Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef32bf1a0>} 2025-05-07T20:32:07.0941915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.0942982Z context = 2025-05-07T20:32:07.0943291Z 2025-05-07T20:32:07.0943465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.0944013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.0944493Z module_map=module_map) 2025-05-07T20:32:07.0944875Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.0945249Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.0945518Z E ^ 2025-05-07T20:32:07.0946001Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.0946554Z 2025-05-07T20:32:07.0946994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.0947526Z 2025-05-07T20:32:07.0947643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.0948073Z self=, 2025-05-07T20:32:07.0948492Z T=16384, 2025-05-07T20:32:07.0948695Z D=5120, 2025-05-07T20:32:07.0948890Z scale_ub=None, 2025-05-07T20:32:07.0949112Z contiguous=True, 2025-05-07T20:32:07.0949343Z compiled=True, 2025-05-07T20:32:07.0949561Z ) 2025-05-07T20:32:07.0949891Z self = 2025-05-07T20:32:07.0950409Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:07.0950694Z 2025-05-07T20:32:07.0950778Z @given( 2025-05-07T20:32:07.0951012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.0951341Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.0951666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.0952001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.0952341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.0952638Z ) 2025-05-07T20:32:07.0952997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.0953455Z def test_silu_mul_quant( 2025-05-07T20:32:07.0953705Z self, 2025-05-07T20:32:07.0953904Z T: int, 2025-05-07T20:32:07.0954102Z D: int, 2025-05-07T20:32:07.0954327Z scale_ub: Optional[float], 2025-05-07T20:32:07.0954606Z contiguous: bool, 2025-05-07T20:32:07.0954849Z compiled: bool, 2025-05-07T20:32:07.0955080Z ) -> None: 2025-05-07T20:32:07.0955304Z torch.manual_seed(2025) 2025-05-07T20:32:07.0955556Z 2025-05-07T20:32:07.0955842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.0956205Z 2025-05-07T20:32:07.0956403Z x_sign = torch.sign(x) 2025-05-07T20:32:07.0956707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.0957027Z x = x_sign * x_clamp 2025-05-07T20:32:07.0957271Z x0 = x[:, :D] 2025-05-07T20:32:07.0957495Z x1 = x[:, D:] 2025-05-07T20:32:07.0957726Z 2025-05-07T20:32:07.0957943Z if contiguous: 2025-05-07T20:32:07.0958184Z x0 = x0.contiguous() 2025-05-07T20:32:07.0958453Z x1 = x1.contiguous() 2025-05-07T20:32:07.0958704Z 2025-05-07T20:32:07.0958895Z if scale_ub is not None: 2025-05-07T20:32:07.0959178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.0959528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.0959844Z ) 2025-05-07T20:32:07.0960047Z else: 2025-05-07T20:32:07.0960337Z scale_ub_tensor = None 2025-05-07T20:32:07.0960590Z 2025-05-07T20:32:07.0960832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.0961241Z op = silu_mul_quant 2025-05-07T20:32:07.0961494Z if compiled: 2025-05-07T20:32:07.0961751Z op = torch.compile(op) 2025-05-07T20:32:07.0962058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.0962336Z 2025-05-07T20:32:07.0962535Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.0962830Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.0963127Z 2025-05-07T20:32:07.0963378Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.0963727Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.0964029Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.0964348Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.0964716Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.0965038Z 2025-05-07T20:32:07.0965240Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.0965530Z 2025-05-07T20:32:07.0965640Z moe/activation_test.py:126: 2025-05-07T20:32:07.0965947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.0966289Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.0966628Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.0967477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.0968279Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.0968840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.0969547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.0970259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.0971023Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.0971774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.0972439Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.0973063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.0973592Z fn() 2025-05-07T20:32:07.0974113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.0974713Z self.fn.run( 2025-05-07T20:32:07.0975198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.0975740Z kernel = self.compile( 2025-05-07T20:32:07.0976298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.0976986Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.0977394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.0977640Z 2025-05-07T20:32:07.0977856Z self = 2025-05-07T20:32:07.0978969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.0980387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef3a90860>} 2025-05-07T20:32:07.0981782Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.0982919Z context = 2025-05-07T20:32:07.0983226Z 2025-05-07T20:32:07.0983397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.0983938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.0984420Z module_map=module_map) 2025-05-07T20:32:07.0984792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.0985162Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.0985436Z E ^ 2025-05-07T20:32:07.0985912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.0986393Z 2025-05-07T20:32:07.0986824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.1201445Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:07.1202981Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:07.1204368Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:07.1205393Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:07.1206545Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:07.3332164Z 2025-05-07T20:32:07.3332427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.3333046Z self=, 2025-05-07T20:32:07.3333512Z T=1, 2025-05-07T20:32:07.3333708Z D=5120, 2025-05-07T20:32:07.3333960Z scale_ub=1200.0, 2025-05-07T20:32:07.3334198Z contiguous=True, 2025-05-07T20:32:07.3334436Z compiled=True, 2025-05-07T20:32:07.3334655Z ) 2025-05-07T20:32:07.3334995Z self = 2025-05-07T20:32:07.3335512Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:07.3335786Z 2025-05-07T20:32:07.3335877Z @given( 2025-05-07T20:32:07.3336117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.3336451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.3336776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.3337127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.3337484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.3337791Z ) 2025-05-07T20:32:07.3338154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.3338620Z def test_silu_mul_quant( 2025-05-07T20:32:07.3338880Z self, 2025-05-07T20:32:07.3339082Z T: int, 2025-05-07T20:32:07.3339294Z D: int, 2025-05-07T20:32:07.3339528Z scale_ub: Optional[float], 2025-05-07T20:32:07.3339816Z contiguous: bool, 2025-05-07T20:32:07.3340067Z compiled: bool, 2025-05-07T20:32:07.3340311Z ) -> None: 2025-05-07T20:32:07.3340540Z torch.manual_seed(2025) 2025-05-07T20:32:07.3340791Z 2025-05-07T20:32:07.3341079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.3341439Z 2025-05-07T20:32:07.3341639Z x_sign = torch.sign(x) 2025-05-07T20:32:07.3341948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.3342285Z x = x_sign * x_clamp 2025-05-07T20:32:07.3342704Z x0 = x[:, :D] 2025-05-07T20:32:07.3342939Z x1 = x[:, D:] 2025-05-07T20:32:07.3343159Z 2025-05-07T20:32:07.3343352Z if contiguous: 2025-05-07T20:32:07.3343598Z x0 = x0.contiguous() 2025-05-07T20:32:07.3343874Z x1 = x1.contiguous() 2025-05-07T20:32:07.3344143Z 2025-05-07T20:32:07.3344342Z if scale_ub is not None: 2025-05-07T20:32:07.3344637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.3344995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.3345327Z ) 2025-05-07T20:32:07.3345528Z else: 2025-05-07T20:32:07.3345753Z scale_ub_tensor = None 2025-05-07T20:32:07.3346019Z 2025-05-07T20:32:07.3346264Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.3346602Z op = silu_mul_quant 2025-05-07T20:32:07.3346994Z if compiled: 2025-05-07T20:32:07.3355593Z op = torch.compile(op) 2025-05-07T20:32:07.3355972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.3356266Z 2025-05-07T20:32:07.3356484Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.3356664Z 2025-05-07T20:32:07.3356782Z moe/activation_test.py:117: 2025-05-07T20:32:07.3357096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.3357456Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.3357761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.3358355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:07.3358948Z return fn(*args, **kwargs) 2025-05-07T20:32:07.3359648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.3360468Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.3361047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.3361770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.3362475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.3363041Z kernel = self.compile( 2025-05-07T20:32:07.3363616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.3364316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.3364751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.3364995Z 2025-05-07T20:32:07.3365216Z self = 2025-05-07T20:32:07.3366358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.3367808Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2d12ca0>} 2025-05-07T20:32:07.3369212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.3370289Z context = 2025-05-07T20:32:07.3370596Z 2025-05-07T20:32:07.3370775Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.3371339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.3371837Z module_map=module_map) 2025-05-07T20:32:07.3372400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.3372776Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.3373053Z E ^ 2025-05-07T20:32:07.3373550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.3374021Z 2025-05-07T20:32:07.3374458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.3375000Z 2025-05-07T20:32:07.3375112Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.3375554Z self=, 2025-05-07T20:32:07.3375978Z T=1, 2025-05-07T20:32:07.3376170Z D=5120, 2025-05-07T20:32:07.3376383Z scale_ub=None, 2025-05-07T20:32:07.3376612Z contiguous=False, 2025-05-07T20:32:07.3376849Z compiled=True, 2025-05-07T20:32:07.3377147Z ) 2025-05-07T20:32:07.3377493Z self = 2025-05-07T20:32:07.3378050Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:07.3378331Z 2025-05-07T20:32:07.3378412Z @given( 2025-05-07T20:32:07.3378658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.3378983Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.3379312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.3379662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.3380019Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.3380319Z ) 2025-05-07T20:32:07.3380690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.3381159Z def test_silu_mul_quant( 2025-05-07T20:32:07.3381412Z self, 2025-05-07T20:32:07.3381622Z T: int, 2025-05-07T20:32:07.3381827Z D: int, 2025-05-07T20:32:07.3382057Z scale_ub: Optional[float], 2025-05-07T20:32:07.3382351Z contiguous: bool, 2025-05-07T20:32:07.3382610Z compiled: bool, 2025-05-07T20:32:07.3382843Z ) -> None: 2025-05-07T20:32:07.3383075Z torch.manual_seed(2025) 2025-05-07T20:32:07.3383332Z 2025-05-07T20:32:07.3383617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.3383981Z 2025-05-07T20:32:07.3384190Z x_sign = torch.sign(x) 2025-05-07T20:32:07.3384495Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.3384830Z x = x_sign * x_clamp 2025-05-07T20:32:07.3385090Z x0 = x[:, :D] 2025-05-07T20:32:07.3385322Z x1 = x[:, D:] 2025-05-07T20:32:07.3385539Z 2025-05-07T20:32:07.3385741Z if contiguous: 2025-05-07T20:32:07.3385988Z x0 = x0.contiguous() 2025-05-07T20:32:07.3386260Z x1 = x1.contiguous() 2025-05-07T20:32:07.3386518Z 2025-05-07T20:32:07.3386727Z if scale_ub is not None: 2025-05-07T20:32:07.3387020Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.3387397Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.3387767Z ) 2025-05-07T20:32:07.3387967Z else: 2025-05-07T20:32:07.3388193Z scale_ub_tensor = None 2025-05-07T20:32:07.3388460Z 2025-05-07T20:32:07.3388704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.3389041Z op = silu_mul_quant 2025-05-07T20:32:07.3389314Z if compiled: 2025-05-07T20:32:07.3389572Z op = torch.compile(op) 2025-05-07T20:32:07.3389892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.3390183Z 2025-05-07T20:32:07.3390391Z y_fp8, y_scale = fn() 2025-05-07T20:32:07.3390688Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:07.3391005Z 2025-05-07T20:32:07.3391263Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.3391619Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:07.3392019Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:07.3392357Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:07.3392735Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.3393070Z 2025-05-07T20:32:07.3393291Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:07.3393500Z 2025-05-07T20:32:07.3393607Z moe/activation_test.py:126: 2025-05-07T20:32:07.3393930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.3394292Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:07.3394644Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:07.3395466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:07.3396252Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:07.3396920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.3397646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.3398367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:07.3399136Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:07.3399919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:07.3400673Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:07.3401317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:07.3401869Z fn() 2025-05-07T20:32:07.3402410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:07.3403025Z self.fn.run( 2025-05-07T20:32:07.3403530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.3404095Z kernel = self.compile( 2025-05-07T20:32:07.3404664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.3405357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.3405791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.3406037Z 2025-05-07T20:32:07.3406270Z self = 2025-05-07T20:32:07.3407399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.3408893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef38c60c0>} 2025-05-07T20:32:07.3410289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.3411355Z context = 2025-05-07T20:32:07.3411657Z 2025-05-07T20:32:07.3411845Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.3412394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.3412898Z module_map=module_map) 2025-05-07T20:32:07.3413289Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.3413932Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:07.3414230Z E ^ 2025-05-07T20:32:07.3414864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.3415346Z 2025-05-07T20:32:07.3415793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.4794648Z 2025-05-07T20:32:07.4794914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.4795403Z self=, 2025-05-07T20:32:07.4795868Z T=1, 2025-05-07T20:32:07.4796071Z D=5120, 2025-05-07T20:32:07.4796280Z scale_ub=None, 2025-05-07T20:32:07.4796500Z contiguous=True, 2025-05-07T20:32:07.4796739Z compiled=False, 2025-05-07T20:32:07.4796959Z ) 2025-05-07T20:32:07.4797291Z self = 2025-05-07T20:32:07.4797803Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:07.4798254Z 2025-05-07T20:32:07.4798351Z @given( 2025-05-07T20:32:07.4798597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.4798920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.4799243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.4799588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.4799932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.4800324Z ) 2025-05-07T20:32:07.4800690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.4801144Z def test_silu_mul_quant( 2025-05-07T20:32:07.4801397Z self, 2025-05-07T20:32:07.4801603Z T: int, 2025-05-07T20:32:07.4801802Z D: int, 2025-05-07T20:32:07.4802035Z scale_ub: Optional[float], 2025-05-07T20:32:07.4802320Z contiguous: bool, 2025-05-07T20:32:07.4802561Z compiled: bool, 2025-05-07T20:32:07.4802804Z ) -> None: 2025-05-07T20:32:07.4803023Z torch.manual_seed(2025) 2025-05-07T20:32:07.4803272Z 2025-05-07T20:32:07.4803551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.4803907Z 2025-05-07T20:32:07.4804103Z x_sign = torch.sign(x) 2025-05-07T20:32:07.4804407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.4804730Z x = x_sign * x_clamp 2025-05-07T20:32:07.4804981Z x0 = x[:, :D] 2025-05-07T20:32:07.4805211Z x1 = x[:, D:] 2025-05-07T20:32:07.4805430Z 2025-05-07T20:32:07.4805618Z if contiguous: 2025-05-07T20:32:07.4805857Z x0 = x0.contiguous() 2025-05-07T20:32:07.4806126Z x1 = x1.contiguous() 2025-05-07T20:32:07.4806371Z 2025-05-07T20:32:07.4806569Z if scale_ub is not None: 2025-05-07T20:32:07.4806852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.4807197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.4807528Z ) 2025-05-07T20:32:07.4807737Z else: 2025-05-07T20:32:07.4807952Z scale_ub_tensor = None 2025-05-07T20:32:07.4808212Z 2025-05-07T20:32:07.4808453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.4808779Z op = silu_mul_quant 2025-05-07T20:32:07.4809039Z if compiled: 2025-05-07T20:32:07.4809296Z op = torch.compile(op) 2025-05-07T20:32:07.4809606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.4809886Z 2025-05-07T20:32:07.4810085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.4810255Z 2025-05-07T20:32:07.4810363Z moe/activation_test.py:117: 2025-05-07T20:32:07.4810665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.4811011Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.4811306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.4812148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.4812877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.4813586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.4814303Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.4814989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.4815542Z kernel = self.compile( 2025-05-07T20:32:07.4816105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.4816794Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.4817202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.4817445Z 2025-05-07T20:32:07.4817834Z self = 2025-05-07T20:32:07.4818993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.4820425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f164f1bc0>} 2025-05-07T20:32:07.4821818Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.4822880Z context = 2025-05-07T20:32:07.4823187Z 2025-05-07T20:32:07.4823361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.4823920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.4824402Z module_map=module_map) 2025-05-07T20:32:07.4824785Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.4825155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.4825422Z E ^ 2025-05-07T20:32:07.4825908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.4826386Z 2025-05-07T20:32:07.4826818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.4827347Z 2025-05-07T20:32:07.4827462Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.4827891Z self=, 2025-05-07T20:32:07.4828356Z T=128, 2025-05-07T20:32:07.4828553Z D=5120, 2025-05-07T20:32:07.4828757Z scale_ub=None, 2025-05-07T20:32:07.4828979Z contiguous=False, 2025-05-07T20:32:07.4829217Z compiled=True, 2025-05-07T20:32:07.4829427Z ) 2025-05-07T20:32:07.4829753Z self = 2025-05-07T20:32:07.4830264Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:07.4830545Z 2025-05-07T20:32:07.4830628Z @given( 2025-05-07T20:32:07.4830861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.4831188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.4831508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.4831846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.4832189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.4832488Z ) 2025-05-07T20:32:07.4832856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.4833317Z def test_silu_mul_quant( 2025-05-07T20:32:07.4833571Z self, 2025-05-07T20:32:07.4833890Z T: int, 2025-05-07T20:32:07.4834093Z D: int, 2025-05-07T20:32:07.4834319Z scale_ub: Optional[float], 2025-05-07T20:32:07.4834601Z contiguous: bool, 2025-05-07T20:32:07.4834843Z compiled: bool, 2025-05-07T20:32:07.4835071Z ) -> None: 2025-05-07T20:32:07.4835295Z torch.manual_seed(2025) 2025-05-07T20:32:07.4835543Z 2025-05-07T20:32:07.4835825Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.4836185Z 2025-05-07T20:32:07.4836379Z x_sign = torch.sign(x) 2025-05-07T20:32:07.4836682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.4837004Z x = x_sign * x_clamp 2025-05-07T20:32:07.4837248Z x0 = x[:, :D] 2025-05-07T20:32:07.4837493Z x1 = x[:, D:] 2025-05-07T20:32:07.4837739Z 2025-05-07T20:32:07.4837926Z if contiguous: 2025-05-07T20:32:07.4838254Z x0 = x0.contiguous() 2025-05-07T20:32:07.4838532Z x1 = x1.contiguous() 2025-05-07T20:32:07.4838779Z 2025-05-07T20:32:07.4838971Z if scale_ub is not None: 2025-05-07T20:32:07.4839254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.4839606Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.4839923Z ) 2025-05-07T20:32:07.4840206Z else: 2025-05-07T20:32:07.4840426Z scale_ub_tensor = None 2025-05-07T20:32:07.4840683Z 2025-05-07T20:32:07.4840926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.4841252Z op = silu_mul_quant 2025-05-07T20:32:07.4841507Z if compiled: 2025-05-07T20:32:07.4841769Z op = torch.compile(op) 2025-05-07T20:32:07.4842086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.4842367Z 2025-05-07T20:32:07.4842569Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.4842750Z 2025-05-07T20:32:07.4842859Z moe/activation_test.py:117: 2025-05-07T20:32:07.4843172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.4843514Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.4843808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.4844391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:07.4844968Z return fn(*args, **kwargs) 2025-05-07T20:32:07.4845653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.4846370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.4846929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.4847638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.4848334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.4848899Z kernel = self.compile( 2025-05-07T20:32:07.4849463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.4850147Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.4850564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.4850805Z 2025-05-07T20:32:07.4851024Z self = 2025-05-07T20:32:07.4852139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.4853562Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef38c7e20>} 2025-05-07T20:32:07.4855042Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.4856098Z context = 2025-05-07T20:32:07.4856399Z 2025-05-07T20:32:07.4856579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.4857125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.4857637Z module_map=module_map) 2025-05-07T20:32:07.4858051Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.4858416Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.4858686Z E ^ 2025-05-07T20:32:07.4859172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.4859729Z 2025-05-07T20:32:07.4860174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.4860707Z 2025-05-07T20:32:07.4860819Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.4861252Z self=, 2025-05-07T20:32:07.4861672Z T=128, 2025-05-07T20:32:07.4861863Z D=7168, 2025-05-07T20:32:07.4862069Z scale_ub=1200.0, 2025-05-07T20:32:07.4862304Z contiguous=False, 2025-05-07T20:32:07.4862533Z compiled=False, 2025-05-07T20:32:07.6407709Z ) 2025-05-07T20:32:07.6408093Z self = 2025-05-07T20:32:07.6408624Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:07.6408962Z 2025-05-07T20:32:07.6409044Z @given( 2025-05-07T20:32:07.6409307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.6409636Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.6409958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.6410305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.6410645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.6410944Z ) 2025-05-07T20:32:07.6411310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.6411768Z def test_silu_mul_quant( 2025-05-07T20:32:07.6412026Z self, 2025-05-07T20:32:07.6412234Z T: int, 2025-05-07T20:32:07.6412435Z D: int, 2025-05-07T20:32:07.6412668Z scale_ub: Optional[float], 2025-05-07T20:32:07.6412952Z contiguous: bool, 2025-05-07T20:32:07.6413202Z compiled: bool, 2025-05-07T20:32:07.6413579Z ) -> None: 2025-05-07T20:32:07.6413805Z torch.manual_seed(2025) 2025-05-07T20:32:07.6414061Z 2025-05-07T20:32:07.6414356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.6414715Z 2025-05-07T20:32:07.6414917Z x_sign = torch.sign(x) 2025-05-07T20:32:07.6415213Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.6415537Z x = x_sign * x_clamp 2025-05-07T20:32:07.6415788Z x0 = x[:, :D] 2025-05-07T20:32:07.6416011Z x1 = x[:, D:] 2025-05-07T20:32:07.6416228Z 2025-05-07T20:32:07.6416421Z if contiguous: 2025-05-07T20:32:07.6416655Z x0 = x0.contiguous() 2025-05-07T20:32:07.6416925Z x1 = x1.contiguous() 2025-05-07T20:32:07.6417177Z 2025-05-07T20:32:07.6417371Z if scale_ub is not None: 2025-05-07T20:32:07.6417655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.6418018Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.6418351Z ) 2025-05-07T20:32:07.6418550Z else: 2025-05-07T20:32:07.6418777Z scale_ub_tensor = None 2025-05-07T20:32:07.6419040Z 2025-05-07T20:32:07.6419446Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.6419790Z op = silu_mul_quant 2025-05-07T20:32:07.6420053Z if compiled: 2025-05-07T20:32:07.6420309Z op = torch.compile(op) 2025-05-07T20:32:07.6420618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6420905Z 2025-05-07T20:32:07.6421102Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.6421280Z 2025-05-07T20:32:07.6421383Z moe/activation_test.py:117: 2025-05-07T20:32:07.6421694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6422040Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.6422332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6423055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.6423896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.6424460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.6425173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.6425871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.6426422Z kernel = self.compile( 2025-05-07T20:32:07.6426990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.6427677Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.6428143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6428385Z 2025-05-07T20:32:07.6428606Z self = 2025-05-07T20:32:07.6429728Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.6431162Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef38c6ac0>} 2025-05-07T20:32:07.6432552Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.6433612Z context = 2025-05-07T20:32:07.6433915Z 2025-05-07T20:32:07.6434093Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.6434639Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.6435134Z module_map=module_map) 2025-05-07T20:32:07.6435518Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.6435882Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.6436154Z E ^ 2025-05-07T20:32:07.6436641Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.6437108Z 2025-05-07T20:32:07.6437540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.6438076Z 2025-05-07T20:32:07.6438192Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.6438665Z self=, 2025-05-07T20:32:07.6445454Z T=128, 2025-05-07T20:32:07.6445661Z D=5120, 2025-05-07T20:32:07.6445859Z scale_ub=None, 2025-05-07T20:32:07.6446091Z contiguous=False, 2025-05-07T20:32:07.6446332Z compiled=False, 2025-05-07T20:32:07.6446546Z ) 2025-05-07T20:32:07.6447015Z self = 2025-05-07T20:32:07.6447544Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:07.6447827Z 2025-05-07T20:32:07.6447907Z @given( 2025-05-07T20:32:07.6448149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.6448481Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.6448799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.6449142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.6449488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.6449788Z ) 2025-05-07T20:32:07.6450150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.6450616Z def test_silu_mul_quant( 2025-05-07T20:32:07.6450871Z self, 2025-05-07T20:32:07.6451072Z T: int, 2025-05-07T20:32:07.6451367Z D: int, 2025-05-07T20:32:07.6451593Z scale_ub: Optional[float], 2025-05-07T20:32:07.6451878Z contiguous: bool, 2025-05-07T20:32:07.6452127Z compiled: bool, 2025-05-07T20:32:07.6452359Z ) -> None: 2025-05-07T20:32:07.6452580Z torch.manual_seed(2025) 2025-05-07T20:32:07.6452831Z 2025-05-07T20:32:07.6453116Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.6453466Z 2025-05-07T20:32:07.6453669Z x_sign = torch.sign(x) 2025-05-07T20:32:07.6453975Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.6454299Z x = x_sign * x_clamp 2025-05-07T20:32:07.6454544Z x0 = x[:, :D] 2025-05-07T20:32:07.6454770Z x1 = x[:, D:] 2025-05-07T20:32:07.6454991Z 2025-05-07T20:32:07.6455186Z if contiguous: 2025-05-07T20:32:07.6455427Z x0 = x0.contiguous() 2025-05-07T20:32:07.6455695Z x1 = x1.contiguous() 2025-05-07T20:32:07.6455947Z 2025-05-07T20:32:07.6456146Z if scale_ub is not None: 2025-05-07T20:32:07.6456434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.6456782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.6457106Z ) 2025-05-07T20:32:07.6457308Z else: 2025-05-07T20:32:07.6457548Z scale_ub_tensor = None 2025-05-07T20:32:07.6457836Z 2025-05-07T20:32:07.6458074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.6458399Z op = silu_mul_quant 2025-05-07T20:32:07.6458659Z if compiled: 2025-05-07T20:32:07.6458918Z op = torch.compile(op) 2025-05-07T20:32:07.6459226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6459509Z 2025-05-07T20:32:07.6459713Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.6459886Z 2025-05-07T20:32:07.6459994Z moe/activation_test.py:117: 2025-05-07T20:32:07.6460300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6460655Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.6460952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6461664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.6462380Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.6462941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.6463653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.6464337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.6464890Z kernel = self.compile( 2025-05-07T20:32:07.6465455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.6466141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.6466647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6466895Z 2025-05-07T20:32:07.6467114Z self = 2025-05-07T20:32:07.6468264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.6469722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2d11f80>} 2025-05-07T20:32:07.6471115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.6472253Z context = 2025-05-07T20:32:07.6472562Z 2025-05-07T20:32:07.6472741Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.6473287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.6473771Z module_map=module_map) 2025-05-07T20:32:07.6474160Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.6474531Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.6474793Z E ^ 2025-05-07T20:32:07.6475283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.6475751Z 2025-05-07T20:32:07.6476184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.6476717Z 2025-05-07T20:32:07.6476829Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.6477265Z self=, 2025-05-07T20:32:07.6477684Z T=128, 2025-05-07T20:32:07.6477877Z D=5120, 2025-05-07T20:32:07.6478068Z scale_ub=1200.0, 2025-05-07T20:32:07.6478296Z contiguous=True, 2025-05-07T20:32:07.6478525Z compiled=False, 2025-05-07T20:32:07.6478735Z ) 2025-05-07T20:32:07.6479067Z self = 2025-05-07T20:32:07.6479579Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:07.6479860Z 2025-05-07T20:32:07.6479942Z @given( 2025-05-07T20:32:07.6480271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.6480597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.6480916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.6481255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.6481601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.6481907Z ) 2025-05-07T20:32:07.6482270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.6482728Z def test_silu_mul_quant( 2025-05-07T20:32:07.6482976Z self, 2025-05-07T20:32:07.6483176Z T: int, 2025-05-07T20:32:07.6483375Z D: int, 2025-05-07T20:32:07.6483602Z scale_ub: Optional[float], 2025-05-07T20:32:07.6483882Z contiguous: bool, 2025-05-07T20:32:07.6484122Z compiled: bool, 2025-05-07T20:32:07.6484350Z ) -> None: 2025-05-07T20:32:07.6484575Z torch.manual_seed(2025) 2025-05-07T20:32:07.6484820Z 2025-05-07T20:32:07.6485106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.6485458Z 2025-05-07T20:32:07.6485655Z x_sign = torch.sign(x) 2025-05-07T20:32:07.6485956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.6486278Z x = x_sign * x_clamp 2025-05-07T20:32:07.6486530Z x0 = x[:, :D] 2025-05-07T20:32:07.6486842Z x1 = x[:, D:] 2025-05-07T20:32:07.6487062Z 2025-05-07T20:32:07.6487251Z if contiguous: 2025-05-07T20:32:07.6487523Z x0 = x0.contiguous() 2025-05-07T20:32:07.6487812Z x1 = x1.contiguous() 2025-05-07T20:32:07.6488052Z 2025-05-07T20:32:07.6488250Z if scale_ub is not None: 2025-05-07T20:32:07.6488533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.6488881Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.6489197Z ) 2025-05-07T20:32:07.6489394Z else: 2025-05-07T20:32:07.6489613Z scale_ub_tensor = None 2025-05-07T20:32:07.6489869Z 2025-05-07T20:32:07.6490111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.6490436Z op = silu_mul_quant 2025-05-07T20:32:07.6490693Z if compiled: 2025-05-07T20:32:07.6490951Z op = torch.compile(op) 2025-05-07T20:32:07.6491350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6491638Z 2025-05-07T20:32:07.6491848Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.6492018Z 2025-05-07T20:32:07.6492137Z moe/activation_test.py:117: 2025-05-07T20:32:07.6492441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6492788Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.6493079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.6493794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.6494502Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.6495062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.6495772Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.6496466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.6497017Z kernel = self.compile( 2025-05-07T20:32:07.6497581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.6498297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.6498725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.6498966Z 2025-05-07T20:32:07.6499180Z self = 2025-05-07T20:32:07.6500294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.6501715Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2ab1b20>} 2025-05-07T20:32:07.6503119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.6504176Z context = 2025-05-07T20:32:07.6504479Z 2025-05-07T20:32:07.6504652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.6505193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.6505678Z module_map=module_map) 2025-05-07T20:32:07.6506055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.6506428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.6506692Z E ^ 2025-05-07T20:32:07.6507178Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.6507735Z 2025-05-07T20:32:07.6508177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.8014118Z 2025-05-07T20:32:07.8014517Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.8015426Z self=, 2025-05-07T20:32:07.8016269Z T=1, 2025-05-07T20:32:07.8016653Z D=7168, 2025-05-07T20:32:07.8017119Z scale_ub=1200.0, 2025-05-07T20:32:07.8017521Z contiguous=True, 2025-05-07T20:32:07.8017748Z compiled=True, 2025-05-07T20:32:07.8017962Z ) 2025-05-07T20:32:07.8018300Z self = 2025-05-07T20:32:07.8018806Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:07.8019085Z 2025-05-07T20:32:07.8019171Z @given( 2025-05-07T20:32:07.8019614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.8019945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.8020281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.8020630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.8020980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.8021275Z ) 2025-05-07T20:32:07.8021641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.8022111Z def test_silu_mul_quant( 2025-05-07T20:32:07.8022360Z self, 2025-05-07T20:32:07.8022563Z T: int, 2025-05-07T20:32:07.8022772Z D: int, 2025-05-07T20:32:07.8022995Z scale_ub: Optional[float], 2025-05-07T20:32:07.8023283Z contiguous: bool, 2025-05-07T20:32:07.8023537Z compiled: bool, 2025-05-07T20:32:07.8023769Z ) -> None: 2025-05-07T20:32:07.8023996Z torch.manual_seed(2025) 2025-05-07T20:32:07.8024259Z 2025-05-07T20:32:07.8024544Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.8024905Z 2025-05-07T20:32:07.8025112Z x_sign = torch.sign(x) 2025-05-07T20:32:07.8025417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.8025742Z x = x_sign * x_clamp 2025-05-07T20:32:07.8025998Z x0 = x[:, :D] 2025-05-07T20:32:07.8026230Z x1 = x[:, D:] 2025-05-07T20:32:07.8026441Z 2025-05-07T20:32:07.8026637Z if contiguous: 2025-05-07T20:32:07.8026883Z x0 = x0.contiguous() 2025-05-07T20:32:07.8027153Z x1 = x1.contiguous() 2025-05-07T20:32:07.8027412Z 2025-05-07T20:32:07.8027615Z if scale_ub is not None: 2025-05-07T20:32:07.8027899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.8028254Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.8028583Z ) 2025-05-07T20:32:07.8028782Z else: 2025-05-07T20:32:07.8029011Z scale_ub_tensor = None 2025-05-07T20:32:07.8029275Z 2025-05-07T20:32:07.8029520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.8029852Z op = silu_mul_quant 2025-05-07T20:32:07.8030114Z if compiled: 2025-05-07T20:32:07.8030372Z op = torch.compile(op) 2025-05-07T20:32:07.8030685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.8030979Z 2025-05-07T20:32:07.8031182Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.8031355Z 2025-05-07T20:32:07.8031461Z moe/activation_test.py:117: 2025-05-07T20:32:07.8031773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.8032123Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.8032416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.8033006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:07.8033590Z return fn(*args, **kwargs) 2025-05-07T20:32:07.8034403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.8035124Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.8035687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.8036398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.8037090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.8037662Z kernel = self.compile( 2025-05-07T20:32:07.8038259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.8038948Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.8039364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.8039685Z 2025-05-07T20:32:07.8039906Z self = 2025-05-07T20:32:07.8041172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.8042602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2ab2840>} 2025-05-07T20:32:07.8043993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.8045056Z context = 2025-05-07T20:32:07.8045365Z 2025-05-07T20:32:07.8045548Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.8046105Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.8046591Z module_map=module_map) 2025-05-07T20:32:07.8046974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.8047343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.8047614Z E ^ 2025-05-07T20:32:07.8048153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.8048628Z 2025-05-07T20:32:07.8049065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:07.8049603Z 2025-05-07T20:32:07.8049720Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:07.8050152Z self=, 2025-05-07T20:32:07.8050582Z T=1, 2025-05-07T20:32:07.8050782Z D=7168, 2025-05-07T20:32:07.8050987Z scale_ub=1200.0, 2025-05-07T20:32:07.8051222Z contiguous=False, 2025-05-07T20:32:07.8051459Z compiled=True, 2025-05-07T20:32:07.8051673Z ) 2025-05-07T20:32:07.8052004Z self = 2025-05-07T20:32:07.8052517Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:07.8052796Z 2025-05-07T20:32:07.8052880Z @given( 2025-05-07T20:32:07.8053121Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:07.8053446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:07.8053768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:07.8054110Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:07.8054461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:07.8054759Z ) 2025-05-07T20:32:07.8055123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:07.8055667Z def test_silu_mul_quant( 2025-05-07T20:32:07.8055919Z self, 2025-05-07T20:32:07.8056121Z T: int, 2025-05-07T20:32:07.8056320Z D: int, 2025-05-07T20:32:07.8056546Z scale_ub: Optional[float], 2025-05-07T20:32:07.8056829Z contiguous: bool, 2025-05-07T20:32:07.8057077Z compiled: bool, 2025-05-07T20:32:07.8057309Z ) -> None: 2025-05-07T20:32:07.8057534Z torch.manual_seed(2025) 2025-05-07T20:32:07.8057781Z 2025-05-07T20:32:07.8058071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:07.8058434Z 2025-05-07T20:32:07.8058635Z x_sign = torch.sign(x) 2025-05-07T20:32:07.8058942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:07.8059266Z x = x_sign * x_clamp 2025-05-07T20:32:07.8059512Z x0 = x[:, :D] 2025-05-07T20:32:07.8059743Z x1 = x[:, D:] 2025-05-07T20:32:07.8059960Z 2025-05-07T20:32:07.8060231Z if contiguous: 2025-05-07T20:32:07.8060477Z x0 = x0.contiguous() 2025-05-07T20:32:07.8060747Z x1 = x1.contiguous() 2025-05-07T20:32:07.8060997Z 2025-05-07T20:32:07.8061193Z if scale_ub is not None: 2025-05-07T20:32:07.8061477Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:07.8061829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:07.8062146Z ) 2025-05-07T20:32:07.8062354Z else: 2025-05-07T20:32:07.8062580Z scale_ub_tensor = None 2025-05-07T20:32:07.8062839Z 2025-05-07T20:32:07.8063085Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:07.8063429Z op = silu_mul_quant 2025-05-07T20:32:07.8063691Z if compiled: 2025-05-07T20:32:07.8063951Z op = torch.compile(op) 2025-05-07T20:32:07.8064257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.8064541Z 2025-05-07T20:32:07.8064748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:07.8064919Z 2025-05-07T20:32:07.8065035Z moe/activation_test.py:117: 2025-05-07T20:32:07.8065342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.8065686Z moe/activation_test.py:115: in fn 2025-05-07T20:32:07.8065976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:07.8066557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:07.8067131Z return fn(*args, **kwargs) 2025-05-07T20:32:07.8067817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:07.8068580Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:07.8069139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:07.8069841Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:07.8070542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:07.8071093Z kernel = self.compile( 2025-05-07T20:32:07.8071649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:07.8072332Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:07.8072752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:07.8072992Z 2025-05-07T20:32:07.8073213Z self = 2025-05-07T20:32:07.8074329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:07.8075835Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2ab1440>} 2025-05-07T20:32:07.8077235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:07.8078342Z context = 2025-05-07T20:32:07.8078643Z 2025-05-07T20:32:07.8078822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:07.8079367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:07.8079854Z module_map=module_map) 2025-05-07T20:32:07.8080308Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:07.8080673Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:07.8080942Z E ^ 2025-05-07T20:32:07.8081512Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:07.8081979Z 2025-05-07T20:32:07.8082418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.2337879Z 2025-05-07T20:32:08.2338263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.2338770Z self=, 2025-05-07T20:32:08.2339188Z T=1, 2025-05-07T20:32:08.2339382Z D=7168, 2025-05-07T20:32:08.2339584Z scale_ub=None, 2025-05-07T20:32:08.2339804Z contiguous=False, 2025-05-07T20:32:08.2340037Z compiled=True, 2025-05-07T20:32:08.2340249Z ) 2025-05-07T20:32:08.2340578Z self = 2025-05-07T20:32:08.2341085Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:08.2341371Z 2025-05-07T20:32:08.2341457Z @given( 2025-05-07T20:32:08.2341744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.2342072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.2342386Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.2342731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.2343077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.2343371Z ) 2025-05-07T20:32:08.2343737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.2344197Z def test_silu_mul_quant( 2025-05-07T20:32:08.2344450Z self, 2025-05-07T20:32:08.2344646Z T: int, 2025-05-07T20:32:08.2344850Z D: int, 2025-05-07T20:32:08.2345074Z scale_ub: Optional[float], 2025-05-07T20:32:08.2345350Z contiguous: bool, 2025-05-07T20:32:08.2345596Z compiled: bool, 2025-05-07T20:32:08.2345831Z ) -> None: 2025-05-07T20:32:08.2346055Z torch.manual_seed(2025) 2025-05-07T20:32:08.2346305Z 2025-05-07T20:32:08.2346593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.2346945Z 2025-05-07T20:32:08.2347145Z x_sign = torch.sign(x) 2025-05-07T20:32:08.2347453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.2347822Z x = x_sign * x_clamp 2025-05-07T20:32:08.2354048Z x0 = x[:, :D] 2025-05-07T20:32:08.2354308Z x1 = x[:, D:] 2025-05-07T20:32:08.2354530Z 2025-05-07T20:32:08.2354726Z if contiguous: 2025-05-07T20:32:08.2354971Z x0 = x0.contiguous() 2025-05-07T20:32:08.2355238Z x1 = x1.contiguous() 2025-05-07T20:32:08.2355482Z 2025-05-07T20:32:08.2355684Z if scale_ub is not None: 2025-05-07T20:32:08.2355971Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.2356321Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.2356645Z ) 2025-05-07T20:32:08.2356856Z else: 2025-05-07T20:32:08.2357249Z scale_ub_tensor = None 2025-05-07T20:32:08.2357515Z 2025-05-07T20:32:08.2357760Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.2358093Z op = silu_mul_quant 2025-05-07T20:32:08.2358352Z if compiled: 2025-05-07T20:32:08.2358610Z op = torch.compile(op) 2025-05-07T20:32:08.2358920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2359204Z 2025-05-07T20:32:08.2359407Z y_fp8, y_scale = fn() 2025-05-07T20:32:08.2359703Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:08.2360003Z 2025-05-07T20:32:08.2360370Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.2360723Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:08.2361026Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:08.2361357Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:08.2361859Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.2362195Z 2025-05-07T20:32:08.2362403Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:08.2362610Z 2025-05-07T20:32:08.2362716Z moe/activation_test.py:126: 2025-05-07T20:32:08.2363027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2363375Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:08.2363727Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.2364557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:08.2365339Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:08.2365904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.2366616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.2367347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:08.2368096Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.2368860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:08.2369526Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:08.2370155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:08.2370690Z fn() 2025-05-07T20:32:08.2371220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:08.2371821Z self.fn.run( 2025-05-07T20:32:08.2372307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.2372860Z kernel = self.compile( 2025-05-07T20:32:08.2373433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.2374116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.2374530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2374773Z 2025-05-07T20:32:08.2374990Z self = 2025-05-07T20:32:08.2376113Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.2377532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2ab3e20>} 2025-05-07T20:32:08.2379010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.2380070Z context = 2025-05-07T20:32:08.2380375Z 2025-05-07T20:32:08.2380549Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.2381096Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.2381586Z module_map=module_map) 2025-05-07T20:32:08.2381972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.2382347Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:08.2382627Z E ^ 2025-05-07T20:32:08.2383104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.2383713Z 2025-05-07T20:32:08.2384152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.2384686Z 2025-05-07T20:32:08.2384794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.2385229Z self=, 2025-05-07T20:32:08.2385643Z T=1, 2025-05-07T20:32:08.2385840Z D=5120, 2025-05-07T20:32:08.2386042Z scale_ub=1200.0, 2025-05-07T20:32:08.2386267Z contiguous=False, 2025-05-07T20:32:08.2386502Z compiled=True, 2025-05-07T20:32:08.2386715Z ) 2025-05-07T20:32:08.2387044Z self = 2025-05-07T20:32:08.2387558Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:08.2387835Z 2025-05-07T20:32:08.2387919Z @given( 2025-05-07T20:32:08.2388151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.2388484Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.2388810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.2389156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.2389499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.2389798Z ) 2025-05-07T20:32:08.2390165Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.2390619Z def test_silu_mul_quant( 2025-05-07T20:32:08.2390871Z self, 2025-05-07T20:32:08.2391076Z T: int, 2025-05-07T20:32:08.2391278Z D: int, 2025-05-07T20:32:08.2391508Z scale_ub: Optional[float], 2025-05-07T20:32:08.2391790Z contiguous: bool, 2025-05-07T20:32:08.2392037Z compiled: bool, 2025-05-07T20:32:08.2392266Z ) -> None: 2025-05-07T20:32:08.2392490Z torch.manual_seed(2025) 2025-05-07T20:32:08.2392741Z 2025-05-07T20:32:08.2393030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.2393396Z 2025-05-07T20:32:08.2393600Z x_sign = torch.sign(x) 2025-05-07T20:32:08.2393907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.2394231Z x = x_sign * x_clamp 2025-05-07T20:32:08.2394478Z x0 = x[:, :D] 2025-05-07T20:32:08.2394699Z x1 = x[:, D:] 2025-05-07T20:32:08.2394918Z 2025-05-07T20:32:08.2395116Z if contiguous: 2025-05-07T20:32:08.2395354Z x0 = x0.contiguous() 2025-05-07T20:32:08.2395619Z x1 = x1.contiguous() 2025-05-07T20:32:08.2395870Z 2025-05-07T20:32:08.2396071Z if scale_ub is not None: 2025-05-07T20:32:08.2396359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.2396714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.2397033Z ) 2025-05-07T20:32:08.2397237Z else: 2025-05-07T20:32:08.2397453Z scale_ub_tensor = None 2025-05-07T20:32:08.2397714Z 2025-05-07T20:32:08.2397962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.2398431Z op = silu_mul_quant 2025-05-07T20:32:08.2398688Z if compiled: 2025-05-07T20:32:08.2398948Z op = torch.compile(op) 2025-05-07T20:32:08.2399270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2399560Z 2025-05-07T20:32:08.2399754Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.2399935Z 2025-05-07T20:32:08.2400036Z moe/activation_test.py:117: 2025-05-07T20:32:08.2400432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2400775Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.2401073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.2401654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.2402231Z return fn(*args, **kwargs) 2025-05-07T20:32:08.2402921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.2403723Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.2404284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.2404991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.2405681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.2406235Z kernel = self.compile( 2025-05-07T20:32:08.2406804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.2407482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.2407943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.2408187Z 2025-05-07T20:32:08.2408408Z self = 2025-05-07T20:32:08.2409530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.2410954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef25e79c0>} 2025-05-07T20:32:08.2412349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.2413674Z context = 2025-05-07T20:32:08.2413977Z 2025-05-07T20:32:08.2414157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.2414713Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.2415204Z module_map=module_map) 2025-05-07T20:32:08.2415586Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.2415962Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.2416229Z E ^ 2025-05-07T20:32:08.2416712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.2417180Z 2025-05-07T20:32:08.2417620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3789612Z 2025-05-07T20:32:08.3790011Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3790619Z self=, 2025-05-07T20:32:08.3791212Z T=1, 2025-05-07T20:32:08.3791463Z D=5120, 2025-05-07T20:32:08.3791688Z scale_ub=1200.0, 2025-05-07T20:32:08.3791925Z contiguous=False, 2025-05-07T20:32:08.3792337Z compiled=False, 2025-05-07T20:32:08.3792560Z ) 2025-05-07T20:32:08.3792902Z self = 2025-05-07T20:32:08.3793422Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:08.3793705Z 2025-05-07T20:32:08.3793790Z @given( 2025-05-07T20:32:08.3794028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.3794356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.3794680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.3795022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.3795364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.3795665Z ) 2025-05-07T20:32:08.3796026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.3796489Z def test_silu_mul_quant( 2025-05-07T20:32:08.3796901Z self, 2025-05-07T20:32:08.3797114Z T: int, 2025-05-07T20:32:08.3797326Z D: int, 2025-05-07T20:32:08.3797563Z scale_ub: Optional[float], 2025-05-07T20:32:08.3797883Z contiguous: bool, 2025-05-07T20:32:08.3798143Z compiled: bool, 2025-05-07T20:32:08.3798379Z ) -> None: 2025-05-07T20:32:08.3798602Z torch.manual_seed(2025) 2025-05-07T20:32:08.3798851Z 2025-05-07T20:32:08.3799138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.3799495Z 2025-05-07T20:32:08.3799699Z x_sign = torch.sign(x) 2025-05-07T20:32:08.3800005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.3800455Z x = x_sign * x_clamp 2025-05-07T20:32:08.3800699Z x0 = x[:, :D] 2025-05-07T20:32:08.3800932Z x1 = x[:, D:] 2025-05-07T20:32:08.3801153Z 2025-05-07T20:32:08.3801344Z if contiguous: 2025-05-07T20:32:08.3801584Z x0 = x0.contiguous() 2025-05-07T20:32:08.3801861Z x1 = x1.contiguous() 2025-05-07T20:32:08.3802119Z 2025-05-07T20:32:08.3802316Z if scale_ub is not None: 2025-05-07T20:32:08.3802600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.3802954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.3803278Z ) 2025-05-07T20:32:08.3803484Z else: 2025-05-07T20:32:08.3803706Z scale_ub_tensor = None 2025-05-07T20:32:08.3803967Z 2025-05-07T20:32:08.3804210Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3804538Z op = silu_mul_quant 2025-05-07T20:32:08.3804798Z if compiled: 2025-05-07T20:32:08.3805059Z op = torch.compile(op) 2025-05-07T20:32:08.3805372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3805657Z 2025-05-07T20:32:08.3805860Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.3806032Z 2025-05-07T20:32:08.3806139Z moe/activation_test.py:117: 2025-05-07T20:32:08.3806462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3806806Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.3807102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3807876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.3808591Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.3809156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.3809876Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.3810570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.3811122Z kernel = self.compile( 2025-05-07T20:32:08.3811687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.3812464Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3812887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3813130Z 2025-05-07T20:32:08.3813516Z self = 2025-05-07T20:32:08.3814645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.3816070Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef381afc0>} 2025-05-07T20:32:08.3817475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.3819148Z context = 2025-05-07T20:32:08.3819455Z 2025-05-07T20:32:08.3819632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.3820180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3820671Z module_map=module_map) 2025-05-07T20:32:08.3821050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3821424Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3821704Z E ^ 2025-05-07T20:32:08.3822186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3822660Z 2025-05-07T20:32:08.3823096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3823638Z 2025-05-07T20:32:08.3823753Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3824191Z self=, 2025-05-07T20:32:08.3824608Z T=16384, 2025-05-07T20:32:08.3824811Z D=5120, 2025-05-07T20:32:08.3825013Z scale_ub=1200.0, 2025-05-07T20:32:08.3825244Z contiguous=False, 2025-05-07T20:32:08.3825483Z compiled=True, 2025-05-07T20:32:08.3825692Z ) 2025-05-07T20:32:08.3826021Z self = 2025-05-07T20:32:08.3826553Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:08.3826848Z 2025-05-07T20:32:08.3826932Z @given( 2025-05-07T20:32:08.3827168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.3827496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.3827818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.3828179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.3828562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.3828864Z ) 2025-05-07T20:32:08.3829233Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.3829691Z def test_silu_mul_quant( 2025-05-07T20:32:08.3829943Z self, 2025-05-07T20:32:08.3830143Z T: int, 2025-05-07T20:32:08.3830347Z D: int, 2025-05-07T20:32:08.3830581Z scale_ub: Optional[float], 2025-05-07T20:32:08.3830870Z contiguous: bool, 2025-05-07T20:32:08.3831118Z compiled: bool, 2025-05-07T20:32:08.3831355Z ) -> None: 2025-05-07T20:32:08.3831579Z torch.manual_seed(2025) 2025-05-07T20:32:08.3831830Z 2025-05-07T20:32:08.3832116Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.3832474Z 2025-05-07T20:32:08.3832682Z x_sign = torch.sign(x) 2025-05-07T20:32:08.3832991Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.3833447Z x = x_sign * x_clamp 2025-05-07T20:32:08.3833706Z x0 = x[:, :D] 2025-05-07T20:32:08.3833929Z x1 = x[:, D:] 2025-05-07T20:32:08.3834148Z 2025-05-07T20:32:08.3834341Z if contiguous: 2025-05-07T20:32:08.3834579Z x0 = x0.contiguous() 2025-05-07T20:32:08.3834852Z x1 = x1.contiguous() 2025-05-07T20:32:08.3835105Z 2025-05-07T20:32:08.3835300Z if scale_ub is not None: 2025-05-07T20:32:08.3835587Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.3835942Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.3836263Z ) 2025-05-07T20:32:08.3836467Z else: 2025-05-07T20:32:08.3836688Z scale_ub_tensor = None 2025-05-07T20:32:08.3836947Z 2025-05-07T20:32:08.3837191Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3837521Z op = silu_mul_quant 2025-05-07T20:32:08.3837867Z if compiled: 2025-05-07T20:32:08.3838126Z op = torch.compile(op) 2025-05-07T20:32:08.3838448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3838776Z 2025-05-07T20:32:08.3838974Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.3839150Z 2025-05-07T20:32:08.3839257Z moe/activation_test.py:117: 2025-05-07T20:32:08.3839566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3839913Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.3840295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3840880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.3841461Z return fn(*args, **kwargs) 2025-05-07T20:32:08.3842144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.3842858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.3843432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.3844142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.3844836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.3845395Z kernel = self.compile( 2025-05-07T20:32:08.3845965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.3846647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3847080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3847321Z 2025-05-07T20:32:08.3847545Z self = 2025-05-07T20:32:08.3848671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.3850097Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef3a93ce0>} 2025-05-07T20:32:08.3851494Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.3852557Z context = 2025-05-07T20:32:08.3852858Z 2025-05-07T20:32:08.3853038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.3853579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3854075Z module_map=module_map) 2025-05-07T20:32:08.3854544Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3854921Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3855190Z E ^ 2025-05-07T20:32:08.3855679Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3856147Z 2025-05-07T20:32:08.3856583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3857115Z 2025-05-07T20:32:08.3857224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3857661Z self=, 2025-05-07T20:32:08.3858086Z T=2048, 2025-05-07T20:32:08.3858301Z D=7168, 2025-05-07T20:32:08.3858529Z scale_ub=1200.0, 2025-05-07T20:32:08.3858764Z contiguous=False, 2025-05-07T20:32:08.3859000Z compiled=True, 2025-05-07T20:32:08.5689070Z ) 2025-05-07T20:32:08.5690302Z self = 2025-05-07T20:32:08.5691176Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:08.5691603Z 2025-05-07T20:32:08.5691724Z @given( 2025-05-07T20:32:08.5692066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.5692524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.5692979Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.5693356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.5693705Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.5694020Z ) 2025-05-07T20:32:08.5694401Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.5694864Z def test_silu_mul_quant( 2025-05-07T20:32:08.5695130Z self, 2025-05-07T20:32:08.5695346Z T: int, 2025-05-07T20:32:08.5695570Z D: int, 2025-05-07T20:32:08.5695815Z scale_ub: Optional[float], 2025-05-07T20:32:08.5696113Z contiguous: bool, 2025-05-07T20:32:08.5696378Z compiled: bool, 2025-05-07T20:32:08.5696616Z ) -> None: 2025-05-07T20:32:08.5696848Z torch.manual_seed(2025) 2025-05-07T20:32:08.5697116Z 2025-05-07T20:32:08.5697413Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.5697796Z 2025-05-07T20:32:08.5698013Z x_sign = torch.sign(x) 2025-05-07T20:32:08.5698321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.5698664Z x = x_sign * x_clamp 2025-05-07T20:32:08.5698932Z x0 = x[:, :D] 2025-05-07T20:32:08.5699163Z x1 = x[:, D:] 2025-05-07T20:32:08.5699398Z 2025-05-07T20:32:08.5699608Z if contiguous: 2025-05-07T20:32:08.5699856Z x0 = x0.contiguous() 2025-05-07T20:32:08.5700143Z x1 = x1.contiguous() 2025-05-07T20:32:08.5700409Z 2025-05-07T20:32:08.5700622Z if scale_ub is not None: 2025-05-07T20:32:08.5700926Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.5701294Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.5701632Z ) 2025-05-07T20:32:08.5701839Z else: 2025-05-07T20:32:08.5702071Z scale_ub_tensor = None 2025-05-07T20:32:08.5702346Z 2025-05-07T20:32:08.5702593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.5702934Z op = silu_mul_quant 2025-05-07T20:32:08.5703208Z if compiled: 2025-05-07T20:32:08.5703475Z op = torch.compile(op) 2025-05-07T20:32:08.5703796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5704099Z 2025-05-07T20:32:08.5704303Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.5704490Z 2025-05-07T20:32:08.5704601Z moe/activation_test.py:117: 2025-05-07T20:32:08.5704924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5705292Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.5705972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5706584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.5707183Z return fn(*args, **kwargs) 2025-05-07T20:32:08.5707920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.5708659Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.5709254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.5709982Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.5710690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.5720046Z kernel = self.compile( 2025-05-07T20:32:08.5721002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.5721723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.5722154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5722414Z 2025-05-07T20:32:08.5722637Z self = 2025-05-07T20:32:08.5723796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.5725276Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f29aac720>} 2025-05-07T20:32:08.5726703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.5727787Z context = 2025-05-07T20:32:08.5728104Z 2025-05-07T20:32:08.5728284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.5728847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.5729353Z module_map=module_map) 2025-05-07T20:32:08.5729743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.5730128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.5730416Z E ^ 2025-05-07T20:32:08.5730912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.5731399Z 2025-05-07T20:32:08.5731848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.5732407Z 2025-05-07T20:32:08.5732520Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.5732970Z self=, 2025-05-07T20:32:08.5733398Z T=1, 2025-05-07T20:32:08.5733604Z D=5120, 2025-05-07T20:32:08.5733818Z scale_ub=None, 2025-05-07T20:32:08.5734051Z contiguous=False, 2025-05-07T20:32:08.5734304Z compiled=False, 2025-05-07T20:32:08.5734538Z ) 2025-05-07T20:32:08.5734877Z self = 2025-05-07T20:32:08.5735408Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:08.5735699Z 2025-05-07T20:32:08.5735785Z @given( 2025-05-07T20:32:08.5736043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.5736378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.5736724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.5737214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.5737574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.5737917Z ) 2025-05-07T20:32:08.5738324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.5738795Z def test_silu_mul_quant( 2025-05-07T20:32:08.5739069Z self, 2025-05-07T20:32:08.5739289Z T: int, 2025-05-07T20:32:08.5739502Z D: int, 2025-05-07T20:32:08.5739748Z scale_ub: Optional[float], 2025-05-07T20:32:08.5740051Z contiguous: bool, 2025-05-07T20:32:08.5740310Z compiled: bool, 2025-05-07T20:32:08.5740565Z ) -> None: 2025-05-07T20:32:08.5740806Z torch.manual_seed(2025) 2025-05-07T20:32:08.5741076Z 2025-05-07T20:32:08.5741370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.5741744Z 2025-05-07T20:32:08.5742049Z x_sign = torch.sign(x) 2025-05-07T20:32:08.5742371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.5742715Z x = x_sign * x_clamp 2025-05-07T20:32:08.5742987Z x0 = x[:, :D] 2025-05-07T20:32:08.5743227Z x1 = x[:, D:] 2025-05-07T20:32:08.5743459Z 2025-05-07T20:32:08.5743669Z if contiguous: 2025-05-07T20:32:08.5743917Z x0 = x0.contiguous() 2025-05-07T20:32:08.5744209Z x1 = x1.contiguous() 2025-05-07T20:32:08.5744480Z 2025-05-07T20:32:08.5744687Z if scale_ub is not None: 2025-05-07T20:32:08.5744990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.5745362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.5745695Z ) 2025-05-07T20:32:08.5745917Z else: 2025-05-07T20:32:08.5746153Z scale_ub_tensor = None 2025-05-07T20:32:08.5746422Z 2025-05-07T20:32:08.5746683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.5747038Z op = silu_mul_quant 2025-05-07T20:32:08.5747321Z if compiled: 2025-05-07T20:32:08.5747589Z op = torch.compile(op) 2025-05-07T20:32:08.5747915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5748216Z 2025-05-07T20:32:08.5748422Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.5748607Z 2025-05-07T20:32:08.5748713Z moe/activation_test.py:117: 2025-05-07T20:32:08.5749036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5749389Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.5749699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5750434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.5751165Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.5751739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.5752481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.5753195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.5753757Z kernel = self.compile( 2025-05-07T20:32:08.5754338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.5755040Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.5755473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5755717Z 2025-05-07T20:32:08.5755938Z self = 2025-05-07T20:32:08.5757076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.5758637Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2d125c0>} 2025-05-07T20:32:08.5760052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.5761193Z context = 2025-05-07T20:32:08.5761499Z 2025-05-07T20:32:08.5761679Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.5762240Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.5762741Z module_map=module_map) 2025-05-07T20:32:08.5763130Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.5763597Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.5763889Z E ^ 2025-05-07T20:32:08.5764390Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.5764862Z 2025-05-07T20:32:08.5765299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.5765848Z 2025-05-07T20:32:08.5765962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.5766441Z self=, 2025-05-07T20:32:08.5766872Z T=4096, 2025-05-07T20:32:08.5767071Z D=7168, 2025-05-07T20:32:08.5767287Z scale_ub=1200.0, 2025-05-07T20:32:08.5767533Z contiguous=False, 2025-05-07T20:32:08.5767775Z compiled=False, 2025-05-07T20:32:08.5768005Z ) 2025-05-07T20:32:08.5768358Z self = 2025-05-07T20:32:08.5768910Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:08.5769202Z 2025-05-07T20:32:08.5769287Z @given( 2025-05-07T20:32:08.5769543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.5769885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.5770213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.5770576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.5770937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.5771240Z ) 2025-05-07T20:32:08.5771619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.5772095Z def test_silu_mul_quant( 2025-05-07T20:32:08.5772363Z self, 2025-05-07T20:32:08.5772571Z T: int, 2025-05-07T20:32:08.5772791Z D: int, 2025-05-07T20:32:08.5773033Z scale_ub: Optional[float], 2025-05-07T20:32:08.5773318Z contiguous: bool, 2025-05-07T20:32:08.5773586Z compiled: bool, 2025-05-07T20:32:08.5773829Z ) -> None: 2025-05-07T20:32:08.5774060Z torch.manual_seed(2025) 2025-05-07T20:32:08.5774327Z 2025-05-07T20:32:08.5774625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.5774985Z 2025-05-07T20:32:08.5775205Z x_sign = torch.sign(x) 2025-05-07T20:32:08.5775521Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.5775850Z x = x_sign * x_clamp 2025-05-07T20:32:08.5776114Z x0 = x[:, :D] 2025-05-07T20:32:08.5776353Z x1 = x[:, D:] 2025-05-07T20:32:08.5776571Z 2025-05-07T20:32:08.5776779Z if contiguous: 2025-05-07T20:32:08.5777035Z x0 = x0.contiguous() 2025-05-07T20:32:08.5777308Z x1 = x1.contiguous() 2025-05-07T20:32:08.5777576Z 2025-05-07T20:32:08.5777814Z if scale_ub is not None: 2025-05-07T20:32:08.5778131Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.5778494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.5778911Z ) 2025-05-07T20:32:08.5779125Z else: 2025-05-07T20:32:08.5779352Z scale_ub_tensor = None 2025-05-07T20:32:08.5779628Z 2025-05-07T20:32:08.5779883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.5780216Z op = silu_mul_quant 2025-05-07T20:32:08.5780489Z if compiled: 2025-05-07T20:32:08.5780760Z op = torch.compile(op) 2025-05-07T20:32:08.5781074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5781377Z 2025-05-07T20:32:08.5781595Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.5781770Z 2025-05-07T20:32:08.5781878Z moe/activation_test.py:117: 2025-05-07T20:32:08.5782199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5782557Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.5782864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.5783666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.5784394Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.5784955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.5785673Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.5786378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.5786943Z kernel = self.compile( 2025-05-07T20:32:08.5787504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.5788196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.5788618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.5788863Z 2025-05-07T20:32:08.5789092Z self = 2025-05-07T20:32:08.5790209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.5791636Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1f60900>} 2025-05-07T20:32:08.5793039Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.5794110Z context = 2025-05-07T20:32:08.5794413Z 2025-05-07T20:32:08.5794595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.5795150Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.5795645Z module_map=module_map) 2025-05-07T20:32:08.5796035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.5796407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.5796686Z E ^ 2025-05-07T20:32:08.5797179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.5797647Z 2025-05-07T20:32:08.5798130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.7331801Z 2025-05-07T20:32:08.7332133Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.7332792Z self=, 2025-05-07T20:32:08.7333435Z T=16384, 2025-05-07T20:32:08.7333722Z D=7168, 2025-05-07T20:32:08.7334403Z scale_ub=None, 2025-05-07T20:32:08.7334720Z contiguous=True, 2025-05-07T20:32:08.7335052Z compiled=True, 2025-05-07T20:32:08.7335357Z ) 2025-05-07T20:32:08.7335816Z self = 2025-05-07T20:32:08.7336527Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:08.7336823Z 2025-05-07T20:32:08.7336919Z @given( 2025-05-07T20:32:08.7337165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.7337507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.7337842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.7338194Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.7338553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.7338863Z ) 2025-05-07T20:32:08.7339239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.7340356Z def test_silu_mul_quant( 2025-05-07T20:32:08.7340627Z self, 2025-05-07T20:32:08.7340841Z T: int, 2025-05-07T20:32:08.7341053Z D: int, 2025-05-07T20:32:08.7341291Z scale_ub: Optional[float], 2025-05-07T20:32:08.7341587Z contiguous: bool, 2025-05-07T20:32:08.7341842Z compiled: bool, 2025-05-07T20:32:08.7342092Z ) -> None: 2025-05-07T20:32:08.7342333Z torch.manual_seed(2025) 2025-05-07T20:32:08.7342592Z 2025-05-07T20:32:08.7342891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.7343264Z 2025-05-07T20:32:08.7343471Z x_sign = torch.sign(x) 2025-05-07T20:32:08.7343790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.7344126Z x = x_sign * x_clamp 2025-05-07T20:32:08.7344382Z x0 = x[:, :D] 2025-05-07T20:32:08.7344622Z x1 = x[:, D:] 2025-05-07T20:32:08.7344854Z 2025-05-07T20:32:08.7345063Z if contiguous: 2025-05-07T20:32:08.7345324Z x0 = x0.contiguous() 2025-05-07T20:32:08.7345612Z x1 = x1.contiguous() 2025-05-07T20:32:08.7345877Z 2025-05-07T20:32:08.7346085Z if scale_ub is not None: 2025-05-07T20:32:08.7346384Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.7346751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.7347084Z ) 2025-05-07T20:32:08.7347303Z else: 2025-05-07T20:32:08.7347538Z scale_ub_tensor = None 2025-05-07T20:32:08.7347831Z 2025-05-07T20:32:08.7348109Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.7348451Z op = silu_mul_quant 2025-05-07T20:32:08.7348717Z if compiled: 2025-05-07T20:32:08.7348992Z op = torch.compile(op) 2025-05-07T20:32:08.7349320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.7349613Z 2025-05-07T20:32:08.7349829Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.7350022Z 2025-05-07T20:32:08.7350158Z moe/activation_test.py:117: 2025-05-07T20:32:08.7350484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.7350847Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.7351145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.7351745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.7352338Z return fn(*args, **kwargs) 2025-05-07T20:32:08.7353029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.7353757Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.7354335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.7355059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.7355856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.7356428Z kernel = self.compile( 2025-05-07T20:32:08.7357008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.7357700Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.7358133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.7358385Z 2025-05-07T20:32:08.7358606Z self = 2025-05-07T20:32:08.7359746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.7361317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1f61c60>} 2025-05-07T20:32:08.7362795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.7363870Z context = 2025-05-07T20:32:08.7364184Z 2025-05-07T20:32:08.7364364Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.7364926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.7365421Z module_map=module_map) 2025-05-07T20:32:08.7365815Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.7366201Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.7366478Z E ^ 2025-05-07T20:32:08.7366987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.7367465Z 2025-05-07T20:32:08.7367928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.7368494Z 2025-05-07T20:32:08.7368615Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.7369055Z self=, 2025-05-07T20:32:08.7369490Z T=4096, 2025-05-07T20:32:08.7369699Z D=5120, 2025-05-07T20:32:08.7369903Z scale_ub=None, 2025-05-07T20:32:08.7370141Z contiguous=False, 2025-05-07T20:32:08.7370388Z compiled=True, 2025-05-07T20:32:08.7370603Z ) 2025-05-07T20:32:08.7370948Z self = 2025-05-07T20:32:08.7371479Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:08.7371767Z 2025-05-07T20:32:08.7371866Z @given( 2025-05-07T20:32:08.7372115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.7372458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.7372790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.7373141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.7373500Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.7373813Z ) 2025-05-07T20:32:08.7374186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.7374659Z def test_silu_mul_quant( 2025-05-07T20:32:08.7374923Z self, 2025-05-07T20:32:08.7375130Z T: int, 2025-05-07T20:32:08.7375350Z D: int, 2025-05-07T20:32:08.7375592Z scale_ub: Optional[float], 2025-05-07T20:32:08.7375886Z contiguous: bool, 2025-05-07T20:32:08.7376139Z compiled: bool, 2025-05-07T20:32:08.7376382Z ) -> None: 2025-05-07T20:32:08.7376617Z torch.manual_seed(2025) 2025-05-07T20:32:08.7376878Z 2025-05-07T20:32:08.7377258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.7377634Z 2025-05-07T20:32:08.7377840Z x_sign = torch.sign(x) 2025-05-07T20:32:08.7378154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.7378490Z x = x_sign * x_clamp 2025-05-07T20:32:08.7378744Z x0 = x[:, :D] 2025-05-07T20:32:08.7378982Z x1 = x[:, D:] 2025-05-07T20:32:08.7379211Z 2025-05-07T20:32:08.7379411Z if contiguous: 2025-05-07T20:32:08.7379663Z x0 = x0.contiguous() 2025-05-07T20:32:08.7379945Z x1 = x1.contiguous() 2025-05-07T20:32:08.7380201Z 2025-05-07T20:32:08.7380414Z if scale_ub is not None: 2025-05-07T20:32:08.7380713Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.7381068Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.7381410Z ) 2025-05-07T20:32:08.7381711Z else: 2025-05-07T20:32:08.7381940Z scale_ub_tensor = None 2025-05-07T20:32:08.7382209Z 2025-05-07T20:32:08.7382458Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.7382794Z op = silu_mul_quant 2025-05-07T20:32:08.7383057Z if compiled: 2025-05-07T20:32:08.7383322Z op = torch.compile(op) 2025-05-07T20:32:08.7383640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.7383932Z 2025-05-07T20:32:08.7384141Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.7384313Z 2025-05-07T20:32:08.7384425Z moe/activation_test.py:117: 2025-05-07T20:32:08.7384736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.7385090Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.7385391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.7385980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.7386566Z return fn(*args, **kwargs) 2025-05-07T20:32:08.7387268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.7387992Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.7388552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.7389268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.7389969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.7390535Z kernel = self.compile( 2025-05-07T20:32:08.7391111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.7391807Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.7392237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.7392493Z 2025-05-07T20:32:08.7392712Z self = 2025-05-07T20:32:08.7393838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.7395268Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1f62980>} 2025-05-07T20:32:08.7396666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.7397727Z context = 2025-05-07T20:32:08.7398042Z 2025-05-07T20:32:08.7398306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.7398864Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.7399361Z module_map=module_map) 2025-05-07T20:32:08.7399744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.7400202Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.7400482Z E ^ 2025-05-07T20:32:08.7400971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.7401446Z 2025-05-07T20:32:08.7401880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.8755663Z 2025-05-07T20:32:08.8756098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.8756799Z self=, 2025-05-07T20:32:08.8757668Z T=4096, 2025-05-07T20:32:08.8758068Z D=5120, 2025-05-07T20:32:08.8758457Z scale_ub=1200.0, 2025-05-07T20:32:08.8758922Z contiguous=False, 2025-05-07T20:32:08.8759389Z compiled=False, 2025-05-07T20:32:08.8759808Z ) 2025-05-07T20:32:08.8760592Z self = 2025-05-07T20:32:08.8761633Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:08.8762205Z 2025-05-07T20:32:08.8762365Z @given( 2025-05-07T20:32:08.8762840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.8763489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.8764119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.8764842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.8765524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.8766121Z ) 2025-05-07T20:32:08.8766871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.8767731Z def test_silu_mul_quant( 2025-05-07T20:32:08.8768028Z self, 2025-05-07T20:32:08.8768257Z T: int, 2025-05-07T20:32:08.8776755Z D: int, 2025-05-07T20:32:08.8777009Z scale_ub: Optional[float], 2025-05-07T20:32:08.8777309Z contiguous: bool, 2025-05-07T20:32:08.8777565Z compiled: bool, 2025-05-07T20:32:08.8777816Z ) -> None: 2025-05-07T20:32:08.8778057Z torch.manual_seed(2025) 2025-05-07T20:32:08.8778316Z 2025-05-07T20:32:08.8778617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.8778988Z 2025-05-07T20:32:08.8779199Z x_sign = torch.sign(x) 2025-05-07T20:32:08.8779507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.8779845Z x = x_sign * x_clamp 2025-05-07T20:32:08.8780103Z x0 = x[:, :D] 2025-05-07T20:32:08.8780342Z x1 = x[:, D:] 2025-05-07T20:32:08.8780566Z 2025-05-07T20:32:08.8780774Z if contiguous: 2025-05-07T20:32:08.8781019Z x0 = x0.contiguous() 2025-05-07T20:32:08.8781300Z x1 = x1.contiguous() 2025-05-07T20:32:08.8781558Z 2025-05-07T20:32:08.8781759Z if scale_ub is not None: 2025-05-07T20:32:08.8782056Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.8782420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.8782744Z ) 2025-05-07T20:32:08.8782960Z else: 2025-05-07T20:32:08.8783187Z scale_ub_tensor = None 2025-05-07T20:32:08.8783452Z 2025-05-07T20:32:08.8783705Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.8784040Z op = silu_mul_quant 2025-05-07T20:32:08.8784305Z if compiled: 2025-05-07T20:32:08.8784564Z op = torch.compile(op) 2025-05-07T20:32:08.8784882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.8785183Z 2025-05-07T20:32:08.8785382Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.8785779Z 2025-05-07T20:32:08.8785891Z moe/activation_test.py:117: 2025-05-07T20:32:08.8786207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.8786555Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.8786855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.8787585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.8788307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.8788867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.8789584Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.8790287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.8790924Z kernel = self.compile( 2025-05-07T20:32:08.8791502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.8792190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.8792613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.8792852Z 2025-05-07T20:32:08.8793070Z self = 2025-05-07T20:32:08.8794202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.8795658Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1f63ba0>} 2025-05-07T20:32:08.8797076Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.8798201Z context = 2025-05-07T20:32:08.8798503Z 2025-05-07T20:32:08.8798681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.8799230Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.8799725Z module_map=module_map) 2025-05-07T20:32:08.8800171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.8800552Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.8800831Z E ^ 2025-05-07T20:32:08.8801317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.8801802Z 2025-05-07T20:32:08.8802244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.8802786Z 2025-05-07T20:32:08.8802896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.8803332Z self=, 2025-05-07T20:32:08.8803751Z T=4096, 2025-05-07T20:32:08.8803956Z D=5120, 2025-05-07T20:32:08.8804162Z scale_ub=1200.0, 2025-05-07T20:32:08.8804396Z contiguous=False, 2025-05-07T20:32:08.8804642Z compiled=True, 2025-05-07T20:32:08.8804860Z ) 2025-05-07T20:32:08.8805201Z self = 2025-05-07T20:32:08.8805716Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:08.8806010Z 2025-05-07T20:32:08.8806091Z @given( 2025-05-07T20:32:08.8806340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.8806669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.8807087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.8807442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.8807784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.8808088Z ) 2025-05-07T20:32:08.8808458Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.8808922Z def test_silu_mul_quant( 2025-05-07T20:32:08.8809171Z self, 2025-05-07T20:32:08.8809381Z T: int, 2025-05-07T20:32:08.8809592Z D: int, 2025-05-07T20:32:08.8809818Z scale_ub: Optional[float], 2025-05-07T20:32:08.8810105Z contiguous: bool, 2025-05-07T20:32:08.8810358Z compiled: bool, 2025-05-07T20:32:08.8810588Z ) -> None: 2025-05-07T20:32:08.8810818Z torch.manual_seed(2025) 2025-05-07T20:32:08.8811074Z 2025-05-07T20:32:08.8811360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.8811846Z 2025-05-07T20:32:08.8812058Z x_sign = torch.sign(x) 2025-05-07T20:32:08.8812360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.8812685Z x = x_sign * x_clamp 2025-05-07T20:32:08.8812938Z x0 = x[:, :D] 2025-05-07T20:32:08.8813165Z x1 = x[:, D:] 2025-05-07T20:32:08.8813716Z 2025-05-07T20:32:08.8813918Z if contiguous: 2025-05-07T20:32:08.8814156Z x0 = x0.contiguous() 2025-05-07T20:32:08.8814433Z x1 = x1.contiguous() 2025-05-07T20:32:08.8814688Z 2025-05-07T20:32:08.8814900Z if scale_ub is not None: 2025-05-07T20:32:08.8815188Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.8815546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.8815872Z ) 2025-05-07T20:32:08.8816071Z else: 2025-05-07T20:32:08.8816298Z scale_ub_tensor = None 2025-05-07T20:32:08.8816571Z 2025-05-07T20:32:08.8816814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.8817153Z op = silu_mul_quant 2025-05-07T20:32:08.8817422Z if compiled: 2025-05-07T20:32:08.8817680Z op = torch.compile(op) 2025-05-07T20:32:08.8817994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.8818288Z 2025-05-07T20:32:08.8818486Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.8818669Z 2025-05-07T20:32:08.8818772Z moe/activation_test.py:117: 2025-05-07T20:32:08.8819087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.8819439Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.8819732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.8820318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:08.8820902Z return fn(*args, **kwargs) 2025-05-07T20:32:08.8821586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.8822314Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.8822876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.8823586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.8824275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.8824837Z kernel = self.compile( 2025-05-07T20:32:08.8825405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.8826082Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.8826501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.8826746Z 2025-05-07T20:32:08.8826970Z self = 2025-05-07T20:32:08.8828284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.8829718Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2130ea0>} 2025-05-07T20:32:08.8831107Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.8832173Z context = 2025-05-07T20:32:08.8832480Z 2025-05-07T20:32:08.8832655Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.8833334Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.8833817Z module_map=module_map) 2025-05-07T20:32:08.8834205Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.8834582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.8834849Z E ^ 2025-05-07T20:32:08.8835338Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.8835811Z 2025-05-07T20:32:08.8836244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.8836774Z 2025-05-07T20:32:08.8836889Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.8837319Z self=, 2025-05-07T20:32:08.8837745Z T=2048, 2025-05-07T20:32:08.8837949Z D=7168, 2025-05-07T20:32:08.8838181Z scale_ub=1200.0, 2025-05-07T20:32:08.8838446Z contiguous=False, 2025-05-07T20:32:08.8838693Z compiled=False, 2025-05-07T20:32:09.0771533Z ) 2025-05-07T20:32:09.0772010Z self = 2025-05-07T20:32:09.0772816Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:09.0773219Z 2025-05-07T20:32:09.0773338Z @given( 2025-05-07T20:32:09.0773597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.0773926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.0774260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.0774612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.0774956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.0775265Z ) 2025-05-07T20:32:09.0775637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.0776127Z def test_silu_mul_quant( 2025-05-07T20:32:09.0776392Z self, 2025-05-07T20:32:09.0776610Z T: int, 2025-05-07T20:32:09.0776814Z D: int, 2025-05-07T20:32:09.0777047Z scale_ub: Optional[float], 2025-05-07T20:32:09.0777333Z contiguous: bool, 2025-05-07T20:32:09.0777591Z compiled: bool, 2025-05-07T20:32:09.0777839Z ) -> None: 2025-05-07T20:32:09.0778099Z torch.manual_seed(2025) 2025-05-07T20:32:09.0778355Z 2025-05-07T20:32:09.0778637Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.0778998Z 2025-05-07T20:32:09.0779203Z x_sign = torch.sign(x) 2025-05-07T20:32:09.0779506Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.0779835Z x = x_sign * x_clamp 2025-05-07T20:32:09.0780092Z x0 = x[:, :D] 2025-05-07T20:32:09.0780315Z x1 = x[:, D:] 2025-05-07T20:32:09.0780537Z 2025-05-07T20:32:09.0780737Z if contiguous: 2025-05-07T20:32:09.0780982Z x0 = x0.contiguous() 2025-05-07T20:32:09.0781604Z x1 = x1.contiguous() 2025-05-07T20:32:09.0781864Z 2025-05-07T20:32:09.0782065Z if scale_ub is not None: 2025-05-07T20:32:09.0782355Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.0782711Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.0783039Z ) 2025-05-07T20:32:09.0783239Z else: 2025-05-07T20:32:09.0783464Z scale_ub_tensor = None 2025-05-07T20:32:09.0783730Z 2025-05-07T20:32:09.0783969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.0784304Z op = silu_mul_quant 2025-05-07T20:32:09.0784570Z if compiled: 2025-05-07T20:32:09.0784827Z op = torch.compile(op) 2025-05-07T20:32:09.0785143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0785442Z 2025-05-07T20:32:09.0785641Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.0785823Z 2025-05-07T20:32:09.0786124Z moe/activation_test.py:117: 2025-05-07T20:32:09.0786448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0786795Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.0787097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0787830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.0788556Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.0789121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.0789842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.0790544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.0791112Z kernel = self.compile( 2025-05-07T20:32:09.0791681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.0792385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.0792809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0793049Z 2025-05-07T20:32:09.0793272Z self = 2025-05-07T20:32:09.0794406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.0795862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2131940>} 2025-05-07T20:32:09.0797266Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.0798341Z context = 2025-05-07T20:32:09.0798644Z 2025-05-07T20:32:09.0798824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.0799378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.0799871Z module_map=module_map) 2025-05-07T20:32:09.0800354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.0800725Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.0800999Z E ^ 2025-05-07T20:32:09.0801485Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.0801955Z 2025-05-07T20:32:09.0802390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.0802935Z 2025-05-07T20:32:09.0803133Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.0803578Z self=, 2025-05-07T20:32:09.0804007Z T=1, 2025-05-07T20:32:09.0804197Z D=7168, 2025-05-07T20:32:09.0804403Z scale_ub=None, 2025-05-07T20:32:09.0804628Z contiguous=True, 2025-05-07T20:32:09.0804860Z compiled=False, 2025-05-07T20:32:09.0805076Z ) 2025-05-07T20:32:09.0805412Z self = 2025-05-07T20:32:09.0805920Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:09.0806199Z 2025-05-07T20:32:09.0806281Z @given( 2025-05-07T20:32:09.0806526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.0806853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.0807177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.0807611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.0807967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.0808269Z ) 2025-05-07T20:32:09.0808637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.0809104Z def test_silu_mul_quant( 2025-05-07T20:32:09.0809357Z self, 2025-05-07T20:32:09.0809562Z T: int, 2025-05-07T20:32:09.0809775Z D: int, 2025-05-07T20:32:09.0810001Z scale_ub: Optional[float], 2025-05-07T20:32:09.0810288Z contiguous: bool, 2025-05-07T20:32:09.0810547Z compiled: bool, 2025-05-07T20:32:09.0810782Z ) -> None: 2025-05-07T20:32:09.0811014Z torch.manual_seed(2025) 2025-05-07T20:32:09.0811275Z 2025-05-07T20:32:09.0811564Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.0811928Z 2025-05-07T20:32:09.0812139Z x_sign = torch.sign(x) 2025-05-07T20:32:09.0812452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.0812787Z x = x_sign * x_clamp 2025-05-07T20:32:09.0813043Z x0 = x[:, :D] 2025-05-07T20:32:09.0813274Z x1 = x[:, D:] 2025-05-07T20:32:09.0813800Z 2025-05-07T20:32:09.0813995Z if contiguous: 2025-05-07T20:32:09.0814239Z x0 = x0.contiguous() 2025-05-07T20:32:09.0814506Z x1 = x1.contiguous() 2025-05-07T20:32:09.0814760Z 2025-05-07T20:32:09.0814961Z if scale_ub is not None: 2025-05-07T20:32:09.0815242Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.0815596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.0815921Z ) 2025-05-07T20:32:09.0816121Z else: 2025-05-07T20:32:09.0816345Z scale_ub_tensor = None 2025-05-07T20:32:09.0816610Z 2025-05-07T20:32:09.0816847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.0817179Z op = silu_mul_quant 2025-05-07T20:32:09.0817452Z if compiled: 2025-05-07T20:32:09.0817719Z op = torch.compile(op) 2025-05-07T20:32:09.0818086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0818379Z 2025-05-07T20:32:09.0818583Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.0818755Z 2025-05-07T20:32:09.0818859Z moe/activation_test.py:117: 2025-05-07T20:32:09.0819174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0819527Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.0819821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0820547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.0821268Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.0821838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.0822560Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.0823392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.0823959Z kernel = self.compile( 2025-05-07T20:32:09.0824527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.0825217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.0825641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0825888Z 2025-05-07T20:32:09.0826112Z self = 2025-05-07T20:32:09.0827238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.0828798Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2132ca0>} 2025-05-07T20:32:09.0830202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.0831274Z context = 2025-05-07T20:32:09.0831578Z 2025-05-07T20:32:09.0831762Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.0832312Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.0832807Z module_map=module_map) 2025-05-07T20:32:09.0833191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.0833562Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.0833845Z E ^ 2025-05-07T20:32:09.0834345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.0834817Z 2025-05-07T20:32:09.0835262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.0835800Z 2025-05-07T20:32:09.0835911Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.0836355Z self=, 2025-05-07T20:32:09.0836781Z T=16384, 2025-05-07T20:32:09.0836982Z D=7168, 2025-05-07T20:32:09.0837189Z scale_ub=1200.0, 2025-05-07T20:32:09.0837429Z contiguous=False, 2025-05-07T20:32:09.0837665Z compiled=True, 2025-05-07T20:32:09.0837881Z ) 2025-05-07T20:32:09.0838218Z self = 2025-05-07T20:32:09.0838748Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.0839052Z 2025-05-07T20:32:09.0839138Z @given( 2025-05-07T20:32:09.0839386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.0839722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.0840047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.0840476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.0840828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.0841129Z ) 2025-05-07T20:32:09.0841501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.0841968Z def test_silu_mul_quant( 2025-05-07T20:32:09.0842220Z self, 2025-05-07T20:32:09.0842428Z T: int, 2025-05-07T20:32:09.0842637Z D: int, 2025-05-07T20:32:09.0842869Z scale_ub: Optional[float], 2025-05-07T20:32:09.0843152Z contiguous: bool, 2025-05-07T20:32:09.0843406Z compiled: bool, 2025-05-07T20:32:09.0843648Z ) -> None: 2025-05-07T20:32:09.0843961Z torch.manual_seed(2025) 2025-05-07T20:32:09.0844219Z 2025-05-07T20:32:09.0844506Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.0844860Z 2025-05-07T20:32:09.0845067Z x_sign = torch.sign(x) 2025-05-07T20:32:09.0845381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.0845704Z x = x_sign * x_clamp 2025-05-07T20:32:09.0845962Z x0 = x[:, :D] 2025-05-07T20:32:09.0846199Z x1 = x[:, D:] 2025-05-07T20:32:09.0846414Z 2025-05-07T20:32:09.0846614Z if contiguous: 2025-05-07T20:32:09.0846860Z x0 = x0.contiguous() 2025-05-07T20:32:09.0847129Z x1 = x1.contiguous() 2025-05-07T20:32:09.0847389Z 2025-05-07T20:32:09.0847593Z if scale_ub is not None: 2025-05-07T20:32:09.0847887Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.0848238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.0848653Z ) 2025-05-07T20:32:09.0848871Z else: 2025-05-07T20:32:09.0849094Z scale_ub_tensor = None 2025-05-07T20:32:09.0849363Z 2025-05-07T20:32:09.0849615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.0849942Z op = silu_mul_quant 2025-05-07T20:32:09.0850210Z if compiled: 2025-05-07T20:32:09.0850473Z op = torch.compile(op) 2025-05-07T20:32:09.0850779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0851069Z 2025-05-07T20:32:09.0851273Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.0851445Z 2025-05-07T20:32:09.0851549Z moe/activation_test.py:117: 2025-05-07T20:32:09.0851861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0852210Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.0852510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.0853096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.0853688Z return fn(*args, **kwargs) 2025-05-07T20:32:09.0854380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.0855094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.0855657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.0856371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.0857069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.0857621Z kernel = self.compile( 2025-05-07T20:32:09.0858258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.0858950Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.0859388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.0859633Z 2025-05-07T20:32:09.0859857Z self = 2025-05-07T20:32:09.0860979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.0862412Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2133f60>} 2025-05-07T20:32:09.0863815Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.0864900Z context = 2025-05-07T20:32:09.0874077Z 2025-05-07T20:32:09.0874300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.0874869Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.0875375Z module_map=module_map) 2025-05-07T20:32:09.0875771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.0876146Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.0876426Z E ^ 2025-05-07T20:32:09.0876930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.0877405Z 2025-05-07T20:32:09.0877877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.2163862Z 2025-05-07T20:32:09.2164141Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.2164994Z self=, 2025-05-07T20:32:09.2165552Z T=1, 2025-05-07T20:32:09.2165745Z D=7168, 2025-05-07T20:32:09.2165958Z scale_ub=None, 2025-05-07T20:32:09.2166192Z contiguous=False, 2025-05-07T20:32:09.2166429Z compiled=False, 2025-05-07T20:32:09.2166657Z ) 2025-05-07T20:32:09.2166998Z self = 2025-05-07T20:32:09.2167505Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:09.2167785Z 2025-05-07T20:32:09.2167868Z @given( 2025-05-07T20:32:09.2168117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.2168454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.2168776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.2169128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.2169479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.2169785Z ) 2025-05-07T20:32:09.2170161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.2170630Z def test_silu_mul_quant( 2025-05-07T20:32:09.2170881Z self, 2025-05-07T20:32:09.2171092Z T: int, 2025-05-07T20:32:09.2171306Z D: int, 2025-05-07T20:32:09.2171535Z scale_ub: Optional[float], 2025-05-07T20:32:09.2171826Z contiguous: bool, 2025-05-07T20:32:09.2172086Z compiled: bool, 2025-05-07T20:32:09.2172326Z ) -> None: 2025-05-07T20:32:09.2172562Z torch.manual_seed(2025) 2025-05-07T20:32:09.2172825Z 2025-05-07T20:32:09.2173111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.2173477Z 2025-05-07T20:32:09.2173690Z x_sign = torch.sign(x) 2025-05-07T20:32:09.2174004Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.2174328Z x = x_sign * x_clamp 2025-05-07T20:32:09.2174598Z x0 = x[:, :D] 2025-05-07T20:32:09.2174836Z x1 = x[:, D:] 2025-05-07T20:32:09.2175057Z 2025-05-07T20:32:09.2175265Z if contiguous: 2025-05-07T20:32:09.2175510Z x0 = x0.contiguous() 2025-05-07T20:32:09.2175780Z x1 = x1.contiguous() 2025-05-07T20:32:09.2176044Z 2025-05-07T20:32:09.2176258Z if scale_ub is not None: 2025-05-07T20:32:09.2176540Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.2176902Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.2177234Z ) 2025-05-07T20:32:09.2177468Z else: 2025-05-07T20:32:09.2177701Z scale_ub_tensor = None 2025-05-07T20:32:09.2177964Z 2025-05-07T20:32:09.2178214Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.2178547Z op = silu_mul_quant 2025-05-07T20:32:09.2178809Z if compiled: 2025-05-07T20:32:09.2179073Z op = torch.compile(op) 2025-05-07T20:32:09.2179398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.2179846Z 2025-05-07T20:32:09.2180058Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.2180239Z 2025-05-07T20:32:09.2180346Z moe/activation_test.py:117: 2025-05-07T20:32:09.2180662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.2181014Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.2181314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.2182037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.2182751Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.2183321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.2184041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.2184752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.2185386Z kernel = self.compile( 2025-05-07T20:32:09.2185967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.2186662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.2187083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.2187334Z 2025-05-07T20:32:09.2187555Z self = 2025-05-07T20:32:09.2188687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.2190132Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef19909a0>} 2025-05-07T20:32:09.2191541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.2192602Z context = 2025-05-07T20:32:09.2192911Z 2025-05-07T20:32:09.2193087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.2193641Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.2194134Z module_map=module_map) 2025-05-07T20:32:09.2194514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.2194893Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.2195171Z E ^ 2025-05-07T20:32:09.2195656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.2196153Z 2025-05-07T20:32:09.2196589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.2197132Z 2025-05-07T20:32:09.2197243Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.2197679Z self=, 2025-05-07T20:32:09.2198099Z T=2048, 2025-05-07T20:32:09.2198307Z D=7168, 2025-05-07T20:32:09.2198514Z scale_ub=None, 2025-05-07T20:32:09.2198740Z contiguous=False, 2025-05-07T20:32:09.2198988Z compiled=True, 2025-05-07T20:32:09.2199209Z ) 2025-05-07T20:32:09.2199539Z self = 2025-05-07T20:32:09.2200063Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.2200453Z 2025-05-07T20:32:09.2200545Z @given( 2025-05-07T20:32:09.2200797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.2201208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.2201534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.2201886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.2202235Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.2202540Z ) 2025-05-07T20:32:09.2202908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.2203365Z def test_silu_mul_quant( 2025-05-07T20:32:09.2203624Z self, 2025-05-07T20:32:09.2203834Z T: int, 2025-05-07T20:32:09.2204036Z D: int, 2025-05-07T20:32:09.2204269Z scale_ub: Optional[float], 2025-05-07T20:32:09.2204556Z contiguous: bool, 2025-05-07T20:32:09.2204805Z compiled: bool, 2025-05-07T20:32:09.2205043Z ) -> None: 2025-05-07T20:32:09.2205273Z torch.manual_seed(2025) 2025-05-07T20:32:09.2205610Z 2025-05-07T20:32:09.2205903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.2206265Z 2025-05-07T20:32:09.2206476Z x_sign = torch.sign(x) 2025-05-07T20:32:09.2206781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.2207110Z x = x_sign * x_clamp 2025-05-07T20:32:09.2207375Z x0 = x[:, :D] 2025-05-07T20:32:09.2207605Z x1 = x[:, D:] 2025-05-07T20:32:09.2207833Z 2025-05-07T20:32:09.2208035Z if contiguous: 2025-05-07T20:32:09.2208276Z x0 = x0.contiguous() 2025-05-07T20:32:09.2208556Z x1 = x1.contiguous() 2025-05-07T20:32:09.2208818Z 2025-05-07T20:32:09.2209018Z if scale_ub is not None: 2025-05-07T20:32:09.2209309Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.2209668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.2209991Z ) 2025-05-07T20:32:09.2210201Z else: 2025-05-07T20:32:09.2210435Z scale_ub_tensor = None 2025-05-07T20:32:09.2210699Z 2025-05-07T20:32:09.2210954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.2211295Z op = silu_mul_quant 2025-05-07T20:32:09.2211566Z if compiled: 2025-05-07T20:32:09.2211825Z op = torch.compile(op) 2025-05-07T20:32:09.2212142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.2212442Z 2025-05-07T20:32:09.2212642Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.2212824Z 2025-05-07T20:32:09.2212933Z moe/activation_test.py:117: 2025-05-07T20:32:09.2213254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.2213879Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.2214197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.2214867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.2215543Z return fn(*args, **kwargs) 2025-05-07T20:32:09.2216366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.2217214Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.2217863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.2218692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.2219502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.2220152Z kernel = self.compile( 2025-05-07T20:32:09.2220797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.2221601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.2222073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.2222354Z 2025-05-07T20:32:09.2222784Z self = 2025-05-07T20:32:09.2223905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.2225335Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1992160>} 2025-05-07T20:32:09.2226733Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.2227796Z context = 2025-05-07T20:32:09.2228103Z 2025-05-07T20:32:09.2228429Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.2228982Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.2229483Z module_map=module_map) 2025-05-07T20:32:09.2229870Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.2230243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.2230533Z E ^ 2025-05-07T20:32:09.2231025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.2231495Z 2025-05-07T20:32:09.2231938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.2232471Z 2025-05-07T20:32:09.2232584Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.2233019Z self=, 2025-05-07T20:32:09.2233449Z T=4096, 2025-05-07T20:32:09.2233641Z D=7168, 2025-05-07T20:32:09.2233854Z scale_ub=None, 2025-05-07T20:32:09.2234085Z contiguous=False, 2025-05-07T20:32:09.2234322Z compiled=True, 2025-05-07T20:32:09.4453859Z ) 2025-05-07T20:32:09.4454371Z self = 2025-05-07T20:32:09.4455121Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.4455513Z 2025-05-07T20:32:09.4455642Z @given( 2025-05-07T20:32:09.4455952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.4456383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.4456797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.4457208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.4457556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.4457859Z ) 2025-05-07T20:32:09.4458220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.4458716Z def test_silu_mul_quant( 2025-05-07T20:32:09.4458973Z self, 2025-05-07T20:32:09.4459175Z T: int, 2025-05-07T20:32:09.4459385Z D: int, 2025-05-07T20:32:09.4459615Z scale_ub: Optional[float], 2025-05-07T20:32:09.4459902Z contiguous: bool, 2025-05-07T20:32:09.4460150Z compiled: bool, 2025-05-07T20:32:09.4460386Z ) -> None: 2025-05-07T20:32:09.4460612Z torch.manual_seed(2025) 2025-05-07T20:32:09.4460862Z 2025-05-07T20:32:09.4461154Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.4461513Z 2025-05-07T20:32:09.4461714Z x_sign = torch.sign(x) 2025-05-07T20:32:09.4462024Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.4462353Z x = x_sign * x_clamp 2025-05-07T20:32:09.4462604Z x0 = x[:, :D] 2025-05-07T20:32:09.4462835Z x1 = x[:, D:] 2025-05-07T20:32:09.4463064Z 2025-05-07T20:32:09.4463256Z if contiguous: 2025-05-07T20:32:09.4463838Z x0 = x0.contiguous() 2025-05-07T20:32:09.4464121Z x1 = x1.contiguous() 2025-05-07T20:32:09.4464369Z 2025-05-07T20:32:09.4464574Z if scale_ub is not None: 2025-05-07T20:32:09.4464863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.4465212Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.4465540Z ) 2025-05-07T20:32:09.4465748Z else: 2025-05-07T20:32:09.4465978Z scale_ub_tensor = None 2025-05-07T20:32:09.4466243Z 2025-05-07T20:32:09.4466486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.4466816Z op = silu_mul_quant 2025-05-07T20:32:09.4467072Z if compiled: 2025-05-07T20:32:09.4467331Z op = torch.compile(op) 2025-05-07T20:32:09.4467643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4467956Z 2025-05-07T20:32:09.4468333Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.4468507Z 2025-05-07T20:32:09.4468627Z moe/activation_test.py:117: 2025-05-07T20:32:09.4468934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4469287Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.4469585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4470175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.4470755Z return fn(*args, **kwargs) 2025-05-07T20:32:09.4471443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.4472160Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.4472719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.4473430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.4474139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.4474694Z kernel = self.compile( 2025-05-07T20:32:09.4475255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.4475940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4476586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4476828Z 2025-05-07T20:32:09.4477051Z self = 2025-05-07T20:32:09.4478177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.4479626Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1992e80>} 2025-05-07T20:32:09.4481122Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.4482184Z context = 2025-05-07T20:32:09.4482486Z 2025-05-07T20:32:09.4482663Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.4483213Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4483705Z module_map=module_map) 2025-05-07T20:32:09.4484090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4484462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.4484737Z E ^ 2025-05-07T20:32:09.4485323Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.4485796Z 2025-05-07T20:32:09.4486231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.4486772Z 2025-05-07T20:32:09.4486884Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.4487321Z self=, 2025-05-07T20:32:09.4487746Z T=16384, 2025-05-07T20:32:09.4487950Z D=5120, 2025-05-07T20:32:09.4488196Z scale_ub=1200.0, 2025-05-07T20:32:09.4488428Z contiguous=False, 2025-05-07T20:32:09.4488669Z compiled=False, 2025-05-07T20:32:09.4488887Z ) 2025-05-07T20:32:09.4489219Z self = 2025-05-07T20:32:09.4489746Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:09.4490126Z 2025-05-07T20:32:09.4490213Z @given( 2025-05-07T20:32:09.4490459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.4490789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.4491114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.4491454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.4491800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.4492101Z ) 2025-05-07T20:32:09.4492469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.4492930Z def test_silu_mul_quant( 2025-05-07T20:32:09.4493191Z self, 2025-05-07T20:32:09.4493402Z T: int, 2025-05-07T20:32:09.4493605Z D: int, 2025-05-07T20:32:09.4493837Z scale_ub: Optional[float], 2025-05-07T20:32:09.4494125Z contiguous: bool, 2025-05-07T20:32:09.4494374Z compiled: bool, 2025-05-07T20:32:09.4494612Z ) -> None: 2025-05-07T20:32:09.4494851Z torch.manual_seed(2025) 2025-05-07T20:32:09.4495105Z 2025-05-07T20:32:09.4495399Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.4495762Z 2025-05-07T20:32:09.4495968Z x_sign = torch.sign(x) 2025-05-07T20:32:09.4496278Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.4496607Z x = x_sign * x_clamp 2025-05-07T20:32:09.4496860Z x0 = x[:, :D] 2025-05-07T20:32:09.4497084Z x1 = x[:, D:] 2025-05-07T20:32:09.4497305Z 2025-05-07T20:32:09.4497503Z if contiguous: 2025-05-07T20:32:09.4497743Z x0 = x0.contiguous() 2025-05-07T20:32:09.4498019Z x1 = x1.contiguous() 2025-05-07T20:32:09.4498274Z 2025-05-07T20:32:09.4498472Z if scale_ub is not None: 2025-05-07T20:32:09.4498764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.4499122Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.4499454Z ) 2025-05-07T20:32:09.4499665Z else: 2025-05-07T20:32:09.4499899Z scale_ub_tensor = None 2025-05-07T20:32:09.4500164Z 2025-05-07T20:32:09.4500416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.4500752Z op = silu_mul_quant 2025-05-07T20:32:09.4501012Z if compiled: 2025-05-07T20:32:09.4501274Z op = torch.compile(op) 2025-05-07T20:32:09.4501591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4501879Z 2025-05-07T20:32:09.4502086Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.4502264Z 2025-05-07T20:32:09.4502368Z moe/activation_test.py:117: 2025-05-07T20:32:09.4502679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4503027Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.4503324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4504046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.4504848Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.4505417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.4506131Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.4506824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.4507373Z kernel = self.compile( 2025-05-07T20:32:09.4507969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.4508681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4509098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4509337Z 2025-05-07T20:32:09.4509564Z self = 2025-05-07T20:32:09.4510766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.4512193Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef169c220>} 2025-05-07T20:32:09.4513857Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.4514926Z context = 2025-05-07T20:32:09.4515228Z 2025-05-07T20:32:09.4515404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.4515954Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4516463Z module_map=module_map) 2025-05-07T20:32:09.4516850Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4517373Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.4517666Z E ^ 2025-05-07T20:32:09.4518222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.4518781Z 2025-05-07T20:32:09.4519295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.4519941Z 2025-05-07T20:32:09.4520055Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.4520596Z self=, 2025-05-07T20:32:09.4521070Z T=16384, 2025-05-07T20:32:09.4521274Z D=5120, 2025-05-07T20:32:09.4521485Z scale_ub=1200.0, 2025-05-07T20:32:09.4521733Z contiguous=True, 2025-05-07T20:32:09.4521968Z compiled=True, 2025-05-07T20:32:09.4522197Z ) 2025-05-07T20:32:09.4522565Z self = 2025-05-07T20:32:09.4523148Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:09.4523488Z 2025-05-07T20:32:09.4523571Z @given( 2025-05-07T20:32:09.4523826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.4524184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.4524527Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.4524904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.4525281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.4525603Z ) 2025-05-07T20:32:09.4526011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.4526537Z def test_silu_mul_quant( 2025-05-07T20:32:09.4526828Z self, 2025-05-07T20:32:09.4527034Z T: int, 2025-05-07T20:32:09.4527424Z D: int, 2025-05-07T20:32:09.4527664Z scale_ub: Optional[float], 2025-05-07T20:32:09.4527956Z contiguous: bool, 2025-05-07T20:32:09.4528204Z compiled: bool, 2025-05-07T20:32:09.4528444Z ) -> None: 2025-05-07T20:32:09.4528679Z torch.manual_seed(2025) 2025-05-07T20:32:09.4528939Z 2025-05-07T20:32:09.4529237Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.4538304Z 2025-05-07T20:32:09.4538560Z x_sign = torch.sign(x) 2025-05-07T20:32:09.4538886Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.4539216Z x = x_sign * x_clamp 2025-05-07T20:32:09.4539481Z x0 = x[:, :D] 2025-05-07T20:32:09.4539713Z x1 = x[:, D:] 2025-05-07T20:32:09.4539935Z 2025-05-07T20:32:09.4540139Z if contiguous: 2025-05-07T20:32:09.4540389Z x0 = x0.contiguous() 2025-05-07T20:32:09.4540840Z x1 = x1.contiguous() 2025-05-07T20:32:09.4541096Z 2025-05-07T20:32:09.4541315Z if scale_ub is not None: 2025-05-07T20:32:09.4541601Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.4541970Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.4542308Z ) 2025-05-07T20:32:09.4542520Z else: 2025-05-07T20:32:09.4542741Z scale_ub_tensor = None 2025-05-07T20:32:09.4543014Z 2025-05-07T20:32:09.4543268Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.4543602Z op = silu_mul_quant 2025-05-07T20:32:09.4543865Z if compiled: 2025-05-07T20:32:09.4544126Z op = torch.compile(op) 2025-05-07T20:32:09.4544432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4544725Z 2025-05-07T20:32:09.4544934Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.4545109Z 2025-05-07T20:32:09.4545217Z moe/activation_test.py:117: 2025-05-07T20:32:09.4545543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4545900Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.4546204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.4546794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.4547383Z return fn(*args, **kwargs) 2025-05-07T20:32:09.4548075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.4548788Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.4549355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.4550070Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.4550774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.4551336Z kernel = self.compile( 2025-05-07T20:32:09.4551909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.4552600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4553016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.4553272Z 2025-05-07T20:32:09.4553488Z self = 2025-05-07T20:32:09.4554613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.4556047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef169d4e0>} 2025-05-07T20:32:09.4557541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.4558602Z context = 2025-05-07T20:32:09.4558912Z 2025-05-07T20:32:09.4559088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.4559637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4560240Z module_map=module_map) 2025-05-07T20:32:09.4560625Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4561002Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.4561279Z E ^ 2025-05-07T20:32:09.4561764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.4562323Z 2025-05-07T20:32:09.4562765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.6073613Z 2025-05-07T20:32:09.6074406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.6075642Z self=, 2025-05-07T20:32:09.6076525Z T=16384, 2025-05-07T20:32:09.6076933Z D=5120, 2025-05-07T20:32:09.6077321Z scale_ub=None, 2025-05-07T20:32:09.6077768Z contiguous=False, 2025-05-07T20:32:09.6078224Z compiled=True, 2025-05-07T20:32:09.6078462Z ) 2025-05-07T20:32:09.6078829Z self = 2025-05-07T20:32:09.6079364Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.6079661Z 2025-05-07T20:32:09.6079743Z @given( 2025-05-07T20:32:09.6079990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.6080458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.6080795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.6081141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.6081490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.6081796Z ) 2025-05-07T20:32:09.6082166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.6082636Z def test_silu_mul_quant( 2025-05-07T20:32:09.6082896Z self, 2025-05-07T20:32:09.6083095Z T: int, 2025-05-07T20:32:09.6083305Z D: int, 2025-05-07T20:32:09.6083539Z scale_ub: Optional[float], 2025-05-07T20:32:09.6083821Z contiguous: bool, 2025-05-07T20:32:09.6084077Z compiled: bool, 2025-05-07T20:32:09.6084322Z ) -> None: 2025-05-07T20:32:09.6084545Z torch.manual_seed(2025) 2025-05-07T20:32:09.6084804Z 2025-05-07T20:32:09.6085092Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.6085462Z 2025-05-07T20:32:09.6085667Z x_sign = torch.sign(x) 2025-05-07T20:32:09.6085978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.6086307Z x = x_sign * x_clamp 2025-05-07T20:32:09.6086553Z x0 = x[:, :D] 2025-05-07T20:32:09.6086780Z x1 = x[:, D:] 2025-05-07T20:32:09.6086996Z 2025-05-07T20:32:09.6087184Z if contiguous: 2025-05-07T20:32:09.6087429Z x0 = x0.contiguous() 2025-05-07T20:32:09.6087704Z x1 = x1.contiguous() 2025-05-07T20:32:09.6087953Z 2025-05-07T20:32:09.6088155Z if scale_ub is not None: 2025-05-07T20:32:09.6088446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.6088798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.6089130Z ) 2025-05-07T20:32:09.6089334Z else: 2025-05-07T20:32:09.6089551Z scale_ub_tensor = None 2025-05-07T20:32:09.6089825Z 2025-05-07T20:32:09.6090429Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.6090762Z op = silu_mul_quant 2025-05-07T20:32:09.6091039Z if compiled: 2025-05-07T20:32:09.6091303Z op = torch.compile(op) 2025-05-07T20:32:09.6091621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6091908Z 2025-05-07T20:32:09.6092113Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.6092288Z 2025-05-07T20:32:09.6092403Z moe/activation_test.py:117: 2025-05-07T20:32:09.6092715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6093071Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.6093372Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6093955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.6094542Z return fn(*args, **kwargs) 2025-05-07T20:32:09.6095239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.6096118Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.6096676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.6097389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.6098087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.6098652Z kernel = self.compile( 2025-05-07T20:32:09.6099216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.6099906Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.6100329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6100577Z 2025-05-07T20:32:09.6100801Z self = 2025-05-07T20:32:09.6101936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.6103462Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef169e2a0>} 2025-05-07T20:32:09.6104859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.6105921Z context = 2025-05-07T20:32:09.6106223Z 2025-05-07T20:32:09.6106406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.6106959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.6107446Z module_map=module_map) 2025-05-07T20:32:09.6107834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.6108198Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.6108472Z E ^ 2025-05-07T20:32:09.6108961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.6109428Z 2025-05-07T20:32:09.6109870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.6110401Z 2025-05-07T20:32:09.6110509Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.6110945Z self=, 2025-05-07T20:32:09.6111369Z T=2048, 2025-05-07T20:32:09.6111568Z D=5120, 2025-05-07T20:32:09.6111772Z scale_ub=None, 2025-05-07T20:32:09.6112085Z contiguous=False, 2025-05-07T20:32:09.6112321Z compiled=True, 2025-05-07T20:32:09.6112537Z ) 2025-05-07T20:32:09.6112873Z self = 2025-05-07T20:32:09.6113717Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:09.6114129Z 2025-05-07T20:32:09.6114212Z @given( 2025-05-07T20:32:09.6114455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.6114786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.6115102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.6115451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.6115802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.6116101Z ) 2025-05-07T20:32:09.6116468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.6117087Z def test_silu_mul_quant( 2025-05-07T20:32:09.6117347Z self, 2025-05-07T20:32:09.6117558Z T: int, 2025-05-07T20:32:09.6117760Z D: int, 2025-05-07T20:32:09.6118012Z scale_ub: Optional[float], 2025-05-07T20:32:09.6118324Z contiguous: bool, 2025-05-07T20:32:09.6118576Z compiled: bool, 2025-05-07T20:32:09.6118804Z ) -> None: 2025-05-07T20:32:09.6119029Z torch.manual_seed(2025) 2025-05-07T20:32:09.6119282Z 2025-05-07T20:32:09.6119560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.6119917Z 2025-05-07T20:32:09.6120180Z x_sign = torch.sign(x) 2025-05-07T20:32:09.6120478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.6120803Z x = x_sign * x_clamp 2025-05-07T20:32:09.6121053Z x0 = x[:, :D] 2025-05-07T20:32:09.6121274Z x1 = x[:, D:] 2025-05-07T20:32:09.6121493Z 2025-05-07T20:32:09.6121687Z if contiguous: 2025-05-07T20:32:09.6121931Z x0 = x0.contiguous() 2025-05-07T20:32:09.6122203Z x1 = x1.contiguous() 2025-05-07T20:32:09.6122452Z 2025-05-07T20:32:09.6122646Z if scale_ub is not None: 2025-05-07T20:32:09.6122932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.6123282Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.6123607Z ) 2025-05-07T20:32:09.6123803Z else: 2025-05-07T20:32:09.6124025Z scale_ub_tensor = None 2025-05-07T20:32:09.6124286Z 2025-05-07T20:32:09.6124521Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.6124849Z op = silu_mul_quant 2025-05-07T20:32:09.6125108Z if compiled: 2025-05-07T20:32:09.6125360Z op = torch.compile(op) 2025-05-07T20:32:09.6125672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6125960Z 2025-05-07T20:32:09.6126155Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.6126337Z 2025-05-07T20:32:09.6126442Z moe/activation_test.py:117: 2025-05-07T20:32:09.6126750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6127092Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.6127387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.6127971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.6128553Z return fn(*args, **kwargs) 2025-05-07T20:32:09.6129233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.6129943Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.6130501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.6131202Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.6132024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.6132583Z kernel = self.compile( 2025-05-07T20:32:09.6133148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.6133824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.6134239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.6134484Z 2025-05-07T20:32:09.6134701Z self = 2025-05-07T20:32:09.6135824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.6137249Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef169f560>} 2025-05-07T20:32:09.6138775Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.6139841Z context = 2025-05-07T20:32:09.6140142Z 2025-05-07T20:32:09.6140322Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.6140867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.6141350Z module_map=module_map) 2025-05-07T20:32:09.6141730Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.6142103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.6142371Z E ^ 2025-05-07T20:32:09.6142859Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.6143335Z 2025-05-07T20:32:09.6143780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.7715975Z 2025-05-07T20:32:09.7716594Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.7717277Z self=, 2025-05-07T20:32:09.7717858Z T=2048, 2025-05-07T20:32:09.7718061Z D=5120, 2025-05-07T20:32:09.7718265Z scale_ub=1200.0, 2025-05-07T20:32:09.7718502Z contiguous=False, 2025-05-07T20:32:09.7718743Z compiled=True, 2025-05-07T20:32:09.7718962Z ) 2025-05-07T20:32:09.7719304Z self = 2025-05-07T20:32:09.7719844Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.7720240Z 2025-05-07T20:32:09.7720322Z @given( 2025-05-07T20:32:09.7720592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.7720929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.7721259Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.7721608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.7721953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.7722257Z ) 2025-05-07T20:32:09.7722630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.7723098Z def test_silu_mul_quant( 2025-05-07T20:32:09.7723347Z self, 2025-05-07T20:32:09.7723555Z T: int, 2025-05-07T20:32:09.7723765Z D: int, 2025-05-07T20:32:09.7723988Z scale_ub: Optional[float], 2025-05-07T20:32:09.7724277Z contiguous: bool, 2025-05-07T20:32:09.7724531Z compiled: bool, 2025-05-07T20:32:09.7724761Z ) -> None: 2025-05-07T20:32:09.7724987Z torch.manual_seed(2025) 2025-05-07T20:32:09.7725251Z 2025-05-07T20:32:09.7725873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.7726239Z 2025-05-07T20:32:09.7726443Z x_sign = torch.sign(x) 2025-05-07T20:32:09.7726745Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.7727072Z x = x_sign * x_clamp 2025-05-07T20:32:09.7727324Z x0 = x[:, :D] 2025-05-07T20:32:09.7727546Z x1 = x[:, D:] 2025-05-07T20:32:09.7727764Z 2025-05-07T20:32:09.7727959Z if contiguous: 2025-05-07T20:32:09.7728194Z x0 = x0.contiguous() 2025-05-07T20:32:09.7728466Z x1 = x1.contiguous() 2025-05-07T20:32:09.7728720Z 2025-05-07T20:32:09.7728916Z if scale_ub is not None: 2025-05-07T20:32:09.7729239Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.7729588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.7729913Z ) 2025-05-07T20:32:09.7730117Z else: 2025-05-07T20:32:09.7730505Z scale_ub_tensor = None 2025-05-07T20:32:09.7730769Z 2025-05-07T20:32:09.7731019Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7731351Z op = silu_mul_quant 2025-05-07T20:32:09.7731614Z if compiled: 2025-05-07T20:32:09.7731877Z op = torch.compile(op) 2025-05-07T20:32:09.7732190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7732473Z 2025-05-07T20:32:09.7732676Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.7732848Z 2025-05-07T20:32:09.7732961Z moe/activation_test.py:117: 2025-05-07T20:32:09.7733270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7733626Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.7733930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7734519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.7735118Z return fn(*args, **kwargs) 2025-05-07T20:32:09.7735821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.7736548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.7737112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.7737836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.7738541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.7739107Z kernel = self.compile( 2025-05-07T20:32:09.7739676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.7740369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.7740793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7741042Z 2025-05-07T20:32:09.7741265Z self = 2025-05-07T20:32:09.7742406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.7743871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1534c20>} 2025-05-07T20:32:09.7745288Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.7746366Z context = 2025-05-07T20:32:09.7746674Z 2025-05-07T20:32:09.7746849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.7747485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.7747985Z module_map=module_map) 2025-05-07T20:32:09.7748370Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.7748739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.7749015Z E ^ 2025-05-07T20:32:09.7749511Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7749985Z 2025-05-07T20:32:09.7750426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.7750969Z 2025-05-07T20:32:09.7751078Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.7751514Z self=, 2025-05-07T20:32:09.7752055Z T=4096, 2025-05-07T20:32:09.7752249Z D=5120, 2025-05-07T20:32:09.7752457Z scale_ub=1200.0, 2025-05-07T20:32:09.7752693Z contiguous=True, 2025-05-07T20:32:09.7752919Z compiled=True, 2025-05-07T20:32:09.7753131Z ) 2025-05-07T20:32:09.7753467Z self = 2025-05-07T20:32:09.7753982Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:09.7754274Z 2025-05-07T20:32:09.7754357Z @given( 2025-05-07T20:32:09.7754602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.7754927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.7755251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.7755601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.7755947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.7756242Z ) 2025-05-07T20:32:09.7756611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.7757088Z def test_silu_mul_quant( 2025-05-07T20:32:09.7757337Z self, 2025-05-07T20:32:09.7757541Z T: int, 2025-05-07T20:32:09.7757756Z D: int, 2025-05-07T20:32:09.7758002Z scale_ub: Optional[float], 2025-05-07T20:32:09.7758319Z contiguous: bool, 2025-05-07T20:32:09.7758575Z compiled: bool, 2025-05-07T20:32:09.7758802Z ) -> None: 2025-05-07T20:32:09.7759031Z torch.manual_seed(2025) 2025-05-07T20:32:09.7759287Z 2025-05-07T20:32:09.7759569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.7759933Z 2025-05-07T20:32:09.7760223Z x_sign = torch.sign(x) 2025-05-07T20:32:09.7760533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.7760863Z x = x_sign * x_clamp 2025-05-07T20:32:09.7761117Z x0 = x[:, :D] 2025-05-07T20:32:09.7761348Z x1 = x[:, D:] 2025-05-07T20:32:09.7761568Z 2025-05-07T20:32:09.7761763Z if contiguous: 2025-05-07T20:32:09.7762008Z x0 = x0.contiguous() 2025-05-07T20:32:09.7762277Z x1 = x1.contiguous() 2025-05-07T20:32:09.7762529Z 2025-05-07T20:32:09.7762728Z if scale_ub is not None: 2025-05-07T20:32:09.7763009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.7763363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.7763686Z ) 2025-05-07T20:32:09.7763884Z else: 2025-05-07T20:32:09.7764104Z scale_ub_tensor = None 2025-05-07T20:32:09.7764370Z 2025-05-07T20:32:09.7764607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.7764938Z op = silu_mul_quant 2025-05-07T20:32:09.7765205Z if compiled: 2025-05-07T20:32:09.7765466Z op = torch.compile(op) 2025-05-07T20:32:09.7765776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7766068Z 2025-05-07T20:32:09.7766291Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.7766468Z 2025-05-07T20:32:09.7766670Z moe/activation_test.py:117: 2025-05-07T20:32:09.7766978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7767336Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.7767634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.7768219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.7768815Z return fn(*args, **kwargs) 2025-05-07T20:32:09.7769514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.7770240Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.7779412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.7780196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.7781034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.7781605Z kernel = self.compile( 2025-05-07T20:32:09.7782179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.7782875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.7783302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.7783546Z 2025-05-07T20:32:09.7783765Z self = 2025-05-07T20:32:09.7784900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.7786356Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1535a80>} 2025-05-07T20:32:09.7787772Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.7788847Z context = 2025-05-07T20:32:09.7789154Z 2025-05-07T20:32:09.7789329Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.7789882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.7790377Z module_map=module_map) 2025-05-07T20:32:09.7790765Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.7791138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.7791419Z E ^ 2025-05-07T20:32:09.7791923Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.7792394Z 2025-05-07T20:32:09.7792838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.9453402Z 2025-05-07T20:32:09.9453770Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.9454268Z self=, 2025-05-07T20:32:09.9454692Z T=128, 2025-05-07T20:32:09.9454898Z D=5120, 2025-05-07T20:32:09.9455111Z scale_ub=1200.0, 2025-05-07T20:32:09.9455347Z contiguous=False, 2025-05-07T20:32:09.9455597Z compiled=True, 2025-05-07T20:32:09.9455820Z ) 2025-05-07T20:32:09.9456159Z self = 2025-05-07T20:32:09.9456682Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:09.9456998Z 2025-05-07T20:32:09.9457081Z @given( 2025-05-07T20:32:09.9457682Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.9458013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.9458345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.9458701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.9459042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.9459350Z ) 2025-05-07T20:32:09.9459725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.9460196Z def test_silu_mul_quant( 2025-05-07T20:32:09.9460448Z self, 2025-05-07T20:32:09.9460661Z T: int, 2025-05-07T20:32:09.9460873Z D: int, 2025-05-07T20:32:09.9461100Z scale_ub: Optional[float], 2025-05-07T20:32:09.9461389Z contiguous: bool, 2025-05-07T20:32:09.9461654Z compiled: bool, 2025-05-07T20:32:09.9461893Z ) -> None: 2025-05-07T20:32:09.9462292Z torch.manual_seed(2025) 2025-05-07T20:32:09.9462555Z 2025-05-07T20:32:09.9462846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.9463209Z 2025-05-07T20:32:09.9463423Z x_sign = torch.sign(x) 2025-05-07T20:32:09.9463727Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.9464062Z x = x_sign * x_clamp 2025-05-07T20:32:09.9464318Z x0 = x[:, :D] 2025-05-07T20:32:09.9464545Z x1 = x[:, D:] 2025-05-07T20:32:09.9464772Z 2025-05-07T20:32:09.9464972Z if contiguous: 2025-05-07T20:32:09.9465211Z x0 = x0.contiguous() 2025-05-07T20:32:09.9465487Z x1 = x1.contiguous() 2025-05-07T20:32:09.9465745Z 2025-05-07T20:32:09.9465946Z if scale_ub is not None: 2025-05-07T20:32:09.9466240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.9466600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.9466938Z ) 2025-05-07T20:32:09.9467140Z else: 2025-05-07T20:32:09.9467378Z scale_ub_tensor = None 2025-05-07T20:32:09.9467646Z 2025-05-07T20:32:09.9467888Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.9468227Z op = silu_mul_quant 2025-05-07T20:32:09.9468492Z if compiled: 2025-05-07T20:32:09.9468751Z op = torch.compile(op) 2025-05-07T20:32:09.9469068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.9469363Z 2025-05-07T20:32:09.9469566Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.9469747Z 2025-05-07T20:32:09.9469855Z moe/activation_test.py:117: 2025-05-07T20:32:09.9470166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.9470543Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.9470851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.9471438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.9472034Z return fn(*args, **kwargs) 2025-05-07T20:32:09.9472729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.9473441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.9474005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.9474715Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.9475408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.9475964Z kernel = self.compile( 2025-05-07T20:32:09.9476530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.9477215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.9477724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.9477977Z 2025-05-07T20:32:09.9478229Z self = 2025-05-07T20:32:09.9479372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.9481053Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1536ca0>} 2025-05-07T20:32:09.9482462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.9483532Z context = 2025-05-07T20:32:09.9483928Z 2025-05-07T20:32:09.9484114Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.9484674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.9485177Z module_map=module_map) 2025-05-07T20:32:09.9485566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.9485950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.9486235Z E ^ 2025-05-07T20:32:09.9486732Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.9487207Z 2025-05-07T20:32:09.9487647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.9488198Z 2025-05-07T20:32:09.9488311Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.9488763Z self=, 2025-05-07T20:32:09.9489200Z T=16384, 2025-05-07T20:32:09.9489408Z D=7168, 2025-05-07T20:32:09.9489623Z scale_ub=1200.0, 2025-05-07T20:32:09.9489861Z contiguous=True, 2025-05-07T20:32:09.9490093Z compiled=True, 2025-05-07T20:32:09.9490313Z ) 2025-05-07T20:32:09.9490657Z self = 2025-05-07T20:32:09.9491180Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:09.9491484Z 2025-05-07T20:32:09.9491568Z @given( 2025-05-07T20:32:09.9491821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.9492157Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.9492491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.9492846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.9493201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.9493512Z ) 2025-05-07T20:32:09.9493892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.9494368Z def test_silu_mul_quant( 2025-05-07T20:32:09.9494623Z self, 2025-05-07T20:32:09.9494836Z T: int, 2025-05-07T20:32:09.9495050Z D: int, 2025-05-07T20:32:09.9495281Z scale_ub: Optional[float], 2025-05-07T20:32:09.9495572Z contiguous: bool, 2025-05-07T20:32:09.9495829Z compiled: bool, 2025-05-07T20:32:09.9496060Z ) -> None: 2025-05-07T20:32:09.9496288Z torch.manual_seed(2025) 2025-05-07T20:32:09.9496547Z 2025-05-07T20:32:09.9496833Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.9497193Z 2025-05-07T20:32:09.9497401Z x_sign = torch.sign(x) 2025-05-07T20:32:09.9497705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.9498038Z x = x_sign * x_clamp 2025-05-07T20:32:09.9498294Z x0 = x[:, :D] 2025-05-07T20:32:09.9498532Z x1 = x[:, D:] 2025-05-07T20:32:09.9498750Z 2025-05-07T20:32:09.9499047Z if contiguous: 2025-05-07T20:32:09.9499297Z x0 = x0.contiguous() 2025-05-07T20:32:09.9499569Z x1 = x1.contiguous() 2025-05-07T20:32:09.9499827Z 2025-05-07T20:32:09.9500035Z if scale_ub is not None: 2025-05-07T20:32:09.9500325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.9500685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.9501015Z ) 2025-05-07T20:32:09.9501222Z else: 2025-05-07T20:32:09.9501451Z scale_ub_tensor = None 2025-05-07T20:32:09.9501722Z 2025-05-07T20:32:09.9501966Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.9502305Z op = silu_mul_quant 2025-05-07T20:32:09.9502576Z if compiled: 2025-05-07T20:32:09.9502835Z op = torch.compile(op) 2025-05-07T20:32:09.9503155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.9503572Z 2025-05-07T20:32:09.9503782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.9503959Z 2025-05-07T20:32:09.9504063Z moe/activation_test.py:117: 2025-05-07T20:32:09.9504378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.9504731Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.9505031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.9505619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:09.9506202Z return fn(*args, **kwargs) 2025-05-07T20:32:09.9506889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.9507608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.9508179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.9508902Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.9509603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.9510166Z kernel = self.compile( 2025-05-07T20:32:09.9510730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.9511423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.9511842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.9512086Z 2025-05-07T20:32:09.9512308Z self = 2025-05-07T20:32:09.9513700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.9515146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef14c8400>} 2025-05-07T20:32:09.9516551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.9517620Z context = 2025-05-07T20:32:09.9517925Z 2025-05-07T20:32:09.9518115Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.9518663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.9519171Z module_map=module_map) 2025-05-07T20:32:09.9519563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.9519938Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.9520283Z E ^ 2025-05-07T20:32:09.9520930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.9521407Z 2025-05-07T20:32:09.9521854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.0664094Z 2025-05-07T20:32:10.0664643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0665284Z self=, 2025-05-07T20:32:10.0665713Z T=16384, 2025-05-07T20:32:10.0665914Z D=5120, 2025-05-07T20:32:10.0666121Z scale_ub=1200.0, 2025-05-07T20:32:10.0666356Z contiguous=True, 2025-05-07T20:32:10.0666586Z compiled=False, 2025-05-07T20:32:10.0666807Z ) 2025-05-07T20:32:10.0667145Z self = 2025-05-07T20:32:10.0668038Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:10.0668352Z 2025-05-07T20:32:10.0668432Z @given( 2025-05-07T20:32:10.0668673Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.0668999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.0669315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.0669660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.0670009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.0670304Z ) 2025-05-07T20:32:10.0670670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.0671135Z def test_silu_mul_quant( 2025-05-07T20:32:10.0671383Z self, 2025-05-07T20:32:10.0671592Z T: int, 2025-05-07T20:32:10.0671797Z D: int, 2025-05-07T20:32:10.0672020Z scale_ub: Optional[float], 2025-05-07T20:32:10.0672305Z contiguous: bool, 2025-05-07T20:32:10.0672570Z compiled: bool, 2025-05-07T20:32:10.0672812Z ) -> None: 2025-05-07T20:32:10.0673037Z torch.manual_seed(2025) 2025-05-07T20:32:10.0673290Z 2025-05-07T20:32:10.0673581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.0673930Z 2025-05-07T20:32:10.0674140Z x_sign = torch.sign(x) 2025-05-07T20:32:10.0674477Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.0674803Z x = x_sign * x_clamp 2025-05-07T20:32:10.0675055Z x0 = x[:, :D] 2025-05-07T20:32:10.0675277Z x1 = x[:, D:] 2025-05-07T20:32:10.0675496Z 2025-05-07T20:32:10.0675691Z if contiguous: 2025-05-07T20:32:10.0675929Z x0 = x0.contiguous() 2025-05-07T20:32:10.0676202Z x1 = x1.contiguous() 2025-05-07T20:32:10.0676457Z 2025-05-07T20:32:10.0676655Z if scale_ub is not None: 2025-05-07T20:32:10.0676943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.0677297Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.0677623Z ) 2025-05-07T20:32:10.0677834Z else: 2025-05-07T20:32:10.0678074Z scale_ub_tensor = None 2025-05-07T20:32:10.0678372Z 2025-05-07T20:32:10.0678618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.0678953Z op = silu_mul_quant 2025-05-07T20:32:10.0679211Z if compiled: 2025-05-07T20:32:10.0679473Z op = torch.compile(op) 2025-05-07T20:32:10.0679786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0680077Z 2025-05-07T20:32:10.0680412Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.0680592Z 2025-05-07T20:32:10.0680698Z moe/activation_test.py:117: 2025-05-07T20:32:10.0681010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0681355Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.0681652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0682525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.0683240Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.0683804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.0684516Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.0685208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.0685756Z kernel = self.compile( 2025-05-07T20:32:10.0686323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.0687009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.0687424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0687745Z 2025-05-07T20:32:10.0687970Z self = 2025-05-07T20:32:10.0689093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.0690529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef14c8e00>} 2025-05-07T20:32:10.0691921Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.0692978Z context = 2025-05-07T20:32:10.0693285Z 2025-05-07T20:32:10.0693460Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.0694017Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.0694504Z module_map=module_map) 2025-05-07T20:32:10.0694880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.0695251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.0695524Z E ^ 2025-05-07T20:32:10.0696005Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.0696479Z 2025-05-07T20:32:10.0696911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.0697450Z 2025-05-07T20:32:10.0697561Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0697997Z self=, 2025-05-07T20:32:10.0698414Z T=1, 2025-05-07T20:32:10.0698614Z D=7168, 2025-05-07T20:32:10.0698821Z scale_ub=1200.0, 2025-05-07T20:32:10.0699056Z contiguous=False, 2025-05-07T20:32:10.0699293Z compiled=False, 2025-05-07T20:32:10.0699509Z ) 2025-05-07T20:32:10.0699840Z self = 2025-05-07T20:32:10.0700355Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.0700642Z 2025-05-07T20:32:10.0700725Z @given( 2025-05-07T20:32:10.0700971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.0701295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.0701622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.0701973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.0702315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.0702620Z ) 2025-05-07T20:32:10.0702995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.0703456Z def test_silu_mul_quant( 2025-05-07T20:32:10.0703802Z self, 2025-05-07T20:32:10.0704010Z T: int, 2025-05-07T20:32:10.0704222Z D: int, 2025-05-07T20:32:10.0704446Z scale_ub: Optional[float], 2025-05-07T20:32:10.0704733Z contiguous: bool, 2025-05-07T20:32:10.0704985Z compiled: bool, 2025-05-07T20:32:10.0705213Z ) -> None: 2025-05-07T20:32:10.0705439Z torch.manual_seed(2025) 2025-05-07T20:32:10.0705693Z 2025-05-07T20:32:10.0705971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.0706327Z 2025-05-07T20:32:10.0706528Z x_sign = torch.sign(x) 2025-05-07T20:32:10.0706825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.0707149Z x = x_sign * x_clamp 2025-05-07T20:32:10.0707398Z x0 = x[:, :D] 2025-05-07T20:32:10.0707617Z x1 = x[:, D:] 2025-05-07T20:32:10.0707837Z 2025-05-07T20:32:10.0708031Z if contiguous: 2025-05-07T20:32:10.0708399Z x0 = x0.contiguous() 2025-05-07T20:32:10.0708674Z x1 = x1.contiguous() 2025-05-07T20:32:10.0708924Z 2025-05-07T20:32:10.0709119Z if scale_ub is not None: 2025-05-07T20:32:10.0709405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.0709755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.0710077Z ) 2025-05-07T20:32:10.0710275Z else: 2025-05-07T20:32:10.0710495Z scale_ub_tensor = None 2025-05-07T20:32:10.0710758Z 2025-05-07T20:32:10.0710996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.0711324Z op = silu_mul_quant 2025-05-07T20:32:10.0711586Z if compiled: 2025-05-07T20:32:10.0711840Z op = torch.compile(op) 2025-05-07T20:32:10.0712150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0712437Z 2025-05-07T20:32:10.0712635Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.0712821Z 2025-05-07T20:32:10.0712926Z moe/activation_test.py:117: 2025-05-07T20:32:10.0713237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0713875Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.0714170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.0714884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.0715598Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.0716152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.0716864Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.0717554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.0718112Z kernel = self.compile( 2025-05-07T20:32:10.0718683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.0719368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.0719781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.0720020Z 2025-05-07T20:32:10.0720325Z self = 2025-05-07T20:32:10.0721447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.0722881Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef14ca160>} 2025-05-07T20:32:10.0724408Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.0725470Z context = 2025-05-07T20:32:10.0725775Z 2025-05-07T20:32:10.0725947Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.0726491Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.0726980Z module_map=module_map) 2025-05-07T20:32:10.0727354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.0727723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.0728001Z E ^ 2025-05-07T20:32:10.0728527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.0729000Z 2025-05-07T20:32:10.0729430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.0730123Z 2025-05-07T20:32:10.0730231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.0730666Z self=, 2025-05-07T20:32:10.0731078Z T=4096, 2025-05-07T20:32:10.0731274Z D=7168, 2025-05-07T20:32:10.0731474Z scale_ub=1200.0, 2025-05-07T20:32:10.0731700Z contiguous=False, 2025-05-07T20:32:10.0731937Z compiled=True, 2025-05-07T20:32:10.2326995Z ) 2025-05-07T20:32:10.2327407Z self = 2025-05-07T20:32:10.2328049Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.2328513Z 2025-05-07T20:32:10.2328613Z @given( 2025-05-07T20:32:10.2328865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.2329198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.2329545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.2329901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.2330250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.2339054Z ) 2025-05-07T20:32:10.2339479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.2339950Z def test_silu_mul_quant( 2025-05-07T20:32:10.2340216Z self, 2025-05-07T20:32:10.2340431Z T: int, 2025-05-07T20:32:10.2340636Z D: int, 2025-05-07T20:32:10.2340870Z scale_ub: Optional[float], 2025-05-07T20:32:10.2341166Z contiguous: bool, 2025-05-07T20:32:10.2341418Z compiled: bool, 2025-05-07T20:32:10.2341666Z ) -> None: 2025-05-07T20:32:10.2341901Z torch.manual_seed(2025) 2025-05-07T20:32:10.2342158Z 2025-05-07T20:32:10.2342457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.2342826Z 2025-05-07T20:32:10.2343035Z x_sign = torch.sign(x) 2025-05-07T20:32:10.2343356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.2343698Z x = x_sign * x_clamp 2025-05-07T20:32:10.2343958Z x0 = x[:, :D] 2025-05-07T20:32:10.2344185Z x1 = x[:, D:] 2025-05-07T20:32:10.2344411Z 2025-05-07T20:32:10.2344617Z if contiguous: 2025-05-07T20:32:10.2344853Z x0 = x0.contiguous() 2025-05-07T20:32:10.2345127Z x1 = x1.contiguous() 2025-05-07T20:32:10.2345392Z 2025-05-07T20:32:10.2345594Z if scale_ub is not None: 2025-05-07T20:32:10.2345890Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.2346252Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.2346575Z ) 2025-05-07T20:32:10.2346783Z else: 2025-05-07T20:32:10.2347013Z scale_ub_tensor = None 2025-05-07T20:32:10.2347276Z 2025-05-07T20:32:10.2347526Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.2347870Z op = silu_mul_quant 2025-05-07T20:32:10.2348481Z if compiled: 2025-05-07T20:32:10.2348753Z op = torch.compile(op) 2025-05-07T20:32:10.2349071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2349370Z 2025-05-07T20:32:10.2349572Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.2349753Z 2025-05-07T20:32:10.2349862Z moe/activation_test.py:117: 2025-05-07T20:32:10.2350180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2350533Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.2350842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2351438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.2352025Z return fn(*args, **kwargs) 2025-05-07T20:32:10.2352722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.2353600Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.2354181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.2354895Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.2355599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.2356164Z kernel = self.compile( 2025-05-07T20:32:10.2356732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.2357431Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.2357862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2358109Z 2025-05-07T20:32:10.2358337Z self = 2025-05-07T20:32:10.2359477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.2361047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef14cb420>} 2025-05-07T20:32:10.2362457Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.2363529Z context = 2025-05-07T20:32:10.2363836Z 2025-05-07T20:32:10.2364023Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.2364574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.2365086Z module_map=module_map) 2025-05-07T20:32:10.2365483Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.2365859Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.2366141Z E ^ 2025-05-07T20:32:10.2366642Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.2367115Z 2025-05-07T20:32:10.2367567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.2368105Z 2025-05-07T20:32:10.2368220Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.2368670Z self=, 2025-05-07T20:32:10.2369104Z T=128, 2025-05-07T20:32:10.2369302Z D=7168, 2025-05-07T20:32:10.2369513Z scale_ub=1200.0, 2025-05-07T20:32:10.2369759Z contiguous=False, 2025-05-07T20:32:10.2370004Z compiled=True, 2025-05-07T20:32:10.2370219Z ) 2025-05-07T20:32:10.2370652Z self = 2025-05-07T20:32:10.2371185Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:10.2371493Z 2025-05-07T20:32:10.2371584Z @given( 2025-05-07T20:32:10.2371831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.2372167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.2372499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.2372848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.2373202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.2373510Z ) 2025-05-07T20:32:10.2373875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.2374340Z def test_silu_mul_quant( 2025-05-07T20:32:10.2374600Z self, 2025-05-07T20:32:10.2374915Z T: int, 2025-05-07T20:32:10.2375120Z D: int, 2025-05-07T20:32:10.2375362Z scale_ub: Optional[float], 2025-05-07T20:32:10.2375649Z contiguous: bool, 2025-05-07T20:32:10.2375897Z compiled: bool, 2025-05-07T20:32:10.2376133Z ) -> None: 2025-05-07T20:32:10.2376364Z torch.manual_seed(2025) 2025-05-07T20:32:10.2376614Z 2025-05-07T20:32:10.2376903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.2377265Z 2025-05-07T20:32:10.2377466Z x_sign = torch.sign(x) 2025-05-07T20:32:10.2377783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.2378117Z x = x_sign * x_clamp 2025-05-07T20:32:10.2378372Z x0 = x[:, :D] 2025-05-07T20:32:10.2378607Z x1 = x[:, D:] 2025-05-07T20:32:10.2378841Z 2025-05-07T20:32:10.2379035Z if contiguous: 2025-05-07T20:32:10.2379287Z x0 = x0.contiguous() 2025-05-07T20:32:10.2379564Z x1 = x1.contiguous() 2025-05-07T20:32:10.2379823Z 2025-05-07T20:32:10.2380036Z if scale_ub is not None: 2025-05-07T20:32:10.2380329Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.2380691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.2381017Z ) 2025-05-07T20:32:10.2381230Z else: 2025-05-07T20:32:10.2381461Z scale_ub_tensor = None 2025-05-07T20:32:10.2381729Z 2025-05-07T20:32:10.2381983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.2382322Z op = silu_mul_quant 2025-05-07T20:32:10.2382591Z if compiled: 2025-05-07T20:32:10.2382863Z op = torch.compile(op) 2025-05-07T20:32:10.2383187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2383474Z 2025-05-07T20:32:10.2383680Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.2383854Z 2025-05-07T20:32:10.2383969Z moe/activation_test.py:117: 2025-05-07T20:32:10.2384278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2384641Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.2384945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.2385536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.2386115Z return fn(*args, **kwargs) 2025-05-07T20:32:10.2386809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.2387533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.2388097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.2388816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.2389517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.2390079Z kernel = self.compile( 2025-05-07T20:32:10.2390735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.2391430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.2391861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.2392107Z 2025-05-07T20:32:10.2392332Z self = 2025-05-07T20:32:10.2393457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.2394883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2028720>} 2025-05-07T20:32:10.2396367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.2397440Z context = 2025-05-07T20:32:10.2397742Z 2025-05-07T20:32:10.2397924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.2398474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.2398972Z module_map=module_map) 2025-05-07T20:32:10.2399364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.2399735Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.2400016Z E ^ 2025-05-07T20:32:10.2400580Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.2401055Z 2025-05-07T20:32:10.2401502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.2402033Z 2025-05-07T20:32:10.2402144Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.2402590Z self=, 2025-05-07T20:32:10.2403012Z T=2048, 2025-05-07T20:32:10.2403205Z D=7168, 2025-05-07T20:32:10.2403411Z scale_ub=None, 2025-05-07T20:32:10.2403641Z contiguous=True, 2025-05-07T20:32:10.2403874Z compiled=True, 2025-05-07T20:32:10.3669725Z ) 2025-05-07T20:32:10.3670131Z self = 2025-05-07T20:32:10.3670655Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.3670954Z 2025-05-07T20:32:10.3671037Z @given( 2025-05-07T20:32:10.3671281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3671632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3671964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3672313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3672660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3672955Z ) 2025-05-07T20:32:10.3673329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3673795Z def test_silu_mul_quant( 2025-05-07T20:32:10.3674045Z self, 2025-05-07T20:32:10.3674253Z T: int, 2025-05-07T20:32:10.3674471Z D: int, 2025-05-07T20:32:10.3674695Z scale_ub: Optional[float], 2025-05-07T20:32:10.3674988Z contiguous: bool, 2025-05-07T20:32:10.3675244Z compiled: bool, 2025-05-07T20:32:10.3675475Z ) -> None: 2025-05-07T20:32:10.3675704Z torch.manual_seed(2025) 2025-05-07T20:32:10.3675963Z 2025-05-07T20:32:10.3676255Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3676618Z 2025-05-07T20:32:10.3676829Z x_sign = torch.sign(x) 2025-05-07T20:32:10.3677445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.3677771Z x = x_sign * x_clamp 2025-05-07T20:32:10.3678026Z x0 = x[:, :D] 2025-05-07T20:32:10.3678275Z x1 = x[:, D:] 2025-05-07T20:32:10.3678520Z 2025-05-07T20:32:10.3678716Z if contiguous: 2025-05-07T20:32:10.3678961Z x0 = x0.contiguous() 2025-05-07T20:32:10.3679238Z x1 = x1.contiguous() 2025-05-07T20:32:10.3679501Z 2025-05-07T20:32:10.3679708Z if scale_ub is not None: 2025-05-07T20:32:10.3679992Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.3680436Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.3680765Z ) 2025-05-07T20:32:10.3680963Z else: 2025-05-07T20:32:10.3681184Z scale_ub_tensor = None 2025-05-07T20:32:10.3681451Z 2025-05-07T20:32:10.3681881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.3682219Z op = silu_mul_quant 2025-05-07T20:32:10.3682484Z if compiled: 2025-05-07T20:32:10.3682745Z op = torch.compile(op) 2025-05-07T20:32:10.3683051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3683343Z 2025-05-07T20:32:10.3683546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.3683721Z 2025-05-07T20:32:10.3683828Z moe/activation_test.py:117: 2025-05-07T20:32:10.3684140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3684491Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.3684784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3685372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.3685958Z return fn(*args, **kwargs) 2025-05-07T20:32:10.3686676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.3687404Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.3687972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.3688680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.3689379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.3689944Z kernel = self.compile( 2025-05-07T20:32:10.3690507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.3691201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.3691632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3691873Z 2025-05-07T20:32:10.3692103Z self = 2025-05-07T20:32:10.3693236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.3694688Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2029440>} 2025-05-07T20:32:10.3696086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.3697158Z context = 2025-05-07T20:32:10.3697461Z 2025-05-07T20:32:10.3697646Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.3698289Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.3698790Z module_map=module_map) 2025-05-07T20:32:10.3699179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.3699548Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.3699826Z E ^ 2025-05-07T20:32:10.3700320Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.3700787Z 2025-05-07T20:32:10.3701231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.3701763Z 2025-05-07T20:32:10.3701873Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.3702313Z self=, 2025-05-07T20:32:10.3702741Z T=16384, 2025-05-07T20:32:10.3702941Z D=5120, 2025-05-07T20:32:10.3703226Z scale_ub=None, 2025-05-07T20:32:10.3703453Z contiguous=False, 2025-05-07T20:32:10.3703691Z compiled=False, 2025-05-07T20:32:10.3703909Z ) 2025-05-07T20:32:10.3704248Z self = 2025-05-07T20:32:10.3704769Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:10.3705064Z 2025-05-07T20:32:10.3705145Z @given( 2025-05-07T20:32:10.3705389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3705720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3706039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3706389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3706737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3707036Z ) 2025-05-07T20:32:10.3707403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3707868Z def test_silu_mul_quant( 2025-05-07T20:32:10.3708131Z self, 2025-05-07T20:32:10.3708352Z T: int, 2025-05-07T20:32:10.3708598Z D: int, 2025-05-07T20:32:10.3708829Z scale_ub: Optional[float], 2025-05-07T20:32:10.3709111Z contiguous: bool, 2025-05-07T20:32:10.3709367Z compiled: bool, 2025-05-07T20:32:10.3709601Z ) -> None: 2025-05-07T20:32:10.3709822Z torch.manual_seed(2025) 2025-05-07T20:32:10.3710077Z 2025-05-07T20:32:10.3710365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3710721Z 2025-05-07T20:32:10.3710936Z x_sign = torch.sign(x) 2025-05-07T20:32:10.3711245Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.3713607Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.3715555Z 2025-05-07T20:32:10.3715687Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:10.3715912Z 2025-05-07T20:32:10.3716036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.3716469Z self=, 2025-05-07T20:32:10.3716888Z T=4096, 2025-05-07T20:32:10.3717088Z D=7168, 2025-05-07T20:32:10.3717287Z scale_ub=1200.0, 2025-05-07T20:32:10.3717523Z contiguous=True, 2025-05-07T20:32:10.3717758Z compiled=True, 2025-05-07T20:32:10.3717966Z ) 2025-05-07T20:32:10.3718306Z self = 2025-05-07T20:32:10.3718825Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:10.3719116Z 2025-05-07T20:32:10.3719325Z @given( 2025-05-07T20:32:10.3719571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3719901Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3720290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3720631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3720976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3721275Z ) 2025-05-07T20:32:10.3721641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3722102Z def test_silu_mul_quant( 2025-05-07T20:32:10.3722355Z self, 2025-05-07T20:32:10.3722555Z T: int, 2025-05-07T20:32:10.3722764Z D: int, 2025-05-07T20:32:10.3722996Z scale_ub: Optional[float], 2025-05-07T20:32:10.3723282Z contiguous: bool, 2025-05-07T20:32:10.3723535Z compiled: bool, 2025-05-07T20:32:10.3723892Z ) -> None: 2025-05-07T20:32:10.3724118Z torch.manual_seed(2025) 2025-05-07T20:32:10.3724375Z 2025-05-07T20:32:10.3724662Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3725015Z 2025-05-07T20:32:10.3725225Z x_sign = torch.sign(x) 2025-05-07T20:32:10.3725533Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.3727607Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.3729537Z 2025-05-07T20:32:10.3729673Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:10.3729895Z 2025-05-07T20:32:10.3730004Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.3730442Z self=, 2025-05-07T20:32:10.3730860Z T=16384, 2025-05-07T20:32:10.3731067Z D=7168, 2025-05-07T20:32:10.3731267Z scale_ub=None, 2025-05-07T20:32:10.3731494Z contiguous=False, 2025-05-07T20:32:10.3731734Z compiled=False, 2025-05-07T20:32:10.3731943Z ) 2025-05-07T20:32:10.3732282Z self = 2025-05-07T20:32:10.3732805Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:10.3733098Z 2025-05-07T20:32:10.3733180Z @given( 2025-05-07T20:32:10.3733423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3733754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3734076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3734429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3734780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3735083Z ) 2025-05-07T20:32:10.3735445Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3735907Z def test_silu_mul_quant( 2025-05-07T20:32:10.3736163Z self, 2025-05-07T20:32:10.3736370Z T: int, 2025-05-07T20:32:10.3736581Z D: int, 2025-05-07T20:32:10.3736815Z scale_ub: Optional[float], 2025-05-07T20:32:10.3737097Z contiguous: bool, 2025-05-07T20:32:10.3737355Z compiled: bool, 2025-05-07T20:32:10.3737589Z ) -> None: 2025-05-07T20:32:10.3737811Z torch.manual_seed(2025) 2025-05-07T20:32:10.3738071Z 2025-05-07T20:32:10.3738400Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3740677Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.3742609Z 2025-05-07T20:32:10.3742739Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.4964183Z 2025-05-07T20:32:10.4964575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.4965251Z self=, 2025-05-07T20:32:10.4965801Z T=2048, 2025-05-07T20:32:10.4966004Z D=7168, 2025-05-07T20:32:10.4966219Z scale_ub=1200.0, 2025-05-07T20:32:10.4966832Z contiguous=True, 2025-05-07T20:32:10.4967075Z compiled=True, 2025-05-07T20:32:10.4967309Z ) 2025-05-07T20:32:10.4967646Z self = 2025-05-07T20:32:10.4968177Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:10.4968463Z 2025-05-07T20:32:10.4968551Z @given( 2025-05-07T20:32:10.4968791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.4969126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.4969453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.4969800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.4970151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.4970462Z ) 2025-05-07T20:32:10.4970837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.4971299Z def test_silu_mul_quant( 2025-05-07T20:32:10.4971564Z self, 2025-05-07T20:32:10.4971773Z T: int, 2025-05-07T20:32:10.4971980Z D: int, 2025-05-07T20:32:10.4972218Z scale_ub: Optional[float], 2025-05-07T20:32:10.4972542Z contiguous: bool, 2025-05-07T20:32:10.4972795Z compiled: bool, 2025-05-07T20:32:10.4973033Z ) -> None: 2025-05-07T20:32:10.4973255Z torch.manual_seed(2025) 2025-05-07T20:32:10.4973511Z 2025-05-07T20:32:10.4973799Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.4974154Z 2025-05-07T20:32:10.4974358Z x_sign = torch.sign(x) 2025-05-07T20:32:10.4974666Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.4976813Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.4978812Z 2025-05-07T20:32:10.4978944Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:10.4979167Z 2025-05-07T20:32:10.4979276Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.4979722Z self=, 2025-05-07T20:32:10.4980166Z T=2048, 2025-05-07T20:32:10.4980371Z D=7168, 2025-05-07T20:32:10.4980577Z scale_ub=None, 2025-05-07T20:32:10.4980802Z contiguous=True, 2025-05-07T20:32:10.4981041Z compiled=False, 2025-05-07T20:32:10.4981260Z ) 2025-05-07T20:32:10.4981593Z self = 2025-05-07T20:32:10.4982121Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.4991756Z 2025-05-07T20:32:10.4992089Z @given( 2025-05-07T20:32:10.4992363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.4992696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.4993026Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.4993382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.4993729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.4994039Z ) 2025-05-07T20:32:10.4994417Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.4994885Z def test_silu_mul_quant( 2025-05-07T20:32:10.4995146Z self, 2025-05-07T20:32:10.4995355Z T: int, 2025-05-07T20:32:10.4995568Z D: int, 2025-05-07T20:32:10.4995793Z scale_ub: Optional[float], 2025-05-07T20:32:10.4996089Z contiguous: bool, 2025-05-07T20:32:10.4996346Z compiled: bool, 2025-05-07T20:32:10.4996667Z ) -> None: 2025-05-07T20:32:10.4996898Z torch.manual_seed(2025) 2025-05-07T20:32:10.4997164Z 2025-05-07T20:32:10.4997451Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.4997818Z 2025-05-07T20:32:10.4998034Z > x_sign = torch.sign(x) 2025-05-07T20:32:10.5000058Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.5002130Z 2025-05-07T20:32:10.5002257Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:10.5002495Z 2025-05-07T20:32:10.5002610Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.5003045Z self=, 2025-05-07T20:32:10.5003466Z T=1, 2025-05-07T20:32:10.5003656Z D=7168, 2025-05-07T20:32:10.5003862Z scale_ub=1200.0, 2025-05-07T20:32:10.5004100Z contiguous=True, 2025-05-07T20:32:10.5004327Z compiled=False, 2025-05-07T20:32:10.5004546Z ) 2025-05-07T20:32:10.5004882Z self = 2025-05-07T20:32:10.5005387Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:10.5005669Z 2025-05-07T20:32:10.5005749Z @given( 2025-05-07T20:32:10.5005994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.5006324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.5006640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.5006988Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.5007347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.5007644Z ) 2025-05-07T20:32:10.5008012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.5008472Z def test_silu_mul_quant( 2025-05-07T20:32:10.5008723Z self, 2025-05-07T20:32:10.5008930Z T: int, 2025-05-07T20:32:10.5009137Z D: int, 2025-05-07T20:32:10.5009364Z scale_ub: Optional[float], 2025-05-07T20:32:10.5009655Z contiguous: bool, 2025-05-07T20:32:10.5009912Z compiled: bool, 2025-05-07T20:32:10.5010144Z ) -> None: 2025-05-07T20:32:10.5010375Z torch.manual_seed(2025) 2025-05-07T20:32:10.5010633Z 2025-05-07T20:32:10.5010922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.5011281Z 2025-05-07T20:32:10.5011491Z x_sign = torch.sign(x) 2025-05-07T20:32:10.5011801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.5012130Z x = x_sign * x_clamp 2025-05-07T20:32:10.5012478Z x0 = x[:, :D] 2025-05-07T20:32:10.5012717Z x1 = x[:, D:] 2025-05-07T20:32:10.5012935Z 2025-05-07T20:32:10.5013133Z if contiguous: 2025-05-07T20:32:10.5013652Z x0 = x0.contiguous() 2025-05-07T20:32:10.5013924Z x1 = x1.contiguous() 2025-05-07T20:32:10.5014184Z 2025-05-07T20:32:10.5014389Z if scale_ub is not None: 2025-05-07T20:32:10.5014674Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.5015036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.5015367Z ) 2025-05-07T20:32:10.5015566Z else: 2025-05-07T20:32:10.5015792Z scale_ub_tensor = None 2025-05-07T20:32:10.5016057Z 2025-05-07T20:32:10.5016298Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.5016634Z op = silu_mul_quant 2025-05-07T20:32:10.5016902Z if compiled: 2025-05-07T20:32:10.5017301Z op = torch.compile(op) 2025-05-07T20:32:10.5017617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.5017911Z 2025-05-07T20:32:10.5018125Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.5018298Z 2025-05-07T20:32:10.5018426Z moe/activation_test.py:117: 2025-05-07T20:32:10.5018766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.5019117Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.5019413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.5020142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.5020869Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.5021434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.5022143Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.5022849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.5023408Z kernel = self.compile( 2025-05-07T20:32:10.5023972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.5024661Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.5025081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.5025318Z 2025-05-07T20:32:10.5025541Z self = 2025-05-07T20:32:10.5026658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.5028096Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef17c4400>} 2025-05-07T20:32:10.5029549Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.5030620Z context = 2025-05-07T20:32:10.5030923Z 2025-05-07T20:32:10.5031106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.5031651Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.5032144Z module_map=module_map) 2025-05-07T20:32:10.5032532Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.5032901Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.5033185Z E ^ 2025-05-07T20:32:10.5033823Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.5034298Z 2025-05-07T20:32:10.5034740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.5035273Z 2025-05-07T20:32:10.5035384Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.5035825Z self=, 2025-05-07T20:32:10.5036249Z T=128, 2025-05-07T20:32:10.5036442Z D=5120, 2025-05-07T20:32:10.5036653Z scale_ub=None, 2025-05-07T20:32:10.5036880Z contiguous=True, 2025-05-07T20:32:10.5037115Z compiled=False, 2025-05-07T20:32:10.5037336Z ) 2025-05-07T20:32:10.5037676Z self = 2025-05-07T20:32:10.5038196Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.5038564Z 2025-05-07T20:32:10.5038644Z @given( 2025-05-07T20:32:10.5038898Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.5039230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.5039555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.5039907Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.5040358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.5040656Z ) 2025-05-07T20:32:10.5041026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.5041489Z def test_silu_mul_quant( 2025-05-07T20:32:10.5041746Z self, 2025-05-07T20:32:10.5041948Z T: int, 2025-05-07T20:32:10.5042159Z D: int, 2025-05-07T20:32:10.5042397Z scale_ub: Optional[float], 2025-05-07T20:32:10.5042677Z contiguous: bool, 2025-05-07T20:32:10.5042932Z compiled: bool, 2025-05-07T20:32:10.5043167Z ) -> None: 2025-05-07T20:32:10.5043397Z torch.manual_seed(2025) 2025-05-07T20:32:10.5043654Z 2025-05-07T20:32:10.5043946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.5044302Z 2025-05-07T20:32:10.5044510Z x_sign = torch.sign(x) 2025-05-07T20:32:10.5044824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.5045149Z x = x_sign * x_clamp 2025-05-07T20:32:10.5045405Z x0 = x[:, :D] 2025-05-07T20:32:10.5045636Z x1 = x[:, D:] 2025-05-07T20:32:10.5045851Z 2025-05-07T20:32:10.5046048Z if contiguous: 2025-05-07T20:32:10.5046294Z x0 = x0.contiguous() 2025-05-07T20:32:10.5046564Z x1 = x1.contiguous() 2025-05-07T20:32:10.5046821Z 2025-05-07T20:32:10.5047026Z if scale_ub is not None: 2025-05-07T20:32:10.5047316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.5047667Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.5048005Z ) 2025-05-07T20:32:10.5048215Z else: 2025-05-07T20:32:10.5048442Z scale_ub_tensor = None 2025-05-07T20:32:10.5048715Z 2025-05-07T20:32:10.5048966Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.5049297Z op = silu_mul_quant 2025-05-07T20:32:10.5049567Z if compiled: 2025-05-07T20:32:10.5049830Z op = torch.compile(op) 2025-05-07T20:32:10.5050137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.5050429Z 2025-05-07T20:32:10.5050632Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.5050813Z 2025-05-07T20:32:10.5050914Z moe/activation_test.py:117: 2025-05-07T20:32:10.5051223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.5051573Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.5051865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.5052587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.5053407Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.5053965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.5054676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.5055369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.5055924Z kernel = self.compile( 2025-05-07T20:32:10.5056485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.5057169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.5057587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.5057827Z 2025-05-07T20:32:10.5058042Z self = 2025-05-07T20:32:10.5059340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.5060760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef17c5300>} 2025-05-07T20:32:10.5062156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.5063219Z context = 2025-05-07T20:32:10.5063521Z 2025-05-07T20:32:10.5063697Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.5064249Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.5064749Z module_map=module_map) 2025-05-07T20:32:10.5065137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.5065514Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.5065791Z E ^ 2025-05-07T20:32:10.5066281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.5066754Z 2025-05-07T20:32:10.5067187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.6176470Z 2025-05-07T20:32:10.6176816Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.6177495Z self=, 2025-05-07T20:32:10.6178096Z T=128, 2025-05-07T20:32:10.6178363Z D=7168, 2025-05-07T20:32:10.6178639Z scale_ub=None, 2025-05-07T20:32:10.6178973Z contiguous=True, 2025-05-07T20:32:10.6179280Z compiled=False, 2025-05-07T20:32:10.6179519Z ) 2025-05-07T20:32:10.6179855Z self = 2025-05-07T20:32:10.6180366Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.6180652Z 2025-05-07T20:32:10.6180735Z @given( 2025-05-07T20:32:10.6180977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.6181313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.6181632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.6181985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.6182336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.6182633Z ) 2025-05-07T20:32:10.6183003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.6183468Z def test_silu_mul_quant( 2025-05-07T20:32:10.6183724Z self, 2025-05-07T20:32:10.6183929Z T: int, 2025-05-07T20:32:10.6185066Z D: int, 2025-05-07T20:32:10.6185302Z scale_ub: Optional[float], 2025-05-07T20:32:10.6185588Z contiguous: bool, 2025-05-07T20:32:10.6185849Z compiled: bool, 2025-05-07T20:32:10.6186090Z ) -> None: 2025-05-07T20:32:10.6186348Z torch.manual_seed(2025) 2025-05-07T20:32:10.6186599Z 2025-05-07T20:32:10.6186887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.6187251Z 2025-05-07T20:32:10.6187450Z x_sign = torch.sign(x) 2025-05-07T20:32:10.6187760Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.6188089Z x = x_sign * x_clamp 2025-05-07T20:32:10.6188368Z x0 = x[:, :D] 2025-05-07T20:32:10.6188617Z x1 = x[:, D:] 2025-05-07T20:32:10.6188841Z 2025-05-07T20:32:10.6189040Z if contiguous: 2025-05-07T20:32:10.6189281Z x0 = x0.contiguous() 2025-05-07T20:32:10.6189716Z x1 = x1.contiguous() 2025-05-07T20:32:10.6189972Z 2025-05-07T20:32:10.6190179Z if scale_ub is not None: 2025-05-07T20:32:10.6190471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.6190831Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.6191156Z ) 2025-05-07T20:32:10.6191363Z else: 2025-05-07T20:32:10.6191588Z scale_ub_tensor = None 2025-05-07T20:32:10.6191849Z 2025-05-07T20:32:10.6192096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.6192431Z op = silu_mul_quant 2025-05-07T20:32:10.6192691Z if compiled: 2025-05-07T20:32:10.6192952Z op = torch.compile(op) 2025-05-07T20:32:10.6193265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6193557Z 2025-05-07T20:32:10.6193754Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.6193933Z 2025-05-07T20:32:10.6194040Z moe/activation_test.py:117: 2025-05-07T20:32:10.6194365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6194710Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.6195007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6195733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.6196445Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.6197012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.6197730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.6198433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.6199025Z kernel = self.compile( 2025-05-07T20:32:10.6199597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.6200409Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6200832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6201070Z 2025-05-07T20:32:10.6201286Z self = 2025-05-07T20:32:10.6202406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.6203841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef17c60c0>} 2025-05-07T20:32:10.6205231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.6206383Z context = 2025-05-07T20:32:10.6206695Z 2025-05-07T20:32:10.6206871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.6207418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6207909Z module_map=module_map) 2025-05-07T20:32:10.6208287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6208661Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6208936Z E ^ 2025-05-07T20:32:10.6209418Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6209893Z 2025-05-07T20:32:10.6210325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.6210949Z 2025-05-07T20:32:10.6211065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.6211506Z self=, 2025-05-07T20:32:10.6211921Z T=2048, 2025-05-07T20:32:10.6212123Z D=7168, 2025-05-07T20:32:10.6212333Z scale_ub=1200.0, 2025-05-07T20:32:10.6212565Z contiguous=True, 2025-05-07T20:32:10.6212802Z compiled=False, 2025-05-07T20:32:10.6213021Z ) 2025-05-07T20:32:10.6213606Z self = 2025-05-07T20:32:10.6214134Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:10.6214427Z 2025-05-07T20:32:10.6214507Z @given( 2025-05-07T20:32:10.6214755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.6215077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.6215401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.6215750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.6216106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.6216410Z ) 2025-05-07T20:32:10.6216775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.6217230Z def test_silu_mul_quant( 2025-05-07T20:32:10.6217487Z self, 2025-05-07T20:32:10.6217694Z T: int, 2025-05-07T20:32:10.6217902Z D: int, 2025-05-07T20:32:10.6218129Z scale_ub: Optional[float], 2025-05-07T20:32:10.6218415Z contiguous: bool, 2025-05-07T20:32:10.6218668Z compiled: bool, 2025-05-07T20:32:10.6218898Z ) -> None: 2025-05-07T20:32:10.6219124Z torch.manual_seed(2025) 2025-05-07T20:32:10.6219384Z 2025-05-07T20:32:10.6219668Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.6221812Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.6223739Z 2025-05-07T20:32:10.6223865Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.6224093Z 2025-05-07T20:32:10.6224203Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.6224636Z self=, 2025-05-07T20:32:10.6225048Z T=1, 2025-05-07T20:32:10.6225246Z D=5120, 2025-05-07T20:32:10.6225448Z scale_ub=1200.0, 2025-05-07T20:32:10.6225677Z contiguous=True, 2025-05-07T20:32:10.6225910Z compiled=False, 2025-05-07T20:32:10.6226137Z ) 2025-05-07T20:32:10.6226597Z self = 2025-05-07T20:32:10.6227113Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:10.6227389Z 2025-05-07T20:32:10.6227475Z @given( 2025-05-07T20:32:10.6227714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.6228043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.6228372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.6228721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.6229065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.6229372Z ) 2025-05-07T20:32:10.6229747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.6230208Z def test_silu_mul_quant( 2025-05-07T20:32:10.6230469Z self, 2025-05-07T20:32:10.6230677Z T: int, 2025-05-07T20:32:10.6230879Z D: int, 2025-05-07T20:32:10.6231227Z scale_ub: Optional[float], 2025-05-07T20:32:10.6231523Z contiguous: bool, 2025-05-07T20:32:10.6231772Z compiled: bool, 2025-05-07T20:32:10.6232010Z ) -> None: 2025-05-07T20:32:10.6232240Z torch.manual_seed(2025) 2025-05-07T20:32:10.6232490Z 2025-05-07T20:32:10.6232779Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.6233141Z 2025-05-07T20:32:10.6233343Z x_sign = torch.sign(x) 2025-05-07T20:32:10.6233659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.6233988Z x = x_sign * x_clamp 2025-05-07T20:32:10.6234243Z x0 = x[:, :D] 2025-05-07T20:32:10.6234470Z x1 = x[:, D:] 2025-05-07T20:32:10.6234692Z 2025-05-07T20:32:10.6234888Z if contiguous: 2025-05-07T20:32:10.6235129Z x0 = x0.contiguous() 2025-05-07T20:32:10.6235405Z x1 = x1.contiguous() 2025-05-07T20:32:10.6235662Z 2025-05-07T20:32:10.6235867Z if scale_ub is not None: 2025-05-07T20:32:10.6236157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.6236523Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.6236853Z ) 2025-05-07T20:32:10.6237055Z else: 2025-05-07T20:32:10.6237283Z scale_ub_tensor = None 2025-05-07T20:32:10.6237550Z 2025-05-07T20:32:10.6237794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.6238131Z op = silu_mul_quant 2025-05-07T20:32:10.6238397Z if compiled: 2025-05-07T20:32:10.6238656Z op = torch.compile(op) 2025-05-07T20:32:10.6238976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6239270Z 2025-05-07T20:32:10.6239470Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.6239648Z 2025-05-07T20:32:10.6239752Z moe/activation_test.py:117: 2025-05-07T20:32:10.6240064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6240514Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.6240812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6241534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.6242256Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.6242816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.6243531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.6244226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.6244784Z kernel = self.compile( 2025-05-07T20:32:10.6245348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.6246037Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6246580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6246822Z 2025-05-07T20:32:10.6247047Z self = 2025-05-07T20:32:10.6248174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.6249647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef17c76a0>} 2025-05-07T20:32:10.6251060Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.6252125Z context = 2025-05-07T20:32:10.6252505Z 2025-05-07T20:32:10.6252696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.6253248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6253738Z module_map=module_map) 2025-05-07T20:32:10.6254127Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6254504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6254775Z E ^ 2025-05-07T20:32:10.6255264Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6255734Z 2025-05-07T20:32:10.6256177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.7078824Z 2025-05-07T20:32:10.7079239Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7079851Z self=, 2025-05-07T20:32:10.7080604Z T=2048, 2025-05-07T20:32:10.7080877Z D=5120, 2025-05-07T20:32:10.7081161Z scale_ub=None, 2025-05-07T20:32:10.7081383Z contiguous=True, 2025-05-07T20:32:10.7081624Z compiled=False, 2025-05-07T20:32:10.7081845Z ) 2025-05-07T20:32:10.7082171Z self = 2025-05-07T20:32:10.7082686Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.7082966Z 2025-05-07T20:32:10.7083056Z @given( 2025-05-07T20:32:10.7083299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7083618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7083939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7084286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7084626Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7084932Z ) 2025-05-07T20:32:10.7085305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7085764Z def test_silu_mul_quant( 2025-05-07T20:32:10.7086057Z self, 2025-05-07T20:32:10.7086258Z T: int, 2025-05-07T20:32:10.7086465Z D: int, 2025-05-07T20:32:10.7086697Z scale_ub: Optional[float], 2025-05-07T20:32:10.7086977Z contiguous: bool, 2025-05-07T20:32:10.7087231Z compiled: bool, 2025-05-07T20:32:10.7087472Z ) -> None: 2025-05-07T20:32:10.7087694Z torch.manual_seed(2025) 2025-05-07T20:32:10.7087950Z 2025-05-07T20:32:10.7088243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7088643Z 2025-05-07T20:32:10.7088850Z > x_sign = torch.sign(x) 2025-05-07T20:32:10.7091170Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7093128Z 2025-05-07T20:32:10.7093251Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:10.7093473Z 2025-05-07T20:32:10.7093588Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7094014Z self=, 2025-05-07T20:32:10.7094435Z T=16384, 2025-05-07T20:32:10.7094640Z D=5120, 2025-05-07T20:32:10.7094842Z scale_ub=None, 2025-05-07T20:32:10.7095065Z contiguous=True, 2025-05-07T20:32:10.7095299Z compiled=False, 2025-05-07T20:32:10.7095515Z ) 2025-05-07T20:32:10.7095845Z self = 2025-05-07T20:32:10.7096546Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.7096837Z 2025-05-07T20:32:10.7096926Z @given( 2025-05-07T20:32:10.7097159Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7097489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7097825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7098239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7098664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7099038Z ) 2025-05-07T20:32:10.7099481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7100031Z def test_silu_mul_quant( 2025-05-07T20:32:10.7100283Z self, 2025-05-07T20:32:10.7100481Z T: int, 2025-05-07T20:32:10.7100688Z D: int, 2025-05-07T20:32:10.7100916Z scale_ub: Optional[float], 2025-05-07T20:32:10.7101206Z contiguous: bool, 2025-05-07T20:32:10.7101455Z compiled: bool, 2025-05-07T20:32:10.7101686Z ) -> None: 2025-05-07T20:32:10.7101911Z torch.manual_seed(2025) 2025-05-07T20:32:10.7102159Z 2025-05-07T20:32:10.7102449Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7104576Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7106502Z 2025-05-07T20:32:10.7106639Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7106860Z 2025-05-07T20:32:10.7106981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7107408Z self=, 2025-05-07T20:32:10.7107828Z T=4096, 2025-05-07T20:32:10.7108028Z D=5120, 2025-05-07T20:32:10.7108238Z scale_ub=None, 2025-05-07T20:32:10.7108503Z contiguous=True, 2025-05-07T20:32:10.7108742Z compiled=False, 2025-05-07T20:32:10.7108951Z ) 2025-05-07T20:32:10.7109285Z self = 2025-05-07T20:32:10.7109802Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.7110081Z 2025-05-07T20:32:10.7110162Z @given( 2025-05-07T20:32:10.7110406Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7110741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7111053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7111407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7111837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7112147Z ) 2025-05-07T20:32:10.7112507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7112977Z def test_silu_mul_quant( 2025-05-07T20:32:10.7113231Z self, 2025-05-07T20:32:10.7113774Z T: int, 2025-05-07T20:32:10.7113984Z D: int, 2025-05-07T20:32:10.7114215Z scale_ub: Optional[float], 2025-05-07T20:32:10.7114497Z contiguous: bool, 2025-05-07T20:32:10.7114751Z compiled: bool, 2025-05-07T20:32:10.7114985Z ) -> None: 2025-05-07T20:32:10.7115204Z torch.manual_seed(2025) 2025-05-07T20:32:10.7115460Z 2025-05-07T20:32:10.7115749Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7117890Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7120013Z 2025-05-07T20:32:10.7120257Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7120479Z 2025-05-07T20:32:10.7120589Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7121021Z self=, 2025-05-07T20:32:10.7121438Z T=2048, 2025-05-07T20:32:10.7121629Z D=5120, 2025-05-07T20:32:10.7121826Z scale_ub=None, 2025-05-07T20:32:10.7122049Z contiguous=False, 2025-05-07T20:32:10.7122280Z compiled=False, 2025-05-07T20:32:10.7122500Z ) 2025-05-07T20:32:10.7122834Z self = 2025-05-07T20:32:10.7123354Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:10.7123637Z 2025-05-07T20:32:10.7123718Z @given( 2025-05-07T20:32:10.7123957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7124287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7124599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7124947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7125294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7125591Z ) 2025-05-07T20:32:10.7125960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7126424Z def test_silu_mul_quant( 2025-05-07T20:32:10.7126670Z self, 2025-05-07T20:32:10.7126876Z T: int, 2025-05-07T20:32:10.7127092Z D: int, 2025-05-07T20:32:10.7127323Z scale_ub: Optional[float], 2025-05-07T20:32:10.7127606Z contiguous: bool, 2025-05-07T20:32:10.7127864Z compiled: bool, 2025-05-07T20:32:10.7128101Z ) -> None: 2025-05-07T20:32:10.7128319Z torch.manual_seed(2025) 2025-05-07T20:32:10.7128600Z 2025-05-07T20:32:10.7128911Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7131017Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7132936Z 2025-05-07T20:32:10.7133186Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7133416Z 2025-05-07T20:32:10.7133525Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7133962Z self=, 2025-05-07T20:32:10.7134385Z T=4096, 2025-05-07T20:32:10.7134579Z D=7168, 2025-05-07T20:32:10.7134782Z scale_ub=None, 2025-05-07T20:32:10.7135007Z contiguous=True, 2025-05-07T20:32:10.7135235Z compiled=True, 2025-05-07T20:32:10.7135450Z ) 2025-05-07T20:32:10.7135781Z self = 2025-05-07T20:32:10.7136286Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.7136571Z 2025-05-07T20:32:10.7136651Z @given( 2025-05-07T20:32:10.7136892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7137214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7137622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7137973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7138321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7138616Z ) 2025-05-07T20:32:10.7138978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7139447Z def test_silu_mul_quant( 2025-05-07T20:32:10.7139700Z self, 2025-05-07T20:32:10.7139896Z T: int, 2025-05-07T20:32:10.7140099Z D: int, 2025-05-07T20:32:10.7140326Z scale_ub: Optional[float], 2025-05-07T20:32:10.7140607Z contiguous: bool, 2025-05-07T20:32:10.7140861Z compiled: bool, 2025-05-07T20:32:10.7141093Z ) -> None: 2025-05-07T20:32:10.7141319Z torch.manual_seed(2025) 2025-05-07T20:32:10.7141565Z 2025-05-07T20:32:10.7141850Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7143966Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7145890Z 2025-05-07T20:32:10.7146018Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7146237Z 2025-05-07T20:32:10.7146345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7146777Z self=, 2025-05-07T20:32:10.7147192Z T=2048, 2025-05-07T20:32:10.7147385Z D=5120, 2025-05-07T20:32:10.7147577Z scale_ub=1200.0, 2025-05-07T20:32:10.7147813Z contiguous=False, 2025-05-07T20:32:10.7148048Z compiled=False, 2025-05-07T20:32:10.7691125Z ) 2025-05-07T20:32:10.7691521Z self = 2025-05-07T20:32:10.7692253Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.7692541Z 2025-05-07T20:32:10.7692630Z @given( 2025-05-07T20:32:10.7692867Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7693195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7693520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7693862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7694209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7694518Z ) 2025-05-07T20:32:10.7694884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7695345Z def test_silu_mul_quant( 2025-05-07T20:32:10.7695617Z self, 2025-05-07T20:32:10.7695817Z T: int, 2025-05-07T20:32:10.7696281Z D: int, 2025-05-07T20:32:10.7696514Z scale_ub: Optional[float], 2025-05-07T20:32:10.7696796Z contiguous: bool, 2025-05-07T20:32:10.7697043Z compiled: bool, 2025-05-07T20:32:10.7697277Z ) -> None: 2025-05-07T20:32:10.7697502Z torch.manual_seed(2025) 2025-05-07T20:32:10.7697750Z 2025-05-07T20:32:10.7698037Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7700163Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7702216Z 2025-05-07T20:32:10.7702346Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7702568Z 2025-05-07T20:32:10.7702674Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7703110Z self=, 2025-05-07T20:32:10.7703530Z T=4096, 2025-05-07T20:32:10.7703732Z D=7168, 2025-05-07T20:32:10.7703928Z scale_ub=1200.0, 2025-05-07T20:32:10.7704160Z contiguous=True, 2025-05-07T20:32:10.7704393Z compiled=False, 2025-05-07T20:32:10.7704607Z ) 2025-05-07T20:32:10.7704940Z self = 2025-05-07T20:32:10.7705459Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:10.7705745Z 2025-05-07T20:32:10.7705825Z @given( 2025-05-07T20:32:10.7706065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7706401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7706724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7707068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7707415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7707717Z ) 2025-05-07T20:32:10.7708075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7708594Z def test_silu_mul_quant( 2025-05-07T20:32:10.7708851Z self, 2025-05-07T20:32:10.7709049Z T: int, 2025-05-07T20:32:10.7709256Z D: int, 2025-05-07T20:32:10.7709490Z scale_ub: Optional[float], 2025-05-07T20:32:10.7709771Z contiguous: bool, 2025-05-07T20:32:10.7710022Z compiled: bool, 2025-05-07T20:32:10.7710255Z ) -> None: 2025-05-07T20:32:10.7710473Z torch.manual_seed(2025) 2025-05-07T20:32:10.7710726Z 2025-05-07T20:32:10.7711012Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7713138Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7715344Z 2025-05-07T20:32:10.7715472Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7715724Z 2025-05-07T20:32:10.7715834Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7716268Z self=, 2025-05-07T20:32:10.7716682Z T=16384, 2025-05-07T20:32:10.7716894Z D=7168, 2025-05-07T20:32:10.7717093Z scale_ub=None, 2025-05-07T20:32:10.7717438Z contiguous=False, 2025-05-07T20:32:10.7717677Z compiled=True, 2025-05-07T20:32:10.7717888Z ) 2025-05-07T20:32:10.7718214Z self = 2025-05-07T20:32:10.7718738Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:10.7719030Z 2025-05-07T20:32:10.7719118Z @given( 2025-05-07T20:32:10.7719352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7719682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7720002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7720435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7720772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7721071Z ) 2025-05-07T20:32:10.7721438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7722038Z def test_silu_mul_quant( 2025-05-07T20:32:10.7722297Z self, 2025-05-07T20:32:10.7722514Z T: int, 2025-05-07T20:32:10.7722717Z D: int, 2025-05-07T20:32:10.7722950Z scale_ub: Optional[float], 2025-05-07T20:32:10.7723236Z contiguous: bool, 2025-05-07T20:32:10.7723485Z compiled: bool, 2025-05-07T20:32:10.7723721Z ) -> None: 2025-05-07T20:32:10.7723949Z torch.manual_seed(2025) 2025-05-07T20:32:10.7724198Z 2025-05-07T20:32:10.7724486Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7726613Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7728544Z 2025-05-07T20:32:10.7728666Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7728887Z 2025-05-07T20:32:10.7729005Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7729431Z self=, 2025-05-07T20:32:10.7729855Z T=4096, 2025-05-07T20:32:10.7730052Z D=7168, 2025-05-07T20:32:10.7730247Z scale_ub=None, 2025-05-07T20:32:10.7730469Z contiguous=True, 2025-05-07T20:32:10.7730701Z compiled=False, 2025-05-07T20:32:10.7730911Z ) 2025-05-07T20:32:10.7731241Z self = 2025-05-07T20:32:10.7731756Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.7732035Z 2025-05-07T20:32:10.7732121Z @given( 2025-05-07T20:32:10.7732359Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7732695Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7733013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7733350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7733691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7733990Z ) 2025-05-07T20:32:10.7734348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7734808Z def test_silu_mul_quant( 2025-05-07T20:32:10.7735064Z self, 2025-05-07T20:32:10.7735271Z T: int, 2025-05-07T20:32:10.7735473Z D: int, 2025-05-07T20:32:10.7735699Z scale_ub: Optional[float], 2025-05-07T20:32:10.7735982Z contiguous: bool, 2025-05-07T20:32:10.7736227Z compiled: bool, 2025-05-07T20:32:10.7736460Z ) -> None: 2025-05-07T20:32:10.7736684Z torch.manual_seed(2025) 2025-05-07T20:32:10.7736936Z 2025-05-07T20:32:10.7737307Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7739470Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7741386Z 2025-05-07T20:32:10.7741516Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7741735Z 2025-05-07T20:32:10.7741847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7742271Z self=, 2025-05-07T20:32:10.7742769Z T=16384, 2025-05-07T20:32:10.7742979Z D=7168, 2025-05-07T20:32:10.7743175Z scale_ub=None, 2025-05-07T20:32:10.7743402Z contiguous=True, 2025-05-07T20:32:10.7743638Z compiled=False, 2025-05-07T20:32:10.7743851Z ) 2025-05-07T20:32:10.7744183Z self = 2025-05-07T20:32:10.7744701Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:10.7744992Z 2025-05-07T20:32:10.7745073Z @given( 2025-05-07T20:32:10.7745317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7745645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7745965Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7746305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7746651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7746952Z ) 2025-05-07T20:32:10.7747312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7747790Z def test_silu_mul_quant( 2025-05-07T20:32:10.7748052Z self, 2025-05-07T20:32:10.7748253Z T: int, 2025-05-07T20:32:10.7748460Z D: int, 2025-05-07T20:32:10.7748686Z scale_ub: Optional[float], 2025-05-07T20:32:10.7748964Z contiguous: bool, 2025-05-07T20:32:10.7749213Z compiled: bool, 2025-05-07T20:32:10.7749445Z ) -> None: 2025-05-07T20:32:10.7749663Z torch.manual_seed(2025) 2025-05-07T20:32:10.7749916Z 2025-05-07T20:32:10.7750203Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7752323Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7754239Z 2025-05-07T20:32:10.7754367Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.7754585Z 2025-05-07T20:32:10.7754692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.7755123Z self=, 2025-05-07T20:32:10.7755543Z T=16384, 2025-05-07T20:32:10.7755739Z D=7168, 2025-05-07T20:32:10.7755940Z scale_ub=1200.0, 2025-05-07T20:32:10.7756173Z contiguous=True, 2025-05-07T20:32:10.7756399Z compiled=False, 2025-05-07T20:32:10.7756618Z ) 2025-05-07T20:32:10.7756950Z self = 2025-05-07T20:32:10.7757469Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:10.7757764Z 2025-05-07T20:32:10.7757856Z @given( 2025-05-07T20:32:10.7758178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.7758511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.7758832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.7759178Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.7759517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.7759819Z ) 2025-05-07T20:32:10.7760283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.7760742Z def test_silu_mul_quant( 2025-05-07T20:32:10.7760994Z self, 2025-05-07T20:32:10.7761199Z T: int, 2025-05-07T20:32:10.7761401Z D: int, 2025-05-07T20:32:10.7761631Z scale_ub: Optional[float], 2025-05-07T20:32:10.7761916Z contiguous: bool, 2025-05-07T20:32:10.7762165Z compiled: bool, 2025-05-07T20:32:10.7762479Z ) -> None: 2025-05-07T20:32:10.7762708Z torch.manual_seed(2025) 2025-05-07T20:32:10.7762963Z 2025-05-07T20:32:10.7763248Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.7765366Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.7767281Z 2025-05-07T20:32:10.7767403Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.9563301Z 2025-05-07T20:32:10.9563985Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9564709Z self=, 2025-05-07T20:32:10.9565225Z T=128, 2025-05-07T20:32:10.9565420Z D=5120, 2025-05-07T20:32:10.9565620Z scale_ub=1200.0, 2025-05-07T20:32:10.9565849Z contiguous=False, 2025-05-07T20:32:10.9566086Z compiled=False, 2025-05-07T20:32:10.9566307Z ) 2025-05-07T20:32:10.9566636Z self = 2025-05-07T20:32:10.9567171Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.9567469Z 2025-05-07T20:32:10.9567549Z @given( 2025-05-07T20:32:10.9567789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.9568118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.9568467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.9568835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.9569187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.9569490Z ) 2025-05-07T20:32:10.9569901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.9580018Z def test_silu_mul_quant( 2025-05-07T20:32:10.9580294Z self, 2025-05-07T20:32:10.9580498Z T: int, 2025-05-07T20:32:10.9580713Z D: int, 2025-05-07T20:32:10.9580947Z scale_ub: Optional[float], 2025-05-07T20:32:10.9581238Z contiguous: bool, 2025-05-07T20:32:10.9581499Z compiled: bool, 2025-05-07T20:32:10.9581745Z ) -> None: 2025-05-07T20:32:10.9581968Z torch.manual_seed(2025) 2025-05-07T20:32:10.9582233Z 2025-05-07T20:32:10.9582527Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.9582886Z 2025-05-07T20:32:10.9583088Z x_sign = torch.sign(x) 2025-05-07T20:32:10.9583393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.9583715Z x = x_sign * x_clamp 2025-05-07T20:32:10.9583984Z x0 = x[:, :D] 2025-05-07T20:32:10.9584220Z x1 = x[:, D:] 2025-05-07T20:32:10.9584778Z 2025-05-07T20:32:10.9584971Z if contiguous: 2025-05-07T20:32:10.9585216Z x0 = x0.contiguous() 2025-05-07T20:32:10.9585491Z x1 = x1.contiguous() 2025-05-07T20:32:10.9585741Z 2025-05-07T20:32:10.9585946Z if scale_ub is not None: 2025-05-07T20:32:10.9586234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.9586583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.9586911Z ) 2025-05-07T20:32:10.9587117Z else: 2025-05-07T20:32:10.9587335Z scale_ub_tensor = None 2025-05-07T20:32:10.9587598Z 2025-05-07T20:32:10.9587845Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9588175Z op = silu_mul_quant 2025-05-07T20:32:10.9588439Z if compiled: 2025-05-07T20:32:10.9588742Z op = torch.compile(op) 2025-05-07T20:32:10.9589208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9589504Z 2025-05-07T20:32:10.9589715Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.9589889Z 2025-05-07T20:32:10.9589999Z moe/activation_test.py:117: 2025-05-07T20:32:10.9590303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9590660Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.9590956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9591672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.9592398Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.9592965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.9593689Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.9594385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.9594954Z kernel = self.compile( 2025-05-07T20:32:10.9595525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.9596206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9596626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9596872Z 2025-05-07T20:32:10.9597088Z self = 2025-05-07T20:32:10.9598236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.9599790Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef0f39bc0>} 2025-05-07T20:32:10.9601325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.9602391Z context = 2025-05-07T20:32:10.9602702Z 2025-05-07T20:32:10.9602877Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.9603431Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9603916Z module_map=module_map) 2025-05-07T20:32:10.9604304Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.9604682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.9604957Z E ^ 2025-05-07T20:32:10.9605441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9605925Z 2025-05-07T20:32:10.9606443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.9606977Z 2025-05-07T20:32:10.9607095Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9607531Z self=, 2025-05-07T20:32:10.9607947Z T=2048, 2025-05-07T20:32:10.9608145Z D=7168, 2025-05-07T20:32:10.9608350Z scale_ub=None, 2025-05-07T20:32:10.9608572Z contiguous=False, 2025-05-07T20:32:10.9608812Z compiled=False, 2025-05-07T20:32:10.9609031Z ) 2025-05-07T20:32:10.9609361Z self = 2025-05-07T20:32:10.9609880Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:10.9610164Z 2025-05-07T20:32:10.9610252Z @given( 2025-05-07T20:32:10.9610488Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.9610909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.9611232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.9611579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.9611920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.9612222Z ) 2025-05-07T20:32:10.9612589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.9613047Z def test_silu_mul_quant( 2025-05-07T20:32:10.9613754Z self, 2025-05-07T20:32:10.9614136Z T: int, 2025-05-07T20:32:10.9614419Z D: int, 2025-05-07T20:32:10.9614744Z scale_ub: Optional[float], 2025-05-07T20:32:10.9615149Z contiguous: bool, 2025-05-07T20:32:10.9615493Z compiled: bool, 2025-05-07T20:32:10.9615823Z ) -> None: 2025-05-07T20:32:10.9616138Z torch.manual_seed(2025) 2025-05-07T20:32:10.9616497Z 2025-05-07T20:32:10.9616906Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.9619485Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:10.9621413Z 2025-05-07T20:32:10.9621540Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:10.9621768Z 2025-05-07T20:32:10.9621875Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.9622310Z self=, 2025-05-07T20:32:10.9622729Z T=128, 2025-05-07T20:32:10.9622925Z D=7168, 2025-05-07T20:32:10.9623131Z scale_ub=1200.0, 2025-05-07T20:32:10.9623360Z contiguous=True, 2025-05-07T20:32:10.9623591Z compiled=True, 2025-05-07T20:32:10.9623804Z ) 2025-05-07T20:32:10.9624131Z self = 2025-05-07T20:32:10.9624645Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:10.9624922Z 2025-05-07T20:32:10.9625013Z @given( 2025-05-07T20:32:10.9625258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.9625580Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.9625901Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.9626245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.9626584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.9626885Z ) 2025-05-07T20:32:10.9627253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.9627715Z def test_silu_mul_quant( 2025-05-07T20:32:10.9628127Z self, 2025-05-07T20:32:10.9628336Z T: int, 2025-05-07T20:32:10.9628536Z D: int, 2025-05-07T20:32:10.9628767Z scale_ub: Optional[float], 2025-05-07T20:32:10.9629056Z contiguous: bool, 2025-05-07T20:32:10.9629302Z compiled: bool, 2025-05-07T20:32:10.9629538Z ) -> None: 2025-05-07T20:32:10.9629763Z torch.manual_seed(2025) 2025-05-07T20:32:10.9630009Z 2025-05-07T20:32:10.9630299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.9630656Z 2025-05-07T20:32:10.9630869Z x_sign = torch.sign(x) 2025-05-07T20:32:10.9631169Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.9631495Z x = x_sign * x_clamp 2025-05-07T20:32:10.9631747Z x0 = x[:, :D] 2025-05-07T20:32:10.9631970Z x1 = x[:, D:] 2025-05-07T20:32:10.9632189Z 2025-05-07T20:32:10.9632509Z if contiguous: 2025-05-07T20:32:10.9632745Z x0 = x0.contiguous() 2025-05-07T20:32:10.9633023Z x1 = x1.contiguous() 2025-05-07T20:32:10.9633275Z 2025-05-07T20:32:10.9633469Z if scale_ub is not None: 2025-05-07T20:32:10.9633756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.9634109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.9634425Z ) 2025-05-07T20:32:10.9634628Z else: 2025-05-07T20:32:10.9634846Z scale_ub_tensor = None 2025-05-07T20:32:10.9635104Z 2025-05-07T20:32:10.9635348Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.9635678Z op = silu_mul_quant 2025-05-07T20:32:10.9635940Z if compiled: 2025-05-07T20:32:10.9636194Z op = torch.compile(op) 2025-05-07T20:32:10.9636507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9636798Z 2025-05-07T20:32:10.9636996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.9637184Z 2025-05-07T20:32:10.9637291Z moe/activation_test.py:117: 2025-05-07T20:32:10.9637604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9637950Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.9638247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.9638831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:10.9639416Z return fn(*args, **kwargs) 2025-05-07T20:32:10.9640096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.9640880Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.9641441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.9642152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.9642854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.9643407Z kernel = self.compile( 2025-05-07T20:32:10.9643965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.9644648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9645059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.9645296Z 2025-05-07T20:32:10.9645510Z self = 2025-05-07T20:32:10.9646624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.9648130Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef0e2c2c0>} 2025-05-07T20:32:10.9649529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.9650583Z context = 2025-05-07T20:32:10.9650883Z 2025-05-07T20:32:10.9651057Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.9651602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9652084Z module_map=module_map) 2025-05-07T20:32:10.9652469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.9652833Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.9653101Z E ^ 2025-05-07T20:32:10.9653696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9654161Z 2025-05-07T20:32:10.9654590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.5609964Z 2025-05-07T20:32:11.5610465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5610957Z self=, 2025-05-07T20:32:11.5611377Z T=128, 2025-05-07T20:32:11.5611574Z D=7168, 2025-05-07T20:32:11.5611775Z scale_ub=1200.0, 2025-05-07T20:32:11.5612002Z contiguous=True, 2025-05-07T20:32:11.5612236Z compiled=False, 2025-05-07T20:32:11.5612455Z ) 2025-05-07T20:32:11.5612785Z self = 2025-05-07T20:32:11.5613572Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.5613991Z 2025-05-07T20:32:11.5614105Z @given( 2025-05-07T20:32:11.5614355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5614674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5614996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5615338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5615674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5615972Z ) 2025-05-07T20:32:11.5616338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5616792Z def test_silu_mul_quant( 2025-05-07T20:32:11.5617045Z self, 2025-05-07T20:32:11.5617245Z T: int, 2025-05-07T20:32:11.5617443Z D: int, 2025-05-07T20:32:11.5617673Z scale_ub: Optional[float], 2025-05-07T20:32:11.5617959Z contiguous: bool, 2025-05-07T20:32:11.5618210Z compiled: bool, 2025-05-07T20:32:11.5618452Z ) -> None: 2025-05-07T20:32:11.5618718Z torch.manual_seed(2025) 2025-05-07T20:32:11.5618978Z 2025-05-07T20:32:11.5619259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5619614Z 2025-05-07T20:32:11.5619818Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5620118Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5622204Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5624150Z 2025-05-07T20:32:11.5624272Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:11.5624500Z 2025-05-07T20:32:11.5624883Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5625321Z self=, 2025-05-07T20:32:11.5625730Z T=128, 2025-05-07T20:32:11.5625925Z D=5120, 2025-05-07T20:32:11.5626126Z scale_ub=1200.0, 2025-05-07T20:32:11.5626349Z contiguous=True, 2025-05-07T20:32:11.5626575Z compiled=True, 2025-05-07T20:32:11.5626786Z ) 2025-05-07T20:32:11.5627129Z self = 2025-05-07T20:32:11.5627632Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.5627914Z 2025-05-07T20:32:11.5627993Z @given( 2025-05-07T20:32:11.5628229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5628552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5628864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5629372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5629724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5630015Z ) 2025-05-07T20:32:11.5630375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5630831Z def test_silu_mul_quant( 2025-05-07T20:32:11.5631074Z self, 2025-05-07T20:32:11.5631276Z T: int, 2025-05-07T20:32:11.5631482Z D: int, 2025-05-07T20:32:11.5631703Z scale_ub: Optional[float], 2025-05-07T20:32:11.5631985Z contiguous: bool, 2025-05-07T20:32:11.5632236Z compiled: bool, 2025-05-07T20:32:11.5632460Z ) -> None: 2025-05-07T20:32:11.5632687Z torch.manual_seed(2025) 2025-05-07T20:32:11.5632939Z 2025-05-07T20:32:11.5633212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5633570Z 2025-05-07T20:32:11.5633774Z > x_sign = torch.sign(x) 2025-05-07T20:32:11.5635794Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5637711Z 2025-05-07T20:32:11.5637839Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:11.5638058Z 2025-05-07T20:32:11.5638166Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5638596Z self=, 2025-05-07T20:32:11.5639014Z T=128, 2025-05-07T20:32:11.5639203Z D=7168, 2025-05-07T20:32:11.5639401Z scale_ub=None, 2025-05-07T20:32:11.5639629Z contiguous=True, 2025-05-07T20:32:11.5639861Z compiled=True, 2025-05-07T20:32:11.5640067Z ) 2025-05-07T20:32:11.5640513Z self = 2025-05-07T20:32:11.5641025Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.5641297Z 2025-05-07T20:32:11.5641376Z @given( 2025-05-07T20:32:11.5641617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5641944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5642257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5642597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5642936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5643225Z ) 2025-05-07T20:32:11.5643585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5644043Z def test_silu_mul_quant( 2025-05-07T20:32:11.5644292Z self, 2025-05-07T20:32:11.5644493Z T: int, 2025-05-07T20:32:11.5644694Z D: int, 2025-05-07T20:32:11.5645009Z scale_ub: Optional[float], 2025-05-07T20:32:11.5645305Z contiguous: bool, 2025-05-07T20:32:11.5645565Z compiled: bool, 2025-05-07T20:32:11.5645805Z ) -> None: 2025-05-07T20:32:11.5646030Z torch.manual_seed(2025) 2025-05-07T20:32:11.5646297Z 2025-05-07T20:32:11.5646593Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5649207Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5651683Z 2025-05-07T20:32:11.5651818Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.5652061Z 2025-05-07T20:32:11.5701633Z FAILED 2025-05-07T20:32:11.5701836Z 2025-05-07T20:32:11.5702208Z =================================== FAILURES =================================== 2025-05-07T20:32:11.5702684Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:11.5703333Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:11.5704222Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:11.5705024Z | yield 2025-05-07T20:32:11.5705648Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:11.5706395Z | self._callTestMethod(testMethod) 2025-05-07T20:32:11.5706796Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:11.5707607Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:11.5708398Z | if method() is not None: 2025-05-07T20:32:11.5708752Z | ~~~~~~^^ 2025-05-07T20:32:11.5709682Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:11.5710730Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5711152Z | ^^^^^^^ 2025-05-07T20:32:11.5711958Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:11.5712867Z | raise the_error_hypothesis_found 2025-05-07T20:32:11.5713722Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:11.5714327Z +-+---------------- 1 ---------------- 2025-05-07T20:32:11.5714760Z | Traceback (most recent call last): 2025-05-07T20:32:11.5715790Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:11.5716918Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5719862Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5722732Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:11.5723386Z | self=, 2025-05-07T20:32:11.5723814Z | T=128, 2025-05-07T20:32:11.5724026Z | D=7168, 2025-05-07T20:32:11.5724250Z | scale_ub=1200.0, 2025-05-07T20:32:11.5724503Z | contiguous=True, 2025-05-07T20:32:11.5724760Z | compiled=False, 2025-05-07T20:32:11.5725001Z | ) 2025-05-07T20:32:11.5725186Z | 2025-05-07T20:32:11.5725739Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:11.5726368Z +---------------- 2 ---------------- 2025-05-07T20:32:11.5726666Z | Traceback (most recent call last): 2025-05-07T20:32:11.5727411Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:11.5728216Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5730502Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5732526Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:11.5732980Z | self=, 2025-05-07T20:32:11.5733403Z | T=128, 2025-05-07T20:32:11.5733613Z | D=7168, 2025-05-07T20:32:11.5733825Z | scale_ub=None, 2025-05-07T20:32:11.5734076Z | contiguous=True, 2025-05-07T20:32:11.5734333Z | compiled=True, 2025-05-07T20:32:11.5734564Z | ) 2025-05-07T20:32:11.5734752Z | 2025-05-07T20:32:11.5735297Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:11.5735920Z +---------------- 3 ---------------- 2025-05-07T20:32:11.5736260Z | Traceback (most recent call last): 2025-05-07T20:32:11.5737397Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:11.5738552Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5756667Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.5759575Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:11.5760337Z | self=, 2025-05-07T20:32:11.5760929Z | T=128, 2025-05-07T20:32:11.5761221Z | D=5120, 2025-05-07T20:32:11.5761519Z | scale_ub=1200.0, 2025-05-07T20:32:11.5761859Z | contiguous=True, 2025-05-07T20:32:11.5762209Z | compiled=True, 2025-05-07T20:32:11.5762540Z | ) 2025-05-07T20:32:11.5762790Z | 2025-05-07T20:32:11.5763547Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:11.5764437Z +---------------- 4 ---------------- 2025-05-07T20:32:11.5764932Z | Traceback (most recent call last): 2025-05-07T20:32:11.5765986Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:11.5767033Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.5767451Z | ~~~~~~^^ 2025-05-07T20:32:11.5768381Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:11.5769448Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5770605Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:11.5771431Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.5771725Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:11.5772140Z | a, 2025-05-07T20:32:11.5772356Z | ^^ 2025-05-07T20:32:11.5772566Z | ...<23 lines>... 2025-05-07T20:32:11.5772824Z | USE_INT64=use_int64, 2025-05-07T20:32:11.5773099Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.5773349Z | ) 2025-05-07T20:32:11.5773547Z | ^ 2025-05-07T20:32:11.5774095Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:11.5774862Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5775326Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.5776001Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:11.5776810Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.5777308Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.5777981Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:11.5778704Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.5779103Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.5779734Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:11.5780321Z | fn() 2025-05-07T20:32:11.5780531Z | ~~^^ 2025-05-07T20:32:11.5781124Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:11.5781771Z | self.fn.run( 2025-05-07T20:32:11.5782011Z | ~~~~~~~~~~~^ 2025-05-07T20:32:11.5782240Z | *args, 2025-05-07T20:32:11.5782459Z | ^^^^^^ 2025-05-07T20:32:11.5782683Z | **current, 2025-05-07T20:32:11.5782922Z | ^^^^^^^^^^ 2025-05-07T20:32:11.5783147Z | ) 2025-05-07T20:32:11.5783348Z | ^ 2025-05-07T20:32:11.5783872Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:11.5784468Z | kernel = self.compile( 2025-05-07T20:32:11.5784735Z | src, 2025-05-07T20:32:11.5784958Z | target=target, 2025-05-07T20:32:11.5785223Z | options=options.__dict__, 2025-05-07T20:32:11.5785507Z | ) 2025-05-07T20:32:11.5786074Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:11.5786809Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5787629Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.5788458Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5788947Z | module_map=module_map) 2025-05-07T20:32:11.5789332Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5789699Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.5789975Z | ^ 2025-05-07T20:32:11.5790459Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5791044Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:11.5791464Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:11.5792002Z | self=, 2025-05-07T20:32:11.5793111Z | T=1, # or any other generated value 2025-05-07T20:32:11.5793442Z | D=5120, # or any other generated value 2025-05-07T20:32:11.5793801Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:11.5794177Z | contiguous=True, # or any other generated value 2025-05-07T20:32:11.5794548Z | compiled=True, # or any other generated value 2025-05-07T20:32:11.5794860Z | ) 2025-05-07T20:32:11.5795052Z | 2025-05-07T20:32:11.5795592Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:11.5796226Z +------------------------------------ 2025-05-07T20:32:11.5796602Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:11.5796991Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5797428Z self=, 2025-05-07T20:32:11.5797861Z T=1, 2025-05-07T20:32:11.5798059Z D=5120, 2025-05-07T20:32:11.5798262Z scale_ub=None, 2025-05-07T20:32:11.5798492Z contiguous=True, 2025-05-07T20:32:11.5798731Z compiled=True, 2025-05-07T20:32:11.5798940Z ) 2025-05-07T20:32:11.5799280Z self = 2025-05-07T20:32:11.5799787Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.5800058Z 2025-05-07T20:32:11.5800234Z @given( 2025-05-07T20:32:11.5800480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5800809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5801127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5801476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5801825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5802129Z ) 2025-05-07T20:32:11.5802491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5802963Z def test_silu_mul_quant( 2025-05-07T20:32:11.5803222Z self, 2025-05-07T20:32:11.5803423Z T: int, 2025-05-07T20:32:11.5803633Z D: int, 2025-05-07T20:32:11.5803865Z scale_ub: Optional[float], 2025-05-07T20:32:11.5804146Z contiguous: bool, 2025-05-07T20:32:11.5804400Z compiled: bool, 2025-05-07T20:32:11.5804636Z ) -> None: 2025-05-07T20:32:11.5804857Z torch.manual_seed(2025) 2025-05-07T20:32:11.5805117Z 2025-05-07T20:32:11.5805405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5805756Z 2025-05-07T20:32:11.5805964Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5806272Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5806598Z x = x_sign * x_clamp 2025-05-07T20:32:11.5806850Z x0 = x[:, :D] 2025-05-07T20:32:11.5807077Z x1 = x[:, D:] 2025-05-07T20:32:11.5807297Z 2025-05-07T20:32:11.5807492Z if contiguous: 2025-05-07T20:32:11.5807732Z x0 = x0.contiguous() 2025-05-07T20:32:11.5808094Z x1 = x1.contiguous() 2025-05-07T20:32:11.5808340Z 2025-05-07T20:32:11.5808567Z if scale_ub is not None: 2025-05-07T20:32:11.5808880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.5809226Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.5809556Z ) 2025-05-07T20:32:11.5809760Z else: 2025-05-07T20:32:11.5809977Z scale_ub_tensor = None 2025-05-07T20:32:11.5810246Z 2025-05-07T20:32:11.5810494Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5810823Z op = silu_mul_quant 2025-05-07T20:32:11.5811095Z if compiled: 2025-05-07T20:32:11.5811369Z op = torch.compile(op) 2025-05-07T20:32:11.5811677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5811970Z 2025-05-07T20:32:11.5812180Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.5812582Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.5812886Z 2025-05-07T20:32:11.5813140Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5813820Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.5814125Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.5814451Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.5814827Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5815210Z 2025-05-07T20:32:11.5815487Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.5815782Z 2025-05-07T20:32:11.5815927Z moe/activation_test.py:126: 2025-05-07T20:32:11.5816361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5816843Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.5817317Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5818473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.5819618Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.5820414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.5821413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5822421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.5823475Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.5824498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.5825362Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.5826183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.5826889Z fn() 2025-05-07T20:32:11.5827596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.5828437Z self.fn.run( 2025-05-07T20:32:11.5829077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.5829826Z kernel = self.compile( 2025-05-07T20:32:11.5830603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.5831574Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5832161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5832506Z 2025-05-07T20:32:11.5832806Z self = 2025-05-07T20:32:11.5834654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.5836700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f17492f20>} 2025-05-07T20:32:11.5838666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.5840248Z context = 2025-05-07T20:32:11.5840683Z 2025-05-07T20:32:11.5840926Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.5841703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5842524Z module_map=module_map) 2025-05-07T20:32:11.5843056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5843570Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.5843957Z E ^ 2025-05-07T20:32:11.5844635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5845296Z 2025-05-07T20:32:11.5845909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.5846667Z 2025-05-07T20:32:11.5846820Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5847430Z self=, 2025-05-07T20:32:11.5848005Z T=2048, 2025-05-07T20:32:11.5848278Z D=5120, 2025-05-07T20:32:11.5848585Z scale_ub=1200.0, 2025-05-07T20:32:11.5848928Z contiguous=True, 2025-05-07T20:32:11.5849269Z compiled=False, 2025-05-07T20:32:11.5849577Z ) 2025-05-07T20:32:11.5850044Z self = 2025-05-07T20:32:11.5850766Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.5851175Z 2025-05-07T20:32:11.5851289Z @given( 2025-05-07T20:32:11.5851614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5852044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5852468Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5852927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5853374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5853768Z ) 2025-05-07T20:32:11.5854269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5854906Z def test_silu_mul_quant( 2025-05-07T20:32:11.5855258Z self, 2025-05-07T20:32:11.5855552Z T: int, 2025-05-07T20:32:11.5855834Z D: int, 2025-05-07T20:32:11.5856154Z scale_ub: Optional[float], 2025-05-07T20:32:11.5856502Z contiguous: bool, 2025-05-07T20:32:11.5856851Z compiled: bool, 2025-05-07T20:32:11.5857163Z ) -> None: 2025-05-07T20:32:11.5857436Z torch.manual_seed(2025) 2025-05-07T20:32:11.5857749Z 2025-05-07T20:32:11.5858091Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5858544Z 2025-05-07T20:32:11.5858799Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5859166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5859567Z x = x_sign * x_clamp 2025-05-07T20:32:11.5859895Z x0 = x[:, :D] 2025-05-07T20:32:11.5860184Z x1 = x[:, D:] 2025-05-07T20:32:11.5860451Z 2025-05-07T20:32:11.5860687Z if contiguous: 2025-05-07T20:32:11.5860979Z x0 = x0.contiguous() 2025-05-07T20:32:11.5861351Z x1 = x1.contiguous() 2025-05-07T20:32:11.5861667Z 2025-05-07T20:32:11.5861926Z if scale_ub is not None: 2025-05-07T20:32:11.5862623Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.5863070Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.5863460Z ) 2025-05-07T20:32:11.5863726Z else: 2025-05-07T20:32:11.5863996Z scale_ub_tensor = None 2025-05-07T20:32:11.5864331Z 2025-05-07T20:32:11.5864642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5865058Z op = silu_mul_quant 2025-05-07T20:32:11.5865408Z if compiled: 2025-05-07T20:32:11.5865745Z op = torch.compile(op) 2025-05-07T20:32:11.5866147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5866535Z 2025-05-07T20:32:11.5866799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.5867043Z 2025-05-07T20:32:11.5867176Z moe/activation_test.py:117: 2025-05-07T20:32:11.5867607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5868162Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.5868553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5869508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.5870505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.5871275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.5872271Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5873239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.5874012Z kernel = self.compile( 2025-05-07T20:32:11.5874809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.5875782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5876376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5876716Z 2025-05-07T20:32:11.5877016Z self = 2025-05-07T20:32:11.5878593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.5880741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f289bfec0>} 2025-05-07T20:32:11.5882621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.5884045Z context = 2025-05-07T20:32:11.5884450Z 2025-05-07T20:32:11.5884661Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.5885366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5886027Z module_map=module_map) 2025-05-07T20:32:11.5886554Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5887072Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.5887457Z E ^ 2025-05-07T20:32:11.5888147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5888813Z 2025-05-07T20:32:11.5889428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.5890197Z 2025-05-07T20:32:11.5890360Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5891084Z self=, 2025-05-07T20:32:11.5891680Z T=2048, 2025-05-07T20:32:11.5891948Z D=5120, 2025-05-07T20:32:11.5892218Z scale_ub=1200.0, 2025-05-07T20:32:11.5892525Z contiguous=True, 2025-05-07T20:32:11.5892803Z compiled=True, 2025-05-07T20:32:11.5893065Z ) 2025-05-07T20:32:11.5893547Z self = 2025-05-07T20:32:11.5894271Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.5894678Z 2025-05-07T20:32:11.5894791Z @given( 2025-05-07T20:32:11.5895130Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5895586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5896045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5896545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5897123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5897552Z ) 2025-05-07T20:32:11.5898038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5898678Z def test_silu_mul_quant( 2025-05-07T20:32:11.5899003Z self, 2025-05-07T20:32:11.5899290Z T: int, 2025-05-07T20:32:11.5899582Z D: int, 2025-05-07T20:32:11.5899895Z scale_ub: Optional[float], 2025-05-07T20:32:11.5900299Z contiguous: bool, 2025-05-07T20:32:11.5900653Z compiled: bool, 2025-05-07T20:32:11.5900967Z ) -> None: 2025-05-07T20:32:11.5901270Z torch.manual_seed(2025) 2025-05-07T20:32:11.5901623Z 2025-05-07T20:32:11.5902015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5902516Z 2025-05-07T20:32:11.5902798Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5903218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5903686Z x = x_sign * x_clamp 2025-05-07T20:32:11.5904041Z x0 = x[:, :D] 2025-05-07T20:32:11.5904363Z x1 = x[:, D:] 2025-05-07T20:32:11.5904671Z 2025-05-07T20:32:11.5904944Z if contiguous: 2025-05-07T20:32:11.5905280Z x0 = x0.contiguous() 2025-05-07T20:32:11.5905650Z x1 = x1.contiguous() 2025-05-07T20:32:11.5905996Z 2025-05-07T20:32:11.5906274Z if scale_ub is not None: 2025-05-07T20:32:11.5906673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.5907166Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.5907617Z ) 2025-05-07T20:32:11.5907892Z else: 2025-05-07T20:32:11.5908205Z scale_ub_tensor = None 2025-05-07T20:32:11.5908578Z 2025-05-07T20:32:11.5908910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5909368Z op = silu_mul_quant 2025-05-07T20:32:11.5909733Z if compiled: 2025-05-07T20:32:11.5910092Z op = torch.compile(op) 2025-05-07T20:32:11.5910542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5910950Z 2025-05-07T20:32:11.5911223Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.5911644Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.5912073Z 2025-05-07T20:32:11.5912422Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5912896Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.5913306Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.5913996Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.5914464Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5914917Z 2025-05-07T20:32:11.5915202Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.5915478Z 2025-05-07T20:32:11.5915627Z moe/activation_test.py:126: 2025-05-07T20:32:11.5916092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5916625Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.5917321Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5918485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.5919588Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.5920510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.5921495Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5922505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.5923559Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.5924635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.5925743Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.5926631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.5927385Z fn() 2025-05-07T20:32:11.5928132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.5928976Z self.fn.run( 2025-05-07T20:32:11.5929651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.5930424Z kernel = self.compile( 2025-05-07T20:32:11.5931245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.5932219Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5932798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5933137Z 2025-05-07T20:32:11.5933444Z self = 2025-05-07T20:32:11.5935016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.5937050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f288f7240>} 2025-05-07T20:32:11.5939062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.5940547Z context = 2025-05-07T20:32:11.5940975Z 2025-05-07T20:32:11.5941224Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.5941995Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5942653Z module_map=module_map) 2025-05-07T20:32:11.5943170Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5943638Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.5944008Z E ^ 2025-05-07T20:32:11.5944667Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5945280Z 2025-05-07T20:32:11.5945896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.5946621Z 2025-05-07T20:32:11.5946765Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5947383Z self=, 2025-05-07T20:32:11.5962437Z T=16384, 2025-05-07T20:32:11.5962750Z D=7168, 2025-05-07T20:32:11.5963229Z scale_ub=1200.0, 2025-05-07T20:32:11.5963558Z contiguous=False, 2025-05-07T20:32:11.5963870Z compiled=False, 2025-05-07T20:32:11.5964163Z ) 2025-05-07T20:32:11.5964621Z self = 2025-05-07T20:32:11.5965361Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.5965759Z 2025-05-07T20:32:11.5965867Z @given( 2025-05-07T20:32:11.5966191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5966642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5967089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5967593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5968083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5968492Z ) 2025-05-07T20:32:11.5969018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5969833Z def test_silu_mul_quant( 2025-05-07T20:32:11.5970183Z self, 2025-05-07T20:32:11.5970466Z T: int, 2025-05-07T20:32:11.5970759Z D: int, 2025-05-07T20:32:11.5970902Z scale_ub: Optional[float], 2025-05-07T20:32:11.5971029Z contiguous: bool, 2025-05-07T20:32:11.5971166Z compiled: bool, 2025-05-07T20:32:11.5971279Z ) -> None: 2025-05-07T20:32:11.5971423Z torch.manual_seed(2025) 2025-05-07T20:32:11.5971541Z 2025-05-07T20:32:11.5971786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5971894Z 2025-05-07T20:32:11.5972030Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5972214Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5972354Z x = x_sign * x_clamp 2025-05-07T20:32:11.5972471Z x0 = x[:, :D] 2025-05-07T20:32:11.5972586Z x1 = x[:, D:] 2025-05-07T20:32:11.5972704Z 2025-05-07T20:32:11.5972827Z if contiguous: 2025-05-07T20:32:11.5972964Z x0 = x0.contiguous() 2025-05-07T20:32:11.5973096Z x1 = x1.contiguous() 2025-05-07T20:32:11.5973200Z 2025-05-07T20:32:11.5973331Z if scale_ub is not None: 2025-05-07T20:32:11.5973492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.5973691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.5973802Z ) 2025-05-07T20:32:11.5973908Z else: 2025-05-07T20:32:11.5974042Z scale_ub_tensor = None 2025-05-07T20:32:11.5974149Z 2025-05-07T20:32:11.5974334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5974462Z op = silu_mul_quant 2025-05-07T20:32:11.5974586Z if compiled: 2025-05-07T20:32:11.5974729Z op = torch.compile(op) 2025-05-07T20:32:11.5974879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5974990Z 2025-05-07T20:32:11.5975131Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.5975137Z 2025-05-07T20:32:11.5975286Z moe/activation_test.py:117: 2025-05-07T20:32:11.5975480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5975623Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.5975777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5976510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.5976652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.5977194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.5977534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5978056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.5978201Z kernel = self.compile( 2025-05-07T20:32:11.5978876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.5979142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5979330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5979337Z 2025-05-07T20:32:11.5979639Z self = 2025-05-07T20:32:11.5980788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.5981541Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f164b6e80>} 2025-05-07T20:32:11.5982735Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.5983014Z context = 2025-05-07T20:32:11.5983021Z 2025-05-07T20:32:11.5983273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.5983665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5983823Z module_map=module_map) 2025-05-07T20:32:11.5984065Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5984209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.5984321Z E ^ 2025-05-07T20:32:11.5984854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5984869Z 2025-05-07T20:32:11.5985490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.5985497Z 2025-05-07T20:32:11.5985661Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5985989Z self=, 2025-05-07T20:32:11.5986102Z T=1, 2025-05-07T20:32:11.5986223Z D=7168, 2025-05-07T20:32:11.5986346Z scale_ub=None, 2025-05-07T20:32:11.5986474Z contiguous=True, 2025-05-07T20:32:11.5986607Z compiled=True, 2025-05-07T20:32:11.5986719Z ) 2025-05-07T20:32:11.5987045Z self = 2025-05-07T20:32:11.5987295Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.5987302Z 2025-05-07T20:32:11.5987415Z @given( 2025-05-07T20:32:11.5987603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5987750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5987927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5988118Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5988286Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5988393Z ) 2025-05-07T20:32:11.5988752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5988886Z def test_silu_mul_quant( 2025-05-07T20:32:11.5988998Z self, 2025-05-07T20:32:11.5989105Z T: int, 2025-05-07T20:32:11.5989210Z D: int, 2025-05-07T20:32:11.5989354Z scale_ub: Optional[float], 2025-05-07T20:32:11.5989481Z contiguous: bool, 2025-05-07T20:32:11.5989601Z compiled: bool, 2025-05-07T20:32:11.5989721Z ) -> None: 2025-05-07T20:32:11.5989854Z torch.manual_seed(2025) 2025-05-07T20:32:11.5989954Z 2025-05-07T20:32:11.5990197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5990312Z 2025-05-07T20:32:11.5990442Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5990717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5990844Z x = x_sign * x_clamp 2025-05-07T20:32:11.5990966Z x0 = x[:, :D] 2025-05-07T20:32:11.5991075Z x1 = x[:, D:] 2025-05-07T20:32:11.5991179Z 2025-05-07T20:32:11.5991305Z if contiguous: 2025-05-07T20:32:11.5991442Z x0 = x0.contiguous() 2025-05-07T20:32:11.5991574Z x1 = x1.contiguous() 2025-05-07T20:32:11.5991677Z 2025-05-07T20:32:11.5991795Z if scale_ub is not None: 2025-05-07T20:32:11.5991937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.5992140Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.5992239Z ) 2025-05-07T20:32:11.5992337Z else: 2025-05-07T20:32:11.5992467Z scale_ub_tensor = None 2025-05-07T20:32:11.5992563Z 2025-05-07T20:32:11.5992734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5992956Z op = silu_mul_quant 2025-05-07T20:32:11.5993067Z if compiled: 2025-05-07T20:32:11.5993205Z op = torch.compile(op) 2025-05-07T20:32:11.5993340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5993441Z 2025-05-07T20:32:11.5993583Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.5993762Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.5993872Z 2025-05-07T20:32:11.5994087Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5994238Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.5994385Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.5994572Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.5994781Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5994896Z 2025-05-07T20:32:11.5995043Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.5995058Z 2025-05-07T20:32:11.5995204Z moe/activation_test.py:126: 2025-05-07T20:32:11.5995411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5995564Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.5995764Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5996586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.5996734Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.5997275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.5997606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5998142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.5998545Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.5999160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.5999412Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.5999921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6000034Z fn() 2025-05-07T20:32:11.6000757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6000885Z self.fn.run( 2025-05-07T20:32:11.6001385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6001529Z kernel = self.compile( 2025-05-07T20:32:11.6002092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6002463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6002669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6002676Z 2025-05-07T20:32:11.6002975Z self = 2025-05-07T20:32:11.6004121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6004876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f164b6700>} 2025-05-07T20:32:11.6006011Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6006408Z context = 2025-05-07T20:32:11.6006415Z 2025-05-07T20:32:11.6006661Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6007058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6007213Z module_map=module_map) 2025-05-07T20:32:11.6007452Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6007599Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6007711Z E ^ 2025-05-07T20:32:11.6008238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6008245Z 2025-05-07T20:32:11.6008904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6008919Z 2025-05-07T20:32:11.6009085Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6009423Z self=, 2025-05-07T20:32:11.6009541Z T=4096, 2025-05-07T20:32:11.6009662Z D=5120, 2025-05-07T20:32:11.6009784Z scale_ub=None, 2025-05-07T20:32:11.6009912Z contiguous=False, 2025-05-07T20:32:11.6010050Z compiled=False, 2025-05-07T20:32:11.6010165Z ) 2025-05-07T20:32:11.6010469Z self = 2025-05-07T20:32:11.6010706Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6010712Z 2025-05-07T20:32:11.6010813Z @given( 2025-05-07T20:32:11.6010969Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6011106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6011252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6011411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6011572Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6011671Z ) 2025-05-07T20:32:11.6011997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6012121Z def test_silu_mul_quant( 2025-05-07T20:32:11.6012233Z self, 2025-05-07T20:32:11.6012357Z T: int, 2025-05-07T20:32:11.6012470Z D: int, 2025-05-07T20:32:11.6012618Z scale_ub: Optional[float], 2025-05-07T20:32:11.6012758Z contiguous: bool, 2025-05-07T20:32:11.6012887Z compiled: bool, 2025-05-07T20:32:11.6013018Z ) -> None: 2025-05-07T20:32:11.6013158Z torch.manual_seed(2025) 2025-05-07T20:32:11.6013267Z 2025-05-07T20:32:11.6013819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6013945Z 2025-05-07T20:32:11.6014086Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6014283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6014427Z x = x_sign * x_clamp 2025-05-07T20:32:11.6014842Z x0 = x[:, :D] 2025-05-07T20:32:11.6014993Z x1 = x[:, D:] 2025-05-07T20:32:11.6015103Z 2025-05-07T20:32:11.6015230Z if contiguous: 2025-05-07T20:32:11.6015375Z x0 = x0.contiguous() 2025-05-07T20:32:11.6015508Z x1 = x1.contiguous() 2025-05-07T20:32:11.6015617Z 2025-05-07T20:32:11.6015759Z if scale_ub is not None: 2025-05-07T20:32:11.6015914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6016128Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6016239Z ) 2025-05-07T20:32:11.6016348Z else: 2025-05-07T20:32:11.6016495Z scale_ub_tensor = None 2025-05-07T20:32:11.6016604Z 2025-05-07T20:32:11.6016789Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6016929Z op = silu_mul_quant 2025-05-07T20:32:11.6017051Z if compiled: 2025-05-07T20:32:11.6017373Z op = torch.compile(op) 2025-05-07T20:32:11.6017544Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6017649Z 2025-05-07T20:32:11.6017781Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6017797Z 2025-05-07T20:32:11.6017938Z moe/activation_test.py:117: 2025-05-07T20:32:11.6018124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6018278Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6018427Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6019138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6019288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6019810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6020151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6020689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6020824Z kernel = self.compile( 2025-05-07T20:32:11.6021395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6021650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6021839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6021846Z 2025-05-07T20:32:11.6022148Z self = 2025-05-07T20:32:11.6023127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6023671Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f28971d00>} 2025-05-07T20:32:11.6024453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6024661Z context = 2025-05-07T20:32:11.6024666Z 2025-05-07T20:32:11.6024840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6025116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6025235Z module_map=module_map) 2025-05-07T20:32:11.6025406Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6025510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6025599Z E ^ 2025-05-07T20:32:11.6026071Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6026077Z 2025-05-07T20:32:11.6026521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6026526Z 2025-05-07T20:32:11.6026636Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6026869Z self=, 2025-05-07T20:32:11.6026957Z T=4096, 2025-05-07T20:32:11.6027038Z D=7168, 2025-05-07T20:32:11.6027125Z scale_ub=None, 2025-05-07T20:32:11.6027226Z contiguous=False, 2025-05-07T20:32:11.6027315Z compiled=False, 2025-05-07T20:32:11.6027401Z ) 2025-05-07T20:32:11.6027634Z self = 2025-05-07T20:32:11.6027819Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6027901Z 2025-05-07T20:32:11.6027991Z @given( 2025-05-07T20:32:11.6028125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6028229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6028358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6028482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6028602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6028694Z ) 2025-05-07T20:32:11.6028953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6029060Z def test_silu_mul_quant( 2025-05-07T20:32:11.6029142Z self, 2025-05-07T20:32:11.6029224Z T: int, 2025-05-07T20:32:11.6029313Z D: int, 2025-05-07T20:32:11.6029417Z scale_ub: Optional[float], 2025-05-07T20:32:11.6029512Z contiguous: bool, 2025-05-07T20:32:11.6029614Z compiled: bool, 2025-05-07T20:32:11.6029698Z ) -> None: 2025-05-07T20:32:11.6029810Z torch.manual_seed(2025) 2025-05-07T20:32:11.6029896Z 2025-05-07T20:32:11.6030078Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6030156Z 2025-05-07T20:32:11.6030264Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6030397Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6030500Z x = x_sign * x_clamp 2025-05-07T20:32:11.6030585Z x0 = x[:, :D] 2025-05-07T20:32:11.6030671Z x1 = x[:, D:] 2025-05-07T20:32:11.6030765Z 2025-05-07T20:32:11.6030852Z if contiguous: 2025-05-07T20:32:11.6030948Z x0 = x0.contiguous() 2025-05-07T20:32:11.6031048Z x1 = x1.contiguous() 2025-05-07T20:32:11.6031123Z 2025-05-07T20:32:11.6031220Z if scale_ub is not None: 2025-05-07T20:32:11.6031335Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6031475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6031559Z ) 2025-05-07T20:32:11.6031644Z else: 2025-05-07T20:32:11.6031746Z scale_ub_tensor = None 2025-05-07T20:32:11.6031830Z 2025-05-07T20:32:11.6031963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6032057Z op = silu_mul_quant 2025-05-07T20:32:11.6032151Z if compiled: 2025-05-07T20:32:11.6032253Z op = torch.compile(op) 2025-05-07T20:32:11.6032363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6032445Z 2025-05-07T20:32:11.6032539Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6032544Z 2025-05-07T20:32:11.6032654Z moe/activation_test.py:117: 2025-05-07T20:32:11.6032787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6032892Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6033002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6033517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6033714Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6034095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6034330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6034693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6034792Z kernel = self.compile( 2025-05-07T20:32:11.6035190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6035385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6035519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6035523Z 2025-05-07T20:32:11.6035738Z self = 2025-05-07T20:32:11.6036633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6037158Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f1663e0c0>} 2025-05-07T20:32:11.6037933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6038134Z context = 2025-05-07T20:32:11.6038139Z 2025-05-07T20:32:11.6038317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6038618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6038758Z module_map=module_map) 2025-05-07T20:32:11.6038935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6039038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6039118Z E ^ 2025-05-07T20:32:11.6039492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6039497Z 2025-05-07T20:32:11.6039927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6039932Z 2025-05-07T20:32:11.6040045Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6040385Z self=, 2025-05-07T20:32:11.6040466Z T=128, 2025-05-07T20:32:11.6040553Z D=7168, 2025-05-07T20:32:11.6040638Z scale_ub=None, 2025-05-07T20:32:11.6040736Z contiguous=False, 2025-05-07T20:32:11.6040830Z compiled=True, 2025-05-07T20:32:11.6040913Z ) 2025-05-07T20:32:11.6041149Z self = 2025-05-07T20:32:11.6041328Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6041332Z 2025-05-07T20:32:11.6041412Z @given( 2025-05-07T20:32:11.6041543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6041649Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6041769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6041895Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6042013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6042097Z ) 2025-05-07T20:32:11.6042354Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6042454Z def test_silu_mul_quant( 2025-05-07T20:32:11.6042550Z self, 2025-05-07T20:32:11.6042631Z T: int, 2025-05-07T20:32:11.6042794Z D: int, 2025-05-07T20:32:11.6042906Z scale_ub: Optional[float], 2025-05-07T20:32:11.6042999Z contiguous: bool, 2025-05-07T20:32:11.6043090Z compiled: bool, 2025-05-07T20:32:11.6043177Z ) -> None: 2025-05-07T20:32:11.6043276Z torch.manual_seed(2025) 2025-05-07T20:32:11.6043352Z 2025-05-07T20:32:11.6043535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6043612Z 2025-05-07T20:32:11.6043708Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6043847Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6043939Z x = x_sign * x_clamp 2025-05-07T20:32:11.6044031Z x0 = x[:, :D] 2025-05-07T20:32:11.6044116Z x1 = x[:, D:] 2025-05-07T20:32:11.6044193Z 2025-05-07T20:32:11.6044288Z if contiguous: 2025-05-07T20:32:11.6044383Z x0 = x0.contiguous() 2025-05-07T20:32:11.6044555Z x1 = x1.contiguous() 2025-05-07T20:32:11.6044637Z 2025-05-07T20:32:11.6044736Z if scale_ub is not None: 2025-05-07T20:32:11.6044847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6044995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6045078Z ) 2025-05-07T20:32:11.6045159Z else: 2025-05-07T20:32:11.6045265Z scale_ub_tensor = None 2025-05-07T20:32:11.6045340Z 2025-05-07T20:32:11.6045483Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6045576Z op = silu_mul_quant 2025-05-07T20:32:11.6045665Z if compiled: 2025-05-07T20:32:11.6045777Z op = torch.compile(op) 2025-05-07T20:32:11.6045886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6045964Z 2025-05-07T20:32:11.6046065Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6046191Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6046272Z 2025-05-07T20:32:11.6046423Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6046529Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6046632Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6046765Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6046910Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6046991Z 2025-05-07T20:32:11.6047094Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6047098Z 2025-05-07T20:32:11.6047201Z moe/activation_test.py:126: 2025-05-07T20:32:11.6047341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6047453Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6047592Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6048174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6048290Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6048669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6048905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6049284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6049558Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6049949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6050128Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6050483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6050568Z fn() 2025-05-07T20:32:11.6051103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6051191Z self.fn.run( 2025-05-07T20:32:11.6051543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6051647Z kernel = self.compile( 2025-05-07T20:32:11.6052039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6052230Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6052363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6052367Z 2025-05-07T20:32:11.6052579Z self = 2025-05-07T20:32:11.6053390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6053987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f1663d940>} 2025-05-07T20:32:11.6054760Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6054961Z context = 2025-05-07T20:32:11.6054965Z 2025-05-07T20:32:11.6055136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6055415Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6055528Z module_map=module_map) 2025-05-07T20:32:11.6055706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6055820Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6055900Z E ^ 2025-05-07T20:32:11.6056273Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6056278Z 2025-05-07T20:32:11.6056709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6056714Z 2025-05-07T20:32:11.6056832Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6057063Z self=, 2025-05-07T20:32:11.6057145Z T=128, 2025-05-07T20:32:11.6057231Z D=7168, 2025-05-07T20:32:11.6057317Z scale_ub=None, 2025-05-07T20:32:11.6057409Z contiguous=False, 2025-05-07T20:32:11.6057502Z compiled=False, 2025-05-07T20:32:11.6057579Z ) 2025-05-07T20:32:11.6057806Z self = 2025-05-07T20:32:11.6058001Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6058006Z 2025-05-07T20:32:11.6058085Z @given( 2025-05-07T20:32:11.6058216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6058319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6058439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6058593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6058733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6058819Z ) 2025-05-07T20:32:11.6059080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6059179Z def test_silu_mul_quant( 2025-05-07T20:32:11.6059260Z self, 2025-05-07T20:32:11.6059349Z T: int, 2025-05-07T20:32:11.6059432Z D: int, 2025-05-07T20:32:11.6059534Z scale_ub: Optional[float], 2025-05-07T20:32:11.6059642Z contiguous: bool, 2025-05-07T20:32:11.6059843Z compiled: bool, 2025-05-07T20:32:11.6059934Z ) -> None: 2025-05-07T20:32:11.6060033Z torch.manual_seed(2025) 2025-05-07T20:32:11.6060110Z 2025-05-07T20:32:11.6060291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6060368Z 2025-05-07T20:32:11.6060462Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6060598Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6060691Z x = x_sign * x_clamp 2025-05-07T20:32:11.6060775Z x0 = x[:, :D] 2025-05-07T20:32:11.6060866Z x1 = x[:, D:] 2025-05-07T20:32:11.6060943Z 2025-05-07T20:32:11.6061030Z if contiguous: 2025-05-07T20:32:11.6061136Z x0 = x0.contiguous() 2025-05-07T20:32:11.6061231Z x1 = x1.contiguous() 2025-05-07T20:32:11.6061314Z 2025-05-07T20:32:11.6061409Z if scale_ub is not None: 2025-05-07T20:32:11.6061598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6061750Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6061830Z ) 2025-05-07T20:32:11.6061909Z else: 2025-05-07T20:32:11.6062013Z scale_ub_tensor = None 2025-05-07T20:32:11.6062091Z 2025-05-07T20:32:11.6062226Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6062326Z op = silu_mul_quant 2025-05-07T20:32:11.6062415Z if compiled: 2025-05-07T20:32:11.6062518Z op = torch.compile(op) 2025-05-07T20:32:11.6062634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6062710Z 2025-05-07T20:32:11.6062811Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6062815Z 2025-05-07T20:32:11.6062915Z moe/activation_test.py:117: 2025-05-07T20:32:11.6063048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6063159Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6063272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6063794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6063901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6064273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6064514Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6064868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6064966Z kernel = self.compile( 2025-05-07T20:32:11.6065367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6065548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6065684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6065700Z 2025-05-07T20:32:11.6065911Z self = 2025-05-07T20:32:11.6066710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6067240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f143fa700>} 2025-05-07T20:32:11.6068008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6068213Z context = 2025-05-07T20:32:11.6068222Z 2025-05-07T20:32:11.6068474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6068789Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6068917Z module_map=module_map) 2025-05-07T20:32:11.6069084Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6069187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6069275Z E ^ 2025-05-07T20:32:11.6069640Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6069645Z 2025-05-07T20:32:11.6070080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6070084Z 2025-05-07T20:32:11.6070193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6070423Z self=, 2025-05-07T20:32:11.6070586Z T=4096, 2025-05-07T20:32:11.6070670Z D=5120, 2025-05-07T20:32:11.6070764Z scale_ub=1200.0, 2025-05-07T20:32:11.6070852Z contiguous=True, 2025-05-07T20:32:11.6070940Z compiled=False, 2025-05-07T20:32:11.6071021Z ) 2025-05-07T20:32:11.6071248Z self = 2025-05-07T20:32:11.6071429Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.6071434Z 2025-05-07T20:32:11.6071520Z @given( 2025-05-07T20:32:11.6071642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6071745Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6071869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6071993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6072120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6072198Z ) 2025-05-07T20:32:11.6072466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6072570Z def test_silu_mul_quant( 2025-05-07T20:32:11.6072650Z self, 2025-05-07T20:32:11.6072729Z T: int, 2025-05-07T20:32:11.6072817Z D: int, 2025-05-07T20:32:11.6072919Z scale_ub: Optional[float], 2025-05-07T20:32:11.6073012Z contiguous: bool, 2025-05-07T20:32:11.6073108Z compiled: bool, 2025-05-07T20:32:11.6073190Z ) -> None: 2025-05-07T20:32:11.6073288Z torch.manual_seed(2025) 2025-05-07T20:32:11.6073372Z 2025-05-07T20:32:11.6073547Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6073631Z 2025-05-07T20:32:11.6073727Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6073857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6073957Z x = x_sign * x_clamp 2025-05-07T20:32:11.6074041Z x0 = x[:, :D] 2025-05-07T20:32:11.6074131Z x1 = x[:, D:] 2025-05-07T20:32:11.6074214Z 2025-05-07T20:32:11.6074306Z if contiguous: 2025-05-07T20:32:11.6074400Z x0 = x0.contiguous() 2025-05-07T20:32:11.6074500Z x1 = x1.contiguous() 2025-05-07T20:32:11.6074576Z 2025-05-07T20:32:11.6074673Z if scale_ub is not None: 2025-05-07T20:32:11.6074788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6074929Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6075009Z ) 2025-05-07T20:32:11.6075097Z else: 2025-05-07T20:32:11.6075195Z scale_ub_tensor = None 2025-05-07T20:32:11.6075277Z 2025-05-07T20:32:11.6075414Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6075509Z op = silu_mul_quant 2025-05-07T20:32:11.6075607Z if compiled: 2025-05-07T20:32:11.6075712Z op = torch.compile(op) 2025-05-07T20:32:11.6075824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6075912Z 2025-05-07T20:32:11.6076008Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6076095Z 2025-05-07T20:32:11.6076199Z moe/activation_test.py:117: 2025-05-07T20:32:11.6076338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6076445Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6076557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6077074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6077176Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6077554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6077789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6078142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6078349Z kernel = self.compile( 2025-05-07T20:32:11.6078749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6078937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6079069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6079073Z 2025-05-07T20:32:11.6079287Z self = 2025-05-07T20:32:11.6080101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6080742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f143f8220>} 2025-05-07T20:32:11.6081528Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6081728Z context = 2025-05-07T20:32:11.6081733Z 2025-05-07T20:32:11.6081911Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6082186Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6082298Z module_map=module_map) 2025-05-07T20:32:11.6082471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6082574Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6082654Z E ^ 2025-05-07T20:32:11.6083026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6083035Z 2025-05-07T20:32:11.6083466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6083471Z 2025-05-07T20:32:11.6083588Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6083819Z self=, 2025-05-07T20:32:11.6083899Z T=1, 2025-05-07T20:32:11.6083985Z D=5120, 2025-05-07T20:32:11.6084073Z scale_ub=None, 2025-05-07T20:32:11.6084163Z contiguous=True, 2025-05-07T20:32:11.6084256Z compiled=True, 2025-05-07T20:32:11.6084334Z ) 2025-05-07T20:32:11.6084562Z self = 2025-05-07T20:32:11.6084735Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.6084739Z 2025-05-07T20:32:11.6084818Z @given( 2025-05-07T20:32:11.6084947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6085056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6085257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6085387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6085505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6085583Z ) 2025-05-07T20:32:11.6085842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6085938Z def test_silu_mul_quant( 2025-05-07T20:32:11.6086018Z self, 2025-05-07T20:32:11.6086105Z T: int, 2025-05-07T20:32:11.6086186Z D: int, 2025-05-07T20:32:11.6086295Z scale_ub: Optional[float], 2025-05-07T20:32:11.6086388Z contiguous: bool, 2025-05-07T20:32:11.6086477Z compiled: bool, 2025-05-07T20:32:11.6086564Z ) -> None: 2025-05-07T20:32:11.6086665Z torch.manual_seed(2025) 2025-05-07T20:32:11.6086742Z 2025-05-07T20:32:11.6086923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6087077Z 2025-05-07T20:32:11.6087178Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6087316Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6087408Z x = x_sign * x_clamp 2025-05-07T20:32:11.6087491Z x0 = x[:, :D] 2025-05-07T20:32:11.6087581Z x1 = x[:, D:] 2025-05-07T20:32:11.6087658Z 2025-05-07T20:32:11.6087753Z if contiguous: 2025-05-07T20:32:11.6087846Z x0 = x0.contiguous() 2025-05-07T20:32:11.6087950Z x1 = x1.contiguous() 2025-05-07T20:32:11.6088027Z 2025-05-07T20:32:11.6088122Z if scale_ub is not None: 2025-05-07T20:32:11.6088238Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6088378Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6088466Z ) 2025-05-07T20:32:11.6088547Z else: 2025-05-07T20:32:11.6088646Z scale_ub_tensor = None 2025-05-07T20:32:11.6088729Z 2025-05-07T20:32:11.6088868Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6088967Z op = silu_mul_quant 2025-05-07T20:32:11.6089062Z if compiled: 2025-05-07T20:32:11.6089166Z op = torch.compile(op) 2025-05-07T20:32:11.6089276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6089361Z 2025-05-07T20:32:11.6089459Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6089584Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6089667Z 2025-05-07T20:32:11.6089806Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6089921Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6090026Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6090152Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6090305Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6090382Z 2025-05-07T20:32:11.6090491Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6090495Z 2025-05-07T20:32:11.6090610Z moe/activation_test.py:126: 2025-05-07T20:32:11.6090742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6090852Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6090995Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6091572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6091684Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6092055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6092290Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6092675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6093029Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6093426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6093599Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6093953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6094041Z fn() 2025-05-07T20:32:11.6094456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6094542Z self.fn.run( 2025-05-07T20:32:11.6094898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6094996Z kernel = self.compile( 2025-05-07T20:32:11.6095394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6095657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6095791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6095795Z 2025-05-07T20:32:11.6096019Z self = 2025-05-07T20:32:11.6096826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6097354Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f143fae80>} 2025-05-07T20:32:11.6098122Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6098331Z context = 2025-05-07T20:32:11.6098342Z 2025-05-07T20:32:11.6098516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6098791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6098909Z module_map=module_map) 2025-05-07T20:32:11.6099076Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6099183Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6099271Z E ^ 2025-05-07T20:32:11.6099638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6099643Z 2025-05-07T20:32:11.6100077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6100087Z 2025-05-07T20:32:11.6100200Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6100436Z self=, 2025-05-07T20:32:11.6100522Z T=2048, 2025-05-07T20:32:11.6100599Z D=5120, 2025-05-07T20:32:11.6100702Z scale_ub=None, 2025-05-07T20:32:11.6100792Z contiguous=True, 2025-05-07T20:32:11.6100888Z compiled=True, 2025-05-07T20:32:11.6100966Z ) 2025-05-07T20:32:11.6101194Z self = 2025-05-07T20:32:11.6101378Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.6101382Z 2025-05-07T20:32:11.6101462Z @given( 2025-05-07T20:32:11.6101586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6101697Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6101818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6101951Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6102157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6102238Z ) 2025-05-07T20:32:11.6102502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6102600Z def test_silu_mul_quant( 2025-05-07T20:32:11.6102681Z self, 2025-05-07T20:32:11.6102771Z T: int, 2025-05-07T20:32:11.6102851Z D: int, 2025-05-07T20:32:11.6102954Z scale_ub: Optional[float], 2025-05-07T20:32:11.6103055Z contiguous: bool, 2025-05-07T20:32:11.6103146Z compiled: bool, 2025-05-07T20:32:11.6103229Z ) -> None: 2025-05-07T20:32:11.6103336Z torch.manual_seed(2025) 2025-05-07T20:32:11.6103419Z 2025-05-07T20:32:11.6103594Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6103679Z 2025-05-07T20:32:11.6103865Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6104081Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6104287Z x = x_sign * x_clamp 2025-05-07T20:32:11.6104592Z x0 = x[:, :D] 2025-05-07T20:32:11.6104720Z x1 = x[:, D:] 2025-05-07T20:32:11.6104858Z 2025-05-07T20:32:11.6104999Z if contiguous: 2025-05-07T20:32:11.6112837Z x0 = x0.contiguous() 2025-05-07T20:32:11.6112967Z x1 = x1.contiguous() 2025-05-07T20:32:11.6113050Z 2025-05-07T20:32:11.6113159Z if scale_ub is not None: 2025-05-07T20:32:11.6113282Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6113762Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6113901Z ) 2025-05-07T20:32:11.6114004Z else: 2025-05-07T20:32:11.6114107Z scale_ub_tensor = None 2025-05-07T20:32:11.6114195Z 2025-05-07T20:32:11.6114339Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6114439Z op = silu_mul_quant 2025-05-07T20:32:11.6114556Z if compiled: 2025-05-07T20:32:11.6114674Z op = torch.compile(op) 2025-05-07T20:32:11.6114797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6114876Z 2025-05-07T20:32:11.6114974Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6115111Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6115190Z 2025-05-07T20:32:11.6115334Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6115452Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6115558Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6115689Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6115845Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6115924Z 2025-05-07T20:32:11.6116040Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6116046Z 2025-05-07T20:32:11.6116152Z moe/activation_test.py:126: 2025-05-07T20:32:11.6116296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6116422Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6116565Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6117152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6117269Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6117648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6117897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6118283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6118556Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6119203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6119385Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6119753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6119838Z fn() 2025-05-07T20:32:11.6120330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6120429Z self.fn.run( 2025-05-07T20:32:11.6120783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6120884Z kernel = self.compile( 2025-05-07T20:32:11.6121292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6121478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6121757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6121762Z 2025-05-07T20:32:11.6121977Z self = 2025-05-07T20:32:11.6122786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6123319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f141f0b80>} 2025-05-07T20:32:11.6124088Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6124298Z context = 2025-05-07T20:32:11.6124308Z 2025-05-07T20:32:11.6124486Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6124761Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6124882Z module_map=module_map) 2025-05-07T20:32:11.6125052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6125168Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6125250Z E ^ 2025-05-07T20:32:11.6125621Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6125626Z 2025-05-07T20:32:11.6126064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6126069Z 2025-05-07T20:32:11.6126178Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6126418Z self=, 2025-05-07T20:32:11.6126514Z T=128, 2025-05-07T20:32:11.6126595Z D=5120, 2025-05-07T20:32:11.6126690Z scale_ub=None, 2025-05-07T20:32:11.6126780Z contiguous=True, 2025-05-07T20:32:11.6126867Z compiled=True, 2025-05-07T20:32:11.6126953Z ) 2025-05-07T20:32:11.6127182Z self = 2025-05-07T20:32:11.6127359Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.6127364Z 2025-05-07T20:32:11.6127453Z @given( 2025-05-07T20:32:11.6127576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6127687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6127809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6127933Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6128059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6128143Z ) 2025-05-07T20:32:11.6128479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6128591Z def test_silu_mul_quant( 2025-05-07T20:32:11.6128677Z self, 2025-05-07T20:32:11.6128778Z T: int, 2025-05-07T20:32:11.6128870Z D: int, 2025-05-07T20:32:11.6128992Z scale_ub: Optional[float], 2025-05-07T20:32:11.6129086Z contiguous: bool, 2025-05-07T20:32:11.6129183Z compiled: bool, 2025-05-07T20:32:11.6129266Z ) -> None: 2025-05-07T20:32:11.6129372Z torch.manual_seed(2025) 2025-05-07T20:32:11.6129453Z 2025-05-07T20:32:11.6129630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6129713Z 2025-05-07T20:32:11.6129811Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6129941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6130041Z x = x_sign * x_clamp 2025-05-07T20:32:11.6130126Z x0 = x[:, :D] 2025-05-07T20:32:11.6130287Z x1 = x[:, D:] 2025-05-07T20:32:11.6130376Z 2025-05-07T20:32:11.6130472Z if contiguous: 2025-05-07T20:32:11.6130569Z x0 = x0.contiguous() 2025-05-07T20:32:11.6130671Z x1 = x1.contiguous() 2025-05-07T20:32:11.6130749Z 2025-05-07T20:32:11.6130844Z if scale_ub is not None: 2025-05-07T20:32:11.6130967Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6131109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6131198Z ) 2025-05-07T20:32:11.6131281Z else: 2025-05-07T20:32:11.6131383Z scale_ub_tensor = None 2025-05-07T20:32:11.6131471Z 2025-05-07T20:32:11.6131607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6131702Z op = silu_mul_quant 2025-05-07T20:32:11.6131800Z if compiled: 2025-05-07T20:32:11.6131906Z op = torch.compile(op) 2025-05-07T20:32:11.6132019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6132112Z 2025-05-07T20:32:11.6132213Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6132340Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6132424Z 2025-05-07T20:32:11.6132565Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6132680Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6132785Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6132912Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6133066Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6133147Z 2025-05-07T20:32:11.6133253Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6133257Z 2025-05-07T20:32:11.6133373Z moe/activation_test.py:126: 2025-05-07T20:32:11.6133507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6133627Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6133774Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6134358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6134473Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6134855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6135089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6135481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6135751Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6136153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6136332Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6136773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6136863Z fn() 2025-05-07T20:32:11.6137279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6137367Z self.fn.run( 2025-05-07T20:32:11.6137729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6137828Z kernel = self.compile( 2025-05-07T20:32:11.6138230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6138413Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6138546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6138551Z 2025-05-07T20:32:11.6138848Z self = 2025-05-07T20:32:11.6139659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6140190Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f141f1da0>} 2025-05-07T20:32:11.6140964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6141173Z context = 2025-05-07T20:32:11.6141178Z 2025-05-07T20:32:11.6141351Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6141636Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6141757Z module_map=module_map) 2025-05-07T20:32:11.6141926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6142033Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6142122Z E ^ 2025-05-07T20:32:11.6142490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6142495Z 2025-05-07T20:32:11.6142935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6142939Z 2025-05-07T20:32:11.6143050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6143283Z self=, 2025-05-07T20:32:11.6143372Z T=4096, 2025-05-07T20:32:11.6143455Z D=5120, 2025-05-07T20:32:11.6143548Z scale_ub=None, 2025-05-07T20:32:11.6143646Z contiguous=True, 2025-05-07T20:32:11.6143740Z compiled=True, 2025-05-07T20:32:11.6143817Z ) 2025-05-07T20:32:11.6144051Z self = 2025-05-07T20:32:11.6144228Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.6144232Z 2025-05-07T20:32:11.6144318Z @given( 2025-05-07T20:32:11.6144445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6144550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6144677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6144799Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6144917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6145002Z ) 2025-05-07T20:32:11.6145258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6145370Z def test_silu_mul_quant( 2025-05-07T20:32:11.6145451Z self, 2025-05-07T20:32:11.6145640Z T: int, 2025-05-07T20:32:11.6145731Z D: int, 2025-05-07T20:32:11.6145836Z scale_ub: Optional[float], 2025-05-07T20:32:11.6145932Z contiguous: bool, 2025-05-07T20:32:11.6146032Z compiled: bool, 2025-05-07T20:32:11.6146115Z ) -> None: 2025-05-07T20:32:11.6146214Z torch.manual_seed(2025) 2025-05-07T20:32:11.6146297Z 2025-05-07T20:32:11.6146474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6146553Z 2025-05-07T20:32:11.6146657Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6146788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6146890Z x = x_sign * x_clamp 2025-05-07T20:32:11.6146974Z x0 = x[:, :D] 2025-05-07T20:32:11.6147060Z x1 = x[:, D:] 2025-05-07T20:32:11.6147147Z 2025-05-07T20:32:11.6147235Z if contiguous: 2025-05-07T20:32:11.6147411Z x0 = x0.contiguous() 2025-05-07T20:32:11.6147515Z x1 = x1.contiguous() 2025-05-07T20:32:11.6147598Z 2025-05-07T20:32:11.6147693Z if scale_ub is not None: 2025-05-07T20:32:11.6147813Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6147954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6148035Z ) 2025-05-07T20:32:11.6148123Z else: 2025-05-07T20:32:11.6148222Z scale_ub_tensor = None 2025-05-07T20:32:11.6148299Z 2025-05-07T20:32:11.6148447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6148560Z op = silu_mul_quant 2025-05-07T20:32:11.6148667Z if compiled: 2025-05-07T20:32:11.6148795Z op = torch.compile(op) 2025-05-07T20:32:11.6148906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6148990Z 2025-05-07T20:32:11.6149086Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6149215Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6149310Z 2025-05-07T20:32:11.6149460Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6149567Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6149679Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6149808Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6149965Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6150045Z 2025-05-07T20:32:11.6150150Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6150154Z 2025-05-07T20:32:11.6150266Z moe/activation_test.py:126: 2025-05-07T20:32:11.6150401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6150511Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6150658Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6151245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6151365Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6151739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6151974Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6152360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6152632Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6153023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6153204Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6153557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6153653Z fn() 2025-05-07T20:32:11.6154157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6154246Z self.fn.run( 2025-05-07T20:32:11.6154605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6154704Z kernel = self.compile( 2025-05-07T20:32:11.6155100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6155292Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6155424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6155429Z 2025-05-07T20:32:11.6155649Z self = 2025-05-07T20:32:11.6156454Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6157063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef32bf1a0>} 2025-05-07T20:32:11.6157836Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6158037Z context = 2025-05-07T20:32:11.6158041Z 2025-05-07T20:32:11.6158222Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6158500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6158625Z module_map=module_map) 2025-05-07T20:32:11.6158798Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6158905Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6158994Z E ^ 2025-05-07T20:32:11.6159362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6159366Z 2025-05-07T20:32:11.6159796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6159807Z 2025-05-07T20:32:11.6159917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6160231Z self=, 2025-05-07T20:32:11.6160323Z T=16384, 2025-05-07T20:32:11.6160403Z D=5120, 2025-05-07T20:32:11.6160489Z scale_ub=None, 2025-05-07T20:32:11.6160586Z contiguous=True, 2025-05-07T20:32:11.6160673Z compiled=True, 2025-05-07T20:32:11.6160757Z ) 2025-05-07T20:32:11.6160996Z self = 2025-05-07T20:32:11.6161180Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.6161184Z 2025-05-07T20:32:11.6161265Z @given( 2025-05-07T20:32:11.6161400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6161504Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6161635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6161758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6161877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6161964Z ) 2025-05-07T20:32:11.6162220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6162318Z def test_silu_mul_quant( 2025-05-07T20:32:11.6162404Z self, 2025-05-07T20:32:11.6162486Z T: int, 2025-05-07T20:32:11.6162566Z D: int, 2025-05-07T20:32:11.6162680Z scale_ub: Optional[float], 2025-05-07T20:32:11.6162857Z contiguous: bool, 2025-05-07T20:32:11.6162958Z compiled: bool, 2025-05-07T20:32:11.6163043Z ) -> None: 2025-05-07T20:32:11.6163142Z torch.manual_seed(2025) 2025-05-07T20:32:11.6163226Z 2025-05-07T20:32:11.6163401Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6163479Z 2025-05-07T20:32:11.6163582Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6163717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6163814Z x = x_sign * x_clamp 2025-05-07T20:32:11.6163908Z x0 = x[:, :D] 2025-05-07T20:32:11.6163994Z x1 = x[:, D:] 2025-05-07T20:32:11.6164071Z 2025-05-07T20:32:11.6164171Z if contiguous: 2025-05-07T20:32:11.6164271Z x0 = x0.contiguous() 2025-05-07T20:32:11.6164366Z x1 = x1.contiguous() 2025-05-07T20:32:11.6164452Z 2025-05-07T20:32:11.6164626Z if scale_ub is not None: 2025-05-07T20:32:11.6164751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6164893Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6164975Z ) 2025-05-07T20:32:11.6165064Z else: 2025-05-07T20:32:11.6165163Z scale_ub_tensor = None 2025-05-07T20:32:11.6165240Z 2025-05-07T20:32:11.6165385Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6165481Z op = silu_mul_quant 2025-05-07T20:32:11.6165570Z if compiled: 2025-05-07T20:32:11.6165681Z op = torch.compile(op) 2025-05-07T20:32:11.6165793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6165871Z 2025-05-07T20:32:11.6165972Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6166099Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6166181Z 2025-05-07T20:32:11.6166322Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6166433Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6166547Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6166674Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6166826Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6166903Z 2025-05-07T20:32:11.6167007Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6167011Z 2025-05-07T20:32:11.6167119Z moe/activation_test.py:126: 2025-05-07T20:32:11.6167252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6167362Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6167507Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6168084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6168198Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6168602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6168865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6169250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6169515Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6169904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6170083Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6170436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6170521Z fn() 2025-05-07T20:32:11.6170936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6171110Z self.fn.run( 2025-05-07T20:32:11.6171466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6171564Z kernel = self.compile( 2025-05-07T20:32:11.6171957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6172147Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6172282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6172287Z 2025-05-07T20:32:11.6172506Z self = 2025-05-07T20:32:11.6173308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6173942Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef3a90860>} 2025-05-07T20:32:11.6174711Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6174911Z context = 2025-05-07T20:32:11.6174916Z 2025-05-07T20:32:11.6175097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6175373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6175493Z module_map=module_map) 2025-05-07T20:32:11.6175662Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6175775Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6175864Z E ^ 2025-05-07T20:32:11.6176236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6176241Z 2025-05-07T20:32:11.6176671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6176683Z 2025-05-07T20:32:11.6176795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6177026Z self=, 2025-05-07T20:32:11.6177113Z T=1, 2025-05-07T20:32:11.6177193Z D=5120, 2025-05-07T20:32:11.6177284Z scale_ub=1200.0, 2025-05-07T20:32:11.6177380Z contiguous=True, 2025-05-07T20:32:11.6177467Z compiled=True, 2025-05-07T20:32:11.6177544Z ) 2025-05-07T20:32:11.6177778Z self = 2025-05-07T20:32:11.6177950Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.6177959Z 2025-05-07T20:32:11.6178042Z @given( 2025-05-07T20:32:11.6178175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6178278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6178405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6178531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6178651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6178736Z ) 2025-05-07T20:32:11.6178992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6179091Z def test_silu_mul_quant( 2025-05-07T20:32:11.6179180Z self, 2025-05-07T20:32:11.6179263Z T: int, 2025-05-07T20:32:11.6179343Z D: int, 2025-05-07T20:32:11.6179450Z scale_ub: Optional[float], 2025-05-07T20:32:11.6179543Z contiguous: bool, 2025-05-07T20:32:11.6179639Z compiled: bool, 2025-05-07T20:32:11.6179728Z ) -> None: 2025-05-07T20:32:11.6179908Z torch.manual_seed(2025) 2025-05-07T20:32:11.6179994Z 2025-05-07T20:32:11.6180169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6180244Z 2025-05-07T20:32:11.6180346Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6180476Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6180568Z x = x_sign * x_clamp 2025-05-07T20:32:11.6180658Z x0 = x[:, :D] 2025-05-07T20:32:11.6180741Z x1 = x[:, D:] 2025-05-07T20:32:11.6180817Z 2025-05-07T20:32:11.6180910Z if contiguous: 2025-05-07T20:32:11.6181005Z x0 = x0.contiguous() 2025-05-07T20:32:11.6181099Z x1 = x1.contiguous() 2025-05-07T20:32:11.6181180Z 2025-05-07T20:32:11.6181273Z if scale_ub is not None: 2025-05-07T20:32:11.6181389Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6181528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6181686Z ) 2025-05-07T20:32:11.6181777Z else: 2025-05-07T20:32:11.6181874Z scale_ub_tensor = None 2025-05-07T20:32:11.6181950Z 2025-05-07T20:32:11.6182090Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6182184Z op = silu_mul_quant 2025-05-07T20:32:11.6182276Z if compiled: 2025-05-07T20:32:11.6182387Z op = torch.compile(op) 2025-05-07T20:32:11.6182499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6182575Z 2025-05-07T20:32:11.6182676Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6182681Z 2025-05-07T20:32:11.6182781Z moe/activation_test.py:117: 2025-05-07T20:32:11.6182918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6183023Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6183127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6183520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6183623Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6184136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6184245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6184616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6184855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6185209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6185308Z kernel = self.compile( 2025-05-07T20:32:11.6185709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6185897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6186040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6186044Z 2025-05-07T20:32:11.6186257Z self = 2025-05-07T20:32:11.6187058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6187587Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2d12ca0>} 2025-05-07T20:32:11.6188357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6188647Z context = 2025-05-07T20:32:11.6188652Z 2025-05-07T20:32:11.6188827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6189103Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6189221Z module_map=module_map) 2025-05-07T20:32:11.6189389Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6189501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6189582Z E ^ 2025-05-07T20:32:11.6189949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6189953Z 2025-05-07T20:32:11.6190390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6190395Z 2025-05-07T20:32:11.6190503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6190823Z self=, 2025-05-07T20:32:11.6190907Z T=1, 2025-05-07T20:32:11.6190986Z D=5120, 2025-05-07T20:32:11.6191079Z scale_ub=None, 2025-05-07T20:32:11.6191171Z contiguous=False, 2025-05-07T20:32:11.6191261Z compiled=True, 2025-05-07T20:32:11.6191345Z ) 2025-05-07T20:32:11.6191573Z self = 2025-05-07T20:32:11.6191744Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6191748Z 2025-05-07T20:32:11.6191839Z @given( 2025-05-07T20:32:11.6191963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6192067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6192196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6192319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6192443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6192528Z ) 2025-05-07T20:32:11.6192788Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6192892Z def test_silu_mul_quant( 2025-05-07T20:32:11.6192973Z self, 2025-05-07T20:32:11.6193057Z T: int, 2025-05-07T20:32:11.6193144Z D: int, 2025-05-07T20:32:11.6193246Z scale_ub: Optional[float], 2025-05-07T20:32:11.6193340Z contiguous: bool, 2025-05-07T20:32:11.6193437Z compiled: bool, 2025-05-07T20:32:11.6193519Z ) -> None: 2025-05-07T20:32:11.6193618Z torch.manual_seed(2025) 2025-05-07T20:32:11.6193701Z 2025-05-07T20:32:11.6193876Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6193959Z 2025-05-07T20:32:11.6194055Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6194185Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6194286Z x = x_sign * x_clamp 2025-05-07T20:32:11.6194377Z x0 = x[:, :D] 2025-05-07T20:32:11.6194465Z x1 = x[:, D:] 2025-05-07T20:32:11.6194548Z 2025-05-07T20:32:11.6194635Z if contiguous: 2025-05-07T20:32:11.6194731Z x0 = x0.contiguous() 2025-05-07T20:32:11.6194830Z x1 = x1.contiguous() 2025-05-07T20:32:11.6194907Z 2025-05-07T20:32:11.6195001Z if scale_ub is not None: 2025-05-07T20:32:11.6195119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6195259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6195342Z ) 2025-05-07T20:32:11.6195421Z else: 2025-05-07T20:32:11.6195519Z scale_ub_tensor = None 2025-05-07T20:32:11.6195600Z 2025-05-07T20:32:11.6195733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6195827Z op = silu_mul_quant 2025-05-07T20:32:11.6195921Z if compiled: 2025-05-07T20:32:11.6196025Z op = torch.compile(op) 2025-05-07T20:32:11.6196140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6196307Z 2025-05-07T20:32:11.6196404Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6196528Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6196611Z 2025-05-07T20:32:11.6196752Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6196864Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6196969Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6197093Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6197244Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6197320Z 2025-05-07T20:32:11.6197425Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6197429Z 2025-05-07T20:32:11.6197537Z moe/activation_test.py:126: 2025-05-07T20:32:11.6197669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6197855Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6198007Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6198588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6198702Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6199074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6199309Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6199694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6199960Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6200438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6200623Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6200979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6201065Z fn() 2025-05-07T20:32:11.6201478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6201563Z self.fn.run( 2025-05-07T20:32:11.6201916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6202014Z kernel = self.compile( 2025-05-07T20:32:11.6202411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6202592Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6202724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6202734Z 2025-05-07T20:32:11.6202957Z self = 2025-05-07T20:32:11.6203758Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6204288Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef38c60c0>} 2025-05-07T20:32:11.6205054Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6205253Z context = 2025-05-07T20:32:11.6205264Z 2025-05-07T20:32:11.6205441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6205800Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6205920Z module_map=module_map) 2025-05-07T20:32:11.6206088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6206195Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6206284Z E ^ 2025-05-07T20:32:11.6206651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6206656Z 2025-05-07T20:32:11.6207091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6207095Z 2025-05-07T20:32:11.6207203Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6207434Z self=, 2025-05-07T20:32:11.6207623Z T=1, 2025-05-07T20:32:11.6207704Z D=5120, 2025-05-07T20:32:11.6207798Z scale_ub=None, 2025-05-07T20:32:11.6207895Z contiguous=True, 2025-05-07T20:32:11.6207983Z compiled=False, 2025-05-07T20:32:11.6208060Z ) 2025-05-07T20:32:11.6208293Z self = 2025-05-07T20:32:11.6208462Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.6208466Z 2025-05-07T20:32:11.6208553Z @given( 2025-05-07T20:32:11.6208678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6208781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6208907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6209032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6209151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6209237Z ) 2025-05-07T20:32:11.6209494Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6209608Z def test_silu_mul_quant( 2025-05-07T20:32:11.6209694Z self, 2025-05-07T20:32:11.6209775Z T: int, 2025-05-07T20:32:11.6209865Z D: int, 2025-05-07T20:32:11.6209967Z scale_ub: Optional[float], 2025-05-07T20:32:11.6210061Z contiguous: bool, 2025-05-07T20:32:11.6210163Z compiled: bool, 2025-05-07T20:32:11.6210245Z ) -> None: 2025-05-07T20:32:11.6210347Z torch.manual_seed(2025) 2025-05-07T20:32:11.6210430Z 2025-05-07T20:32:11.6210603Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6210682Z 2025-05-07T20:32:11.6210786Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6210918Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6211011Z x = x_sign * x_clamp 2025-05-07T20:32:11.6211105Z x0 = x[:, :D] 2025-05-07T20:32:11.6211190Z x1 = x[:, D:] 2025-05-07T20:32:11.6211273Z 2025-05-07T20:32:11.6211367Z if contiguous: 2025-05-07T20:32:11.6211463Z x0 = x0.contiguous() 2025-05-07T20:32:11.6211570Z x1 = x1.contiguous() 2025-05-07T20:32:11.6211647Z 2025-05-07T20:32:11.6211740Z if scale_ub is not None: 2025-05-07T20:32:11.6211856Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6211996Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6212075Z ) 2025-05-07T20:32:11.6212161Z else: 2025-05-07T20:32:11.6212258Z scale_ub_tensor = None 2025-05-07T20:32:11.6212333Z 2025-05-07T20:32:11.6212471Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6212564Z op = silu_mul_quant 2025-05-07T20:32:11.6212657Z if compiled: 2025-05-07T20:32:11.6212760Z op = torch.compile(op) 2025-05-07T20:32:11.6212869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6212950Z 2025-05-07T20:32:11.6213044Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6213054Z 2025-05-07T20:32:11.6213239Z moe/activation_test.py:117: 2025-05-07T20:32:11.6213639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6213795Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6213918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6214448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6214552Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6214933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6215168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6215523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6215628Z kernel = self.compile( 2025-05-07T20:32:11.6216230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6216422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6216560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6216565Z 2025-05-07T20:32:11.6216779Z self = 2025-05-07T20:32:11.6217591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6218115Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f164f1bc0>} 2025-05-07T20:32:11.6218897Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6219104Z context = 2025-05-07T20:32:11.6219109Z 2025-05-07T20:32:11.6219283Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6219568Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6219682Z module_map=module_map) 2025-05-07T20:32:11.6219856Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6219961Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6220045Z E ^ 2025-05-07T20:32:11.6220419Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6220423Z 2025-05-07T20:32:11.6220857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6220866Z 2025-05-07T20:32:11.6220981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6221215Z self=, 2025-05-07T20:32:11.6221297Z T=128, 2025-05-07T20:32:11.6221383Z D=5120, 2025-05-07T20:32:11.6221470Z scale_ub=None, 2025-05-07T20:32:11.6221562Z contiguous=False, 2025-05-07T20:32:11.6221656Z compiled=True, 2025-05-07T20:32:11.6221734Z ) 2025-05-07T20:32:11.6221963Z self = 2025-05-07T20:32:11.6222150Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6222155Z 2025-05-07T20:32:11.6222235Z @given( 2025-05-07T20:32:11.6222360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6222473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6222599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6222855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6222977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6223059Z ) 2025-05-07T20:32:11.6223321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6223422Z def test_silu_mul_quant( 2025-05-07T20:32:11.6223503Z self, 2025-05-07T20:32:11.6223596Z T: int, 2025-05-07T20:32:11.6223676Z D: int, 2025-05-07T20:32:11.6223781Z scale_ub: Optional[float], 2025-05-07T20:32:11.6223888Z contiguous: bool, 2025-05-07T20:32:11.6223979Z compiled: bool, 2025-05-07T20:32:11.6224069Z ) -> None: 2025-05-07T20:32:11.6224170Z torch.manual_seed(2025) 2025-05-07T20:32:11.6224248Z 2025-05-07T20:32:11.6224431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6224509Z 2025-05-07T20:32:11.6224690Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6224835Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6224930Z x = x_sign * x_clamp 2025-05-07T20:32:11.6225016Z x0 = x[:, :D] 2025-05-07T20:32:11.6225107Z x1 = x[:, D:] 2025-05-07T20:32:11.6225185Z 2025-05-07T20:32:11.6225274Z if contiguous: 2025-05-07T20:32:11.6225379Z x0 = x0.contiguous() 2025-05-07T20:32:11.6225481Z x1 = x1.contiguous() 2025-05-07T20:32:11.6225558Z 2025-05-07T20:32:11.6225654Z if scale_ub is not None: 2025-05-07T20:32:11.6225770Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6225912Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6225999Z ) 2025-05-07T20:32:11.6226086Z else: 2025-05-07T20:32:11.6226184Z scale_ub_tensor = None 2025-05-07T20:32:11.6226269Z 2025-05-07T20:32:11.6226404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6226505Z op = silu_mul_quant 2025-05-07T20:32:11.6226607Z if compiled: 2025-05-07T20:32:11.6226716Z op = torch.compile(op) 2025-05-07T20:32:11.6226827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6226912Z 2025-05-07T20:32:11.6227008Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6227012Z 2025-05-07T20:32:11.6227123Z moe/activation_test.py:117: 2025-05-07T20:32:11.6227258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6227364Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6227475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6227857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6227956Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6228475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6228588Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6228974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6229209Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6229564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6229667Z kernel = self.compile( 2025-05-07T20:32:11.6230064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6230252Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6230394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6230399Z 2025-05-07T20:32:11.6230611Z self = 2025-05-07T20:32:11.6231509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6232034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef38c7e20>} 2025-05-07T20:32:11.6232813Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6233015Z context = 2025-05-07T20:32:11.6233020Z 2025-05-07T20:32:11.6233195Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6233479Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6233668Z module_map=module_map) 2025-05-07T20:32:11.6233836Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6233947Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6234028Z E ^ 2025-05-07T20:32:11.6234404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6234409Z 2025-05-07T20:32:11.6234839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6234844Z 2025-05-07T20:32:11.6234953Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6235196Z self=, 2025-05-07T20:32:11.6235277Z T=128, 2025-05-07T20:32:11.6235364Z D=7168, 2025-05-07T20:32:11.6235456Z scale_ub=1200.0, 2025-05-07T20:32:11.6235548Z contiguous=False, 2025-05-07T20:32:11.6235651Z compiled=False, 2025-05-07T20:32:11.6235730Z ) 2025-05-07T20:32:11.6235963Z self = 2025-05-07T20:32:11.6236154Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.6236159Z 2025-05-07T20:32:11.6236240Z @given( 2025-05-07T20:32:11.6236364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6236473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6236595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6236723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6236841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6236921Z ) 2025-05-07T20:32:11.6237183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6237282Z def test_silu_mul_quant( 2025-05-07T20:32:11.6237378Z self, 2025-05-07T20:32:11.6237463Z T: int, 2025-05-07T20:32:11.6237545Z D: int, 2025-05-07T20:32:11.6237659Z scale_ub: Optional[float], 2025-05-07T20:32:11.6237755Z contiguous: bool, 2025-05-07T20:32:11.6237853Z compiled: bool, 2025-05-07T20:32:11.6237936Z ) -> None: 2025-05-07T20:32:11.6238036Z torch.manual_seed(2025) 2025-05-07T20:32:11.6238120Z 2025-05-07T20:32:11.6238298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6238375Z 2025-05-07T20:32:11.6238476Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6238607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6238702Z x = x_sign * x_clamp 2025-05-07T20:32:11.6238794Z x0 = x[:, :D] 2025-05-07T20:32:11.6238879Z x1 = x[:, D:] 2025-05-07T20:32:11.6238956Z 2025-05-07T20:32:11.6239052Z if contiguous: 2025-05-07T20:32:11.6239148Z x0 = x0.contiguous() 2025-05-07T20:32:11.6239243Z x1 = x1.contiguous() 2025-05-07T20:32:11.6239333Z 2025-05-07T20:32:11.6239534Z if scale_ub is not None: 2025-05-07T20:32:11.6239655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6239798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6239879Z ) 2025-05-07T20:32:11.6239967Z else: 2025-05-07T20:32:11.6240068Z scale_ub_tensor = None 2025-05-07T20:32:11.6240228Z 2025-05-07T20:32:11.6240379Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6240477Z op = silu_mul_quant 2025-05-07T20:32:11.6240567Z if compiled: 2025-05-07T20:32:11.6240686Z op = torch.compile(op) 2025-05-07T20:32:11.6240802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6240878Z 2025-05-07T20:32:11.6248018Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6248027Z 2025-05-07T20:32:11.6248156Z moe/activation_test.py:117: 2025-05-07T20:32:11.6248435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6248561Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6248673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6249202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6249317Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6249695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6249943Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6250302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6250404Z kernel = self.compile( 2025-05-07T20:32:11.6250814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6251012Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6251157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6251162Z 2025-05-07T20:32:11.6251379Z self = 2025-05-07T20:32:11.6252186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6252721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef38c6ac0>} 2025-05-07T20:32:11.6253496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6253715Z context = 2025-05-07T20:32:11.6253720Z 2025-05-07T20:32:11.6253896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6254174Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6254297Z module_map=module_map) 2025-05-07T20:32:11.6254469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6254581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6254664Z E ^ 2025-05-07T20:32:11.6255037Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6255042Z 2025-05-07T20:32:11.6255481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6255490Z 2025-05-07T20:32:11.6255601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6255924Z self=, 2025-05-07T20:32:11.6256009Z T=128, 2025-05-07T20:32:11.6256092Z D=5120, 2025-05-07T20:32:11.6256188Z scale_ub=None, 2025-05-07T20:32:11.6256281Z contiguous=False, 2025-05-07T20:32:11.6256371Z compiled=False, 2025-05-07T20:32:11.6256462Z ) 2025-05-07T20:32:11.6256691Z self = 2025-05-07T20:32:11.6256875Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6256879Z 2025-05-07T20:32:11.6256967Z @given( 2025-05-07T20:32:11.6257094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6257208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6257332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6257458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6257663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6257751Z ) 2025-05-07T20:32:11.6258012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6258122Z def test_silu_mul_quant( 2025-05-07T20:32:11.6258205Z self, 2025-05-07T20:32:11.6258287Z T: int, 2025-05-07T20:32:11.6258376Z D: int, 2025-05-07T20:32:11.6258480Z scale_ub: Optional[float], 2025-05-07T20:32:11.6258576Z contiguous: bool, 2025-05-07T20:32:11.6258680Z compiled: bool, 2025-05-07T20:32:11.6258764Z ) -> None: 2025-05-07T20:32:11.6258874Z torch.manual_seed(2025) 2025-05-07T20:32:11.6258952Z 2025-05-07T20:32:11.6259131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6259215Z 2025-05-07T20:32:11.6259312Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6259444Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6259554Z x = x_sign * x_clamp 2025-05-07T20:32:11.6259645Z x0 = x[:, :D] 2025-05-07T20:32:11.6259732Z x1 = x[:, D:] 2025-05-07T20:32:11.6259821Z 2025-05-07T20:32:11.6259910Z if contiguous: 2025-05-07T20:32:11.6260008Z x0 = x0.contiguous() 2025-05-07T20:32:11.6260110Z x1 = x1.contiguous() 2025-05-07T20:32:11.6260188Z 2025-05-07T20:32:11.6260291Z if scale_ub is not None: 2025-05-07T20:32:11.6260405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6260549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6260638Z ) 2025-05-07T20:32:11.6260722Z else: 2025-05-07T20:32:11.6260825Z scale_ub_tensor = None 2025-05-07T20:32:11.6260909Z 2025-05-07T20:32:11.6261050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6261145Z op = silu_mul_quant 2025-05-07T20:32:11.6261245Z if compiled: 2025-05-07T20:32:11.6261357Z op = torch.compile(op) 2025-05-07T20:32:11.6261472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6261557Z 2025-05-07T20:32:11.6261654Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6261658Z 2025-05-07T20:32:11.6261769Z moe/activation_test.py:117: 2025-05-07T20:32:11.6261905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6262012Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6262124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6262644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6262748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6263128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6263362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6263813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6263915Z kernel = self.compile( 2025-05-07T20:32:11.6264312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6264503Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6264638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6264642Z 2025-05-07T20:32:11.6264862Z self = 2025-05-07T20:32:11.6265667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6266198Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2d11f80>} 2025-05-07T20:32:11.6267052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6267254Z context = 2025-05-07T20:32:11.6267259Z 2025-05-07T20:32:11.6267440Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6267717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6267834Z module_map=module_map) 2025-05-07T20:32:11.6268012Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6268118Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6268201Z E ^ 2025-05-07T20:32:11.6268588Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6268593Z 2025-05-07T20:32:11.6269024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6269029Z 2025-05-07T20:32:11.6269146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6269380Z self=, 2025-05-07T20:32:11.6269462Z T=128, 2025-05-07T20:32:11.6269554Z D=5120, 2025-05-07T20:32:11.6269645Z scale_ub=1200.0, 2025-05-07T20:32:11.6269735Z contiguous=True, 2025-05-07T20:32:11.6269832Z compiled=False, 2025-05-07T20:32:11.6269911Z ) 2025-05-07T20:32:11.6270153Z self = 2025-05-07T20:32:11.6270334Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.6270345Z 2025-05-07T20:32:11.6270426Z @given( 2025-05-07T20:32:11.6270565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6270670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6270792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6270924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6271045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6271138Z ) 2025-05-07T20:32:11.6271396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6271497Z def test_silu_mul_quant( 2025-05-07T20:32:11.6271590Z self, 2025-05-07T20:32:11.6271672Z T: int, 2025-05-07T20:32:11.6271754Z D: int, 2025-05-07T20:32:11.6271867Z scale_ub: Optional[float], 2025-05-07T20:32:11.6271966Z contiguous: bool, 2025-05-07T20:32:11.6272062Z compiled: bool, 2025-05-07T20:32:11.6272154Z ) -> None: 2025-05-07T20:32:11.6272263Z torch.manual_seed(2025) 2025-05-07T20:32:11.6272342Z 2025-05-07T20:32:11.6272608Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6272687Z 2025-05-07T20:32:11.6272786Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6272929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6273024Z x = x_sign * x_clamp 2025-05-07T20:32:11.6273118Z x0 = x[:, :D] 2025-05-07T20:32:11.6273207Z x1 = x[:, D:] 2025-05-07T20:32:11.6273286Z 2025-05-07T20:32:11.6273385Z if contiguous: 2025-05-07T20:32:11.6273483Z x0 = x0.contiguous() 2025-05-07T20:32:11.6273578Z x1 = x1.contiguous() 2025-05-07T20:32:11.6273666Z 2025-05-07T20:32:11.6273763Z if scale_ub is not None: 2025-05-07T20:32:11.6273876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6274026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6274107Z ) 2025-05-07T20:32:11.6274264Z else: 2025-05-07T20:32:11.6274379Z scale_ub_tensor = None 2025-05-07T20:32:11.6274457Z 2025-05-07T20:32:11.6274602Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6274698Z op = silu_mul_quant 2025-05-07T20:32:11.6274789Z if compiled: 2025-05-07T20:32:11.6274907Z op = torch.compile(op) 2025-05-07T20:32:11.6275020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6275098Z 2025-05-07T20:32:11.6275206Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6275210Z 2025-05-07T20:32:11.6275313Z moe/activation_test.py:117: 2025-05-07T20:32:11.6275448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6275560Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6275664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6276187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6276304Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6276678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6276923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6277282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6277382Z kernel = self.compile( 2025-05-07T20:32:11.6277785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6277970Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6278109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6278113Z 2025-05-07T20:32:11.6278326Z self = 2025-05-07T20:32:11.6279136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6279671Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2ab1b20>} 2025-05-07T20:32:11.6280573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6280786Z context = 2025-05-07T20:32:11.6280790Z 2025-05-07T20:32:11.6280963Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6281247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6281474Z module_map=module_map) 2025-05-07T20:32:11.6281646Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6281758Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6281840Z E ^ 2025-05-07T20:32:11.6282209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6282214Z 2025-05-07T20:32:11.6282655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6282659Z 2025-05-07T20:32:11.6282770Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6283013Z self=, 2025-05-07T20:32:11.6283096Z T=1, 2025-05-07T20:32:11.6283178Z D=7168, 2025-05-07T20:32:11.6283274Z scale_ub=1200.0, 2025-05-07T20:32:11.6283442Z contiguous=True, 2025-05-07T20:32:11.6283531Z compiled=True, 2025-05-07T20:32:11.6283622Z ) 2025-05-07T20:32:11.6283852Z self = 2025-05-07T20:32:11.6284025Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.6284037Z 2025-05-07T20:32:11.6284120Z @given( 2025-05-07T20:32:11.6284246Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6284357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6284478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6284601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6284729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6284810Z ) 2025-05-07T20:32:11.6285065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6285173Z def test_silu_mul_quant( 2025-05-07T20:32:11.6285256Z self, 2025-05-07T20:32:11.6285344Z T: int, 2025-05-07T20:32:11.6285434Z D: int, 2025-05-07T20:32:11.6285545Z scale_ub: Optional[float], 2025-05-07T20:32:11.6285649Z contiguous: bool, 2025-05-07T20:32:11.6285744Z compiled: bool, 2025-05-07T20:32:11.6285829Z ) -> None: 2025-05-07T20:32:11.6285939Z torch.manual_seed(2025) 2025-05-07T20:32:11.6286017Z 2025-05-07T20:32:11.6286195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6286281Z 2025-05-07T20:32:11.6286381Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6286512Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6286618Z x = x_sign * x_clamp 2025-05-07T20:32:11.6286706Z x0 = x[:, :D] 2025-05-07T20:32:11.6286791Z x1 = x[:, D:] 2025-05-07T20:32:11.6286878Z 2025-05-07T20:32:11.6286967Z if contiguous: 2025-05-07T20:32:11.6287066Z x0 = x0.contiguous() 2025-05-07T20:32:11.6287173Z x1 = x1.contiguous() 2025-05-07T20:32:11.6287250Z 2025-05-07T20:32:11.6287357Z if scale_ub is not None: 2025-05-07T20:32:11.6287469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6287611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6287702Z ) 2025-05-07T20:32:11.6287784Z else: 2025-05-07T20:32:11.6287884Z scale_ub_tensor = None 2025-05-07T20:32:11.6287973Z 2025-05-07T20:32:11.6288108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6288204Z op = silu_mul_quant 2025-05-07T20:32:11.6288302Z if compiled: 2025-05-07T20:32:11.6288409Z op = torch.compile(op) 2025-05-07T20:32:11.6288520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6288608Z 2025-05-07T20:32:11.6288704Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6288709Z 2025-05-07T20:32:11.6288819Z moe/activation_test.py:117: 2025-05-07T20:32:11.6288962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6289155Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6289271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6289655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6289754Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6290273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6290381Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6290761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6290994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6291350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6291532Z kernel = self.compile( 2025-05-07T20:32:11.6291933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6292126Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6292265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6292270Z 2025-05-07T20:32:11.6292486Z self = 2025-05-07T20:32:11.6293297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6293824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2ab2840>} 2025-05-07T20:32:11.6294612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6294815Z context = 2025-05-07T20:32:11.6294819Z 2025-05-07T20:32:11.6294992Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6295275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6295388Z module_map=module_map) 2025-05-07T20:32:11.6295563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6295672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6295753Z E ^ 2025-05-07T20:32:11.6296129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6296138Z 2025-05-07T20:32:11.6296572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6296576Z 2025-05-07T20:32:11.6296693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6296927Z self=, 2025-05-07T20:32:11.6297010Z T=1, 2025-05-07T20:32:11.6297097Z D=7168, 2025-05-07T20:32:11.6297185Z scale_ub=1200.0, 2025-05-07T20:32:11.6297278Z contiguous=False, 2025-05-07T20:32:11.6297372Z compiled=True, 2025-05-07T20:32:11.6297450Z ) 2025-05-07T20:32:11.6297679Z self = 2025-05-07T20:32:11.6297861Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6297865Z 2025-05-07T20:32:11.6297947Z @given( 2025-05-07T20:32:11.6298079Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6298191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6298408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6298541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6298661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6298744Z ) 2025-05-07T20:32:11.6299010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6299111Z def test_silu_mul_quant( 2025-05-07T20:32:11.6299193Z self, 2025-05-07T20:32:11.6299285Z T: int, 2025-05-07T20:32:11.6299367Z D: int, 2025-05-07T20:32:11.6299472Z scale_ub: Optional[float], 2025-05-07T20:32:11.6299576Z contiguous: bool, 2025-05-07T20:32:11.6299667Z compiled: bool, 2025-05-07T20:32:11.6299761Z ) -> None: 2025-05-07T20:32:11.6299864Z torch.manual_seed(2025) 2025-05-07T20:32:11.6299942Z 2025-05-07T20:32:11.6300126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6300281Z 2025-05-07T20:32:11.6300383Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6300521Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6300615Z x = x_sign * x_clamp 2025-05-07T20:32:11.6300700Z x0 = x[:, :D] 2025-05-07T20:32:11.6300793Z x1 = x[:, D:] 2025-05-07T20:32:11.6300871Z 2025-05-07T20:32:11.6300960Z if contiguous: 2025-05-07T20:32:11.6301064Z x0 = x0.contiguous() 2025-05-07T20:32:11.6301158Z x1 = x1.contiguous() 2025-05-07T20:32:11.6301235Z 2025-05-07T20:32:11.6301338Z if scale_ub is not None: 2025-05-07T20:32:11.6301448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6301592Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6301671Z ) 2025-05-07T20:32:11.6301750Z else: 2025-05-07T20:32:11.6301853Z scale_ub_tensor = None 2025-05-07T20:32:11.6301929Z 2025-05-07T20:32:11.6302069Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6302175Z op = silu_mul_quant 2025-05-07T20:32:11.6302263Z if compiled: 2025-05-07T20:32:11.6302372Z op = torch.compile(op) 2025-05-07T20:32:11.6302481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6302556Z 2025-05-07T20:32:11.6302657Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6302661Z 2025-05-07T20:32:11.6302762Z moe/activation_test.py:117: 2025-05-07T20:32:11.6302895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6303005Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6303107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6303486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6303591Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6304107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6304221Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6304591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6304821Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6305179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6305278Z kernel = self.compile( 2025-05-07T20:32:11.6305677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6305859Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6305989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6305994Z 2025-05-07T20:32:11.6306215Z self = 2025-05-07T20:32:11.6307094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6307628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2ab1440>} 2025-05-07T20:32:11.6308399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6308599Z context = 2025-05-07T20:32:11.6308609Z 2025-05-07T20:32:11.6308782Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6309163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6309284Z module_map=module_map) 2025-05-07T20:32:11.6309450Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6309555Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6309643Z E ^ 2025-05-07T20:32:11.6310010Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6310015Z 2025-05-07T20:32:11.6310451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6310455Z 2025-05-07T20:32:11.6310564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6310796Z self=, 2025-05-07T20:32:11.6310883Z T=1, 2025-05-07T20:32:11.6310964Z D=7168, 2025-05-07T20:32:11.6311055Z scale_ub=None, 2025-05-07T20:32:11.6311152Z contiguous=False, 2025-05-07T20:32:11.6311243Z compiled=True, 2025-05-07T20:32:11.6311320Z ) 2025-05-07T20:32:11.6311553Z self = 2025-05-07T20:32:11.6311723Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6311728Z 2025-05-07T20:32:11.6311812Z @given( 2025-05-07T20:32:11.6311936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6312042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6312172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6312295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6312414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6312498Z ) 2025-05-07T20:32:11.6312753Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6312852Z def test_silu_mul_quant( 2025-05-07T20:32:11.6312944Z self, 2025-05-07T20:32:11.6313031Z T: int, 2025-05-07T20:32:11.6313121Z D: int, 2025-05-07T20:32:11.6313224Z scale_ub: Optional[float], 2025-05-07T20:32:11.6313764Z contiguous: bool, 2025-05-07T20:32:11.6313922Z compiled: bool, 2025-05-07T20:32:11.6314007Z ) -> None: 2025-05-07T20:32:11.6314106Z torch.manual_seed(2025) 2025-05-07T20:32:11.6314186Z 2025-05-07T20:32:11.6314362Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6314437Z 2025-05-07T20:32:11.6314539Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6314668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6314759Z x = x_sign * x_clamp 2025-05-07T20:32:11.6314850Z x0 = x[:, :D] 2025-05-07T20:32:11.6314933Z x1 = x[:, D:] 2025-05-07T20:32:11.6315010Z 2025-05-07T20:32:11.6315103Z if contiguous: 2025-05-07T20:32:11.6315195Z x0 = x0.contiguous() 2025-05-07T20:32:11.6315305Z x1 = x1.contiguous() 2025-05-07T20:32:11.6315605Z 2025-05-07T20:32:11.6315704Z if scale_ub is not None: 2025-05-07T20:32:11.6315819Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6315961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6316039Z ) 2025-05-07T20:32:11.6316122Z else: 2025-05-07T20:32:11.6316219Z scale_ub_tensor = None 2025-05-07T20:32:11.6316293Z 2025-05-07T20:32:11.6316433Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6316526Z op = silu_mul_quant 2025-05-07T20:32:11.6316613Z if compiled: 2025-05-07T20:32:11.6316723Z op = torch.compile(op) 2025-05-07T20:32:11.6316831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6316911Z 2025-05-07T20:32:11.6317004Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.6317131Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.6317334Z 2025-05-07T20:32:11.6317484Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6317589Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.6317700Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.6317827Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.6317972Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6318054Z 2025-05-07T20:32:11.6318157Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.6318162Z 2025-05-07T20:32:11.6318269Z moe/activation_test.py:126: 2025-05-07T20:32:11.6318405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6318513Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.6318681Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.6319283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.6319399Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.6319778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6320010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6320481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.6320749Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.6321140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.6321318Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.6321670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.6321760Z fn() 2025-05-07T20:32:11.6322180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.6322265Z self.fn.run( 2025-05-07T20:32:11.6322620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6322718Z kernel = self.compile( 2025-05-07T20:32:11.6323110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6323301Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6323433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6323438Z 2025-05-07T20:32:11.6323654Z self = 2025-05-07T20:32:11.6324539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6325069Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2ab3e20>} 2025-05-07T20:32:11.6325843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6326043Z context = 2025-05-07T20:32:11.6326047Z 2025-05-07T20:32:11.6326224Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6326500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6326611Z module_map=module_map) 2025-05-07T20:32:11.6326868Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6326974Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.6327062Z E ^ 2025-05-07T20:32:11.6327430Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6327434Z 2025-05-07T20:32:11.6327862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6327866Z 2025-05-07T20:32:11.6327981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6328211Z self=, 2025-05-07T20:32:11.6328301Z T=1, 2025-05-07T20:32:11.6328382Z D=5120, 2025-05-07T20:32:11.6328470Z scale_ub=1200.0, 2025-05-07T20:32:11.6328568Z contiguous=False, 2025-05-07T20:32:11.6328657Z compiled=True, 2025-05-07T20:32:11.6328741Z ) 2025-05-07T20:32:11.6328979Z self = 2025-05-07T20:32:11.6329152Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6329156Z 2025-05-07T20:32:11.6329236Z @given( 2025-05-07T20:32:11.6329369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6329473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6329600Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6329723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6329844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6329930Z ) 2025-05-07T20:32:11.6330184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6330282Z def test_silu_mul_quant( 2025-05-07T20:32:11.6330370Z self, 2025-05-07T20:32:11.6330455Z T: int, 2025-05-07T20:32:11.6330535Z D: int, 2025-05-07T20:32:11.6330649Z scale_ub: Optional[float], 2025-05-07T20:32:11.6330748Z contiguous: bool, 2025-05-07T20:32:11.6330839Z compiled: bool, 2025-05-07T20:32:11.6330926Z ) -> None: 2025-05-07T20:32:11.6331025Z torch.manual_seed(2025) 2025-05-07T20:32:11.6331100Z 2025-05-07T20:32:11.6331281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6331357Z 2025-05-07T20:32:11.6331460Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6331591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6331683Z x = x_sign * x_clamp 2025-05-07T20:32:11.6331771Z x0 = x[:, :D] 2025-05-07T20:32:11.6331857Z x1 = x[:, D:] 2025-05-07T20:32:11.6331933Z 2025-05-07T20:32:11.6332030Z if contiguous: 2025-05-07T20:32:11.6332125Z x0 = x0.contiguous() 2025-05-07T20:32:11.6332217Z x1 = x1.contiguous() 2025-05-07T20:32:11.6332298Z 2025-05-07T20:32:11.6332392Z if scale_ub is not None: 2025-05-07T20:32:11.6332507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6332737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6332820Z ) 2025-05-07T20:32:11.6332906Z else: 2025-05-07T20:32:11.6333004Z scale_ub_tensor = None 2025-05-07T20:32:11.6333079Z 2025-05-07T20:32:11.6333218Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6333312Z op = silu_mul_quant 2025-05-07T20:32:11.6333401Z if compiled: 2025-05-07T20:32:11.6333510Z op = torch.compile(op) 2025-05-07T20:32:11.6333619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6333695Z 2025-05-07T20:32:11.6333799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6333804Z 2025-05-07T20:32:11.6333905Z moe/activation_test.py:117: 2025-05-07T20:32:11.6334046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6334229Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6334337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6334723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6334822Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6335334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6335441Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6335811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6336047Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6336399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6336497Z kernel = self.compile( 2025-05-07T20:32:11.6336907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6337093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6337225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6337234Z 2025-05-07T20:32:11.6337446Z self = 2025-05-07T20:32:11.6338246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6338773Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef25e79c0>} 2025-05-07T20:32:11.6339543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6339752Z context = 2025-05-07T20:32:11.6339757Z 2025-05-07T20:32:11.6339929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6340205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6340323Z module_map=module_map) 2025-05-07T20:32:11.6340491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6340598Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6340685Z E ^ 2025-05-07T20:32:11.6341050Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6341055Z 2025-05-07T20:32:11.6341487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6341576Z 2025-05-07T20:32:11.6341688Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6341921Z self=, 2025-05-07T20:32:11.6342009Z T=1, 2025-05-07T20:32:11.6342089Z D=5120, 2025-05-07T20:32:11.6342185Z scale_ub=1200.0, 2025-05-07T20:32:11.6342275Z contiguous=False, 2025-05-07T20:32:11.6342364Z compiled=False, 2025-05-07T20:32:11.6342449Z ) 2025-05-07T20:32:11.6342677Z self = 2025-05-07T20:32:11.6342854Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.6342858Z 2025-05-07T20:32:11.6342945Z @given( 2025-05-07T20:32:11.6343069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6343172Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6343299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6343531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6343656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6343735Z ) 2025-05-07T20:32:11.6343990Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6344094Z def test_silu_mul_quant( 2025-05-07T20:32:11.6344177Z self, 2025-05-07T20:32:11.6344257Z T: int, 2025-05-07T20:32:11.6344345Z D: int, 2025-05-07T20:32:11.6344447Z scale_ub: Optional[float], 2025-05-07T20:32:11.6344540Z contiguous: bool, 2025-05-07T20:32:11.6344639Z compiled: bool, 2025-05-07T20:32:11.6344722Z ) -> None: 2025-05-07T20:32:11.6344821Z torch.manual_seed(2025) 2025-05-07T20:32:11.6344906Z 2025-05-07T20:32:11.6345079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6345156Z 2025-05-07T20:32:11.6345264Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6345406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6345505Z x = x_sign * x_clamp 2025-05-07T20:32:11.6345591Z x0 = x[:, :D] 2025-05-07T20:32:11.6345676Z x1 = x[:, D:] 2025-05-07T20:32:11.6345760Z 2025-05-07T20:32:11.6345848Z if contiguous: 2025-05-07T20:32:11.6345943Z x0 = x0.contiguous() 2025-05-07T20:32:11.6346041Z x1 = x1.contiguous() 2025-05-07T20:32:11.6346118Z 2025-05-07T20:32:11.6346212Z if scale_ub is not None: 2025-05-07T20:32:11.6346327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6346467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6346546Z ) 2025-05-07T20:32:11.6346632Z else: 2025-05-07T20:32:11.6346730Z scale_ub_tensor = None 2025-05-07T20:32:11.6346810Z 2025-05-07T20:32:11.6346944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6347043Z op = silu_mul_quant 2025-05-07T20:32:11.6347137Z if compiled: 2025-05-07T20:32:11.6347244Z op = torch.compile(op) 2025-05-07T20:32:11.6347354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6347436Z 2025-05-07T20:32:11.6347530Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6347534Z 2025-05-07T20:32:11.6347636Z moe/activation_test.py:117: 2025-05-07T20:32:11.6347773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6347878Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6347988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6348508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6348612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6349039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6349360Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6349716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6349821Z kernel = self.compile( 2025-05-07T20:32:11.6350216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6350404Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6350535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6350539Z 2025-05-07T20:32:11.6350750Z self = 2025-05-07T20:32:11.6351559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6352161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef381afc0>} 2025-05-07T20:32:11.6352937Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6353136Z context = 2025-05-07T20:32:11.6353140Z 2025-05-07T20:32:11.6353317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6353593Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6353705Z module_map=module_map) 2025-05-07T20:32:11.6353878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6353988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6354069Z E ^ 2025-05-07T20:32:11.6354444Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6354449Z 2025-05-07T20:32:11.6354879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6354884Z 2025-05-07T20:32:11.6354998Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6355230Z self=, 2025-05-07T20:32:11.6355311Z T=16384, 2025-05-07T20:32:11.6355398Z D=5120, 2025-05-07T20:32:11.6355485Z scale_ub=1200.0, 2025-05-07T20:32:11.6355576Z contiguous=False, 2025-05-07T20:32:11.6355670Z compiled=True, 2025-05-07T20:32:11.6355746Z ) 2025-05-07T20:32:11.6355973Z self = 2025-05-07T20:32:11.6356170Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6356178Z 2025-05-07T20:32:11.6356258Z @given( 2025-05-07T20:32:11.6356390Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6356492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6356611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6356739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6356857Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6356936Z ) 2025-05-07T20:32:11.6357201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6357299Z def test_silu_mul_quant( 2025-05-07T20:32:11.6357381Z self, 2025-05-07T20:32:11.6357470Z T: int, 2025-05-07T20:32:11.6357550Z D: int, 2025-05-07T20:32:11.6357659Z scale_ub: Optional[float], 2025-05-07T20:32:11.6357754Z contiguous: bool, 2025-05-07T20:32:11.6357850Z compiled: bool, 2025-05-07T20:32:11.6357943Z ) -> None: 2025-05-07T20:32:11.6358124Z torch.manual_seed(2025) 2025-05-07T20:32:11.6358201Z 2025-05-07T20:32:11.6358383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6358461Z 2025-05-07T20:32:11.6358558Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6358695Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6358790Z x = x_sign * x_clamp 2025-05-07T20:32:11.6358876Z x0 = x[:, :D] 2025-05-07T20:32:11.6358971Z x1 = x[:, D:] 2025-05-07T20:32:11.6359065Z 2025-05-07T20:32:11.6359168Z if contiguous: 2025-05-07T20:32:11.6359289Z x0 = x0.contiguous() 2025-05-07T20:32:11.6359387Z x1 = x1.contiguous() 2025-05-07T20:32:11.6359464Z 2025-05-07T20:32:11.6359566Z if scale_ub is not None: 2025-05-07T20:32:11.6359679Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6359899Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6359991Z ) 2025-05-07T20:32:11.6360071Z else: 2025-05-07T20:32:11.6360252Z scale_ub_tensor = None 2025-05-07T20:32:11.6360340Z 2025-05-07T20:32:11.6360476Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6360570Z op = silu_mul_quant 2025-05-07T20:32:11.6360667Z if compiled: 2025-05-07T20:32:11.6360772Z op = torch.compile(op) 2025-05-07T20:32:11.6360881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6360965Z 2025-05-07T20:32:11.6361060Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6361064Z 2025-05-07T20:32:11.6361173Z moe/activation_test.py:117: 2025-05-07T20:32:11.6361307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6361414Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6361523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6361917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6362015Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6362531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6362633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6363010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6363241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6363593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6363697Z kernel = self.compile( 2025-05-07T20:32:11.6364091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6364287Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6364422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6364426Z 2025-05-07T20:32:11.6364637Z self = 2025-05-07T20:32:11.6365440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6365963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef3a93ce0>} 2025-05-07T20:32:11.6366736Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6367022Z context = 2025-05-07T20:32:11.6367027Z 2025-05-07T20:32:11.6367203Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6367482Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6367593Z module_map=module_map) 2025-05-07T20:32:11.6367766Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6367869Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6367949Z E ^ 2025-05-07T20:32:11.6368324Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6368328Z 2025-05-07T20:32:11.6368757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6368761Z 2025-05-07T20:32:11.6368950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6369189Z self=, 2025-05-07T20:32:11.6369270Z T=2048, 2025-05-07T20:32:11.6369354Z D=7168, 2025-05-07T20:32:11.6369444Z scale_ub=1200.0, 2025-05-07T20:32:11.6369539Z contiguous=False, 2025-05-07T20:32:11.6369632Z compiled=True, 2025-05-07T20:32:11.6369711Z ) 2025-05-07T20:32:11.6369938Z self = 2025-05-07T20:32:11.6370124Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6370129Z 2025-05-07T20:32:11.6370209Z @given( 2025-05-07T20:32:11.6370339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6370442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6370562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6370688Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6370813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6370897Z ) 2025-05-07T20:32:11.6371156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6371254Z def test_silu_mul_quant( 2025-05-07T20:32:11.6371354Z self, 2025-05-07T20:32:11.6371435Z T: int, 2025-05-07T20:32:11.6371523Z D: int, 2025-05-07T20:32:11.6371625Z scale_ub: Optional[float], 2025-05-07T20:32:11.6371719Z contiguous: bool, 2025-05-07T20:32:11.6371816Z compiled: bool, 2025-05-07T20:32:11.6371898Z ) -> None: 2025-05-07T20:32:11.6371997Z torch.manual_seed(2025) 2025-05-07T20:32:11.6372084Z 2025-05-07T20:32:11.6372261Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6372338Z 2025-05-07T20:32:11.6372440Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6372574Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6372673Z x = x_sign * x_clamp 2025-05-07T20:32:11.6372765Z x0 = x[:, :D] 2025-05-07T20:32:11.6372854Z x1 = x[:, D:] 2025-05-07T20:32:11.6372938Z 2025-05-07T20:32:11.6373030Z if contiguous: 2025-05-07T20:32:11.6373126Z x0 = x0.contiguous() 2025-05-07T20:32:11.6373229Z x1 = x1.contiguous() 2025-05-07T20:32:11.6373306Z 2025-05-07T20:32:11.6373401Z if scale_ub is not None: 2025-05-07T20:32:11.6373517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6373658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6373738Z ) 2025-05-07T20:32:11.6373826Z else: 2025-05-07T20:32:11.6373925Z scale_ub_tensor = None 2025-05-07T20:32:11.6374002Z 2025-05-07T20:32:11.6374145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6374241Z op = silu_mul_quant 2025-05-07T20:32:11.6374329Z if compiled: 2025-05-07T20:32:11.6374446Z op = torch.compile(op) 2025-05-07T20:32:11.6374665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6374750Z 2025-05-07T20:32:11.6382416Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6382428Z 2025-05-07T20:32:11.6382552Z moe/activation_test.py:117: 2025-05-07T20:32:11.6382698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6382818Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6382932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6383335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6383440Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6383956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6384070Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6384443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6384803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6385168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6385271Z kernel = self.compile( 2025-05-07T20:32:11.6385682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6385871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6386008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6386013Z 2025-05-07T20:32:11.6386238Z self = 2025-05-07T20:32:11.6387046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6387585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1f29aac720>} 2025-05-07T20:32:11.6388359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6388570Z context = 2025-05-07T20:32:11.6388574Z 2025-05-07T20:32:11.6388750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6389029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6389154Z module_map=module_map) 2025-05-07T20:32:11.6389330Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6389442Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6389534Z E ^ 2025-05-07T20:32:11.6389904Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6389910Z 2025-05-07T20:32:11.6390347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6390352Z 2025-05-07T20:32:11.6390462Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6390696Z self=, 2025-05-07T20:32:11.6390789Z T=1, 2025-05-07T20:32:11.6390871Z D=5120, 2025-05-07T20:32:11.6390959Z scale_ub=None, 2025-05-07T20:32:11.6391061Z contiguous=False, 2025-05-07T20:32:11.6391152Z compiled=False, 2025-05-07T20:32:11.6391243Z ) 2025-05-07T20:32:11.6391473Z self = 2025-05-07T20:32:11.6391739Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6391744Z 2025-05-07T20:32:11.6391836Z @given( 2025-05-07T20:32:11.6391963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6392070Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6392201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6392324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6392444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6392532Z ) 2025-05-07T20:32:11.6392790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6392901Z def test_silu_mul_quant( 2025-05-07T20:32:11.6392985Z self, 2025-05-07T20:32:11.6393072Z T: int, 2025-05-07T20:32:11.6393163Z D: int, 2025-05-07T20:32:11.6393267Z scale_ub: Optional[float], 2025-05-07T20:32:11.6393440Z contiguous: bool, 2025-05-07T20:32:11.6393545Z compiled: bool, 2025-05-07T20:32:11.6393631Z ) -> None: 2025-05-07T20:32:11.6393731Z torch.manual_seed(2025) 2025-05-07T20:32:11.6393819Z 2025-05-07T20:32:11.6393998Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6394076Z 2025-05-07T20:32:11.6394182Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6394318Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6394421Z x = x_sign * x_clamp 2025-05-07T20:32:11.6394508Z x0 = x[:, :D] 2025-05-07T20:32:11.6394594Z x1 = x[:, D:] 2025-05-07T20:32:11.6394679Z 2025-05-07T20:32:11.6394769Z if contiguous: 2025-05-07T20:32:11.6394869Z x0 = x0.contiguous() 2025-05-07T20:32:11.6394973Z x1 = x1.contiguous() 2025-05-07T20:32:11.6395050Z 2025-05-07T20:32:11.6395146Z if scale_ub is not None: 2025-05-07T20:32:11.6395268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6395422Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6395503Z ) 2025-05-07T20:32:11.6395590Z else: 2025-05-07T20:32:11.6395689Z scale_ub_tensor = None 2025-05-07T20:32:11.6395766Z 2025-05-07T20:32:11.6395910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6396007Z op = silu_mul_quant 2025-05-07T20:32:11.6396103Z if compiled: 2025-05-07T20:32:11.6396209Z op = torch.compile(op) 2025-05-07T20:32:11.6396321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6396406Z 2025-05-07T20:32:11.6396502Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6396507Z 2025-05-07T20:32:11.6396610Z moe/activation_test.py:117: 2025-05-07T20:32:11.6396757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6396864Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6396974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6397505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6397610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6397992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6398226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6398583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6398690Z kernel = self.compile( 2025-05-07T20:32:11.6399090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6399286Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6399420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6399430Z 2025-05-07T20:32:11.6399728Z self = 2025-05-07T20:32:11.6400662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6401189Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2d125c0>} 2025-05-07T20:32:11.6401967Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6402171Z context = 2025-05-07T20:32:11.6402250Z 2025-05-07T20:32:11.6402432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6402718Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6402834Z module_map=module_map) 2025-05-07T20:32:11.6403016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6403122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6403204Z E ^ 2025-05-07T20:32:11.6403581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6403586Z 2025-05-07T20:32:11.6404018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6404023Z 2025-05-07T20:32:11.6404141Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6404379Z self=, 2025-05-07T20:32:11.6404469Z T=4096, 2025-05-07T20:32:11.6404567Z D=7168, 2025-05-07T20:32:11.6404658Z scale_ub=1200.0, 2025-05-07T20:32:11.6404752Z contiguous=False, 2025-05-07T20:32:11.6404848Z compiled=False, 2025-05-07T20:32:11.6404931Z ) 2025-05-07T20:32:11.6405161Z self = 2025-05-07T20:32:11.6405356Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.6405360Z 2025-05-07T20:32:11.6405444Z @given( 2025-05-07T20:32:11.6405582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6405692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6405814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6405945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6406067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6406148Z ) 2025-05-07T20:32:11.6406420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6406526Z def test_silu_mul_quant( 2025-05-07T20:32:11.6406609Z self, 2025-05-07T20:32:11.6406700Z T: int, 2025-05-07T20:32:11.6406782Z D: int, 2025-05-07T20:32:11.6406890Z scale_ub: Optional[float], 2025-05-07T20:32:11.6406994Z contiguous: bool, 2025-05-07T20:32:11.6407088Z compiled: bool, 2025-05-07T20:32:11.6407179Z ) -> None: 2025-05-07T20:32:11.6407280Z torch.manual_seed(2025) 2025-05-07T20:32:11.6407361Z 2025-05-07T20:32:11.6407552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6407631Z 2025-05-07T20:32:11.6407729Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6407872Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6407966Z x = x_sign * x_clamp 2025-05-07T20:32:11.6408054Z x0 = x[:, :D] 2025-05-07T20:32:11.6408148Z x1 = x[:, D:] 2025-05-07T20:32:11.6408233Z 2025-05-07T20:32:11.6408322Z if contiguous: 2025-05-07T20:32:11.6408512Z x0 = x0.contiguous() 2025-05-07T20:32:11.6408625Z x1 = x1.contiguous() 2025-05-07T20:32:11.6408719Z 2025-05-07T20:32:11.6408835Z if scale_ub is not None: 2025-05-07T20:32:11.6408952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6409105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6409185Z ) 2025-05-07T20:32:11.6409266Z else: 2025-05-07T20:32:11.6409374Z scale_ub_tensor = None 2025-05-07T20:32:11.6409452Z 2025-05-07T20:32:11.6409592Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6409698Z op = silu_mul_quant 2025-05-07T20:32:11.6409789Z if compiled: 2025-05-07T20:32:11.6409895Z op = torch.compile(op) 2025-05-07T20:32:11.6410015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6410170Z 2025-05-07T20:32:11.6410274Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6410286Z 2025-05-07T20:32:11.6410392Z moe/activation_test.py:117: 2025-05-07T20:32:11.6410528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6410644Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6410749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6411265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6411381Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6411754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6411993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6412347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6412454Z kernel = self.compile( 2025-05-07T20:32:11.6412863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6413048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6413181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6413193Z 2025-05-07T20:32:11.6413676Z self = 2025-05-07T20:32:11.6414548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6415087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1f60900>} 2025-05-07T20:32:11.6415871Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6416081Z context = 2025-05-07T20:32:11.6416085Z 2025-05-07T20:32:11.6416260Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6416535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6416657Z module_map=module_map) 2025-05-07T20:32:11.6416826Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6416931Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6417020Z E ^ 2025-05-07T20:32:11.6417389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6417398Z 2025-05-07T20:32:11.6418083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6418088Z 2025-05-07T20:32:11.6418201Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6418435Z self=, 2025-05-07T20:32:11.6418526Z T=16384, 2025-05-07T20:32:11.6418608Z D=7168, 2025-05-07T20:32:11.6418703Z scale_ub=None, 2025-05-07T20:32:11.6418794Z contiguous=True, 2025-05-07T20:32:11.6418883Z compiled=True, 2025-05-07T20:32:11.6418969Z ) 2025-05-07T20:32:11.6419198Z self = 2025-05-07T20:32:11.6419383Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.6419387Z 2025-05-07T20:32:11.6419475Z @given( 2025-05-07T20:32:11.6419604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6419835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6419970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6420096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6420226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6420306Z ) 2025-05-07T20:32:11.6420565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6420674Z def test_silu_mul_quant( 2025-05-07T20:32:11.6420760Z self, 2025-05-07T20:32:11.6420844Z T: int, 2025-05-07T20:32:11.6420934Z D: int, 2025-05-07T20:32:11.6421038Z scale_ub: Optional[float], 2025-05-07T20:32:11.6421134Z contiguous: bool, 2025-05-07T20:32:11.6421235Z compiled: bool, 2025-05-07T20:32:11.6421318Z ) -> None: 2025-05-07T20:32:11.6421420Z torch.manual_seed(2025) 2025-05-07T20:32:11.6421505Z 2025-05-07T20:32:11.6421682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6421774Z 2025-05-07T20:32:11.6421876Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6422009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6422113Z x = x_sign * x_clamp 2025-05-07T20:32:11.6422200Z x0 = x[:, :D] 2025-05-07T20:32:11.6422286Z x1 = x[:, D:] 2025-05-07T20:32:11.6422370Z 2025-05-07T20:32:11.6422460Z if contiguous: 2025-05-07T20:32:11.6422562Z x0 = x0.contiguous() 2025-05-07T20:32:11.6422665Z x1 = x1.contiguous() 2025-05-07T20:32:11.6422743Z 2025-05-07T20:32:11.6422840Z if scale_ub is not None: 2025-05-07T20:32:11.6422959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6423101Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6423185Z ) 2025-05-07T20:32:11.6423274Z else: 2025-05-07T20:32:11.6423374Z scale_ub_tensor = None 2025-05-07T20:32:11.6423460Z 2025-05-07T20:32:11.6423600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6423701Z op = silu_mul_quant 2025-05-07T20:32:11.6423797Z if compiled: 2025-05-07T20:32:11.6423903Z op = torch.compile(op) 2025-05-07T20:32:11.6424014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6424102Z 2025-05-07T20:32:11.6424201Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6424205Z 2025-05-07T20:32:11.6424309Z moe/activation_test.py:117: 2025-05-07T20:32:11.6424455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6424563Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6424673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6425057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6425157Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6425677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6425871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6426247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6426487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6426843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6426950Z kernel = self.compile( 2025-05-07T20:32:11.6427346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6427531Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6427676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6427680Z 2025-05-07T20:32:11.6427894Z self = 2025-05-07T20:32:11.6428789Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6429317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1f61c60>} 2025-05-07T20:32:11.6430086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6430297Z context = 2025-05-07T20:32:11.6430301Z 2025-05-07T20:32:11.6430477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6430769Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6430885Z module_map=module_map) 2025-05-07T20:32:11.6431057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6431168Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6431249Z E ^ 2025-05-07T20:32:11.6431626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6431631Z 2025-05-07T20:32:11.6432063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6432067Z 2025-05-07T20:32:11.6432177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6432417Z self=, 2025-05-07T20:32:11.6432499Z T=4096, 2025-05-07T20:32:11.6432579Z D=5120, 2025-05-07T20:32:11.6432679Z scale_ub=None, 2025-05-07T20:32:11.6432773Z contiguous=False, 2025-05-07T20:32:11.6432871Z compiled=True, 2025-05-07T20:32:11.6432950Z ) 2025-05-07T20:32:11.6433178Z self = 2025-05-07T20:32:11.6433367Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6433371Z 2025-05-07T20:32:11.6433452Z @given( 2025-05-07T20:32:11.6433579Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6433692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6433814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6433937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6434065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6434145Z ) 2025-05-07T20:32:11.6434410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6434512Z def test_silu_mul_quant( 2025-05-07T20:32:11.6434600Z self, 2025-05-07T20:32:11.6434776Z T: int, 2025-05-07T20:32:11.6434861Z D: int, 2025-05-07T20:32:11.6434965Z scale_ub: Optional[float], 2025-05-07T20:32:11.6435071Z contiguous: bool, 2025-05-07T20:32:11.6435163Z compiled: bool, 2025-05-07T20:32:11.6435247Z ) -> None: 2025-05-07T20:32:11.6435354Z torch.manual_seed(2025) 2025-05-07T20:32:11.6435434Z 2025-05-07T20:32:11.6435615Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6435705Z 2025-05-07T20:32:11.6435805Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6435946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6436044Z x = x_sign * x_clamp 2025-05-07T20:32:11.6436130Z x0 = x[:, :D] 2025-05-07T20:32:11.6436226Z x1 = x[:, D:] 2025-05-07T20:32:11.6436304Z 2025-05-07T20:32:11.6436393Z if contiguous: 2025-05-07T20:32:11.6436496Z x0 = x0.contiguous() 2025-05-07T20:32:11.6436672Z x1 = x1.contiguous() 2025-05-07T20:32:11.6436756Z 2025-05-07T20:32:11.6436856Z if scale_ub is not None: 2025-05-07T20:32:11.6436967Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6437114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6437194Z ) 2025-05-07T20:32:11.6437275Z else: 2025-05-07T20:32:11.6437380Z scale_ub_tensor = None 2025-05-07T20:32:11.6437456Z 2025-05-07T20:32:11.6437592Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6437692Z op = silu_mul_quant 2025-05-07T20:32:11.6437781Z if compiled: 2025-05-07T20:32:11.6437886Z op = torch.compile(op) 2025-05-07T20:32:11.6438001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6438077Z 2025-05-07T20:32:11.6438172Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6438182Z 2025-05-07T20:32:11.6438290Z moe/activation_test.py:117: 2025-05-07T20:32:11.6438427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6438538Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6438642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6439026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6439131Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6439642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6439745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6440186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6440422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6440785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6440897Z kernel = self.compile( 2025-05-07T20:32:11.6441292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6441482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6441614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6441618Z 2025-05-07T20:32:11.6441838Z self = 2025-05-07T20:32:11.6442638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6443180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1f62980>} 2025-05-07T20:32:11.6444053Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6444259Z context = 2025-05-07T20:32:11.6444263Z 2025-05-07T20:32:11.6444445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6444722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6444836Z module_map=module_map) 2025-05-07T20:32:11.6445016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6445121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6445210Z E ^ 2025-05-07T20:32:11.6445579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6445683Z 2025-05-07T20:32:11.6446121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6446126Z 2025-05-07T20:32:11.6446246Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6446480Z self=, 2025-05-07T20:32:11.6446570Z T=4096, 2025-05-07T20:32:11.6446653Z D=5120, 2025-05-07T20:32:11.6446745Z scale_ub=1200.0, 2025-05-07T20:32:11.6446846Z contiguous=False, 2025-05-07T20:32:11.6446937Z compiled=False, 2025-05-07T20:32:11.6447016Z ) 2025-05-07T20:32:11.6447250Z self = 2025-05-07T20:32:11.6447441Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.6447445Z 2025-05-07T20:32:11.6447526Z @given( 2025-05-07T20:32:11.6447665Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6447783Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6447910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6448037Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6448159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6448246Z ) 2025-05-07T20:32:11.6448504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6448605Z def test_silu_mul_quant( 2025-05-07T20:32:11.6448692Z self, 2025-05-07T20:32:11.6448775Z T: int, 2025-05-07T20:32:11.6448858Z D: int, 2025-05-07T20:32:11.6448969Z scale_ub: Optional[float], 2025-05-07T20:32:11.6449064Z contiguous: bool, 2025-05-07T20:32:11.6449156Z compiled: bool, 2025-05-07T20:32:11.6449246Z ) -> None: 2025-05-07T20:32:11.6449345Z torch.manual_seed(2025) 2025-05-07T20:32:11.6449429Z 2025-05-07T20:32:11.6449612Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6449693Z 2025-05-07T20:32:11.6449795Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6449925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6450018Z x = x_sign * x_clamp 2025-05-07T20:32:11.6450109Z x0 = x[:, :D] 2025-05-07T20:32:11.6450192Z x1 = x[:, D:] 2025-05-07T20:32:11.6450269Z 2025-05-07T20:32:11.6450364Z if contiguous: 2025-05-07T20:32:11.6450461Z x0 = x0.contiguous() 2025-05-07T20:32:11.6450556Z x1 = x1.contiguous() 2025-05-07T20:32:11.6450639Z 2025-05-07T20:32:11.6450734Z if scale_ub is not None: 2025-05-07T20:32:11.6450844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6450992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6451071Z ) 2025-05-07T20:32:11.6451158Z else: 2025-05-07T20:32:11.6451256Z scale_ub_tensor = None 2025-05-07T20:32:11.6451339Z 2025-05-07T20:32:11.6451565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6451662Z op = silu_mul_quant 2025-05-07T20:32:11.6451752Z if compiled: 2025-05-07T20:32:11.6451865Z op = torch.compile(op) 2025-05-07T20:32:11.6451976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6452053Z 2025-05-07T20:32:11.6452155Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6452159Z 2025-05-07T20:32:11.6452261Z moe/activation_test.py:117: 2025-05-07T20:32:11.6452405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6452512Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6452616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6453139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6453242Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6453701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6453940Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6454294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6454399Z kernel = self.compile( 2025-05-07T20:32:11.6454796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6454981Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6455117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6455121Z 2025-05-07T20:32:11.6455334Z self = 2025-05-07T20:32:11.6456147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6456677Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1f63ba0>} 2025-05-07T20:32:11.6457443Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6457652Z context = 2025-05-07T20:32:11.6457656Z 2025-05-07T20:32:11.6457830Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6458112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6458230Z module_map=module_map) 2025-05-07T20:32:11.6458402Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6458513Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6458595Z E ^ 2025-05-07T20:32:11.6458964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6458975Z 2025-05-07T20:32:11.6459409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6459414Z 2025-05-07T20:32:11.6459524Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6459763Z self=, 2025-05-07T20:32:11.6459844Z T=4096, 2025-05-07T20:32:11.6459924Z D=5120, 2025-05-07T20:32:11.6460019Z scale_ub=1200.0, 2025-05-07T20:32:11.6460113Z contiguous=False, 2025-05-07T20:32:11.6460201Z compiled=True, 2025-05-07T20:32:11.6460292Z ) 2025-05-07T20:32:11.6460599Z self = 2025-05-07T20:32:11.6460791Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6460796Z 2025-05-07T20:32:11.6460878Z @given( 2025-05-07T20:32:11.6461004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6461115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6461236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6461359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6461485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6461565Z ) 2025-05-07T20:32:11.6461826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6461931Z def test_silu_mul_quant( 2025-05-07T20:32:11.6462013Z self, 2025-05-07T20:32:11.6462103Z T: int, 2025-05-07T20:32:11.6462261Z D: int, 2025-05-07T20:32:11.6462365Z scale_ub: Optional[float], 2025-05-07T20:32:11.6462476Z contiguous: bool, 2025-05-07T20:32:11.6462567Z compiled: bool, 2025-05-07T20:32:11.6462651Z ) -> None: 2025-05-07T20:32:11.6462757Z torch.manual_seed(2025) 2025-05-07T20:32:11.6462835Z 2025-05-07T20:32:11.6463013Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6463099Z 2025-05-07T20:32:11.6463196Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6463328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6463429Z x = x_sign * x_clamp 2025-05-07T20:32:11.6463513Z x0 = x[:, :D] 2025-05-07T20:32:11.6463605Z x1 = x[:, D:] 2025-05-07T20:32:11.6463681Z 2025-05-07T20:32:11.6463770Z if contiguous: 2025-05-07T20:32:11.6463871Z x0 = x0.contiguous() 2025-05-07T20:32:11.6463964Z x1 = x1.contiguous() 2025-05-07T20:32:11.6464041Z 2025-05-07T20:32:11.6464150Z if scale_ub is not None: 2025-05-07T20:32:11.6464268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6464408Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6464495Z ) 2025-05-07T20:32:11.6464578Z else: 2025-05-07T20:32:11.6464677Z scale_ub_tensor = None 2025-05-07T20:32:11.6464762Z 2025-05-07T20:32:11.6464899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6465001Z op = silu_mul_quant 2025-05-07T20:32:11.6465091Z if compiled: 2025-05-07T20:32:11.6465195Z op = torch.compile(op) 2025-05-07T20:32:11.6465319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6465395Z 2025-05-07T20:32:11.6465491Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6465495Z 2025-05-07T20:32:11.6465603Z moe/activation_test.py:117: 2025-05-07T20:32:11.6465740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6465852Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6465970Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6466353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6466456Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6466970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6467073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6467451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6467685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6468040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6468145Z kernel = self.compile( 2025-05-07T20:32:11.6468633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6468826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6468959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6468963Z 2025-05-07T20:32:11.6469178Z self = 2025-05-07T20:32:11.6469987Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6470513Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2130ea0>} 2025-05-07T20:32:11.6471302Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6471579Z context = 2025-05-07T20:32:11.6471583Z 2025-05-07T20:32:11.6471764Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6472040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6472152Z module_map=module_map) 2025-05-07T20:32:11.6472325Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6472428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6472510Z E ^ 2025-05-07T20:32:11.6472884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6472889Z 2025-05-07T20:32:11.6473324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6473334Z 2025-05-07T20:32:11.6473447Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6473679Z self=, 2025-05-07T20:32:11.6473761Z T=2048, 2025-05-07T20:32:11.6473849Z D=7168, 2025-05-07T20:32:11.6473938Z scale_ub=1200.0, 2025-05-07T20:32:11.6474029Z contiguous=False, 2025-05-07T20:32:11.6474125Z compiled=False, 2025-05-07T20:32:11.6474205Z ) 2025-05-07T20:32:11.6474437Z self = 2025-05-07T20:32:11.6474635Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.6474640Z 2025-05-07T20:32:11.6474721Z @given( 2025-05-07T20:32:11.6474856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6474961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6475088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6475220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6475342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6475422Z ) 2025-05-07T20:32:11.6475686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6475785Z def test_silu_mul_quant( 2025-05-07T20:32:11.6475867Z self, 2025-05-07T20:32:11.6475957Z T: int, 2025-05-07T20:32:11.6476038Z D: int, 2025-05-07T20:32:11.6476147Z scale_ub: Optional[float], 2025-05-07T20:32:11.6476242Z contiguous: bool, 2025-05-07T20:32:11.6476333Z compiled: bool, 2025-05-07T20:32:11.6476424Z ) -> None: 2025-05-07T20:32:11.6476526Z torch.manual_seed(2025) 2025-05-07T20:32:11.6476604Z 2025-05-07T20:32:11.6476788Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6476867Z 2025-05-07T20:32:11.6476970Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6477194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6477293Z x = x_sign * x_clamp 2025-05-07T20:32:11.6477379Z x0 = x[:, :D] 2025-05-07T20:32:11.6477473Z x1 = x[:, D:] 2025-05-07T20:32:11.6477551Z 2025-05-07T20:32:11.6477647Z if contiguous: 2025-05-07T20:32:11.6477744Z x0 = x0.contiguous() 2025-05-07T20:32:11.6477841Z x1 = x1.contiguous() 2025-05-07T20:32:11.6477924Z 2025-05-07T20:32:11.6478021Z if scale_ub is not None: 2025-05-07T20:32:11.6478133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6478284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6478366Z ) 2025-05-07T20:32:11.6478447Z else: 2025-05-07T20:32:11.6478554Z scale_ub_tensor = None 2025-05-07T20:32:11.6478634Z 2025-05-07T20:32:11.6478770Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6478976Z op = silu_mul_quant 2025-05-07T20:32:11.6479074Z if compiled: 2025-05-07T20:32:11.6479186Z op = torch.compile(op) 2025-05-07T20:32:11.6479300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6479377Z 2025-05-07T20:32:11.6479477Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6479481Z 2025-05-07T20:32:11.6479584Z moe/activation_test.py:117: 2025-05-07T20:32:11.6479720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6479831Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6479936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6480564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6480673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6481047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6481297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6481652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6481754Z kernel = self.compile( 2025-05-07T20:32:11.6482156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6482340Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6482479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6482484Z 2025-05-07T20:32:11.6482697Z self = 2025-05-07T20:32:11.6483498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6484040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2131940>} 2025-05-07T20:32:11.6484808Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6485017Z context = 2025-05-07T20:32:11.6485021Z 2025-05-07T20:32:11.6485194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6485469Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6485588Z module_map=module_map) 2025-05-07T20:32:11.6485757Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6485873Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6486048Z E ^ 2025-05-07T20:32:11.6486420Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6486424Z 2025-05-07T20:32:11.6486861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6486865Z 2025-05-07T20:32:11.6486976Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6487215Z self=, 2025-05-07T20:32:11.6487296Z T=1, 2025-05-07T20:32:11.6487378Z D=7168, 2025-05-07T20:32:11.6487472Z scale_ub=None, 2025-05-07T20:32:11.6487563Z contiguous=True, 2025-05-07T20:32:11.6487653Z compiled=False, 2025-05-07T20:32:11.6487740Z ) 2025-05-07T20:32:11.6487969Z self = 2025-05-07T20:32:11.6488224Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.6488229Z 2025-05-07T20:32:11.6488315Z @given( 2025-05-07T20:32:11.6488440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6488548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6488675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6488798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6488923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6489002Z ) 2025-05-07T20:32:11.6489258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6489364Z def test_silu_mul_quant( 2025-05-07T20:32:11.6489445Z self, 2025-05-07T20:32:11.6489527Z T: int, 2025-05-07T20:32:11.6489616Z D: int, 2025-05-07T20:32:11.6489720Z scale_ub: Optional[float], 2025-05-07T20:32:11.6489816Z contiguous: bool, 2025-05-07T20:32:11.6489922Z compiled: bool, 2025-05-07T20:32:11.6490009Z ) -> None: 2025-05-07T20:32:11.6490108Z torch.manual_seed(2025) 2025-05-07T20:32:11.6490194Z 2025-05-07T20:32:11.6490370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6490455Z 2025-05-07T20:32:11.6490551Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6490683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6490785Z x = x_sign * x_clamp 2025-05-07T20:32:11.6490871Z x0 = x[:, :D] 2025-05-07T20:32:11.6490956Z x1 = x[:, D:] 2025-05-07T20:32:11.6491041Z 2025-05-07T20:32:11.6491129Z if contiguous: 2025-05-07T20:32:11.6491225Z x0 = x0.contiguous() 2025-05-07T20:32:11.6491328Z x1 = x1.contiguous() 2025-05-07T20:32:11.6491405Z 2025-05-07T20:32:11.6491501Z if scale_ub is not None: 2025-05-07T20:32:11.6491618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6491766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6491857Z ) 2025-05-07T20:32:11.6491940Z else: 2025-05-07T20:32:11.6492039Z scale_ub_tensor = None 2025-05-07T20:32:11.6492122Z 2025-05-07T20:32:11.6492262Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6492357Z op = silu_mul_quant 2025-05-07T20:32:11.6492457Z if compiled: 2025-05-07T20:32:11.6492563Z op = torch.compile(op) 2025-05-07T20:32:11.6492674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6492759Z 2025-05-07T20:32:11.6492863Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6492867Z 2025-05-07T20:32:11.6492970Z moe/activation_test.py:117: 2025-05-07T20:32:11.6493105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6493217Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6493323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6493933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6494039Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6494413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6494653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6495008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6495116Z kernel = self.compile( 2025-05-07T20:32:11.6495513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6495698Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6495837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6495916Z 2025-05-07T20:32:11.6496136Z self = 2025-05-07T20:32:11.6496943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6497470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2132ca0>} 2025-05-07T20:32:11.6498238Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6498447Z context = 2025-05-07T20:32:11.6498451Z 2025-05-07T20:32:11.6498630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6498917Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6499031Z module_map=module_map) 2025-05-07T20:32:11.6499199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6499310Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6499392Z E ^ 2025-05-07T20:32:11.6499759Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6499770Z 2025-05-07T20:32:11.6500202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6500207Z 2025-05-07T20:32:11.6500316Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6500554Z self=, 2025-05-07T20:32:11.6500642Z T=16384, 2025-05-07T20:32:11.6500722Z D=7168, 2025-05-07T20:32:11.6500820Z scale_ub=1200.0, 2025-05-07T20:32:11.6500917Z contiguous=False, 2025-05-07T20:32:11.6501004Z compiled=True, 2025-05-07T20:32:11.6501088Z ) 2025-05-07T20:32:11.6501320Z self = 2025-05-07T20:32:11.6501511Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6501516Z 2025-05-07T20:32:11.6501597Z @given( 2025-05-07T20:32:11.6501723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6501834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6501954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6502077Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6502203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6502282Z ) 2025-05-07T20:32:11.6502539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6502735Z def test_silu_mul_quant( 2025-05-07T20:32:11.6502818Z self, 2025-05-07T20:32:11.6502908Z T: int, 2025-05-07T20:32:11.6502988Z D: int, 2025-05-07T20:32:11.6503092Z scale_ub: Optional[float], 2025-05-07T20:32:11.6503193Z contiguous: bool, 2025-05-07T20:32:11.6503285Z compiled: bool, 2025-05-07T20:32:11.6503367Z ) -> None: 2025-05-07T20:32:11.6503473Z torch.manual_seed(2025) 2025-05-07T20:32:11.6503550Z 2025-05-07T20:32:11.6503725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6503807Z 2025-05-07T20:32:11.6503904Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6504035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6504134Z x = x_sign * x_clamp 2025-05-07T20:32:11.6504220Z x0 = x[:, :D] 2025-05-07T20:32:11.6504311Z x1 = x[:, D:] 2025-05-07T20:32:11.6504388Z 2025-05-07T20:32:11.6504572Z if contiguous: 2025-05-07T20:32:11.6504674Z x0 = x0.contiguous() 2025-05-07T20:32:11.6504769Z x1 = x1.contiguous() 2025-05-07T20:32:11.6504854Z 2025-05-07T20:32:11.6504950Z if scale_ub is not None: 2025-05-07T20:32:11.6505062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6505210Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6505291Z ) 2025-05-07T20:32:11.6505371Z else: 2025-05-07T20:32:11.6505477Z scale_ub_tensor = None 2025-05-07T20:32:11.6505554Z 2025-05-07T20:32:11.6505697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6505793Z op = silu_mul_quant 2025-05-07T20:32:11.6505883Z if compiled: 2025-05-07T20:32:11.6505995Z op = torch.compile(op) 2025-05-07T20:32:11.6506106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6506183Z 2025-05-07T20:32:11.6506296Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6506300Z 2025-05-07T20:32:11.6506407Z moe/activation_test.py:117: 2025-05-07T20:32:11.6506542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6506655Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6506760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6507150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6507250Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6507765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6507875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6508250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6508484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6508873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6508986Z kernel = self.compile( 2025-05-07T20:32:11.6509410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6509598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6509733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6509738Z 2025-05-07T20:32:11.6509955Z self = 2025-05-07T20:32:11.6518561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6519435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2133f60>} 2025-05-07T20:32:11.6520342Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6520551Z context = 2025-05-07T20:32:11.6520556Z 2025-05-07T20:32:11.6520746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6521024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6521138Z module_map=module_map) 2025-05-07T20:32:11.6521317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6521424Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6521515Z E ^ 2025-05-07T20:32:11.6522027Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6522032Z 2025-05-07T20:32:11.6522466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6522470Z 2025-05-07T20:32:11.6522593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6522829Z self=, 2025-05-07T20:32:11.6522920Z T=1, 2025-05-07T20:32:11.6523002Z D=7168, 2025-05-07T20:32:11.6523090Z scale_ub=None, 2025-05-07T20:32:11.6523191Z contiguous=False, 2025-05-07T20:32:11.6523282Z compiled=False, 2025-05-07T20:32:11.6523362Z ) 2025-05-07T20:32:11.6523598Z self = 2025-05-07T20:32:11.6523775Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6523786Z 2025-05-07T20:32:11.6523869Z @given( 2025-05-07T20:32:11.6524007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6524113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6524242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6524368Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6524489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6524579Z ) 2025-05-07T20:32:11.6524835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6524936Z def test_silu_mul_quant( 2025-05-07T20:32:11.6525028Z self, 2025-05-07T20:32:11.6525113Z T: int, 2025-05-07T20:32:11.6525194Z D: int, 2025-05-07T20:32:11.6525308Z scale_ub: Optional[float], 2025-05-07T20:32:11.6525403Z contiguous: bool, 2025-05-07T20:32:11.6525496Z compiled: bool, 2025-05-07T20:32:11.6525591Z ) -> None: 2025-05-07T20:32:11.6525699Z torch.manual_seed(2025) 2025-05-07T20:32:11.6525786Z 2025-05-07T20:32:11.6525972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6526049Z 2025-05-07T20:32:11.6526153Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6526284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6526378Z x = x_sign * x_clamp 2025-05-07T20:32:11.6526474Z x0 = x[:, :D] 2025-05-07T20:32:11.6526559Z x1 = x[:, D:] 2025-05-07T20:32:11.6526635Z 2025-05-07T20:32:11.6526733Z if contiguous: 2025-05-07T20:32:11.6526829Z x0 = x0.contiguous() 2025-05-07T20:32:11.6526927Z x1 = x1.contiguous() 2025-05-07T20:32:11.6527013Z 2025-05-07T20:32:11.6527107Z if scale_ub is not None: 2025-05-07T20:32:11.6527220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6527373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6527458Z ) 2025-05-07T20:32:11.6527553Z else: 2025-05-07T20:32:11.6527733Z scale_ub_tensor = None 2025-05-07T20:32:11.6527811Z 2025-05-07T20:32:11.6527956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6528054Z op = silu_mul_quant 2025-05-07T20:32:11.6528143Z if compiled: 2025-05-07T20:32:11.6528264Z op = torch.compile(op) 2025-05-07T20:32:11.6528375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6528453Z 2025-05-07T20:32:11.6528559Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6528564Z 2025-05-07T20:32:11.6528665Z moe/activation_test.py:117: 2025-05-07T20:32:11.6528806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6528916Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6529025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6529554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6529739Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6530112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6530356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6530711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6530818Z kernel = self.compile( 2025-05-07T20:32:11.6531218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6531409Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6531924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6531928Z 2025-05-07T20:32:11.6532143Z self = 2025-05-07T20:32:11.6532970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6533495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef19909a0>} 2025-05-07T20:32:11.6534266Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6534474Z context = 2025-05-07T20:32:11.6534479Z 2025-05-07T20:32:11.6534655Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6534936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6535058Z module_map=module_map) 2025-05-07T20:32:11.6535227Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6535341Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6535424Z E ^ 2025-05-07T20:32:11.6535801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6535806Z 2025-05-07T20:32:11.6536237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6536242Z 2025-05-07T20:32:11.6536351Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6536591Z self=, 2025-05-07T20:32:11.6536673Z T=2048, 2025-05-07T20:32:11.6536754Z D=7168, 2025-05-07T20:32:11.6536850Z scale_ub=None, 2025-05-07T20:32:11.6536949Z contiguous=False, 2025-05-07T20:32:11.6537047Z compiled=True, 2025-05-07T20:32:11.6537209Z ) 2025-05-07T20:32:11.6537440Z self = 2025-05-07T20:32:11.6537632Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6537636Z 2025-05-07T20:32:11.6537718Z @given( 2025-05-07T20:32:11.6537844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6537960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6538083Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6538209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6538338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6538417Z ) 2025-05-07T20:32:11.6538684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6538784Z def test_silu_mul_quant( 2025-05-07T20:32:11.6538865Z self, 2025-05-07T20:32:11.6539030Z T: int, 2025-05-07T20:32:11.6539112Z D: int, 2025-05-07T20:32:11.6539219Z scale_ub: Optional[float], 2025-05-07T20:32:11.6539323Z contiguous: bool, 2025-05-07T20:32:11.6539414Z compiled: bool, 2025-05-07T20:32:11.6539497Z ) -> None: 2025-05-07T20:32:11.6539607Z torch.manual_seed(2025) 2025-05-07T20:32:11.6539685Z 2025-05-07T20:32:11.6539865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6539951Z 2025-05-07T20:32:11.6540049Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6540189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6540284Z x = x_sign * x_clamp 2025-05-07T20:32:11.6540369Z x0 = x[:, :D] 2025-05-07T20:32:11.6540463Z x1 = x[:, D:] 2025-05-07T20:32:11.6540541Z 2025-05-07T20:32:11.6540632Z if contiguous: 2025-05-07T20:32:11.6540737Z x0 = x0.contiguous() 2025-05-07T20:32:11.6540837Z x1 = x1.contiguous() 2025-05-07T20:32:11.6540915Z 2025-05-07T20:32:11.6541025Z if scale_ub is not None: 2025-05-07T20:32:11.6541137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6541284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6541374Z ) 2025-05-07T20:32:11.6541456Z else: 2025-05-07T20:32:11.6541558Z scale_ub_tensor = None 2025-05-07T20:32:11.6541645Z 2025-05-07T20:32:11.6541779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6541882Z op = silu_mul_quant 2025-05-07T20:32:11.6541971Z if compiled: 2025-05-07T20:32:11.6542076Z op = torch.compile(op) 2025-05-07T20:32:11.6542194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6542271Z 2025-05-07T20:32:11.6542366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6542370Z 2025-05-07T20:32:11.6542480Z moe/activation_test.py:117: 2025-05-07T20:32:11.6542621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6542731Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6542843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6543227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6543333Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6543847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6543952Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6544335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6544569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6544934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6545041Z kernel = self.compile( 2025-05-07T20:32:11.6545524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6545719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6545855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6545860Z 2025-05-07T20:32:11.6546077Z self = 2025-05-07T20:32:11.6546891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6547419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1992160>} 2025-05-07T20:32:11.6548272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6548474Z context = 2025-05-07T20:32:11.6548478Z 2025-05-07T20:32:11.6548658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6548936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6549050Z module_map=module_map) 2025-05-07T20:32:11.6549225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6549328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6549409Z E ^ 2025-05-07T20:32:11.6549785Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6549796Z 2025-05-07T20:32:11.6550233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6550238Z 2025-05-07T20:32:11.6550352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6550586Z self=, 2025-05-07T20:32:11.6550667Z T=4096, 2025-05-07T20:32:11.6550759Z D=7168, 2025-05-07T20:32:11.6550845Z scale_ub=None, 2025-05-07T20:32:11.6550937Z contiguous=False, 2025-05-07T20:32:11.6551033Z compiled=True, 2025-05-07T20:32:11.6551111Z ) 2025-05-07T20:32:11.6551347Z self = 2025-05-07T20:32:11.6551532Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6551536Z 2025-05-07T20:32:11.6551616Z @given( 2025-05-07T20:32:11.6551753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6551870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6551995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6552127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6552247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6552327Z ) 2025-05-07T20:32:11.6552591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6552692Z def test_silu_mul_quant( 2025-05-07T20:32:11.6552781Z self, 2025-05-07T20:32:11.6552862Z T: int, 2025-05-07T20:32:11.6552944Z D: int, 2025-05-07T20:32:11.6553056Z scale_ub: Optional[float], 2025-05-07T20:32:11.6553150Z contiguous: bool, 2025-05-07T20:32:11.6553243Z compiled: bool, 2025-05-07T20:32:11.6553334Z ) -> None: 2025-05-07T20:32:11.6553435Z torch.manual_seed(2025) 2025-05-07T20:32:11.6553513Z 2025-05-07T20:32:11.6553698Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6553784Z 2025-05-07T20:32:11.6553997Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6554138Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6554234Z x = x_sign * x_clamp 2025-05-07T20:32:11.6554332Z x0 = x[:, :D] 2025-05-07T20:32:11.6554417Z x1 = x[:, D:] 2025-05-07T20:32:11.6554494Z 2025-05-07T20:32:11.6554590Z if contiguous: 2025-05-07T20:32:11.6554688Z x0 = x0.contiguous() 2025-05-07T20:32:11.6554783Z x1 = x1.contiguous() 2025-05-07T20:32:11.6554871Z 2025-05-07T20:32:11.6554966Z if scale_ub is not None: 2025-05-07T20:32:11.6555081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6555232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6555312Z ) 2025-05-07T20:32:11.6555393Z else: 2025-05-07T20:32:11.6555500Z scale_ub_tensor = None 2025-05-07T20:32:11.6555577Z 2025-05-07T20:32:11.6555792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6555902Z op = silu_mul_quant 2025-05-07T20:32:11.6555991Z if compiled: 2025-05-07T20:32:11.6556105Z op = torch.compile(op) 2025-05-07T20:32:11.6556216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6556293Z 2025-05-07T20:32:11.6556398Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6556403Z 2025-05-07T20:32:11.6556506Z moe/activation_test.py:117: 2025-05-07T20:32:11.6556644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6556759Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6556865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6557254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6557354Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6557871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6557990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6558363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6558600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6558963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6559063Z kernel = self.compile( 2025-05-07T20:32:11.6559467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6559655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6559792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6559797Z 2025-05-07T20:32:11.6560021Z self = 2025-05-07T20:32:11.6560954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6561487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1992e80>} 2025-05-07T20:32:11.6562259Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6562460Z context = 2025-05-07T20:32:11.6562470Z 2025-05-07T20:32:11.6562643Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6563009Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6563136Z module_map=module_map) 2025-05-07T20:32:11.6563304Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6563409Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6563496Z E ^ 2025-05-07T20:32:11.6563867Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6563872Z 2025-05-07T20:32:11.6564315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6564319Z 2025-05-07T20:32:11.6564428Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6564661Z self=, 2025-05-07T20:32:11.6564749Z T=16384, 2025-05-07T20:32:11.6564829Z D=5120, 2025-05-07T20:32:11.6564995Z scale_ub=1200.0, 2025-05-07T20:32:11.6565097Z contiguous=False, 2025-05-07T20:32:11.6565189Z compiled=False, 2025-05-07T20:32:11.6565268Z ) 2025-05-07T20:32:11.6565504Z self = 2025-05-07T20:32:11.6565695Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.6565699Z 2025-05-07T20:32:11.6565786Z @given( 2025-05-07T20:32:11.6565914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6566019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6566146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6566269Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6566388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6566473Z ) 2025-05-07T20:32:11.6566730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6566834Z def test_silu_mul_quant( 2025-05-07T20:32:11.6566923Z self, 2025-05-07T20:32:11.6567009Z T: int, 2025-05-07T20:32:11.6567098Z D: int, 2025-05-07T20:32:11.6567205Z scale_ub: Optional[float], 2025-05-07T20:32:11.6567300Z contiguous: bool, 2025-05-07T20:32:11.6567402Z compiled: bool, 2025-05-07T20:32:11.6567485Z ) -> None: 2025-05-07T20:32:11.6567587Z torch.manual_seed(2025) 2025-05-07T20:32:11.6567671Z 2025-05-07T20:32:11.6567847Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6567924Z 2025-05-07T20:32:11.6568032Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6568164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6568260Z x = x_sign * x_clamp 2025-05-07T20:32:11.6568357Z x0 = x[:, :D] 2025-05-07T20:32:11.6568442Z x1 = x[:, D:] 2025-05-07T20:32:11.6568532Z 2025-05-07T20:32:11.6568637Z if contiguous: 2025-05-07T20:32:11.6568749Z x0 = x0.contiguous() 2025-05-07T20:32:11.6568879Z x1 = x1.contiguous() 2025-05-07T20:32:11.6568957Z 2025-05-07T20:32:11.6569053Z if scale_ub is not None: 2025-05-07T20:32:11.6569172Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6569314Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6569395Z ) 2025-05-07T20:32:11.6569486Z else: 2025-05-07T20:32:11.6569585Z scale_ub_tensor = None 2025-05-07T20:32:11.6569662Z 2025-05-07T20:32:11.6569810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6569904Z op = silu_mul_quant 2025-05-07T20:32:11.6569998Z if compiled: 2025-05-07T20:32:11.6570101Z op = torch.compile(op) 2025-05-07T20:32:11.6570210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6570292Z 2025-05-07T20:32:11.6570386Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6570390Z 2025-05-07T20:32:11.6570496Z moe/activation_test.py:117: 2025-05-07T20:32:11.6570724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6570831Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6570940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6571456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6571557Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6571934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6572164Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6572517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6572622Z kernel = self.compile( 2025-05-07T20:32:11.6573022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6573297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6573429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6573433Z 2025-05-07T20:32:11.6573649Z self = 2025-05-07T20:32:11.6574457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6574982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef169c220>} 2025-05-07T20:32:11.6575761Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6575969Z context = 2025-05-07T20:32:11.6575973Z 2025-05-07T20:32:11.6576152Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6576432Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6576545Z module_map=module_map) 2025-05-07T20:32:11.6576717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6576823Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6576904Z E ^ 2025-05-07T20:32:11.6577279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6577283Z 2025-05-07T20:32:11.6577712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6577722Z 2025-05-07T20:32:11.6577841Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6578073Z self=, 2025-05-07T20:32:11.6578155Z T=16384, 2025-05-07T20:32:11.6578245Z D=5120, 2025-05-07T20:32:11.6578333Z scale_ub=1200.0, 2025-05-07T20:32:11.6578423Z contiguous=True, 2025-05-07T20:32:11.6578518Z compiled=True, 2025-05-07T20:32:11.6578596Z ) 2025-05-07T20:32:11.6578825Z self = 2025-05-07T20:32:11.6579018Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.6579022Z 2025-05-07T20:32:11.6579102Z @given( 2025-05-07T20:32:11.6579234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6579338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6579458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6579678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6579800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6579877Z ) 2025-05-07T20:32:11.6580140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6580238Z def test_silu_mul_quant( 2025-05-07T20:32:11.6580318Z self, 2025-05-07T20:32:11.6580408Z T: int, 2025-05-07T20:32:11.6580487Z D: int, 2025-05-07T20:32:11.6580596Z scale_ub: Optional[float], 2025-05-07T20:32:11.6580694Z contiguous: bool, 2025-05-07T20:32:11.6580787Z compiled: bool, 2025-05-07T20:32:11.6580875Z ) -> None: 2025-05-07T20:32:11.6580975Z torch.manual_seed(2025) 2025-05-07T20:32:11.6581051Z 2025-05-07T20:32:11.6581232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6581310Z 2025-05-07T20:32:11.6581406Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6581652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6581750Z x = x_sign * x_clamp 2025-05-07T20:32:11.6581835Z x0 = x[:, :D] 2025-05-07T20:32:11.6581926Z x1 = x[:, D:] 2025-05-07T20:32:11.6582003Z 2025-05-07T20:32:11.6582098Z if contiguous: 2025-05-07T20:32:11.6582195Z x0 = x0.contiguous() 2025-05-07T20:32:11.6582288Z x1 = x1.contiguous() 2025-05-07T20:32:11.6582369Z 2025-05-07T20:32:11.6582462Z if scale_ub is not None: 2025-05-07T20:32:11.6582572Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6582716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6582795Z ) 2025-05-07T20:32:11.6582875Z else: 2025-05-07T20:32:11.6582978Z scale_ub_tensor = None 2025-05-07T20:32:11.6583057Z 2025-05-07T20:32:11.6583192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6583290Z op = silu_mul_quant 2025-05-07T20:32:11.6583384Z if compiled: 2025-05-07T20:32:11.6583500Z op = torch.compile(op) 2025-05-07T20:32:11.6583609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6583685Z 2025-05-07T20:32:11.6583784Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6583792Z 2025-05-07T20:32:11.6583895Z moe/activation_test.py:117: 2025-05-07T20:32:11.6584026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6584138Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6584241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6584621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6584726Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6585236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6585350Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6585724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6585956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6586317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6586415Z kernel = self.compile( 2025-05-07T20:32:11.6586816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6587000Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6587131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6587137Z 2025-05-07T20:32:11.6587355Z self = 2025-05-07T20:32:11.6588248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6588791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef169d4e0>} 2025-05-07T20:32:11.6589560Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6589761Z context = 2025-05-07T20:32:11.6589766Z 2025-05-07T20:32:11.6589944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6590220Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6590412Z module_map=module_map) 2025-05-07T20:32:11.6590587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6590690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6590777Z E ^ 2025-05-07T20:32:11.6591145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6591150Z 2025-05-07T20:32:11.6591581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6591595Z 2025-05-07T20:32:11.6591703Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6591937Z self=, 2025-05-07T20:32:11.6592024Z T=16384, 2025-05-07T20:32:11.6592105Z D=5120, 2025-05-07T20:32:11.6592196Z scale_ub=None, 2025-05-07T20:32:11.6592297Z contiguous=False, 2025-05-07T20:32:11.6592389Z compiled=True, 2025-05-07T20:32:11.6592468Z ) 2025-05-07T20:32:11.6592707Z self = 2025-05-07T20:32:11.6592892Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6592896Z 2025-05-07T20:32:11.6592982Z @given( 2025-05-07T20:32:11.6593107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6593210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6593336Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6593460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6593578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6593663Z ) 2025-05-07T20:32:11.6593920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6594019Z def test_silu_mul_quant( 2025-05-07T20:32:11.6594106Z self, 2025-05-07T20:32:11.6594191Z T: int, 2025-05-07T20:32:11.6594276Z D: int, 2025-05-07T20:32:11.6594390Z scale_ub: Optional[float], 2025-05-07T20:32:11.6594485Z contiguous: bool, 2025-05-07T20:32:11.6594581Z compiled: bool, 2025-05-07T20:32:11.6594664Z ) -> None: 2025-05-07T20:32:11.6594766Z torch.manual_seed(2025) 2025-05-07T20:32:11.6594849Z 2025-05-07T20:32:11.6595023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6595100Z 2025-05-07T20:32:11.6595203Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6595333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6595427Z x = x_sign * x_clamp 2025-05-07T20:32:11.6595516Z x0 = x[:, :D] 2025-05-07T20:32:11.6595601Z x1 = x[:, D:] 2025-05-07T20:32:11.6595679Z 2025-05-07T20:32:11.6595774Z if contiguous: 2025-05-07T20:32:11.6595868Z x0 = x0.contiguous() 2025-05-07T20:32:11.6595970Z x1 = x1.contiguous() 2025-05-07T20:32:11.6596051Z 2025-05-07T20:32:11.6596147Z if scale_ub is not None: 2025-05-07T20:32:11.6596350Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6596492Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6596570Z ) 2025-05-07T20:32:11.6596655Z else: 2025-05-07T20:32:11.6596753Z scale_ub_tensor = None 2025-05-07T20:32:11.6596828Z 2025-05-07T20:32:11.6596975Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6597069Z op = silu_mul_quant 2025-05-07T20:32:11.6597159Z if compiled: 2025-05-07T20:32:11.6597269Z op = torch.compile(op) 2025-05-07T20:32:11.6597377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6597461Z 2025-05-07T20:32:11.6597559Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6597563Z 2025-05-07T20:32:11.6597664Z moe/activation_test.py:117: 2025-05-07T20:32:11.6597801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6597990Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6598098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6598489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6598587Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6599098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6599206Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6599577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6599815Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6600286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6600389Z kernel = self.compile( 2025-05-07T20:32:11.6600801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6600984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6601122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6601127Z 2025-05-07T20:32:11.6601338Z self = 2025-05-07T20:32:11.6602138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6602667Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef169e2a0>} 2025-05-07T20:32:11.6603440Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6603651Z context = 2025-05-07T20:32:11.6603655Z 2025-05-07T20:32:11.6603828Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6604104Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6604223Z module_map=module_map) 2025-05-07T20:32:11.6604391Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6604501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6604582Z E ^ 2025-05-07T20:32:11.6604948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6604953Z 2025-05-07T20:32:11.6605478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6605483Z 2025-05-07T20:32:11.6605592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6605832Z self=, 2025-05-07T20:32:11.6605912Z T=2048, 2025-05-07T20:32:11.6605994Z D=5120, 2025-05-07T20:32:11.6606085Z scale_ub=None, 2025-05-07T20:32:11.6606176Z contiguous=False, 2025-05-07T20:32:11.6606262Z compiled=True, 2025-05-07T20:32:11.6606344Z ) 2025-05-07T20:32:11.6606571Z self = 2025-05-07T20:32:11.6606753Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.6606757Z 2025-05-07T20:32:11.6606841Z @given( 2025-05-07T20:32:11.6606965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6607075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6607276Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6607402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6607528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6607607Z ) 2025-05-07T20:32:11.6607864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6607969Z def test_silu_mul_quant( 2025-05-07T20:32:11.6608050Z self, 2025-05-07T20:32:11.6608130Z T: int, 2025-05-07T20:32:11.6608216Z D: int, 2025-05-07T20:32:11.6608317Z scale_ub: Optional[float], 2025-05-07T20:32:11.6608409Z contiguous: bool, 2025-05-07T20:32:11.6608506Z compiled: bool, 2025-05-07T20:32:11.6608589Z ) -> None: 2025-05-07T20:32:11.6608694Z torch.manual_seed(2025) 2025-05-07T20:32:11.6608771Z 2025-05-07T20:32:11.6608950Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6609039Z 2025-05-07T20:32:11.6609135Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6609268Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6609367Z x = x_sign * x_clamp 2025-05-07T20:32:11.6609451Z x0 = x[:, :D] 2025-05-07T20:32:11.6609534Z x1 = x[:, D:] 2025-05-07T20:32:11.6609620Z 2025-05-07T20:32:11.6609710Z if contiguous: 2025-05-07T20:32:11.6609804Z x0 = x0.contiguous() 2025-05-07T20:32:11.6609903Z x1 = x1.contiguous() 2025-05-07T20:32:11.6609980Z 2025-05-07T20:32:11.6610075Z if scale_ub is not None: 2025-05-07T20:32:11.6610193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6610334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6610421Z ) 2025-05-07T20:32:11.6610504Z else: 2025-05-07T20:32:11.6610605Z scale_ub_tensor = None 2025-05-07T20:32:11.6610691Z 2025-05-07T20:32:11.6610826Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6610925Z op = silu_mul_quant 2025-05-07T20:32:11.6611027Z if compiled: 2025-05-07T20:32:11.6611131Z op = torch.compile(op) 2025-05-07T20:32:11.6611240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6611324Z 2025-05-07T20:32:11.6611421Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6611425Z 2025-05-07T20:32:11.6611536Z moe/activation_test.py:117: 2025-05-07T20:32:11.6611671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6611775Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6611886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6612269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6612368Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6612887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6613080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6613776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6614061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6614418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6614522Z kernel = self.compile( 2025-05-07T20:32:11.6614918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6615101Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6615242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6615246Z 2025-05-07T20:32:11.6615459Z self = 2025-05-07T20:32:11.6616537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6617061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef169f560>} 2025-05-07T20:32:11.6617832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6618032Z context = 2025-05-07T20:32:11.6618037Z 2025-05-07T20:32:11.6618208Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6618490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6618611Z module_map=module_map) 2025-05-07T20:32:11.6618785Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6618888Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6618968Z E ^ 2025-05-07T20:32:11.6619341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6619346Z 2025-05-07T20:32:11.6619775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6619779Z 2025-05-07T20:32:11.6619888Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6620123Z self=, 2025-05-07T20:32:11.6620204Z T=2048, 2025-05-07T20:32:11.6620290Z D=5120, 2025-05-07T20:32:11.6620377Z scale_ub=1200.0, 2025-05-07T20:32:11.6620475Z contiguous=False, 2025-05-07T20:32:11.6620568Z compiled=True, 2025-05-07T20:32:11.6620652Z ) 2025-05-07T20:32:11.6620880Z self = 2025-05-07T20:32:11.6621070Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6621074Z 2025-05-07T20:32:11.6621154Z @given( 2025-05-07T20:32:11.6621277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6621386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6621506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6621633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6621751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6621830Z ) 2025-05-07T20:32:11.6622091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6622189Z def test_silu_mul_quant( 2025-05-07T20:32:11.6622274Z self, 2025-05-07T20:32:11.6622360Z T: int, 2025-05-07T20:32:11.6622576Z D: int, 2025-05-07T20:32:11.6622682Z scale_ub: Optional[float], 2025-05-07T20:32:11.6622781Z contiguous: bool, 2025-05-07T20:32:11.6622870Z compiled: bool, 2025-05-07T20:32:11.6622954Z ) -> None: 2025-05-07T20:32:11.6623057Z torch.manual_seed(2025) 2025-05-07T20:32:11.6623134Z 2025-05-07T20:32:11.6623319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6623398Z 2025-05-07T20:32:11.6623495Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6623633Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6623725Z x = x_sign * x_clamp 2025-05-07T20:32:11.6623809Z x0 = x[:, :D] 2025-05-07T20:32:11.6623899Z x1 = x[:, D:] 2025-05-07T20:32:11.6623975Z 2025-05-07T20:32:11.6624062Z if contiguous: 2025-05-07T20:32:11.6624166Z x0 = x0.contiguous() 2025-05-07T20:32:11.6624337Z x1 = x1.contiguous() 2025-05-07T20:32:11.6624414Z 2025-05-07T20:32:11.6624524Z if scale_ub is not None: 2025-05-07T20:32:11.6624635Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6624788Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6624868Z ) 2025-05-07T20:32:11.6624948Z else: 2025-05-07T20:32:11.6625055Z scale_ub_tensor = None 2025-05-07T20:32:11.6625131Z 2025-05-07T20:32:11.6625268Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6625374Z op = silu_mul_quant 2025-05-07T20:32:11.6625462Z if compiled: 2025-05-07T20:32:11.6625567Z op = torch.compile(op) 2025-05-07T20:32:11.6625687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6625763Z 2025-05-07T20:32:11.6625857Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6625862Z 2025-05-07T20:32:11.6625971Z moe/activation_test.py:117: 2025-05-07T20:32:11.6626118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6626232Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6626337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6626719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6626822Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6627339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6627443Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6627814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6628055Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6628408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6628519Z kernel = self.compile( 2025-05-07T20:32:11.6628919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6629102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6629241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6629246Z 2025-05-07T20:32:11.6629459Z self = 2025-05-07T20:32:11.6630265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6630788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1534c20>} 2025-05-07T20:32:11.6632221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6632438Z context = 2025-05-07T20:32:11.6632443Z 2025-05-07T20:32:11.6632615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6632896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6633009Z module_map=module_map) 2025-05-07T20:32:11.6633178Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6633286Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6633367Z E ^ 2025-05-07T20:32:11.6633741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6633822Z 2025-05-07T20:32:11.6634261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6634266Z 2025-05-07T20:32:11.6634375Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6634612Z self=, 2025-05-07T20:32:11.6634693Z T=4096, 2025-05-07T20:32:11.6634771Z D=5120, 2025-05-07T20:32:11.6634865Z scale_ub=1200.0, 2025-05-07T20:32:11.6634952Z contiguous=True, 2025-05-07T20:32:11.6635045Z compiled=True, 2025-05-07T20:32:11.6635125Z ) 2025-05-07T20:32:11.6635352Z self = 2025-05-07T20:32:11.6635537Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.6635541Z 2025-05-07T20:32:11.6635620Z @given( 2025-05-07T20:32:11.6635743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6635860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6635986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6636107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6636233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6636312Z ) 2025-05-07T20:32:11.6636572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6636670Z def test_silu_mul_quant( 2025-05-07T20:32:11.6636750Z self, 2025-05-07T20:32:11.6636836Z T: int, 2025-05-07T20:32:11.6636915Z D: int, 2025-05-07T20:32:11.6637020Z scale_ub: Optional[float], 2025-05-07T20:32:11.6637119Z contiguous: bool, 2025-05-07T20:32:11.6637209Z compiled: bool, 2025-05-07T20:32:11.6637290Z ) -> None: 2025-05-07T20:32:11.6637395Z torch.manual_seed(2025) 2025-05-07T20:32:11.6637475Z 2025-05-07T20:32:11.6637651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6637740Z 2025-05-07T20:32:11.6637840Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6637981Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6638074Z x = x_sign * x_clamp 2025-05-07T20:32:11.6638157Z x0 = x[:, :D] 2025-05-07T20:32:11.6638245Z x1 = x[:, D:] 2025-05-07T20:32:11.6638322Z 2025-05-07T20:32:11.6638409Z if contiguous: 2025-05-07T20:32:11.6638510Z x0 = x0.contiguous() 2025-05-07T20:32:11.6638604Z x1 = x1.contiguous() 2025-05-07T20:32:11.6638679Z 2025-05-07T20:32:11.6638779Z if scale_ub is not None: 2025-05-07T20:32:11.6638888Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6639052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6639130Z ) 2025-05-07T20:32:11.6639218Z else: 2025-05-07T20:32:11.6639317Z scale_ub_tensor = None 2025-05-07T20:32:11.6639403Z 2025-05-07T20:32:11.6639701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6639799Z op = silu_mul_quant 2025-05-07T20:32:11.6639887Z if compiled: 2025-05-07T20:32:11.6639999Z op = torch.compile(op) 2025-05-07T20:32:11.6640225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6640302Z 2025-05-07T20:32:11.6640403Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6640408Z 2025-05-07T20:32:11.6640509Z moe/activation_test.py:117: 2025-05-07T20:32:11.6640656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6640765Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6640869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6641259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6641357Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6641875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6642064Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6642437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6642680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6643036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6643135Z kernel = self.compile( 2025-05-07T20:32:11.6643545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6643731Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6643864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6643876Z 2025-05-07T20:32:11.6644093Z self = 2025-05-07T20:32:11.6644901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6645430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1535a80>} 2025-05-07T20:32:11.6646198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6646405Z context = 2025-05-07T20:32:11.6646409Z 2025-05-07T20:32:11.6646579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6654204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6654354Z module_map=module_map) 2025-05-07T20:32:11.6654530Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6654635Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6654724Z E ^ 2025-05-07T20:32:11.6655102Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6655107Z 2025-05-07T20:32:11.6655548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6655562Z 2025-05-07T20:32:11.6655672Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6655908Z self=, 2025-05-07T20:32:11.6656000Z T=128, 2025-05-07T20:32:11.6656088Z D=5120, 2025-05-07T20:32:11.6656178Z scale_ub=1200.0, 2025-05-07T20:32:11.6656415Z contiguous=False, 2025-05-07T20:32:11.6656506Z compiled=True, 2025-05-07T20:32:11.6656586Z ) 2025-05-07T20:32:11.6656824Z self = 2025-05-07T20:32:11.6657006Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6657011Z 2025-05-07T20:32:11.6657100Z @given( 2025-05-07T20:32:11.6657227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6657333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6657463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6657587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6657712Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6657801Z ) 2025-05-07T20:32:11.6658059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6658237Z def test_silu_mul_quant( 2025-05-07T20:32:11.6658328Z self, 2025-05-07T20:32:11.6658414Z T: int, 2025-05-07T20:32:11.6658495Z D: int, 2025-05-07T20:32:11.6658607Z scale_ub: Optional[float], 2025-05-07T20:32:11.6658712Z contiguous: bool, 2025-05-07T20:32:11.6658825Z compiled: bool, 2025-05-07T20:32:11.6658925Z ) -> None: 2025-05-07T20:32:11.6659035Z torch.manual_seed(2025) 2025-05-07T20:32:11.6659121Z 2025-05-07T20:32:11.6659299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6659378Z 2025-05-07T20:32:11.6659485Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6659616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6659712Z x = x_sign * x_clamp 2025-05-07T20:32:11.6659810Z x0 = x[:, :D] 2025-05-07T20:32:11.6659895Z x1 = x[:, D:] 2025-05-07T20:32:11.6659974Z 2025-05-07T20:32:11.6660074Z if contiguous: 2025-05-07T20:32:11.6660180Z x0 = x0.contiguous() 2025-05-07T20:32:11.6660289Z x1 = x1.contiguous() 2025-05-07T20:32:11.6660369Z 2025-05-07T20:32:11.6660465Z if scale_ub is not None: 2025-05-07T20:32:11.6660584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6660727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6660808Z ) 2025-05-07T20:32:11.6660897Z else: 2025-05-07T20:32:11.6660995Z scale_ub_tensor = None 2025-05-07T20:32:11.6661072Z 2025-05-07T20:32:11.6661213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6661309Z op = silu_mul_quant 2025-05-07T20:32:11.6661399Z if compiled: 2025-05-07T20:32:11.6661512Z op = torch.compile(op) 2025-05-07T20:32:11.6661623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6661699Z 2025-05-07T20:32:11.6661802Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6661812Z 2025-05-07T20:32:11.6661914Z moe/activation_test.py:117: 2025-05-07T20:32:11.6662062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6662167Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6662274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6662671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6662770Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6663285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6663396Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6663769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6664012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6664455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6664556Z kernel = self.compile( 2025-05-07T20:32:11.6664963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6665148Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6665287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6665292Z 2025-05-07T20:32:11.6665505Z self = 2025-05-07T20:32:11.6666314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6666855Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef1536ca0>} 2025-05-07T20:32:11.6667701Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6667903Z context = 2025-05-07T20:32:11.6667908Z 2025-05-07T20:32:11.6668077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6668349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6668467Z module_map=module_map) 2025-05-07T20:32:11.6668635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6668747Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6668828Z E ^ 2025-05-07T20:32:11.6669200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6669215Z 2025-05-07T20:32:11.6669654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6669659Z 2025-05-07T20:32:11.6669768Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6670007Z self=, 2025-05-07T20:32:11.6670089Z T=16384, 2025-05-07T20:32:11.6670171Z D=7168, 2025-05-07T20:32:11.6670266Z scale_ub=1200.0, 2025-05-07T20:32:11.6670355Z contiguous=True, 2025-05-07T20:32:11.6670443Z compiled=True, 2025-05-07T20:32:11.6670528Z ) 2025-05-07T20:32:11.6670758Z self = 2025-05-07T20:32:11.6670942Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.6670947Z 2025-05-07T20:32:11.6671034Z @given( 2025-05-07T20:32:11.6671164Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6671286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6671407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6671530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6671658Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6671738Z ) 2025-05-07T20:32:11.6671995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6672103Z def test_silu_mul_quant( 2025-05-07T20:32:11.6672185Z self, 2025-05-07T20:32:11.6672270Z T: int, 2025-05-07T20:32:11.6672359Z D: int, 2025-05-07T20:32:11.6672464Z scale_ub: Optional[float], 2025-05-07T20:32:11.6672558Z contiguous: bool, 2025-05-07T20:32:11.6672661Z compiled: bool, 2025-05-07T20:32:11.6672744Z ) -> None: 2025-05-07T20:32:11.6672857Z torch.manual_seed(2025) 2025-05-07T20:32:11.6672940Z 2025-05-07T20:32:11.6673200Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6673286Z 2025-05-07T20:32:11.6673385Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6673520Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6673624Z x = x_sign * x_clamp 2025-05-07T20:32:11.6673709Z x0 = x[:, :D] 2025-05-07T20:32:11.6673795Z x1 = x[:, D:] 2025-05-07T20:32:11.6673882Z 2025-05-07T20:32:11.6673975Z if contiguous: 2025-05-07T20:32:11.6674074Z x0 = x0.contiguous() 2025-05-07T20:32:11.6674178Z x1 = x1.contiguous() 2025-05-07T20:32:11.6674257Z 2025-05-07T20:32:11.6674354Z if scale_ub is not None: 2025-05-07T20:32:11.6674472Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6674615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6674707Z ) 2025-05-07T20:32:11.6674788Z else: 2025-05-07T20:32:11.6674965Z scale_ub_tensor = None 2025-05-07T20:32:11.6675051Z 2025-05-07T20:32:11.6675197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6675292Z op = silu_mul_quant 2025-05-07T20:32:11.6675390Z if compiled: 2025-05-07T20:32:11.6675497Z op = torch.compile(op) 2025-05-07T20:32:11.6675613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6675697Z 2025-05-07T20:32:11.6675793Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6675797Z 2025-05-07T20:32:11.6675907Z moe/activation_test.py:117: 2025-05-07T20:32:11.6676041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6676146Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6676258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6676639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6676743Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6677266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6677369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6677746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6677980Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6678334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6678445Z kernel = self.compile( 2025-05-07T20:32:11.6678844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6679027Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6679167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6679176Z 2025-05-07T20:32:11.6679399Z self = 2025-05-07T20:32:11.6680322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6680854Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef14c8400>} 2025-05-07T20:32:11.6681633Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6681834Z context = 2025-05-07T20:32:11.6681844Z 2025-05-07T20:32:11.6682020Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6682388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6682503Z module_map=module_map) 2025-05-07T20:32:11.6682671Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6682785Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6682868Z E ^ 2025-05-07T20:32:11.6683245Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6683249Z 2025-05-07T20:32:11.6683681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6683685Z 2025-05-07T20:32:11.6683794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6684031Z self=, 2025-05-07T20:32:11.6684188Z T=16384, 2025-05-07T20:32:11.6684277Z D=5120, 2025-05-07T20:32:11.6684371Z scale_ub=1200.0, 2025-05-07T20:32:11.6684460Z contiguous=True, 2025-05-07T20:32:11.6684555Z compiled=False, 2025-05-07T20:32:11.6684633Z ) 2025-05-07T20:32:11.6684860Z self = 2025-05-07T20:32:11.6685053Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.6685057Z 2025-05-07T20:32:11.6685138Z @given( 2025-05-07T20:32:11.6685263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6685375Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6685497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6685631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6685750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6685829Z ) 2025-05-07T20:32:11.6686094Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6686208Z def test_silu_mul_quant( 2025-05-07T20:32:11.6686289Z self, 2025-05-07T20:32:11.6686381Z T: int, 2025-05-07T20:32:11.6686466Z D: int, 2025-05-07T20:32:11.6686572Z scale_ub: Optional[float], 2025-05-07T20:32:11.6686675Z contiguous: bool, 2025-05-07T20:32:11.6686765Z compiled: bool, 2025-05-07T20:32:11.6686847Z ) -> None: 2025-05-07T20:32:11.6686955Z torch.manual_seed(2025) 2025-05-07T20:32:11.6687033Z 2025-05-07T20:32:11.6687218Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6687299Z 2025-05-07T20:32:11.6687396Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6687534Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6687627Z x = x_sign * x_clamp 2025-05-07T20:32:11.6687714Z x0 = x[:, :D] 2025-05-07T20:32:11.6687808Z x1 = x[:, D:] 2025-05-07T20:32:11.6687891Z 2025-05-07T20:32:11.6687981Z if contiguous: 2025-05-07T20:32:11.6688090Z x0 = x0.contiguous() 2025-05-07T20:32:11.6688186Z x1 = x1.contiguous() 2025-05-07T20:32:11.6688263Z 2025-05-07T20:32:11.6688369Z if scale_ub is not None: 2025-05-07T20:32:11.6688482Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6688625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6688713Z ) 2025-05-07T20:32:11.6688794Z else: 2025-05-07T20:32:11.6688901Z scale_ub_tensor = None 2025-05-07T20:32:11.6688979Z 2025-05-07T20:32:11.6689116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6689218Z op = silu_mul_quant 2025-05-07T20:32:11.6689308Z if compiled: 2025-05-07T20:32:11.6689414Z op = torch.compile(op) 2025-05-07T20:32:11.6689533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6689611Z 2025-05-07T20:32:11.6689711Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6689715Z 2025-05-07T20:32:11.6689944Z moe/activation_test.py:117: 2025-05-07T20:32:11.6690080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6690194Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6690299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6690817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6690929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6691303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6691537Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6691902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6692077Z kernel = self.compile( 2025-05-07T20:32:11.6692488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6692670Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6692802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6692806Z 2025-05-07T20:32:11.6693026Z self = 2025-05-07T20:32:11.6693829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6694364Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef14c8e00>} 2025-05-07T20:32:11.6695150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6695351Z context = 2025-05-07T20:32:11.6695364Z 2025-05-07T20:32:11.6695539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6695815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6695936Z module_map=module_map) 2025-05-07T20:32:11.6696104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6696209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6696300Z E ^ 2025-05-07T20:32:11.6696669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6696678Z 2025-05-07T20:32:11.6697126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6697131Z 2025-05-07T20:32:11.6697240Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6697474Z self=, 2025-05-07T20:32:11.6697563Z T=1, 2025-05-07T20:32:11.6697643Z D=7168, 2025-05-07T20:32:11.6697731Z scale_ub=1200.0, 2025-05-07T20:32:11.6697834Z contiguous=False, 2025-05-07T20:32:11.6697922Z compiled=False, 2025-05-07T20:32:11.6697998Z ) 2025-05-07T20:32:11.6698233Z self = 2025-05-07T20:32:11.6698409Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.6698414Z 2025-05-07T20:32:11.6698501Z @given( 2025-05-07T20:32:11.6698625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6698727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6698939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6699066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6699186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6699272Z ) 2025-05-07T20:32:11.6699529Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6699635Z def test_silu_mul_quant( 2025-05-07T20:32:11.6699720Z self, 2025-05-07T20:32:11.6699800Z T: int, 2025-05-07T20:32:11.6699886Z D: int, 2025-05-07T20:32:11.6699989Z scale_ub: Optional[float], 2025-05-07T20:32:11.6700084Z contiguous: bool, 2025-05-07T20:32:11.6700180Z compiled: bool, 2025-05-07T20:32:11.6700262Z ) -> None: 2025-05-07T20:32:11.6700360Z torch.manual_seed(2025) 2025-05-07T20:32:11.6700447Z 2025-05-07T20:32:11.6700621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6700776Z 2025-05-07T20:32:11.6700881Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6701017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6701112Z x = x_sign * x_clamp 2025-05-07T20:32:11.6701205Z x0 = x[:, :D] 2025-05-07T20:32:11.6701292Z x1 = x[:, D:] 2025-05-07T20:32:11.6701377Z 2025-05-07T20:32:11.6701465Z if contiguous: 2025-05-07T20:32:11.6701562Z x0 = x0.contiguous() 2025-05-07T20:32:11.6701663Z x1 = x1.contiguous() 2025-05-07T20:32:11.6701741Z 2025-05-07T20:32:11.6701836Z if scale_ub is not None: 2025-05-07T20:32:11.6701956Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6702096Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6702176Z ) 2025-05-07T20:32:11.6702271Z else: 2025-05-07T20:32:11.6702369Z scale_ub_tensor = None 2025-05-07T20:32:11.6702452Z 2025-05-07T20:32:11.6702596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6702706Z op = silu_mul_quant 2025-05-07T20:32:11.6702804Z if compiled: 2025-05-07T20:32:11.6702909Z op = torch.compile(op) 2025-05-07T20:32:11.6703021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6703105Z 2025-05-07T20:32:11.6703201Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6703206Z 2025-05-07T20:32:11.6703309Z moe/activation_test.py:117: 2025-05-07T20:32:11.6703454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6703560Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6703666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6704196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6704299Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6704678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6704924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6705284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6705383Z kernel = self.compile( 2025-05-07T20:32:11.6705778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6705966Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6706096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6706100Z 2025-05-07T20:32:11.6706313Z self = 2025-05-07T20:32:11.6707210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6707742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef14ca160>} 2025-05-07T20:32:11.6708517Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6708718Z context = 2025-05-07T20:32:11.6708723Z 2025-05-07T20:32:11.6708902Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6709177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6709289Z module_map=module_map) 2025-05-07T20:32:11.6709545Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6709653Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6709734Z E ^ 2025-05-07T20:32:11.6710117Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6710122Z 2025-05-07T20:32:11.6710553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6710557Z 2025-05-07T20:32:11.6710670Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6710902Z self=, 2025-05-07T20:32:11.6710981Z T=4096, 2025-05-07T20:32:11.6711067Z D=7168, 2025-05-07T20:32:11.6711154Z scale_ub=1200.0, 2025-05-07T20:32:11.6711244Z contiguous=False, 2025-05-07T20:32:11.6711337Z compiled=True, 2025-05-07T20:32:11.6711414Z ) 2025-05-07T20:32:11.6711642Z self = 2025-05-07T20:32:11.6711844Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6711848Z 2025-05-07T20:32:11.6711930Z @given( 2025-05-07T20:32:11.6712062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6712167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6712288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6712418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6712537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6712617Z ) 2025-05-07T20:32:11.6712883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6712982Z def test_silu_mul_quant( 2025-05-07T20:32:11.6713072Z self, 2025-05-07T20:32:11.6713152Z T: int, 2025-05-07T20:32:11.6713233Z D: int, 2025-05-07T20:32:11.6713759Z scale_ub: Optional[float], 2025-05-07T20:32:11.6713909Z contiguous: bool, 2025-05-07T20:32:11.6714031Z compiled: bool, 2025-05-07T20:32:11.6714127Z ) -> None: 2025-05-07T20:32:11.6714223Z torch.manual_seed(2025) 2025-05-07T20:32:11.6714301Z 2025-05-07T20:32:11.6714483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6714560Z 2025-05-07T20:32:11.6714654Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6714792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6714883Z x = x_sign * x_clamp 2025-05-07T20:32:11.6714973Z x0 = x[:, :D] 2025-05-07T20:32:11.6715057Z x1 = x[:, D:] 2025-05-07T20:32:11.6715130Z 2025-05-07T20:32:11.6715225Z if contiguous: 2025-05-07T20:32:11.6715321Z x0 = x0.contiguous() 2025-05-07T20:32:11.6715415Z x1 = x1.contiguous() 2025-05-07T20:32:11.6715495Z 2025-05-07T20:32:11.6715588Z if scale_ub is not None: 2025-05-07T20:32:11.6715698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6716087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6716170Z ) 2025-05-07T20:32:11.6716247Z else: 2025-05-07T20:32:11.6716350Z scale_ub_tensor = None 2025-05-07T20:32:11.6716424Z 2025-05-07T20:32:11.6716557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6716656Z op = silu_mul_quant 2025-05-07T20:32:11.6716743Z if compiled: 2025-05-07T20:32:11.6716849Z op = torch.compile(op) 2025-05-07T20:32:11.6716957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6717032Z 2025-05-07T20:32:11.6717134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6717138Z 2025-05-07T20:32:11.6717238Z moe/activation_test.py:117: 2025-05-07T20:32:11.6717371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6717480Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6717739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6718127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6718229Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6718741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6718847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6719218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6719451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6719822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6719920Z kernel = self.compile( 2025-05-07T20:32:11.6720422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6720618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6720749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6720754Z 2025-05-07T20:32:11.6720972Z self = 2025-05-07T20:32:11.6721775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6722306Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef14cb420>} 2025-05-07T20:32:11.6723084Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6723293Z context = 2025-05-07T20:32:11.6723298Z 2025-05-07T20:32:11.6723474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6723750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6723865Z module_map=module_map) 2025-05-07T20:32:11.6724031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6724138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6724222Z E ^ 2025-05-07T20:32:11.6724591Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6724595Z 2025-05-07T20:32:11.6725033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6725043Z 2025-05-07T20:32:11.6725234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6725469Z self=, 2025-05-07T20:32:11.6725555Z T=128, 2025-05-07T20:32:11.6725634Z D=7168, 2025-05-07T20:32:11.6725718Z scale_ub=1200.0, 2025-05-07T20:32:11.6725813Z contiguous=False, 2025-05-07T20:32:11.6725899Z compiled=True, 2025-05-07T20:32:11.6725975Z ) 2025-05-07T20:32:11.6726263Z self = 2025-05-07T20:32:11.6726509Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:11.6726515Z 2025-05-07T20:32:11.6726633Z @given( 2025-05-07T20:32:11.6726802Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6726941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6727113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6727392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6727564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6727669Z ) 2025-05-07T20:32:11.6728018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6728146Z def test_silu_mul_quant( 2025-05-07T20:32:11.6728232Z self, 2025-05-07T20:32:11.6728312Z T: int, 2025-05-07T20:32:11.6728398Z D: int, 2025-05-07T20:32:11.6728503Z scale_ub: Optional[float], 2025-05-07T20:32:11.6728596Z contiguous: bool, 2025-05-07T20:32:11.6728690Z compiled: bool, 2025-05-07T20:32:11.6728771Z ) -> None: 2025-05-07T20:32:11.6728869Z torch.manual_seed(2025) 2025-05-07T20:32:11.6728950Z 2025-05-07T20:32:11.6729127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6729203Z 2025-05-07T20:32:11.6729305Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6729434Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6729531Z x = x_sign * x_clamp 2025-05-07T20:32:11.6729628Z x0 = x[:, :D] 2025-05-07T20:32:11.6729711Z x1 = x[:, D:] 2025-05-07T20:32:11.6729785Z 2025-05-07T20:32:11.6729883Z if contiguous: 2025-05-07T20:32:11.6730014Z x0 = x0.contiguous() 2025-05-07T20:32:11.6730168Z x1 = x1.contiguous() 2025-05-07T20:32:11.6730439Z 2025-05-07T20:32:11.6730641Z if scale_ub is not None: 2025-05-07T20:32:11.6730929Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6731271Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6731600Z ) 2025-05-07T20:32:11.6731801Z else: 2025-05-07T20:32:11.6732011Z scale_ub_tensor = None 2025-05-07T20:32:11.6732357Z 2025-05-07T20:32:11.6732632Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6732959Z op = silu_mul_quant 2025-05-07T20:32:11.6733233Z if compiled: 2025-05-07T20:32:11.6733492Z op = torch.compile(op) 2025-05-07T20:32:11.6733806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6734089Z 2025-05-07T20:32:11.6734290Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6734462Z 2025-05-07T20:32:11.6734573Z moe/activation_test.py:117: 2025-05-07T20:32:11.6734874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6735222Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6735516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6736091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6736676Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6737360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6738081Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6738758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6739478Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6740172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6740720Z kernel = self.compile( 2025-05-07T20:32:11.6741282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6741963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6742376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6742613Z 2025-05-07T20:32:11.6742829Z self = 2025-05-07T20:32:11.6743961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6745495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2028720>} 2025-05-07T20:32:11.6746892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6747952Z context = 2025-05-07T20:32:11.6748251Z 2025-05-07T20:32:11.6748424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6748975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6749467Z module_map=module_map) 2025-05-07T20:32:11.6749857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6750232Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6750505Z E ^ 2025-05-07T20:32:11.6750991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6751458Z 2025-05-07T20:32:11.6751890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6752434Z 2025-05-07T20:32:11.6752543Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6752975Z self=, 2025-05-07T20:32:11.6753395Z T=2048, 2025-05-07T20:32:11.6753592Z D=7168, 2025-05-07T20:32:11.6753791Z scale_ub=None, 2025-05-07T20:32:11.6754014Z contiguous=True, 2025-05-07T20:32:11.6754241Z compiled=True, 2025-05-07T20:32:11.6754459Z ) 2025-05-07T20:32:11.6754797Z self = 2025-05-07T20:32:11.6755304Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.6755585Z 2025-05-07T20:32:11.6755664Z @given( 2025-05-07T20:32:11.6755903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6756225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6756546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6756887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6757230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6757524Z ) 2025-05-07T20:32:11.6757888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6758355Z def test_silu_mul_quant( 2025-05-07T20:32:11.6758624Z self, 2025-05-07T20:32:11.6758857Z T: int, 2025-05-07T20:32:11.6759063Z D: int, 2025-05-07T20:32:11.6759289Z scale_ub: Optional[float], 2025-05-07T20:32:11.6759665Z contiguous: bool, 2025-05-07T20:32:11.6759920Z compiled: bool, 2025-05-07T20:32:11.6760254Z ) -> None: 2025-05-07T20:32:11.6760485Z torch.manual_seed(2025) 2025-05-07T20:32:11.6760744Z 2025-05-07T20:32:11.6761021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6761377Z 2025-05-07T20:32:11.6761579Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6761877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6762205Z x = x_sign * x_clamp 2025-05-07T20:32:11.6762454Z x0 = x[:, :D] 2025-05-07T20:32:11.6762680Z x1 = x[:, D:] 2025-05-07T20:32:11.6762891Z 2025-05-07T20:32:11.6763086Z if contiguous: 2025-05-07T20:32:11.6763327Z x0 = x0.contiguous() 2025-05-07T20:32:11.6763591Z x1 = x1.contiguous() 2025-05-07T20:32:11.6763844Z 2025-05-07T20:32:11.6764044Z if scale_ub is not None: 2025-05-07T20:32:11.6764417Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6764775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6765090Z ) 2025-05-07T20:32:11.6765292Z else: 2025-05-07T20:32:11.6765514Z scale_ub_tensor = None 2025-05-07T20:32:11.6765773Z 2025-05-07T20:32:11.6766015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6766343Z op = silu_mul_quant 2025-05-07T20:32:11.6766597Z if compiled: 2025-05-07T20:32:11.6766858Z op = torch.compile(op) 2025-05-07T20:32:11.6767165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6767444Z 2025-05-07T20:32:11.6767649Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6767824Z 2025-05-07T20:32:11.6767926Z moe/activation_test.py:117: 2025-05-07T20:32:11.6768232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6768578Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6768878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6769456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.6770026Z return fn(*args, **kwargs) 2025-05-07T20:32:11.6770708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6771417Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6771979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6772679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6773367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6773920Z kernel = self.compile( 2025-05-07T20:32:11.6774485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6775165Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6775574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6775809Z 2025-05-07T20:32:11.6776025Z self = 2025-05-07T20:32:11.6777135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6778565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef2029440>} 2025-05-07T20:32:11.6780054Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6781121Z context = 2025-05-07T20:32:11.6781426Z 2025-05-07T20:32:11.6781597Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6782142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6782620Z module_map=module_map) 2025-05-07T20:32:11.6782999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6783370Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6783641Z E ^ 2025-05-07T20:32:11.6784118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6784592Z 2025-05-07T20:32:11.6785025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6785679Z 2025-05-07T20:32:11.6785795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6786225Z self=, 2025-05-07T20:32:11.6786638Z T=16384, 2025-05-07T20:32:11.6786843Z D=5120, 2025-05-07T20:32:11.6787043Z scale_ub=None, 2025-05-07T20:32:11.6787259Z contiguous=False, 2025-05-07T20:32:11.6787493Z compiled=False, 2025-05-07T20:32:11.6787851Z ) 2025-05-07T20:32:11.6788446Z self = 2025-05-07T20:32:11.6797121Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6797429Z 2025-05-07T20:32:11.6797512Z @given( 2025-05-07T20:32:11.6797768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6798106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6798434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6798799Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6799152Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6799458Z ) 2025-05-07T20:32:11.6799827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6800434Z def test_silu_mul_quant( 2025-05-07T20:32:11.6800695Z self, 2025-05-07T20:32:11.6800900Z T: int, 2025-05-07T20:32:11.6801113Z D: int, 2025-05-07T20:32:11.6801347Z scale_ub: Optional[float], 2025-05-07T20:32:11.6801630Z contiguous: bool, 2025-05-07T20:32:11.6801890Z compiled: bool, 2025-05-07T20:32:11.6802131Z ) -> None: 2025-05-07T20:32:11.6802359Z torch.manual_seed(2025) 2025-05-07T20:32:11.6802621Z 2025-05-07T20:32:11.6802916Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6803280Z 2025-05-07T20:32:11.6803480Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6803804Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6805920Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6807863Z 2025-05-07T20:32:11.6807995Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:11.6808220Z 2025-05-07T20:32:11.6808329Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6808766Z self=, 2025-05-07T20:32:11.6809193Z T=4096, 2025-05-07T20:32:11.6809391Z D=7168, 2025-05-07T20:32:11.6809748Z scale_ub=1200.0, 2025-05-07T20:32:11.6809985Z contiguous=True, 2025-05-07T20:32:11.6810219Z compiled=True, 2025-05-07T20:32:11.6810425Z ) 2025-05-07T20:32:11.6810763Z self = 2025-05-07T20:32:11.6811283Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.6811568Z 2025-05-07T20:32:11.6811649Z @given( 2025-05-07T20:32:11.6811896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6812234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6812553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6812899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6813243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6813959Z ) 2025-05-07T20:32:11.6814326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6815044Z def test_silu_mul_quant( 2025-05-07T20:32:11.6815299Z self, 2025-05-07T20:32:11.6815497Z T: int, 2025-05-07T20:32:11.6815705Z D: int, 2025-05-07T20:32:11.6815937Z scale_ub: Optional[float], 2025-05-07T20:32:11.6816218Z contiguous: bool, 2025-05-07T20:32:11.6816477Z compiled: bool, 2025-05-07T20:32:11.6816715Z ) -> None: 2025-05-07T20:32:11.6816938Z torch.manual_seed(2025) 2025-05-07T20:32:11.6817197Z 2025-05-07T20:32:11.6817488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6817840Z 2025-05-07T20:32:11.6818051Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6818363Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6820451Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6822391Z 2025-05-07T20:32:11.6822524Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:11.6822746Z 2025-05-07T20:32:11.6822854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6823292Z self=, 2025-05-07T20:32:11.6823718Z T=16384, 2025-05-07T20:32:11.6823917Z D=7168, 2025-05-07T20:32:11.6824123Z scale_ub=None, 2025-05-07T20:32:11.6824352Z contiguous=False, 2025-05-07T20:32:11.6824588Z compiled=False, 2025-05-07T20:32:11.6824813Z ) 2025-05-07T20:32:11.6825154Z self = 2025-05-07T20:32:11.6825692Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6825984Z 2025-05-07T20:32:11.6826066Z @given( 2025-05-07T20:32:11.6826311Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6826639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6826958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6827305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6827649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6827944Z ) 2025-05-07T20:32:11.6828309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6828776Z def test_silu_mul_quant( 2025-05-07T20:32:11.6829043Z self, 2025-05-07T20:32:11.6829243Z T: int, 2025-05-07T20:32:11.6829453Z D: int, 2025-05-07T20:32:11.6829685Z scale_ub: Optional[float], 2025-05-07T20:32:11.6829970Z contiguous: bool, 2025-05-07T20:32:11.6830376Z compiled: bool, 2025-05-07T20:32:11.6830614Z ) -> None: 2025-05-07T20:32:11.6830838Z torch.manual_seed(2025) 2025-05-07T20:32:11.6831096Z 2025-05-07T20:32:11.6831382Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6833508Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6835447Z 2025-05-07T20:32:11.6835648Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.6835876Z 2025-05-07T20:32:11.6835991Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6836427Z self=, 2025-05-07T20:32:11.6836846Z T=2048, 2025-05-07T20:32:11.6837036Z D=7168, 2025-05-07T20:32:11.6837243Z scale_ub=1200.0, 2025-05-07T20:32:11.6837479Z contiguous=True, 2025-05-07T20:32:11.6837707Z compiled=True, 2025-05-07T20:32:11.6837925Z ) 2025-05-07T20:32:11.6838262Z self = 2025-05-07T20:32:11.6838776Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.6839067Z 2025-05-07T20:32:11.6839148Z @given( 2025-05-07T20:32:11.6839395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6839726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6840048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6840501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6840861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6841156Z ) 2025-05-07T20:32:11.6841522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6841987Z def test_silu_mul_quant( 2025-05-07T20:32:11.6842236Z self, 2025-05-07T20:32:11.6842447Z T: int, 2025-05-07T20:32:11.6842659Z D: int, 2025-05-07T20:32:11.6842884Z scale_ub: Optional[float], 2025-05-07T20:32:11.6843171Z contiguous: bool, 2025-05-07T20:32:11.6843431Z compiled: bool, 2025-05-07T20:32:11.6843660Z ) -> None: 2025-05-07T20:32:11.6843891Z torch.manual_seed(2025) 2025-05-07T20:32:11.6844147Z 2025-05-07T20:32:11.6844426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6844782Z 2025-05-07T20:32:11.6844987Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6845301Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6847363Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6849331Z 2025-05-07T20:32:11.6849455Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:11.6849684Z 2025-05-07T20:32:11.6849791Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6850227Z self=, 2025-05-07T20:32:11.6850647Z T=2048, 2025-05-07T20:32:11.6850848Z D=7168, 2025-05-07T20:32:11.6851050Z scale_ub=None, 2025-05-07T20:32:11.6851367Z contiguous=True, 2025-05-07T20:32:11.6851598Z compiled=False, 2025-05-07T20:32:11.6851812Z ) 2025-05-07T20:32:11.6852147Z self = 2025-05-07T20:32:11.6852656Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.6852945Z 2025-05-07T20:32:11.6853026Z @given( 2025-05-07T20:32:11.6853265Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6853587Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6853905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6854251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6854594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6854886Z ) 2025-05-07T20:32:11.6855252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6855803Z def test_silu_mul_quant( 2025-05-07T20:32:11.6856059Z self, 2025-05-07T20:32:11.6856271Z T: int, 2025-05-07T20:32:11.6856481Z D: int, 2025-05-07T20:32:11.6856705Z scale_ub: Optional[float], 2025-05-07T20:32:11.6856994Z contiguous: bool, 2025-05-07T20:32:11.6857253Z compiled: bool, 2025-05-07T20:32:11.6857481Z ) -> None: 2025-05-07T20:32:11.6857714Z torch.manual_seed(2025) 2025-05-07T20:32:11.6857972Z 2025-05-07T20:32:11.6858250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6858611Z 2025-05-07T20:32:11.6858818Z > x_sign = torch.sign(x) 2025-05-07T20:32:11.6860830Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6862744Z 2025-05-07T20:32:11.6862883Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:11.6863106Z 2025-05-07T20:32:11.6863212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6863645Z self=, 2025-05-07T20:32:11.6864065Z T=1, 2025-05-07T20:32:11.6864260Z D=7168, 2025-05-07T20:32:11.6864457Z scale_ub=1200.0, 2025-05-07T20:32:11.6864691Z contiguous=True, 2025-05-07T20:32:11.6864924Z compiled=False, 2025-05-07T20:32:11.6865135Z ) 2025-05-07T20:32:11.6865469Z self = 2025-05-07T20:32:11.6865980Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.6866262Z 2025-05-07T20:32:11.6866348Z @given( 2025-05-07T20:32:11.6866588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6866917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6867234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6867582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6867930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6868229Z ) 2025-05-07T20:32:11.6868590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6869102Z def test_silu_mul_quant( 2025-05-07T20:32:11.6869363Z self, 2025-05-07T20:32:11.6869565Z T: int, 2025-05-07T20:32:11.6869769Z D: int, 2025-05-07T20:32:11.6869999Z scale_ub: Optional[float], 2025-05-07T20:32:11.6870275Z contiguous: bool, 2025-05-07T20:32:11.6870525Z compiled: bool, 2025-05-07T20:32:11.6870764Z ) -> None: 2025-05-07T20:32:11.6871199Z torch.manual_seed(2025) 2025-05-07T20:32:11.6871458Z 2025-05-07T20:32:11.6871741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6872092Z 2025-05-07T20:32:11.6872298Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6872603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6872921Z x = x_sign * x_clamp 2025-05-07T20:32:11.6873173Z x0 = x[:, :D] 2025-05-07T20:32:11.6873401Z x1 = x[:, D:] 2025-05-07T20:32:11.6873620Z 2025-05-07T20:32:11.6873808Z if contiguous: 2025-05-07T20:32:11.6874052Z x0 = x0.contiguous() 2025-05-07T20:32:11.6874323Z x1 = x1.contiguous() 2025-05-07T20:32:11.6874570Z 2025-05-07T20:32:11.6874771Z if scale_ub is not None: 2025-05-07T20:32:11.6875060Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6875407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6875851Z ) 2025-05-07T20:32:11.6876061Z else: 2025-05-07T20:32:11.6876276Z scale_ub_tensor = None 2025-05-07T20:32:11.6876541Z 2025-05-07T20:32:11.6876784Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6877107Z op = silu_mul_quant 2025-05-07T20:32:11.6877372Z if compiled: 2025-05-07T20:32:11.6877634Z op = torch.compile(op) 2025-05-07T20:32:11.6877940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6878234Z 2025-05-07T20:32:11.6878442Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6878614Z 2025-05-07T20:32:11.6878728Z moe/activation_test.py:117: 2025-05-07T20:32:11.6879057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6879430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6879727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6880520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6881250Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6881813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6882527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6883216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6883773Z kernel = self.compile( 2025-05-07T20:32:11.6884341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6885022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6885440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6885682Z 2025-05-07T20:32:11.6885902Z self = 2025-05-07T20:32:11.6887028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6888451Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef17c4400>} 2025-05-07T20:32:11.6889842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6890905Z context = 2025-05-07T20:32:11.6891209Z 2025-05-07T20:32:11.6891389Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6892030Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6892518Z module_map=module_map) 2025-05-07T20:32:11.6892904Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6893276Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6893542Z E ^ 2025-05-07T20:32:11.6894028Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6894497Z 2025-05-07T20:32:11.6894938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6895470Z 2025-05-07T20:32:11.6895585Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6896014Z self=, 2025-05-07T20:32:11.6896434Z T=128, 2025-05-07T20:32:11.6896634Z D=5120, 2025-05-07T20:32:11.6896910Z scale_ub=None, 2025-05-07T20:32:11.6897139Z contiguous=True, 2025-05-07T20:32:11.6897372Z compiled=False, 2025-05-07T20:32:11.6897579Z ) 2025-05-07T20:32:11.6897911Z self = 2025-05-07T20:32:11.6898424Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.6898703Z 2025-05-07T20:32:11.6898786Z @given( 2025-05-07T20:32:11.6899021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6899347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6899667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6900006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6900350Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6900650Z ) 2025-05-07T20:32:11.6901014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6901482Z def test_silu_mul_quant( 2025-05-07T20:32:11.6901737Z self, 2025-05-07T20:32:11.6901943Z T: int, 2025-05-07T20:32:11.6902152Z D: int, 2025-05-07T20:32:11.6902384Z scale_ub: Optional[float], 2025-05-07T20:32:11.6902663Z contiguous: bool, 2025-05-07T20:32:11.6902916Z compiled: bool, 2025-05-07T20:32:11.6903150Z ) -> None: 2025-05-07T20:32:11.6903373Z torch.manual_seed(2025) 2025-05-07T20:32:11.6903624Z 2025-05-07T20:32:11.6903907Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6904264Z 2025-05-07T20:32:11.6904463Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6904772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6905096Z x = x_sign * x_clamp 2025-05-07T20:32:11.6905340Z x0 = x[:, :D] 2025-05-07T20:32:11.6905568Z x1 = x[:, D:] 2025-05-07T20:32:11.6905785Z 2025-05-07T20:32:11.6905973Z if contiguous: 2025-05-07T20:32:11.6906222Z x0 = x0.contiguous() 2025-05-07T20:32:11.6906499Z x1 = x1.contiguous() 2025-05-07T20:32:11.6906745Z 2025-05-07T20:32:11.6906945Z if scale_ub is not None: 2025-05-07T20:32:11.6907235Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6907580Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6907905Z ) 2025-05-07T20:32:11.6908109Z else: 2025-05-07T20:32:11.6908324Z scale_ub_tensor = None 2025-05-07T20:32:11.6908587Z 2025-05-07T20:32:11.6908829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6909164Z op = silu_mul_quant 2025-05-07T20:32:11.6909419Z if compiled: 2025-05-07T20:32:11.6909678Z op = torch.compile(op) 2025-05-07T20:32:11.6909988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6910268Z 2025-05-07T20:32:11.6910469Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6910640Z 2025-05-07T20:32:11.6910755Z moe/activation_test.py:117: 2025-05-07T20:32:11.6911149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6911501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6911794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6912516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6913226Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6914133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6914853Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6915542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6916098Z kernel = self.compile( 2025-05-07T20:32:11.6916666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6917585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6917998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6918242Z 2025-05-07T20:32:11.6918459Z self = 2025-05-07T20:32:11.6919585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6921131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef17c5300>} 2025-05-07T20:32:11.6922529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6923598Z context = 2025-05-07T20:32:11.6923907Z 2025-05-07T20:32:11.6924083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6924630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6925114Z module_map=module_map) 2025-05-07T20:32:11.6925497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6925867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6926139Z E ^ 2025-05-07T20:32:11.6926622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6927096Z 2025-05-07T20:32:11.6927529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6928066Z 2025-05-07T20:32:11.6928189Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6928612Z self=, 2025-05-07T20:32:11.6929029Z T=128, 2025-05-07T20:32:11.6929223Z D=7168, 2025-05-07T20:32:11.6929420Z scale_ub=None, 2025-05-07T20:32:11.6929639Z contiguous=True, 2025-05-07T20:32:11.6929868Z compiled=False, 2025-05-07T20:32:11.6930076Z ) 2025-05-07T20:32:11.6930409Z self = 2025-05-07T20:32:11.6930918Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.6931194Z 2025-05-07T20:32:11.6931271Z @given( 2025-05-07T20:32:11.6931508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6931833Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6932152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6932495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6932993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6933295Z ) 2025-05-07T20:32:11.6933650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6934106Z def test_silu_mul_quant( 2025-05-07T20:32:11.6934357Z self, 2025-05-07T20:32:11.6934549Z T: int, 2025-05-07T20:32:11.6934749Z D: int, 2025-05-07T20:32:11.6934972Z scale_ub: Optional[float], 2025-05-07T20:32:11.6935249Z contiguous: bool, 2025-05-07T20:32:11.6935494Z compiled: bool, 2025-05-07T20:32:11.6935723Z ) -> None: 2025-05-07T20:32:11.6935937Z torch.manual_seed(2025) 2025-05-07T20:32:11.6936185Z 2025-05-07T20:32:11.6936464Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6936817Z 2025-05-07T20:32:11.6937011Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6937404Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6937733Z x = x_sign * x_clamp 2025-05-07T20:32:11.6937977Z x0 = x[:, :D] 2025-05-07T20:32:11.6938201Z x1 = x[:, D:] 2025-05-07T20:32:11.6938412Z 2025-05-07T20:32:11.6938596Z if contiguous: 2025-05-07T20:32:11.6938833Z x0 = x0.contiguous() 2025-05-07T20:32:11.6939100Z x1 = x1.contiguous() 2025-05-07T20:32:11.6939341Z 2025-05-07T20:32:11.6939559Z if scale_ub is not None: 2025-05-07T20:32:11.6939843Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6940189Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6940510Z ) 2025-05-07T20:32:11.6940703Z else: 2025-05-07T20:32:11.6940921Z scale_ub_tensor = None 2025-05-07T20:32:11.6941184Z 2025-05-07T20:32:11.6941418Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6941743Z op = silu_mul_quant 2025-05-07T20:32:11.6942010Z if compiled: 2025-05-07T20:32:11.6942118Z op = torch.compile(op) 2025-05-07T20:32:11.6942227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6942309Z 2025-05-07T20:32:11.6942406Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6942411Z 2025-05-07T20:32:11.6942513Z moe/activation_test.py:117: 2025-05-07T20:32:11.6942651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6942754Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6942864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6943381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6943482Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6943860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6944101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6944457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6944560Z kernel = self.compile( 2025-05-07T20:32:11.6944955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6945142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6945272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6945277Z 2025-05-07T20:32:11.6945486Z self = 2025-05-07T20:32:11.6946487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6947137Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef17c60c0>} 2025-05-07T20:32:11.6947965Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6956803Z context = 2025-05-07T20:32:11.6956814Z 2025-05-07T20:32:11.6957002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6957292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6957409Z module_map=module_map) 2025-05-07T20:32:11.6957582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6957862Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6957947Z E ^ 2025-05-07T20:32:11.6958329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6958334Z 2025-05-07T20:32:11.6958825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6958830Z 2025-05-07T20:32:11.6958947Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6959190Z self=, 2025-05-07T20:32:11.6959273Z T=2048, 2025-05-07T20:32:11.6959359Z D=7168, 2025-05-07T20:32:11.6959456Z scale_ub=1200.0, 2025-05-07T20:32:11.6959545Z contiguous=True, 2025-05-07T20:32:11.6959633Z compiled=False, 2025-05-07T20:32:11.6959720Z ) 2025-05-07T20:32:11.6959950Z self = 2025-05-07T20:32:11.6960245Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.6960265Z 2025-05-07T20:32:11.6960353Z @given( 2025-05-07T20:32:11.6960482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6960596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6960718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6960842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6960972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6961052Z ) 2025-05-07T20:32:11.6961312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6961420Z def test_silu_mul_quant( 2025-05-07T20:32:11.6961502Z self, 2025-05-07T20:32:11.6961584Z T: int, 2025-05-07T20:32:11.6961673Z D: int, 2025-05-07T20:32:11.6961778Z scale_ub: Optional[float], 2025-05-07T20:32:11.6961883Z contiguous: bool, 2025-05-07T20:32:11.6961976Z compiled: bool, 2025-05-07T20:32:11.6962068Z ) -> None: 2025-05-07T20:32:11.6962182Z torch.manual_seed(2025) 2025-05-07T20:32:11.6962261Z 2025-05-07T20:32:11.6962440Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6964309Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6964315Z 2025-05-07T20:32:11.6964440Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.6964445Z 2025-05-07T20:32:11.6964562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6964894Z self=, 2025-05-07T20:32:11.6964977Z T=1, 2025-05-07T20:32:11.6965064Z D=5120, 2025-05-07T20:32:11.6965152Z scale_ub=1200.0, 2025-05-07T20:32:11.6965246Z contiguous=True, 2025-05-07T20:32:11.6965333Z compiled=False, 2025-05-07T20:32:11.6965409Z ) 2025-05-07T20:32:11.6965642Z self = 2025-05-07T20:32:11.6965815Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.6965819Z 2025-05-07T20:32:11.6965899Z @given( 2025-05-07T20:32:11.6966030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6966132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6966253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6966387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6966505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6966720Z ) 2025-05-07T20:32:11.6966981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6967079Z def test_silu_mul_quant( 2025-05-07T20:32:11.6967166Z self, 2025-05-07T20:32:11.6967246Z T: int, 2025-05-07T20:32:11.6967326Z D: int, 2025-05-07T20:32:11.6967436Z scale_ub: Optional[float], 2025-05-07T20:32:11.6967530Z contiguous: bool, 2025-05-07T20:32:11.6967620Z compiled: bool, 2025-05-07T20:32:11.6967709Z ) -> None: 2025-05-07T20:32:11.6967808Z torch.manual_seed(2025) 2025-05-07T20:32:11.6967884Z 2025-05-07T20:32:11.6968068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6968148Z 2025-05-07T20:32:11.6968252Z x_sign = torch.sign(x) 2025-05-07T20:32:11.6968383Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.6968474Z x = x_sign * x_clamp 2025-05-07T20:32:11.6968570Z x0 = x[:, :D] 2025-05-07T20:32:11.6968657Z x1 = x[:, D:] 2025-05-07T20:32:11.6968733Z 2025-05-07T20:32:11.6968830Z if contiguous: 2025-05-07T20:32:11.6968925Z x0 = x0.contiguous() 2025-05-07T20:32:11.6969019Z x1 = x1.contiguous() 2025-05-07T20:32:11.6969100Z 2025-05-07T20:32:11.6969194Z if scale_ub is not None: 2025-05-07T20:32:11.6969304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.6969453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.6969531Z ) 2025-05-07T20:32:11.6969618Z else: 2025-05-07T20:32:11.6969716Z scale_ub_tensor = None 2025-05-07T20:32:11.6969790Z 2025-05-07T20:32:11.6969935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.6970032Z op = silu_mul_quant 2025-05-07T20:32:11.6970122Z if compiled: 2025-05-07T20:32:11.6970231Z op = torch.compile(op) 2025-05-07T20:32:11.6970348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6970428Z 2025-05-07T20:32:11.6970529Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.6970533Z 2025-05-07T20:32:11.6970635Z moe/activation_test.py:117: 2025-05-07T20:32:11.6970772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6970885Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.6970990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.6971517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.6971620Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.6971996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.6972240Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.6972689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.6972800Z kernel = self.compile( 2025-05-07T20:32:11.6973200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.6973383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.6973522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.6973526Z 2025-05-07T20:32:11.6973739Z self = 2025-05-07T20:32:11.6974544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.6975084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef17c76a0>} 2025-05-07T20:32:11.6975937Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.6976147Z context = 2025-05-07T20:32:11.6976151Z 2025-05-07T20:32:11.6976326Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.6976609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.6976724Z module_map=module_map) 2025-05-07T20:32:11.6976891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.6977003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.6977083Z E ^ 2025-05-07T20:32:11.6977460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.6977475Z 2025-05-07T20:32:11.6977915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.6977920Z 2025-05-07T20:32:11.6978028Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6978269Z self=, 2025-05-07T20:32:11.6978354Z T=2048, 2025-05-07T20:32:11.6978434Z D=5120, 2025-05-07T20:32:11.6978534Z scale_ub=None, 2025-05-07T20:32:11.6978642Z contiguous=True, 2025-05-07T20:32:11.6978739Z compiled=False, 2025-05-07T20:32:11.6978840Z ) 2025-05-07T20:32:11.6979067Z self = 2025-05-07T20:32:11.6979260Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.6979264Z 2025-05-07T20:32:11.6979344Z @given( 2025-05-07T20:32:11.6979474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6979592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6979711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6979833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6979958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6980036Z ) 2025-05-07T20:32:11.6980293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6980399Z def test_silu_mul_quant( 2025-05-07T20:32:11.6980479Z self, 2025-05-07T20:32:11.6980565Z T: int, 2025-05-07T20:32:11.6980645Z D: int, 2025-05-07T20:32:11.6980746Z scale_ub: Optional[float], 2025-05-07T20:32:11.6980847Z contiguous: bool, 2025-05-07T20:32:11.6980935Z compiled: bool, 2025-05-07T20:32:11.6981021Z ) -> None: 2025-05-07T20:32:11.6981127Z torch.manual_seed(2025) 2025-05-07T20:32:11.6981208Z 2025-05-07T20:32:11.6981474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6981563Z 2025-05-07T20:32:11.6981659Z > x_sign = torch.sign(x) 2025-05-07T20:32:11.6983519Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6983525Z 2025-05-07T20:32:11.6983648Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:11.6983653Z 2025-05-07T20:32:11.6983764Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6984093Z self=, 2025-05-07T20:32:11.6984175Z T=16384, 2025-05-07T20:32:11.6984266Z D=5120, 2025-05-07T20:32:11.6984350Z scale_ub=None, 2025-05-07T20:32:11.6984438Z contiguous=True, 2025-05-07T20:32:11.6984532Z compiled=False, 2025-05-07T20:32:11.6984608Z ) 2025-05-07T20:32:11.6984837Z self = 2025-05-07T20:32:11.6985030Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.6985034Z 2025-05-07T20:32:11.6985115Z @given( 2025-05-07T20:32:11.6985238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6985350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6985468Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6985595Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6985713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6985796Z ) 2025-05-07T20:32:11.6986062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6986160Z def test_silu_mul_quant( 2025-05-07T20:32:11.6986238Z self, 2025-05-07T20:32:11.6986324Z T: int, 2025-05-07T20:32:11.6986403Z D: int, 2025-05-07T20:32:11.6986505Z scale_ub: Optional[float], 2025-05-07T20:32:11.6986605Z contiguous: bool, 2025-05-07T20:32:11.6986693Z compiled: bool, 2025-05-07T20:32:11.6986774Z ) -> None: 2025-05-07T20:32:11.6986878Z torch.manual_seed(2025) 2025-05-07T20:32:11.6986955Z 2025-05-07T20:32:11.6987137Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6989024Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6989036Z 2025-05-07T20:32:11.6989165Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.6989170Z 2025-05-07T20:32:11.6989279Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6989510Z self=, 2025-05-07T20:32:11.6989598Z T=4096, 2025-05-07T20:32:11.6989679Z D=5120, 2025-05-07T20:32:11.6989766Z scale_ub=None, 2025-05-07T20:32:11.6989864Z contiguous=True, 2025-05-07T20:32:11.6989953Z compiled=False, 2025-05-07T20:32:11.6990030Z ) 2025-05-07T20:32:11.6990264Z self = 2025-05-07T20:32:11.6990447Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.6990535Z 2025-05-07T20:32:11.6990625Z @given( 2025-05-07T20:32:11.6990748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6990854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6990981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6991101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6991218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6991303Z ) 2025-05-07T20:32:11.6991557Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6991668Z def test_silu_mul_quant( 2025-05-07T20:32:11.6991748Z self, 2025-05-07T20:32:11.6991828Z T: int, 2025-05-07T20:32:11.6991915Z D: int, 2025-05-07T20:32:11.6992018Z scale_ub: Optional[float], 2025-05-07T20:32:11.6992113Z contiguous: bool, 2025-05-07T20:32:11.6992291Z compiled: bool, 2025-05-07T20:32:11.6992373Z ) -> None: 2025-05-07T20:32:11.6992476Z torch.manual_seed(2025) 2025-05-07T20:32:11.6992562Z 2025-05-07T20:32:11.6992739Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6994568Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.6994574Z 2025-05-07T20:32:11.6994697Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.6994706Z 2025-05-07T20:32:11.6994813Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.6995057Z self=, 2025-05-07T20:32:11.6995137Z T=2048, 2025-05-07T20:32:11.6995224Z D=5120, 2025-05-07T20:32:11.6995310Z scale_ub=None, 2025-05-07T20:32:11.6995400Z contiguous=False, 2025-05-07T20:32:11.6995495Z compiled=False, 2025-05-07T20:32:11.6995572Z ) 2025-05-07T20:32:11.6995797Z self = 2025-05-07T20:32:11.6995985Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.6995990Z 2025-05-07T20:32:11.6996069Z @given( 2025-05-07T20:32:11.6996192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.6996303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.6996421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.6996550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.6996673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.6996753Z ) 2025-05-07T20:32:11.6997015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.6997114Z def test_silu_mul_quant( 2025-05-07T20:32:11.6997194Z self, 2025-05-07T20:32:11.6997281Z T: int, 2025-05-07T20:32:11.6997360Z D: int, 2025-05-07T20:32:11.6997461Z scale_ub: Optional[float], 2025-05-07T20:32:11.6997560Z contiguous: bool, 2025-05-07T20:32:11.6997649Z compiled: bool, 2025-05-07T20:32:11.6997729Z ) -> None: 2025-05-07T20:32:11.6997836Z torch.manual_seed(2025) 2025-05-07T20:32:11.6997912Z 2025-05-07T20:32:11.6998094Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.6999997Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7000009Z 2025-05-07T20:32:11.7000235Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7000240Z 2025-05-07T20:32:11.7000352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7000585Z self=, 2025-05-07T20:32:11.7000672Z T=4096, 2025-05-07T20:32:11.7000751Z D=7168, 2025-05-07T20:32:11.7000838Z scale_ub=None, 2025-05-07T20:32:11.7000933Z contiguous=True, 2025-05-07T20:32:11.7001018Z compiled=True, 2025-05-07T20:32:11.7001094Z ) 2025-05-07T20:32:11.7001328Z self = 2025-05-07T20:32:11.7001620Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.7001625Z 2025-05-07T20:32:11.7001710Z @given( 2025-05-07T20:32:11.7001834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7001937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7002061Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7002181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7002299Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7002384Z ) 2025-05-07T20:32:11.7002638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7002745Z def test_silu_mul_quant( 2025-05-07T20:32:11.7002826Z self, 2025-05-07T20:32:11.7002905Z T: int, 2025-05-07T20:32:11.7002992Z D: int, 2025-05-07T20:32:11.7003096Z scale_ub: Optional[float], 2025-05-07T20:32:11.7003195Z contiguous: bool, 2025-05-07T20:32:11.7003294Z compiled: bool, 2025-05-07T20:32:11.7003375Z ) -> None: 2025-05-07T20:32:11.7003473Z torch.manual_seed(2025) 2025-05-07T20:32:11.7003555Z 2025-05-07T20:32:11.7003730Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7005563Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7005569Z 2025-05-07T20:32:11.7005695Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7005700Z 2025-05-07T20:32:11.7005809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7006047Z self=, 2025-05-07T20:32:11.7006129Z T=2048, 2025-05-07T20:32:11.7006214Z D=5120, 2025-05-07T20:32:11.7006302Z scale_ub=1200.0, 2025-05-07T20:32:11.7006390Z contiguous=False, 2025-05-07T20:32:11.7006487Z compiled=False, 2025-05-07T20:32:11.7006563Z ) 2025-05-07T20:32:11.7006790Z self = 2025-05-07T20:32:11.7006984Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.7006988Z 2025-05-07T20:32:11.7007071Z @given( 2025-05-07T20:32:11.7007193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7007304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7007423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7007557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7007762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7007841Z ) 2025-05-07T20:32:11.7008104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7008203Z def test_silu_mul_quant( 2025-05-07T20:32:11.7008284Z self, 2025-05-07T20:32:11.7008371Z T: int, 2025-05-07T20:32:11.7008451Z D: int, 2025-05-07T20:32:11.7008554Z scale_ub: Optional[float], 2025-05-07T20:32:11.7008655Z contiguous: bool, 2025-05-07T20:32:11.7008745Z compiled: bool, 2025-05-07T20:32:11.7008826Z ) -> None: 2025-05-07T20:32:11.7008933Z torch.manual_seed(2025) 2025-05-07T20:32:11.7009010Z 2025-05-07T20:32:11.7009192Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7011021Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7011107Z 2025-05-07T20:32:11.7011237Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7011242Z 2025-05-07T20:32:11.7011356Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7011588Z self=, 2025-05-07T20:32:11.7011675Z T=4096, 2025-05-07T20:32:11.7011755Z D=7168, 2025-05-07T20:32:11.7011846Z scale_ub=1200.0, 2025-05-07T20:32:11.7011933Z contiguous=True, 2025-05-07T20:32:11.7012020Z compiled=False, 2025-05-07T20:32:11.7012110Z ) 2025-05-07T20:32:11.7012342Z self = 2025-05-07T20:32:11.7012522Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.7012527Z 2025-05-07T20:32:11.7012611Z @given( 2025-05-07T20:32:11.7012733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7012835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7012959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7013080Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7013202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7013280Z ) 2025-05-07T20:32:11.7014114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7014229Z def test_silu_mul_quant( 2025-05-07T20:32:11.7014310Z self, 2025-05-07T20:32:11.7014390Z T: int, 2025-05-07T20:32:11.7014484Z D: int, 2025-05-07T20:32:11.7014587Z scale_ub: Optional[float], 2025-05-07T20:32:11.7014683Z contiguous: bool, 2025-05-07T20:32:11.7014779Z compiled: bool, 2025-05-07T20:32:11.7014861Z ) -> None: 2025-05-07T20:32:11.7014963Z torch.manual_seed(2025) 2025-05-07T20:32:11.7015048Z 2025-05-07T20:32:11.7015224Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7017063Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7017073Z 2025-05-07T20:32:11.7017443Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7017449Z 2025-05-07T20:32:11.7017567Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7017798Z self=, 2025-05-07T20:32:11.7017878Z T=16384, 2025-05-07T20:32:11.7017967Z D=7168, 2025-05-07T20:32:11.7018053Z scale_ub=None, 2025-05-07T20:32:11.7018147Z contiguous=False, 2025-05-07T20:32:11.7018240Z compiled=True, 2025-05-07T20:32:11.7018317Z ) 2025-05-07T20:32:11.7018541Z self = 2025-05-07T20:32:11.7018730Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.7018734Z 2025-05-07T20:32:11.7018812Z @given( 2025-05-07T20:32:11.7018938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7019039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7019280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7019410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7019526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7019604Z ) 2025-05-07T20:32:11.7019865Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7019962Z def test_silu_mul_quant( 2025-05-07T20:32:11.7020050Z self, 2025-05-07T20:32:11.7020130Z T: int, 2025-05-07T20:32:11.7020209Z D: int, 2025-05-07T20:32:11.7020315Z scale_ub: Optional[float], 2025-05-07T20:32:11.7020412Z contiguous: bool, 2025-05-07T20:32:11.7020502Z compiled: bool, 2025-05-07T20:32:11.7020588Z ) -> None: 2025-05-07T20:32:11.7020687Z torch.manual_seed(2025) 2025-05-07T20:32:11.7020763Z 2025-05-07T20:32:11.7020945Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7022784Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7022790Z 2025-05-07T20:32:11.7022917Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7022921Z 2025-05-07T20:32:11.7023027Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7023266Z self=, 2025-05-07T20:32:11.7023346Z T=4096, 2025-05-07T20:32:11.7023425Z D=7168, 2025-05-07T20:32:11.7023515Z scale_ub=None, 2025-05-07T20:32:11.7023609Z contiguous=True, 2025-05-07T20:32:11.7023697Z compiled=False, 2025-05-07T20:32:11.7023784Z ) 2025-05-07T20:32:11.7024011Z self = 2025-05-07T20:32:11.7024188Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.7024192Z 2025-05-07T20:32:11.7024282Z @given( 2025-05-07T20:32:11.7024405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7024508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7024632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7024751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7024872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7024950Z ) 2025-05-07T20:32:11.7025202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7025305Z def test_silu_mul_quant( 2025-05-07T20:32:11.7025390Z self, 2025-05-07T20:32:11.7025471Z T: int, 2025-05-07T20:32:11.7025645Z D: int, 2025-05-07T20:32:11.7025747Z scale_ub: Optional[float], 2025-05-07T20:32:11.7025839Z contiguous: bool, 2025-05-07T20:32:11.7025937Z compiled: bool, 2025-05-07T20:32:11.7026017Z ) -> None: 2025-05-07T20:32:11.7026115Z torch.manual_seed(2025) 2025-05-07T20:32:11.7026197Z 2025-05-07T20:32:11.7026372Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7028210Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7028289Z 2025-05-07T20:32:11.7028412Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7028417Z 2025-05-07T20:32:11.7028530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7028764Z self=, 2025-05-07T20:32:11.7028848Z T=16384, 2025-05-07T20:32:11.7028937Z D=7168, 2025-05-07T20:32:11.7029022Z scale_ub=None, 2025-05-07T20:32:11.7029111Z contiguous=True, 2025-05-07T20:32:11.7029205Z compiled=False, 2025-05-07T20:32:11.7029285Z ) 2025-05-07T20:32:11.7029512Z self = 2025-05-07T20:32:11.7029700Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.7029705Z 2025-05-07T20:32:11.7029784Z @given( 2025-05-07T20:32:11.7029914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7030026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7030148Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7030277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7030395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7030476Z ) 2025-05-07T20:32:11.7030739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7030837Z def test_silu_mul_quant( 2025-05-07T20:32:11.7030924Z self, 2025-05-07T20:32:11.7031005Z T: int, 2025-05-07T20:32:11.7031086Z D: int, 2025-05-07T20:32:11.7031202Z scale_ub: Optional[float], 2025-05-07T20:32:11.7031294Z contiguous: bool, 2025-05-07T20:32:11.7031384Z compiled: bool, 2025-05-07T20:32:11.7031473Z ) -> None: 2025-05-07T20:32:11.7031571Z torch.manual_seed(2025) 2025-05-07T20:32:11.7031647Z 2025-05-07T20:32:11.7031829Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7033668Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7033675Z 2025-05-07T20:32:11.7033803Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7033807Z 2025-05-07T20:32:11.7033917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7034153Z self=, 2025-05-07T20:32:11.7034234Z T=16384, 2025-05-07T20:32:11.7034318Z D=7168, 2025-05-07T20:32:11.7034410Z scale_ub=1200.0, 2025-05-07T20:32:11.7034607Z contiguous=True, 2025-05-07T20:32:11.7034695Z compiled=False, 2025-05-07T20:32:11.7034780Z ) 2025-05-07T20:32:11.7035005Z self = 2025-05-07T20:32:11.7035188Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.7035192Z 2025-05-07T20:32:11.7035280Z @given( 2025-05-07T20:32:11.7035405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7035508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7035633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7035753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7035876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7035954Z ) 2025-05-07T20:32:11.7036207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7036491Z def test_silu_mul_quant( 2025-05-07T20:32:11.7036576Z self, 2025-05-07T20:32:11.7036656Z T: int, 2025-05-07T20:32:11.7036748Z D: int, 2025-05-07T20:32:11.7036849Z scale_ub: Optional[float], 2025-05-07T20:32:11.7036941Z contiguous: bool, 2025-05-07T20:32:11.7037036Z compiled: bool, 2025-05-07T20:32:11.7037117Z ) -> None: 2025-05-07T20:32:11.7037214Z torch.manual_seed(2025) 2025-05-07T20:32:11.7037297Z 2025-05-07T20:32:11.7037472Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7039305Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7039316Z 2025-05-07T20:32:11.7039437Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7039442Z 2025-05-07T20:32:11.7039553Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7039785Z self=, 2025-05-07T20:32:11.7039865Z T=128, 2025-05-07T20:32:11.7039952Z D=5120, 2025-05-07T20:32:11.7040038Z scale_ub=1200.0, 2025-05-07T20:32:11.7040214Z contiguous=False, 2025-05-07T20:32:11.7040307Z compiled=False, 2025-05-07T20:32:11.7040381Z ) 2025-05-07T20:32:11.7040605Z self = 2025-05-07T20:32:11.7040786Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:11.7040791Z 2025-05-07T20:32:11.7040877Z @given( 2025-05-07T20:32:11.7041007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7041106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7041223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7041347Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7041463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7041544Z ) 2025-05-07T20:32:11.7041802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7041898Z def test_silu_mul_quant( 2025-05-07T20:32:11.7041982Z self, 2025-05-07T20:32:11.7042060Z T: int, 2025-05-07T20:32:11.7042136Z D: int, 2025-05-07T20:32:11.7042241Z scale_ub: Optional[float], 2025-05-07T20:32:11.7042330Z contiguous: bool, 2025-05-07T20:32:11.7042416Z compiled: bool, 2025-05-07T20:32:11.7042501Z ) -> None: 2025-05-07T20:32:11.7042598Z torch.manual_seed(2025) 2025-05-07T20:32:11.7042677Z 2025-05-07T20:32:11.7042945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7043021Z 2025-05-07T20:32:11.7043115Z x_sign = torch.sign(x) 2025-05-07T20:32:11.7043251Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.7043343Z x = x_sign * x_clamp 2025-05-07T20:32:11.7043425Z x0 = x[:, :D] 2025-05-07T20:32:11.7043513Z x1 = x[:, D:] 2025-05-07T20:32:11.7043586Z 2025-05-07T20:32:11.7043683Z if contiguous: 2025-05-07T20:32:11.7043776Z x0 = x0.contiguous() 2025-05-07T20:32:11.7043867Z x1 = x1.contiguous() 2025-05-07T20:32:11.7043949Z 2025-05-07T20:32:11.7044043Z if scale_ub is not None: 2025-05-07T20:32:11.7044150Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.7044297Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.7044374Z ) 2025-05-07T20:32:11.7044535Z else: 2025-05-07T20:32:11.7044640Z scale_ub_tensor = None 2025-05-07T20:32:11.7044721Z 2025-05-07T20:32:11.7044853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.7044954Z op = silu_mul_quant 2025-05-07T20:32:11.7045041Z if compiled: 2025-05-07T20:32:11.7045148Z op = torch.compile(op) 2025-05-07T20:32:11.7045258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7045332Z 2025-05-07T20:32:11.7045430Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.7045434Z 2025-05-07T20:32:11.7045534Z moe/activation_test.py:117: 2025-05-07T20:32:11.7045667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7045777Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.7045881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7046396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.7046507Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.7046886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.7047125Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.7047477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.7047576Z kernel = self.compile( 2025-05-07T20:32:11.7047978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.7048158Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.7048295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7048300Z 2025-05-07T20:32:11.7048510Z self = 2025-05-07T20:32:11.7049320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.7049850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef0f39bc0>} 2025-05-07T20:32:11.7050619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.7050821Z context = 2025-05-07T20:32:11.7050825Z 2025-05-07T20:32:11.7050999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.7051270Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.7051474Z module_map=module_map) 2025-05-07T20:32:11.7051643Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.7051753Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.7051835Z E ^ 2025-05-07T20:32:11.7052201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.7052206Z 2025-05-07T20:32:11.7052639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.7052643Z 2025-05-07T20:32:11.7052750Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7052984Z self=, 2025-05-07T20:32:11.7053064Z T=2048, 2025-05-07T20:32:11.7053143Z D=7168, 2025-05-07T20:32:11.7053233Z scale_ub=None, 2025-05-07T20:32:11.7053323Z contiguous=False, 2025-05-07T20:32:11.7053492Z compiled=False, 2025-05-07T20:32:11.7053574Z ) 2025-05-07T20:32:11.7053803Z self = 2025-05-07T20:32:11.7053985Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.7053989Z 2025-05-07T20:32:11.7054072Z @given( 2025-05-07T20:32:11.7054193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7054302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7054419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7054538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7054660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7054736Z ) 2025-05-07T20:32:11.7054990Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7055092Z def test_silu_mul_quant( 2025-05-07T20:32:11.7055171Z self, 2025-05-07T20:32:11.7055256Z T: int, 2025-05-07T20:32:11.7055340Z D: int, 2025-05-07T20:32:11.7055447Z scale_ub: Optional[float], 2025-05-07T20:32:11.7055538Z contiguous: bool, 2025-05-07T20:32:11.7055631Z compiled: bool, 2025-05-07T20:32:11.7055711Z ) -> None: 2025-05-07T20:32:11.7055814Z torch.manual_seed(2025) 2025-05-07T20:32:11.7055890Z 2025-05-07T20:32:11.7056064Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7057910Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7057921Z 2025-05-07T20:32:11.7058045Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7058050Z 2025-05-07T20:32:11.7058162Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7058394Z self=, 2025-05-07T20:32:11.7058473Z T=128, 2025-05-07T20:32:11.7058558Z D=7168, 2025-05-07T20:32:11.7058657Z scale_ub=1200.0, 2025-05-07T20:32:11.7058755Z contiguous=True, 2025-05-07T20:32:11.7058862Z compiled=True, 2025-05-07T20:32:11.7058948Z ) 2025-05-07T20:32:11.7059179Z self = 2025-05-07T20:32:11.7059351Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.7059355Z 2025-05-07T20:32:11.7059434Z @given( 2025-05-07T20:32:11.7059560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7059669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7059866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7059994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7060112Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7060190Z ) 2025-05-07T20:32:11.7060450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7060547Z def test_silu_mul_quant( 2025-05-07T20:32:11.7060635Z self, 2025-05-07T20:32:11.7060713Z T: int, 2025-05-07T20:32:11.7060793Z D: int, 2025-05-07T20:32:11.7060899Z scale_ub: Optional[float], 2025-05-07T20:32:11.7060990Z contiguous: bool, 2025-05-07T20:32:11.7061080Z compiled: bool, 2025-05-07T20:32:11.7061166Z ) -> None: 2025-05-07T20:32:11.7061266Z torch.manual_seed(2025) 2025-05-07T20:32:11.7061343Z 2025-05-07T20:32:11.7061520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7061672Z 2025-05-07T20:32:11.7061773Z x_sign = torch.sign(x) 2025-05-07T20:32:11.7061907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.7061998Z x = x_sign * x_clamp 2025-05-07T20:32:11.7062092Z x0 = x[:, :D] 2025-05-07T20:32:11.7062177Z x1 = x[:, D:] 2025-05-07T20:32:11.7062253Z 2025-05-07T20:32:11.7062347Z if contiguous: 2025-05-07T20:32:11.7062441Z x0 = x0.contiguous() 2025-05-07T20:32:11.7062536Z x1 = x1.contiguous() 2025-05-07T20:32:11.7062620Z 2025-05-07T20:32:11.7062712Z if scale_ub is not None: 2025-05-07T20:32:11.7062821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.7062974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.7063054Z ) 2025-05-07T20:32:11.7063134Z else: 2025-05-07T20:32:11.7063237Z scale_ub_tensor = None 2025-05-07T20:32:11.7063314Z 2025-05-07T20:32:11.7063454Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.7063559Z op = silu_mul_quant 2025-05-07T20:32:11.7063646Z if compiled: 2025-05-07T20:32:11.7063753Z op = torch.compile(op) 2025-05-07T20:32:11.7063862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7063938Z 2025-05-07T20:32:11.7064036Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.7064040Z 2025-05-07T20:32:11.7064141Z moe/activation_test.py:117: 2025-05-07T20:32:11.7064271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7064380Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.7064484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7064864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.7064966Z return fn(*args, **kwargs) 2025-05-07T20:32:11.7065474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.7065591Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.7065960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.7066190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.7066550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.7066647Z kernel = self.compile( 2025-05-07T20:32:11.7067050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.7067232Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.7067369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7067373Z 2025-05-07T20:32:11.7067584Z self = 2025-05-07T20:32:11.7068501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.7069085Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1ef0e2c2c0>} 2025-05-07T20:32:11.7069854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.7070062Z context = 2025-05-07T20:32:11.7070067Z 2025-05-07T20:32:11.7070240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.7070598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.7070712Z module_map=module_map) 2025-05-07T20:32:11.7070879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.7070994Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.7071074Z E ^ 2025-05-07T20:32:11.7071440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.7071444Z 2025-05-07T20:32:11.7071882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.7071887Z 2025-05-07T20:32:11.7071994Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7072228Z self=, 2025-05-07T20:32:11.7072309Z T=128, 2025-05-07T20:32:11.7072388Z D=7168, 2025-05-07T20:32:11.7072487Z scale_ub=1200.0, 2025-05-07T20:32:11.7072576Z contiguous=True, 2025-05-07T20:32:11.7072665Z compiled=False, 2025-05-07T20:32:11.7072746Z ) 2025-05-07T20:32:11.7072971Z self = 2025-05-07T20:32:11.7073149Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:11.7073160Z 2025-05-07T20:32:11.7073238Z @given( 2025-05-07T20:32:11.7073360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7073467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7073583Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7073703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7073824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7073901Z ) 2025-05-07T20:32:11.7074157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7074259Z def test_silu_mul_quant( 2025-05-07T20:32:11.7074343Z self, 2025-05-07T20:32:11.7074422Z T: int, 2025-05-07T20:32:11.7074514Z D: int, 2025-05-07T20:32:11.7074614Z scale_ub: Optional[float], 2025-05-07T20:32:11.7074712Z contiguous: bool, 2025-05-07T20:32:11.7074800Z compiled: bool, 2025-05-07T20:32:11.7074880Z ) -> None: 2025-05-07T20:32:11.7074984Z torch.manual_seed(2025) 2025-05-07T20:32:11.7075060Z 2025-05-07T20:32:11.7075232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7075315Z 2025-05-07T20:32:11.7075409Z x_sign = torch.sign(x) 2025-05-07T20:32:11.7075537Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.7077467Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7077478Z 2025-05-07T20:32:11.7077601Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:11.7077606Z 2025-05-07T20:32:11.7077718Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7077949Z self=, 2025-05-07T20:32:11.7078034Z T=128, 2025-05-07T20:32:11.7078114Z D=5120, 2025-05-07T20:32:11.7078198Z scale_ub=1200.0, 2025-05-07T20:32:11.7078290Z contiguous=True, 2025-05-07T20:32:11.7078377Z compiled=True, 2025-05-07T20:32:11.7078452Z ) 2025-05-07T20:32:11.7078681Z self = 2025-05-07T20:32:11.7078855Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.7078937Z 2025-05-07T20:32:11.7079021Z @given( 2025-05-07T20:32:11.7079149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7079266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7079383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7079503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7079628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7079705Z ) 2025-05-07T20:32:11.7079957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7080063Z def test_silu_mul_quant( 2025-05-07T20:32:11.7080196Z self, 2025-05-07T20:32:11.7080275Z T: int, 2025-05-07T20:32:11.7080364Z D: int, 2025-05-07T20:32:11.7080464Z scale_ub: Optional[float], 2025-05-07T20:32:11.7080561Z contiguous: bool, 2025-05-07T20:32:11.7080650Z compiled: bool, 2025-05-07T20:32:11.7080738Z ) -> None: 2025-05-07T20:32:11.7080847Z torch.manual_seed(2025) 2025-05-07T20:32:11.7080927Z 2025-05-07T20:32:11.7081100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7081183Z 2025-05-07T20:32:11.7081278Z > x_sign = torch.sign(x) 2025-05-07T20:32:11.7083103Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7083109Z 2025-05-07T20:32:11.7083229Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:11.7083238Z 2025-05-07T20:32:11.7083349Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7083585Z self=, 2025-05-07T20:32:11.7083666Z T=128, 2025-05-07T20:32:11.7083751Z D=7168, 2025-05-07T20:32:11.7083835Z scale_ub=None, 2025-05-07T20:32:11.7083923Z contiguous=True, 2025-05-07T20:32:11.7084014Z compiled=True, 2025-05-07T20:32:11.7084091Z ) 2025-05-07T20:32:11.7084317Z self = 2025-05-07T20:32:11.7084493Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.7084497Z 2025-05-07T20:32:11.7084574Z @given( 2025-05-07T20:32:11.7084695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7084801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7084920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7085051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7085250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7085328Z ) 2025-05-07T20:32:11.7085589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7085687Z def test_silu_mul_quant( 2025-05-07T20:32:11.7085852Z self, 2025-05-07T20:32:11.7085969Z T: int, 2025-05-07T20:32:11.7093994Z D: int, 2025-05-07T20:32:11.7094130Z scale_ub: Optional[float], 2025-05-07T20:32:11.7094227Z contiguous: bool, 2025-05-07T20:32:11.7094318Z compiled: bool, 2025-05-07T20:32:11.7094407Z ) -> None: 2025-05-07T20:32:11.7094507Z torch.manual_seed(2025) 2025-05-07T20:32:11.7094584Z 2025-05-07T20:32:11.7094780Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7096638Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:11.7096797Z 2025-05-07T20:32:11.7096934Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:11.7097079Z =============================== warnings summary =============================== 2025-05-07T20:32:11.7097408Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:11.7097732Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:11.7098045Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:11.7098973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:11.7099216Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:11.7099221Z 2025-05-07T20:32:11.7099415Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:11.7100728Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:11.7100929Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:11.7100938Z 2025-05-07T20:32:11.7101166Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:11.7101337Z ================== 1 failed, 1 passed, 13 warnings in 18.83s =================== 2025-05-07T20:32:13.5047905Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:13.5673379Z 2025-05-07T20:32:13.5673852Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:13.5674372Z 2025-05-07T20:32:13.5674378Z 2025-05-07T20:32:13.5696193Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:15.7274146Z ============================= test session starts ============================== 2025-05-07T20:32:15.7275655Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:15.7276641Z cachedir: .pytest_cache 2025-05-07T20:32:15.7277621Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:15.7278926Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:15.7279688Z plugins: hypothesis-6.131.14 2025-05-07T20:32:17.2756923Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:17.3721923Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:17.3722345Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:17.3722574Z 2025-05-07T20:32:19.2153885Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:19.2156743Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:19.2159599Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:19.2161778Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:19.2162831Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:19.2164219Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:19.2165705Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.2167105Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:19.2168575Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.2169695Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] module_map=module_map) 2025-05-07T20:32:19.2171047Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:19.2172377Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:19.2173279Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:19.2174565Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:19.2175860Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:19.2177133Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:19.2178228Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:19.2179532Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:19.2180905Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:19.2181879Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:19.2183126Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:19.2184241Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:19.2185066Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:19.2186327Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:19.2187767Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:19.2188914Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.2189896Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.2190702Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:19.2191853Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.2319176Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:19.2321117Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:19.2322551Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:19.2324065Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:19.2325107Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:19.2326485Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:19.2328139Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.2329529Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:19.2330988Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.2332104Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] module_map=module_map) 2025-05-07T20:32:19.2333451Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:19.2334892Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:19.2335791Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:19.2337069Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:19.2338351Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:19.2339445Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:19.2340539Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:19.2341887Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:19.2343243Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:19.2344200Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:19.2345350Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:19.2346465Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:19.2347291Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:19.2348536Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:19.2349974Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:19.2351107Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.2352252Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.2353056Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:19.2354146Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.6215512Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.6216274Z self=, 2025-05-07T20:32:19.6216706Z T=1, 2025-05-07T20:32:19.6216921Z D=5120, 2025-05-07T20:32:19.6217122Z scale_ub=None, 2025-05-07T20:32:19.6217350Z contiguous=True, 2025-05-07T20:32:19.6217587Z compiled=True, 2025-05-07T20:32:19.6217803Z ) 2025-05-07T20:32:19.6218149Z self = 2025-05-07T20:32:19.6219072Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.6219350Z 2025-05-07T20:32:19.6219444Z @given( 2025-05-07T20:32:19.6219690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.6220031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.6220358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.6220700Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.6221049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.6221354Z ) 2025-05-07T20:32:19.6221720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.6222190Z def test_silu_mul_quant( 2025-05-07T20:32:19.6222450Z self, 2025-05-07T20:32:19.6222654Z T: int, 2025-05-07T20:32:19.6222867Z D: int, 2025-05-07T20:32:19.6223101Z scale_ub: Optional[float], 2025-05-07T20:32:19.6223390Z contiguous: bool, 2025-05-07T20:32:19.6223650Z compiled: bool, 2025-05-07T20:32:19.6223902Z ) -> None: 2025-05-07T20:32:19.6224125Z torch.manual_seed(2025) 2025-05-07T20:32:19.6224383Z 2025-05-07T20:32:19.6224673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.6225037Z 2025-05-07T20:32:19.6225238Z x_sign = torch.sign(x) 2025-05-07T20:32:19.6225548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.6225878Z x = x_sign * x_clamp 2025-05-07T20:32:19.6226127Z x0 = x[:, :D] 2025-05-07T20:32:19.6226356Z x1 = x[:, D:] 2025-05-07T20:32:19.6226573Z 2025-05-07T20:32:19.6226763Z if contiguous: 2025-05-07T20:32:19.6227008Z x0 = x0.contiguous() 2025-05-07T20:32:19.6227280Z x1 = x1.contiguous() 2025-05-07T20:32:19.6227530Z 2025-05-07T20:32:19.6227737Z if scale_ub is not None: 2025-05-07T20:32:19.6228038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.6228391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.6228718Z ) 2025-05-07T20:32:19.6228924Z else: 2025-05-07T20:32:19.6229140Z scale_ub_tensor = None 2025-05-07T20:32:19.6229407Z 2025-05-07T20:32:19.6229651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.6229976Z op = silu_mul_quant 2025-05-07T20:32:19.6230243Z if compiled: 2025-05-07T20:32:19.6230504Z op = torch.compile(op) 2025-05-07T20:32:19.6230828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.6231155Z 2025-05-07T20:32:19.6231358Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.6231657Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.6231958Z 2025-05-07T20:32:19.6232210Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.6232562Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.6232869Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.6233362Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.6233743Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.6234063Z 2025-05-07T20:32:19.6234274Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.6234484Z 2025-05-07T20:32:19.6234591Z moe/activation_test.py:126: 2025-05-07T20:32:19.6234902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6235250Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.6235595Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.6236424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.6237200Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.6237776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.6238584Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.6239305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.6240056Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.6240948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.6241619Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.6242253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.6242791Z fn() 2025-05-07T20:32:19.6243322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.6243937Z self.fn.run( 2025-05-07T20:32:19.6244427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.6244982Z kernel = self.compile( 2025-05-07T20:32:19.6245549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.6246234Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.6246650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6246902Z 2025-05-07T20:32:19.6247117Z self = 2025-05-07T20:32:19.6248242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.6249697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3294bd36a0>} 2025-05-07T20:32:19.6251085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.6252144Z context = 2025-05-07T20:32:19.6252451Z 2025-05-07T20:32:19.6252625Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.6253171Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.6253853Z module_map=module_map) 2025-05-07T20:32:19.6254241Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.6254618Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.6254907Z E ^ 2025-05-07T20:32:19.6255488Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.6255968Z 2025-05-07T20:32:19.6256406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.6256941Z 2025-05-07T20:32:19.6257062Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.6257497Z self=, 2025-05-07T20:32:19.6257926Z T=2048, 2025-05-07T20:32:19.6258124Z D=5120, 2025-05-07T20:32:19.6258321Z scale_ub=1200.0, 2025-05-07T20:32:19.6258557Z contiguous=True, 2025-05-07T20:32:19.6258792Z compiled=False, 2025-05-07T20:32:19.6259012Z ) 2025-05-07T20:32:19.6259344Z self = 2025-05-07T20:32:19.6259865Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.6260238Z 2025-05-07T20:32:19.6260325Z @given( 2025-05-07T20:32:19.6260567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.6260932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.6261284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.6261628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.6261977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.6262281Z ) 2025-05-07T20:32:19.6262646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.6263109Z def test_silu_mul_quant( 2025-05-07T20:32:19.6263366Z self, 2025-05-07T20:32:19.6263574Z T: int, 2025-05-07T20:32:19.6263778Z D: int, 2025-05-07T20:32:19.6264013Z scale_ub: Optional[float], 2025-05-07T20:32:19.6264299Z contiguous: bool, 2025-05-07T20:32:19.6264560Z compiled: bool, 2025-05-07T20:32:19.6264887Z ) -> None: 2025-05-07T20:32:19.6265153Z torch.manual_seed(2025) 2025-05-07T20:32:19.6265408Z 2025-05-07T20:32:19.6265695Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.6266054Z 2025-05-07T20:32:19.6266252Z x_sign = torch.sign(x) 2025-05-07T20:32:19.6266558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.6266883Z x = x_sign * x_clamp 2025-05-07T20:32:19.6267131Z x0 = x[:, :D] 2025-05-07T20:32:19.6267356Z x1 = x[:, D:] 2025-05-07T20:32:19.6267573Z 2025-05-07T20:32:19.6267764Z if contiguous: 2025-05-07T20:32:19.6268007Z x0 = x0.contiguous() 2025-05-07T20:32:19.6268278Z x1 = x1.contiguous() 2025-05-07T20:32:19.6268530Z 2025-05-07T20:32:19.6268726Z if scale_ub is not None: 2025-05-07T20:32:19.6269013Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.6269369Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.6269699Z ) 2025-05-07T20:32:19.6269905Z else: 2025-05-07T20:32:19.6270134Z scale_ub_tensor = None 2025-05-07T20:32:19.6270391Z 2025-05-07T20:32:19.6270636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.6270968Z op = silu_mul_quant 2025-05-07T20:32:19.6271225Z if compiled: 2025-05-07T20:32:19.6271487Z op = torch.compile(op) 2025-05-07T20:32:19.6271803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.6272086Z 2025-05-07T20:32:19.6272291Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.6272463Z 2025-05-07T20:32:19.6272573Z moe/activation_test.py:117: 2025-05-07T20:32:19.6272881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6273227Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.6273525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.6274245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.6275056Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.6275626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.6276339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.6277032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.6277582Z kernel = self.compile( 2025-05-07T20:32:19.6278147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.6278832Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.6279242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.6279488Z 2025-05-07T20:32:19.6279705Z self = 2025-05-07T20:32:19.6281050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.6282475Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3294845f80>} 2025-05-07T20:32:19.6283870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.6284929Z context = 2025-05-07T20:32:19.6285234Z 2025-05-07T20:32:19.6285411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.6285970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.6286464Z module_map=module_map) 2025-05-07T20:32:19.6286843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.6287217Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.6287492Z E ^ 2025-05-07T20:32:19.6287977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.6288454Z 2025-05-07T20:32:19.6288888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.0199168Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.0200788Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:20.0202287Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.0203804Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.0204833Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.0206212Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.0207996Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.0209386Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.0210838Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.0211985Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] module_map=module_map) 2025-05-07T20:32:20.0213650Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.0215147Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:20.0216034Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.0217291Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.0218550Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:20.0219637Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.0220717Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:20.0221995Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.0223327Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.0224274Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.0225418Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.0226516Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:20.0227324Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.0228556Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.0229970Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.0231114Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.0232225Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.0233010Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:20.0234088Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.0975684Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.0977138Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:20.0978548Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.0980221Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.0981248Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.0982617Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.0984060Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.0985433Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.0986873Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.0987967Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] module_map=module_map) 2025-05-07T20:32:20.0989291Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.0990592Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:20.0991543Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.0992808Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.0994070Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:20.0995157Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.0996232Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:20.0997629Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.0998970Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.0999920Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.1001152Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.1002243Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:20.1003152Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.1004377Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.1005816Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.1006924Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.1007885Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.1008677Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:20.1009762Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.6717777Z 2025-05-07T20:32:20.6718508Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.6719295Z self=, 2025-05-07T20:32:20.6719901Z T=2048, 2025-05-07T20:32:20.6720206Z D=5120, 2025-05-07T20:32:20.6720409Z scale_ub=1200.0, 2025-05-07T20:32:20.6720638Z contiguous=True, 2025-05-07T20:32:20.6720869Z compiled=True, 2025-05-07T20:32:20.6721088Z ) 2025-05-07T20:32:20.6721456Z self = 2025-05-07T20:32:20.6721994Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:20.6731024Z 2025-05-07T20:32:20.6731151Z @given( 2025-05-07T20:32:20.6731418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.6731756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.6732075Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.6732426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.6732780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.6733085Z ) 2025-05-07T20:32:20.6733449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.6733916Z def test_silu_mul_quant( 2025-05-07T20:32:20.6734175Z self, 2025-05-07T20:32:20.6734380Z T: int, 2025-05-07T20:32:20.6734596Z D: int, 2025-05-07T20:32:20.6734827Z scale_ub: Optional[float], 2025-05-07T20:32:20.6735110Z contiguous: bool, 2025-05-07T20:32:20.6735365Z compiled: bool, 2025-05-07T20:32:20.6735607Z ) -> None: 2025-05-07T20:32:20.6735833Z torch.manual_seed(2025) 2025-05-07T20:32:20.6736098Z 2025-05-07T20:32:20.6736719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.6737084Z 2025-05-07T20:32:20.6737293Z x_sign = torch.sign(x) 2025-05-07T20:32:20.6737601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.6737931Z x = x_sign * x_clamp 2025-05-07T20:32:20.6738180Z x0 = x[:, :D] 2025-05-07T20:32:20.6738412Z x1 = x[:, D:] 2025-05-07T20:32:20.6738636Z 2025-05-07T20:32:20.6738827Z if contiguous: 2025-05-07T20:32:20.6739073Z x0 = x0.contiguous() 2025-05-07T20:32:20.6739349Z x1 = x1.contiguous() 2025-05-07T20:32:20.6739597Z 2025-05-07T20:32:20.6739808Z if scale_ub is not None: 2025-05-07T20:32:20.6740102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.6740453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.6740787Z ) 2025-05-07T20:32:20.6741147Z else: 2025-05-07T20:32:20.6741366Z scale_ub_tensor = None 2025-05-07T20:32:20.6741641Z 2025-05-07T20:32:20.6741890Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6742216Z op = silu_mul_quant 2025-05-07T20:32:20.6742482Z if compiled: 2025-05-07T20:32:20.6742748Z op = torch.compile(op) 2025-05-07T20:32:20.6743063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6743348Z 2025-05-07T20:32:20.6743553Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.6743858Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.6744159Z 2025-05-07T20:32:20.6744413Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6744770Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.6745075Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.6745409Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.6745801Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.6746127Z 2025-05-07T20:32:20.6746346Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.6746555Z 2025-05-07T20:32:20.6746674Z moe/activation_test.py:126: 2025-05-07T20:32:20.6746991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6747344Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.6747694Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.6748531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.6749535Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.6750261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.6751093Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.6751916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.6752666Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.6753435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.6754109Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.6754746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.6755283Z fn() 2025-05-07T20:32:20.6755814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.6756423Z self.fn.run( 2025-05-07T20:32:20.6756906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.6757466Z kernel = self.compile( 2025-05-07T20:32:20.6758132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.6758819Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.6759232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6759478Z 2025-05-07T20:32:20.6759695Z self = 2025-05-07T20:32:20.6760929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.6762440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32944c9d00>} 2025-05-07T20:32:20.6763939Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.6765016Z context = 2025-05-07T20:32:20.6765330Z 2025-05-07T20:32:20.6765504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.6766058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.6766548Z module_map=module_map) 2025-05-07T20:32:20.6766937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.6767315Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.6767590Z E ^ 2025-05-07T20:32:20.6768081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.6768570Z 2025-05-07T20:32:20.6769010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.6769549Z 2025-05-07T20:32:20.6769667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.6770099Z self=, 2025-05-07T20:32:20.6770525Z T=16384, 2025-05-07T20:32:20.6770730Z D=7168, 2025-05-07T20:32:20.6770929Z scale_ub=1200.0, 2025-05-07T20:32:20.6771168Z contiguous=False, 2025-05-07T20:32:20.6771410Z compiled=False, 2025-05-07T20:32:20.6771620Z ) 2025-05-07T20:32:20.6771958Z self = 2025-05-07T20:32:20.6772496Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:20.6772795Z 2025-05-07T20:32:20.6772885Z @given( 2025-05-07T20:32:20.6773124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.6773459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.6773796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.6774144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.6774498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.6774803Z ) 2025-05-07T20:32:20.6775167Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.6775635Z def test_silu_mul_quant( 2025-05-07T20:32:20.6775895Z self, 2025-05-07T20:32:20.6776103Z T: int, 2025-05-07T20:32:20.6776307Z D: int, 2025-05-07T20:32:20.6776542Z scale_ub: Optional[float], 2025-05-07T20:32:20.6776839Z contiguous: bool, 2025-05-07T20:32:20.6777087Z compiled: bool, 2025-05-07T20:32:20.6777325Z ) -> None: 2025-05-07T20:32:20.6777557Z torch.manual_seed(2025) 2025-05-07T20:32:20.6777812Z 2025-05-07T20:32:20.6778108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.6778482Z 2025-05-07T20:32:20.6778681Z x_sign = torch.sign(x) 2025-05-07T20:32:20.6779125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.6779458Z x = x_sign * x_clamp 2025-05-07T20:32:20.6779707Z x0 = x[:, :D] 2025-05-07T20:32:20.6779938Z x1 = x[:, D:] 2025-05-07T20:32:20.6780160Z 2025-05-07T20:32:20.6780355Z if contiguous: 2025-05-07T20:32:20.6780603Z x0 = x0.contiguous() 2025-05-07T20:32:20.6780878Z x1 = x1.contiguous() 2025-05-07T20:32:20.6781136Z 2025-05-07T20:32:20.6781366Z if scale_ub is not None: 2025-05-07T20:32:20.6781668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.6782029Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.6782349Z ) 2025-05-07T20:32:20.6782553Z else: 2025-05-07T20:32:20.6782778Z scale_ub_tensor = None 2025-05-07T20:32:20.6783036Z 2025-05-07T20:32:20.6783365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6783705Z op = silu_mul_quant 2025-05-07T20:32:20.6783964Z if compiled: 2025-05-07T20:32:20.6784225Z op = torch.compile(op) 2025-05-07T20:32:20.6784538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6784823Z 2025-05-07T20:32:20.6785033Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.6785208Z 2025-05-07T20:32:20.6785318Z moe/activation_test.py:117: 2025-05-07T20:32:20.6785632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6785978Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.6786281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6787014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.6787732Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.6788300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.6789037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.6789741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.6790296Z kernel = self.compile( 2025-05-07T20:32:20.6790873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.6791613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.6792053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6792293Z 2025-05-07T20:32:20.6792512Z self = 2025-05-07T20:32:20.6793646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.6795091Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32944a4e00>} 2025-05-07T20:32:20.6796490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.6797561Z context = 2025-05-07T20:32:20.6797864Z 2025-05-07T20:32:20.6798042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.6798582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.6799060Z module_map=module_map) 2025-05-07T20:32:20.6799443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.6799909Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.6800281Z E ^ 2025-05-07T20:32:20.6800776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.6801251Z 2025-05-07T20:32:20.6801697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.9046354Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.9047502Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:20.9048926Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.9050609Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.9051631Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.9052994Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.9054428Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.9055796Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.9057228Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.9058323Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] module_map=module_map) 2025-05-07T20:32:20.9059643Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.9060942Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:20.9061825Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.9063080Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.9064339Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:20.9065422Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.9066486Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:20.9067866Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.9069198Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.9070137Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.9071269Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.9072399Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:20.9073289Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.9074507Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.9075910Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.9077012Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.9077954Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.9078733Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:20.9079812Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.9595325Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.9597506Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:20.9600361Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.9602235Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.9603260Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.9604619Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.9606053Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.9607416Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.9608996Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.9610098Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] module_map=module_map) 2025-05-07T20:32:20.9611416Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.9612711Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:20.9613756Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.9615136Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.9616397Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:20.9617479Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.9618546Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:20.9619815Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.9621163Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.9622158Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.9623297Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.9624389Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:20.9625190Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.9626420Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.9627836Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.9628946Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.9629903Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.9630678Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:20.9631875Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.4075095Z 2025-05-07T20:32:21.4075283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.4075739Z self=, 2025-05-07T20:32:21.4076199Z T=1, 2025-05-07T20:32:21.4076402Z D=7168, 2025-05-07T20:32:21.4076631Z scale_ub=None, 2025-05-07T20:32:21.4077005Z contiguous=True, 2025-05-07T20:32:21.4077337Z compiled=True, 2025-05-07T20:32:21.4077622Z ) 2025-05-07T20:32:21.4078075Z self = 2025-05-07T20:32:21.4078585Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:21.4078856Z 2025-05-07T20:32:21.4078938Z @given( 2025-05-07T20:32:21.4079183Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.4079513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.4080020Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.4080459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.4080809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.4081113Z ) 2025-05-07T20:32:21.4081477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.4081944Z def test_silu_mul_quant( 2025-05-07T20:32:21.4082203Z self, 2025-05-07T20:32:21.4082404Z T: int, 2025-05-07T20:32:21.4082616Z D: int, 2025-05-07T20:32:21.4082848Z scale_ub: Optional[float], 2025-05-07T20:32:21.4083129Z contiguous: bool, 2025-05-07T20:32:21.4083388Z compiled: bool, 2025-05-07T20:32:21.4083627Z ) -> None: 2025-05-07T20:32:21.4083853Z torch.manual_seed(2025) 2025-05-07T20:32:21.4084112Z 2025-05-07T20:32:21.4084405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.4084766Z 2025-05-07T20:32:21.4084973Z x_sign = torch.sign(x) 2025-05-07T20:32:21.4085288Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.4085621Z x = x_sign * x_clamp 2025-05-07T20:32:21.4085869Z x0 = x[:, :D] 2025-05-07T20:32:21.4086100Z x1 = x[:, D:] 2025-05-07T20:32:21.4086321Z 2025-05-07T20:32:21.4086514Z if contiguous: 2025-05-07T20:32:21.4086761Z x0 = x0.contiguous() 2025-05-07T20:32:21.4087033Z x1 = x1.contiguous() 2025-05-07T20:32:21.4087281Z 2025-05-07T20:32:21.4087490Z if scale_ub is not None: 2025-05-07T20:32:21.4087812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.4088166Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.4088494Z ) 2025-05-07T20:32:21.4088692Z else: 2025-05-07T20:32:21.4088921Z scale_ub_tensor = None 2025-05-07T20:32:21.4089186Z 2025-05-07T20:32:21.4089426Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.4089763Z op = silu_mul_quant 2025-05-07T20:32:21.4090037Z if compiled: 2025-05-07T20:32:21.4090293Z op = torch.compile(op) 2025-05-07T20:32:21.4090607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.4090902Z 2025-05-07T20:32:21.4091102Z y_fp8, y_scale = fn() 2025-05-07T20:32:21.4091403Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:21.4091715Z 2025-05-07T20:32:21.4091967Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.4092316Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:21.4092628Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:21.4092960Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:21.4093335Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.4093667Z 2025-05-07T20:32:21.4093881Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.4094087Z 2025-05-07T20:32:21.4094200Z moe/activation_test.py:126: 2025-05-07T20:32:21.4094671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.4095035Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:21.4095384Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.4096205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:21.4096992Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.4097570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.4098285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.4099013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:21.4099781Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.4100636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:21.4101309Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.4101999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:21.4102545Z fn() 2025-05-07T20:32:21.4103077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:21.4103683Z self.fn.run( 2025-05-07T20:32:21.4104178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.4104735Z kernel = self.compile( 2025-05-07T20:32:21.4105300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.4106001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.4106427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.4106669Z 2025-05-07T20:32:21.4106892Z self = 2025-05-07T20:32:21.4108017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.4109447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328ecd3ec0>} 2025-05-07T20:32:21.4110840Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.4111914Z context = 2025-05-07T20:32:21.4112214Z 2025-05-07T20:32:21.4112396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.4112941Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.4113621Z module_map=module_map) 2025-05-07T20:32:21.4114008Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.4114382Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:21.4114664Z E ^ 2025-05-07T20:32:21.4115154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.4115625Z 2025-05-07T20:32:21.4116067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.4116601Z 2025-05-07T20:32:21.4116718Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.4117278Z self=, 2025-05-07T20:32:21.4117703Z T=4096, 2025-05-07T20:32:21.4117901Z D=5120, 2025-05-07T20:32:21.4118108Z scale_ub=None, 2025-05-07T20:32:21.4118340Z contiguous=False, 2025-05-07T20:32:21.4118576Z compiled=False, 2025-05-07T20:32:21.4118799Z ) 2025-05-07T20:32:21.4119139Z self = 2025-05-07T20:32:21.4119658Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:21.4119954Z 2025-05-07T20:32:21.4120036Z @given( 2025-05-07T20:32:21.4120347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.4120678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.4121001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.4121355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.4121836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.4122139Z ) 2025-05-07T20:32:21.4122521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.4122986Z def test_silu_mul_quant( 2025-05-07T20:32:21.4123235Z self, 2025-05-07T20:32:21.4123441Z T: int, 2025-05-07T20:32:21.4123650Z D: int, 2025-05-07T20:32:21.4123879Z scale_ub: Optional[float], 2025-05-07T20:32:21.4124163Z contiguous: bool, 2025-05-07T20:32:21.4124419Z compiled: bool, 2025-05-07T20:32:21.4124650Z ) -> None: 2025-05-07T20:32:21.4124885Z torch.manual_seed(2025) 2025-05-07T20:32:21.4125138Z 2025-05-07T20:32:21.4125420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.4125781Z 2025-05-07T20:32:21.4125990Z x_sign = torch.sign(x) 2025-05-07T20:32:21.4126295Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.4126626Z x = x_sign * x_clamp 2025-05-07T20:32:21.4126881Z x0 = x[:, :D] 2025-05-07T20:32:21.4127115Z x1 = x[:, D:] 2025-05-07T20:32:21.4127345Z 2025-05-07T20:32:21.4127536Z if contiguous: 2025-05-07T20:32:21.4127781Z x0 = x0.contiguous() 2025-05-07T20:32:21.4128057Z x1 = x1.contiguous() 2025-05-07T20:32:21.4128306Z 2025-05-07T20:32:21.4128512Z if scale_ub is not None: 2025-05-07T20:32:21.4135278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.4135679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.4136015Z ) 2025-05-07T20:32:21.4136231Z else: 2025-05-07T20:32:21.4136460Z scale_ub_tensor = None 2025-05-07T20:32:21.4136729Z 2025-05-07T20:32:21.4136980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.4137322Z op = silu_mul_quant 2025-05-07T20:32:21.4137586Z if compiled: 2025-05-07T20:32:21.4137852Z op = torch.compile(op) 2025-05-07T20:32:21.4138183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.4138469Z 2025-05-07T20:32:21.4138681Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.4138864Z 2025-05-07T20:32:21.4138971Z moe/activation_test.py:117: 2025-05-07T20:32:21.4139289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.4139640Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.4139942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.4140674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.4141391Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.4141962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.4142685Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.4143504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.4144066Z kernel = self.compile( 2025-05-07T20:32:21.4144640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.4145334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.4145754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.4146003Z 2025-05-07T20:32:21.4146222Z self = 2025-05-07T20:32:21.4147350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.4148784Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb49bc0>} 2025-05-07T20:32:21.4150265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.4151321Z context = 2025-05-07T20:32:21.4151627Z 2025-05-07T20:32:21.4151805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.4152356Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.4152851Z module_map=module_map) 2025-05-07T20:32:21.4153229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.4153601Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.4153877Z E ^ 2025-05-07T20:32:21.4154368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.4154846Z 2025-05-07T20:32:21.4155282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.6998360Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.6999489Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:21.7000974Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.7002517Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.7003555Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.7004933Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.7006379Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.7007751Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.7009351Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.7010462Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] module_map=module_map) 2025-05-07T20:32:21.7011803Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.7013121Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:21.7014159Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.7015556Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.7016831Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:21.7017926Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.7019009Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:21.7020293Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.7021647Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.7022608Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.7023760Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.7024859Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:21.7025677Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.7026920Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.7028354Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.7029477Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.7030448Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.7031239Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:21.7032379Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.8846675Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.8847806Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:21.8849206Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.8850693Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.8851717Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.8853203Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.8854642Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.8856006Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.8857448Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.8858551Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] module_map=module_map) 2025-05-07T20:32:21.8859872Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.8861170Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:21.8862110Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.8863376Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.8864638Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:21.8865726Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.8866804Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:21.8868083Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.8869410Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.8870447Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.8871590Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.8872731Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:21.8873544Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.8874766Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.8876292Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.8877399Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.8878356Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.8879132Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:21.8880307Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.4142805Z 2025-05-07T20:32:22.4142980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.4143481Z self=, 2025-05-07T20:32:22.4144108Z T=4096, 2025-05-07T20:32:22.4144396Z D=7168, 2025-05-07T20:32:22.4144650Z scale_ub=None, 2025-05-07T20:32:22.4144940Z contiguous=False, 2025-05-07T20:32:22.4145242Z compiled=False, 2025-05-07T20:32:22.4145489Z ) 2025-05-07T20:32:22.4145827Z self = 2025-05-07T20:32:22.4146354Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.4146640Z 2025-05-07T20:32:22.4146720Z @given( 2025-05-07T20:32:22.4146965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.4147297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.4147614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.4147964Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.4148311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.4148623Z ) 2025-05-07T20:32:22.4148993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.4149461Z def test_silu_mul_quant( 2025-05-07T20:32:22.4149719Z self, 2025-05-07T20:32:22.4149917Z T: int, 2025-05-07T20:32:22.4150131Z D: int, 2025-05-07T20:32:22.4150366Z scale_ub: Optional[float], 2025-05-07T20:32:22.4150653Z contiguous: bool, 2025-05-07T20:32:22.4150911Z compiled: bool, 2025-05-07T20:32:22.4151150Z ) -> None: 2025-05-07T20:32:22.4151379Z torch.manual_seed(2025) 2025-05-07T20:32:22.4151638Z 2025-05-07T20:32:22.4151966Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.4152344Z 2025-05-07T20:32:22.4152553Z x_sign = torch.sign(x) 2025-05-07T20:32:22.4152868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.4153204Z x = x_sign * x_clamp 2025-05-07T20:32:22.4153468Z x0 = x[:, :D] 2025-05-07T20:32:22.4153696Z x1 = x[:, D:] 2025-05-07T20:32:22.4154089Z 2025-05-07T20:32:22.4154287Z if contiguous: 2025-05-07T20:32:22.4154535Z x0 = x0.contiguous() 2025-05-07T20:32:22.4154811Z x1 = x1.contiguous() 2025-05-07T20:32:22.4155065Z 2025-05-07T20:32:22.4155274Z if scale_ub is not None: 2025-05-07T20:32:22.4155567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.4155926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.4156263Z ) 2025-05-07T20:32:22.4156478Z else: 2025-05-07T20:32:22.4156696Z scale_ub_tensor = None 2025-05-07T20:32:22.4156967Z 2025-05-07T20:32:22.4157204Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.4157531Z op = silu_mul_quant 2025-05-07T20:32:22.4157795Z if compiled: 2025-05-07T20:32:22.4158057Z op = torch.compile(op) 2025-05-07T20:32:22.4158494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4158784Z 2025-05-07T20:32:22.4158996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.4159170Z 2025-05-07T20:32:22.4159276Z moe/activation_test.py:117: 2025-05-07T20:32:22.4159587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4159936Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.4160326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4161047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.4161768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.4162340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.4163053Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.4163750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.4164320Z kernel = self.compile( 2025-05-07T20:32:22.4164894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.4165578Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.4165999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4166240Z 2025-05-07T20:32:22.4166463Z self = 2025-05-07T20:32:22.4167595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.4169025Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb4a340>} 2025-05-07T20:32:22.4170434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.4171500Z context = 2025-05-07T20:32:22.4171804Z 2025-05-07T20:32:22.4171987Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.4172532Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.4173030Z module_map=module_map) 2025-05-07T20:32:22.4173424Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.4173798Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.4174071Z E ^ 2025-05-07T20:32:22.4174565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.4175040Z 2025-05-07T20:32:22.4175566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.4176106Z 2025-05-07T20:32:22.4176228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.4176663Z self=, 2025-05-07T20:32:22.4177089Z T=128, 2025-05-07T20:32:22.4177290Z D=7168, 2025-05-07T20:32:22.4177492Z scale_ub=None, 2025-05-07T20:32:22.4177722Z contiguous=False, 2025-05-07T20:32:22.4177961Z compiled=True, 2025-05-07T20:32:22.4178169Z ) 2025-05-07T20:32:22.4178507Z self = 2025-05-07T20:32:22.4179030Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.4179312Z 2025-05-07T20:32:22.4179393Z @given( 2025-05-07T20:32:22.4179643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.4180064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.4180390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.4180738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.4181089Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.4181395Z ) 2025-05-07T20:32:22.4181760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.4182225Z def test_silu_mul_quant( 2025-05-07T20:32:22.4182480Z self, 2025-05-07T20:32:22.4182679Z T: int, 2025-05-07T20:32:22.4182890Z D: int, 2025-05-07T20:32:22.4183121Z scale_ub: Optional[float], 2025-05-07T20:32:22.4183402Z contiguous: bool, 2025-05-07T20:32:22.4183655Z compiled: bool, 2025-05-07T20:32:22.4183892Z ) -> None: 2025-05-07T20:32:22.4184115Z torch.manual_seed(2025) 2025-05-07T20:32:22.4184374Z 2025-05-07T20:32:22.4184668Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.4185023Z 2025-05-07T20:32:22.4185226Z x_sign = torch.sign(x) 2025-05-07T20:32:22.4185535Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.4185862Z x = x_sign * x_clamp 2025-05-07T20:32:22.4186110Z x0 = x[:, :D] 2025-05-07T20:32:22.4186337Z x1 = x[:, D:] 2025-05-07T20:32:22.4186554Z 2025-05-07T20:32:22.4186744Z if contiguous: 2025-05-07T20:32:22.4186994Z x0 = x0.contiguous() 2025-05-07T20:32:22.4187266Z x1 = x1.contiguous() 2025-05-07T20:32:22.4187514Z 2025-05-07T20:32:22.4187714Z if scale_ub is not None: 2025-05-07T20:32:22.4188004Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.4188355Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.4188681Z ) 2025-05-07T20:32:22.4188882Z else: 2025-05-07T20:32:22.4189099Z scale_ub_tensor = None 2025-05-07T20:32:22.4189365Z 2025-05-07T20:32:22.4189612Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.4189937Z op = silu_mul_quant 2025-05-07T20:32:22.4190203Z if compiled: 2025-05-07T20:32:22.4190465Z op = torch.compile(op) 2025-05-07T20:32:22.4190791Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4191078Z 2025-05-07T20:32:22.4191280Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.4191585Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.4191888Z 2025-05-07T20:32:22.4192144Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.4192496Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.4192805Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.4193134Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.4193517Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.4193846Z 2025-05-07T20:32:22.4194053Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.4194377Z 2025-05-07T20:32:22.4194485Z moe/activation_test.py:126: 2025-05-07T20:32:22.4194800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4195150Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.4195496Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.4196320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.4197102Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.4197670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.4198386Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.4199111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.4199961Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.4200799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.4201475Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.4202115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.4202651Z fn() 2025-05-07T20:32:22.4203184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.4203793Z self.fn.run( 2025-05-07T20:32:22.4204288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.4204837Z kernel = self.compile( 2025-05-07T20:32:22.4205415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.4206104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.4206525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4206773Z 2025-05-07T20:32:22.4206992Z self = 2025-05-07T20:32:22.4208123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.4209551Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328f23e840>} 2025-05-07T20:32:22.4210952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.4212017Z context = 2025-05-07T20:32:22.4212323Z 2025-05-07T20:32:22.4212502Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.4213052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.4213734Z module_map=module_map) 2025-05-07T20:32:22.4214117Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.4214492Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.4214775Z E ^ 2025-05-07T20:32:22.4215259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.4215734Z 2025-05-07T20:32:22.4216171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.6598277Z 2025-05-07T20:32:22.6598541Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6599063Z self=, 2025-05-07T20:32:22.6599654Z T=128, 2025-05-07T20:32:22.6599861Z D=7168, 2025-05-07T20:32:22.6600185Z scale_ub=None, 2025-05-07T20:32:22.6600514Z contiguous=False, 2025-05-07T20:32:22.6600845Z compiled=False, 2025-05-07T20:32:22.6601134Z ) 2025-05-07T20:32:22.6601604Z self = 2025-05-07T20:32:22.6602128Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.6602463Z 2025-05-07T20:32:22.6602542Z @given( 2025-05-07T20:32:22.6602782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.6603103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.6603421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.6603976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.6604326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.6604629Z ) 2025-05-07T20:32:22.6605006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.6605466Z def test_silu_mul_quant( 2025-05-07T20:32:22.6605719Z self, 2025-05-07T20:32:22.6605919Z T: int, 2025-05-07T20:32:22.6606118Z D: int, 2025-05-07T20:32:22.6606349Z scale_ub: Optional[float], 2025-05-07T20:32:22.6606635Z contiguous: bool, 2025-05-07T20:32:22.6606887Z compiled: bool, 2025-05-07T20:32:22.6607115Z ) -> None: 2025-05-07T20:32:22.6607350Z torch.manual_seed(2025) 2025-05-07T20:32:22.6607604Z 2025-05-07T20:32:22.6607884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.6608242Z 2025-05-07T20:32:22.6608442Z x_sign = torch.sign(x) 2025-05-07T20:32:22.6608758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.6609082Z x = x_sign * x_clamp 2025-05-07T20:32:22.6609335Z x0 = x[:, :D] 2025-05-07T20:32:22.6609559Z x1 = x[:, D:] 2025-05-07T20:32:22.6609772Z 2025-05-07T20:32:22.6609965Z if contiguous: 2025-05-07T20:32:22.6610200Z x0 = x0.contiguous() 2025-05-07T20:32:22.6610469Z x1 = x1.contiguous() 2025-05-07T20:32:22.6610720Z 2025-05-07T20:32:22.6610914Z if scale_ub is not None: 2025-05-07T20:32:22.6611203Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.6611557Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.6611879Z ) 2025-05-07T20:32:22.6612076Z else: 2025-05-07T20:32:22.6612291Z scale_ub_tensor = None 2025-05-07T20:32:22.6612553Z 2025-05-07T20:32:22.6612788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6613126Z op = silu_mul_quant 2025-05-07T20:32:22.6613634Z if compiled: 2025-05-07T20:32:22.6613902Z op = torch.compile(op) 2025-05-07T20:32:22.6614215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6614508Z 2025-05-07T20:32:22.6614706Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.6614886Z 2025-05-07T20:32:22.6614996Z moe/activation_test.py:117: 2025-05-07T20:32:22.6615308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6615659Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.6615951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6616684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.6617414Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.6617981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.6618852Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.6619560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.6620126Z kernel = self.compile( 2025-05-07T20:32:22.6620698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.6621399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.6621850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6622119Z 2025-05-07T20:32:22.6622335Z self = 2025-05-07T20:32:22.6623483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.6625099Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb55f80>} 2025-05-07T20:32:22.6626521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.6627597Z context = 2025-05-07T20:32:22.6627901Z 2025-05-07T20:32:22.6628077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.6628630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.6629125Z module_map=module_map) 2025-05-07T20:32:22.6629514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.6629889Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.6630160Z E ^ 2025-05-07T20:32:22.6630654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6631131Z 2025-05-07T20:32:22.6631572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.6632173Z 2025-05-07T20:32:22.6632282Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6632715Z self=, 2025-05-07T20:32:22.6633138Z T=4096, 2025-05-07T20:32:22.6633330Z D=5120, 2025-05-07T20:32:22.6633531Z scale_ub=1200.0, 2025-05-07T20:32:22.6633764Z contiguous=True, 2025-05-07T20:32:22.6640753Z compiled=False, 2025-05-07T20:32:22.6641005Z ) 2025-05-07T20:32:22.6641339Z self = 2025-05-07T20:32:22.6641886Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:22.6642231Z 2025-05-07T20:32:22.6642312Z @given( 2025-05-07T20:32:22.6642555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.6642876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.6643200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.6643548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.6643890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.6644189Z ) 2025-05-07T20:32:22.6644556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.6645021Z def test_silu_mul_quant( 2025-05-07T20:32:22.6645275Z self, 2025-05-07T20:32:22.6645480Z T: int, 2025-05-07T20:32:22.6645687Z D: int, 2025-05-07T20:32:22.6645916Z scale_ub: Optional[float], 2025-05-07T20:32:22.6646202Z contiguous: bool, 2025-05-07T20:32:22.6646456Z compiled: bool, 2025-05-07T20:32:22.6646684Z ) -> None: 2025-05-07T20:32:22.6647047Z torch.manual_seed(2025) 2025-05-07T20:32:22.6647301Z 2025-05-07T20:32:22.6647579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.6647937Z 2025-05-07T20:32:22.6648140Z x_sign = torch.sign(x) 2025-05-07T20:32:22.6648441Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.6648764Z x = x_sign * x_clamp 2025-05-07T20:32:22.6649014Z x0 = x[:, :D] 2025-05-07T20:32:22.6649233Z x1 = x[:, D:] 2025-05-07T20:32:22.6649449Z 2025-05-07T20:32:22.6649640Z if contiguous: 2025-05-07T20:32:22.6649876Z x0 = x0.contiguous() 2025-05-07T20:32:22.6650143Z x1 = x1.contiguous() 2025-05-07T20:32:22.6650395Z 2025-05-07T20:32:22.6650588Z if scale_ub is not None: 2025-05-07T20:32:22.6650877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.6651225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.6651638Z ) 2025-05-07T20:32:22.6651862Z else: 2025-05-07T20:32:22.6652099Z scale_ub_tensor = None 2025-05-07T20:32:22.6652358Z 2025-05-07T20:32:22.6652596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6652921Z op = silu_mul_quant 2025-05-07T20:32:22.6653179Z if compiled: 2025-05-07T20:32:22.6653428Z op = torch.compile(op) 2025-05-07T20:32:22.6653733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6654024Z 2025-05-07T20:32:22.6654217Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.6654392Z 2025-05-07T20:32:22.6654494Z moe/activation_test.py:117: 2025-05-07T20:32:22.6654797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6655150Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.6655436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6656166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.6656882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.6657430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.6658135Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.6658815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.6659362Z kernel = self.compile( 2025-05-07T20:32:22.6659914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.6660595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.6661004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6661243Z 2025-05-07T20:32:22.6661459Z self = 2025-05-07T20:32:22.6662574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.6663989Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb54fe0>} 2025-05-07T20:32:22.6665372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.6666430Z context = 2025-05-07T20:32:22.6666725Z 2025-05-07T20:32:22.6666897Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.6667524Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.6668009Z module_map=module_map) 2025-05-07T20:32:22.6668385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.6668745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.6669012Z E ^ 2025-05-07T20:32:22.6669490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6669954Z 2025-05-07T20:32:22.6670385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.6670918Z 2025-05-07T20:32:22.6671024Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6671456Z self=, 2025-05-07T20:32:22.6671877Z T=1, 2025-05-07T20:32:22.6672138Z D=5120, 2025-05-07T20:32:22.6672337Z scale_ub=None, 2025-05-07T20:32:22.6672565Z contiguous=True, 2025-05-07T20:32:22.6672790Z compiled=True, 2025-05-07T20:32:22.6672996Z ) 2025-05-07T20:32:22.6673321Z self = 2025-05-07T20:32:22.6673818Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.6674089Z 2025-05-07T20:32:22.6674168Z @given( 2025-05-07T20:32:22.6674409Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.6674729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.6675045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.6675385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.6675728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.6676017Z ) 2025-05-07T20:32:22.6676382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.6676843Z def test_silu_mul_quant( 2025-05-07T20:32:22.6677092Z self, 2025-05-07T20:32:22.6677290Z T: int, 2025-05-07T20:32:22.6677492Z D: int, 2025-05-07T20:32:22.6677713Z scale_ub: Optional[float], 2025-05-07T20:32:22.6677994Z contiguous: bool, 2025-05-07T20:32:22.6678242Z compiled: bool, 2025-05-07T20:32:22.6678466Z ) -> None: 2025-05-07T20:32:22.6678688Z torch.manual_seed(2025) 2025-05-07T20:32:22.6678936Z 2025-05-07T20:32:22.6679208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.6679561Z 2025-05-07T20:32:22.6679762Z x_sign = torch.sign(x) 2025-05-07T20:32:22.6680064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.6680477Z x = x_sign * x_clamp 2025-05-07T20:32:22.6680728Z x0 = x[:, :D] 2025-05-07T20:32:22.6680950Z x1 = x[:, D:] 2025-05-07T20:32:22.6681156Z 2025-05-07T20:32:22.6681345Z if contiguous: 2025-05-07T20:32:22.6681594Z x0 = x0.contiguous() 2025-05-07T20:32:22.6681877Z x1 = x1.contiguous() 2025-05-07T20:32:22.6682158Z 2025-05-07T20:32:22.6682357Z if scale_ub is not None: 2025-05-07T20:32:22.6682637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.6682983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.6683301Z ) 2025-05-07T20:32:22.6683497Z else: 2025-05-07T20:32:22.6683710Z scale_ub_tensor = None 2025-05-07T20:32:22.6683972Z 2025-05-07T20:32:22.6684205Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6684526Z op = silu_mul_quant 2025-05-07T20:32:22.6684787Z if compiled: 2025-05-07T20:32:22.6685042Z op = torch.compile(op) 2025-05-07T20:32:22.6685338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6685620Z 2025-05-07T20:32:22.6685818Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.6686111Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.6686493Z 2025-05-07T20:32:22.6686740Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6687085Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.6687384Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.6687710Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.6688076Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.6688396Z 2025-05-07T20:32:22.6688602Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.6688803Z 2025-05-07T20:32:22.6688910Z moe/activation_test.py:126: 2025-05-07T20:32:22.6689211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6689558Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.6689901Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.6690715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.6691722Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.6692379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.6693202Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.6694032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.6694908Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.6695804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.6696569Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.6697295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.6697926Z fn() 2025-05-07T20:32:22.6698530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.6699231Z self.fn.run( 2025-05-07T20:32:22.6699780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.6700419Z kernel = self.compile( 2025-05-07T20:32:22.6701056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.6701848Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.6702357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6702623Z 2025-05-07T20:32:22.6702859Z self = 2025-05-07T20:32:22.6704206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.6705938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb57a60>} 2025-05-07T20:32:22.6707626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.6708889Z context = 2025-05-07T20:32:22.6709230Z 2025-05-07T20:32:22.6709419Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.6710027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.6710581Z module_map=module_map) 2025-05-07T20:32:22.6711073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.6711443Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.6711717Z E ^ 2025-05-07T20:32:22.6712225Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6712713Z 2025-05-07T20:32:22.6713146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.8955770Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:22.8956891Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:22.8958508Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:22.8959993Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:22.8961113Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:22.8962475Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:22.8963915Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.8965279Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:22.8966715Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.8967814Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] module_map=module_map) 2025-05-07T20:32:22.8969141Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:22.8970451Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:22.8971332Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:22.8972587Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:22.8973847Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:22.8974934Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:22.8976122Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:22.8977404Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:22.8978741Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:22.8979690Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:22.8980826Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:22.8981941Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:22.8982854Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:22.8984079Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:22.8985486Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:22.8986596Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.8987544Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.8988334Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:22.8989414Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.9578320Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:22.9579461Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:22.9580861Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:22.9582413Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:22.9583433Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:22.9584788Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:22.9586223Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.9587746Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:22.9589195Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.9590293Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] module_map=module_map) 2025-05-07T20:32:22.9591607Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:22.9592912Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:22.9593912Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:22.9595166Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:22.9596431Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:22.9597506Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:22.9598573Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:22.9599846Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:22.9601276Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:22.9602270Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:22.9603399Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:22.9604482Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:22.9605289Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:22.9606522Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:22.9607926Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:22.9609032Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.9609991Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.9610773Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:22.9611946Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.4502323Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:23.4503492Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:23.4504897Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:23.4506391Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:23.4507602Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:23.4508968Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:23.4510414Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.4511782Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:23.4513236Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.4514486Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] module_map=module_map) 2025-05-07T20:32:23.4515814Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:23.4517121Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:23.4518010Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.4519279Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:23.4520620Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:23.4521703Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:23.4522774Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:23.4524052Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:23.4525536Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:23.4526489Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.4527627Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:23.4528718Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:23.4529532Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:23.4530765Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:23.4532341Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:23.4533451Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.4534411Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.4535198Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:23.4536279Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.5117397Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:23.5118518Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:23.5120253Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:23.5122086Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:23.5123326Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:23.5125010Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:23.5126448Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.5127819Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:23.5129269Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.5130535Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] module_map=module_map) 2025-05-07T20:32:23.5131875Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:23.5133232Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:23.5134129Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.5135396Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:23.5136781Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:23.5137876Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:23.5138950Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:23.5140232Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:23.5141575Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:23.5142592Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.5143732Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:23.5144828Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:23.5145648Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:23.5146883Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:23.5148305Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:23.5149427Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.5150392Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.5151179Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:23.5152263Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.7816366Z 2025-05-07T20:32:23.7816654Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.7817120Z self=, 2025-05-07T20:32:23.7817652Z T=2048, 2025-05-07T20:32:23.7818148Z D=5120, 2025-05-07T20:32:23.7818359Z scale_ub=None, 2025-05-07T20:32:23.7818581Z contiguous=True, 2025-05-07T20:32:23.7818813Z compiled=True, 2025-05-07T20:32:23.7819029Z ) 2025-05-07T20:32:23.7819358Z self = 2025-05-07T20:32:23.7819871Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:23.7820154Z 2025-05-07T20:32:23.7820237Z @given( 2025-05-07T20:32:23.7820474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.7820797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.7821119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.7821462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.7821803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.7822100Z ) 2025-05-07T20:32:23.7822468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.7823046Z def test_silu_mul_quant( 2025-05-07T20:32:23.7823303Z self, 2025-05-07T20:32:23.7823505Z T: int, 2025-05-07T20:32:23.7823704Z D: int, 2025-05-07T20:32:23.7823933Z scale_ub: Optional[float], 2025-05-07T20:32:23.7824218Z contiguous: bool, 2025-05-07T20:32:23.7824464Z compiled: bool, 2025-05-07T20:32:23.7824702Z ) -> None: 2025-05-07T20:32:23.7824923Z torch.manual_seed(2025) 2025-05-07T20:32:23.7825173Z 2025-05-07T20:32:23.7825457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.7825814Z 2025-05-07T20:32:23.7826016Z x_sign = torch.sign(x) 2025-05-07T20:32:23.7826314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.7826637Z x = x_sign * x_clamp 2025-05-07T20:32:23.7826890Z x0 = x[:, :D] 2025-05-07T20:32:23.7827110Z x1 = x[:, D:] 2025-05-07T20:32:23.7827332Z 2025-05-07T20:32:23.7827527Z if contiguous: 2025-05-07T20:32:23.7827766Z x0 = x0.contiguous() 2025-05-07T20:32:23.7828035Z x1 = x1.contiguous() 2025-05-07T20:32:23.7828289Z 2025-05-07T20:32:23.7828488Z if scale_ub is not None: 2025-05-07T20:32:23.7828769Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.7829116Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.7829437Z ) 2025-05-07T20:32:23.7829633Z else: 2025-05-07T20:32:23.7829854Z scale_ub_tensor = None 2025-05-07T20:32:23.7830115Z 2025-05-07T20:32:23.7830350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.7830678Z op = silu_mul_quant 2025-05-07T20:32:23.7830939Z if compiled: 2025-05-07T20:32:23.7831192Z op = torch.compile(op) 2025-05-07T20:32:23.7831499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.7831790Z 2025-05-07T20:32:23.7831988Z y_fp8, y_scale = fn() 2025-05-07T20:32:23.7832291Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:23.7832597Z 2025-05-07T20:32:23.7832841Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.7833202Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:23.7833511Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:23.7833840Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:23.7834219Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:23.7834540Z 2025-05-07T20:32:23.7840524Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:23.7840772Z 2025-05-07T20:32:23.7840894Z moe/activation_test.py:126: 2025-05-07T20:32:23.7841221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.7841582Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:23.7841935Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:23.7842934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:23.7843717Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:23.7844291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.7845011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.7845734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:23.7846489Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:23.7847259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:23.7847933Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:23.7848653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:23.7849192Z fn() 2025-05-07T20:32:23.7849728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:23.7850343Z self.fn.run( 2025-05-07T20:32:23.7850831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.7851388Z kernel = self.compile( 2025-05-07T20:32:23.7851963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.7852650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.7853066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.7853313Z 2025-05-07T20:32:23.7853533Z self = 2025-05-07T20:32:23.7854672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.7856103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e716700>} 2025-05-07T20:32:23.7857495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.7858564Z context = 2025-05-07T20:32:23.7858871Z 2025-05-07T20:32:23.7859048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.7859600Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.7860101Z module_map=module_map) 2025-05-07T20:32:23.7860487Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.7860867Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:23.7861145Z E ^ 2025-05-07T20:32:23.7861638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.7862116Z 2025-05-07T20:32:23.7862551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.7863089Z 2025-05-07T20:32:23.7863201Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.7863634Z self=, 2025-05-07T20:32:23.7864056Z T=128, 2025-05-07T20:32:23.7864257Z D=5120, 2025-05-07T20:32:23.7864455Z scale_ub=None, 2025-05-07T20:32:23.7864685Z contiguous=True, 2025-05-07T20:32:23.7864920Z compiled=True, 2025-05-07T20:32:23.7865210Z ) 2025-05-07T20:32:23.7865549Z self = 2025-05-07T20:32:23.7866064Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:23.7866344Z 2025-05-07T20:32:23.7866436Z @given( 2025-05-07T20:32:23.7866674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.7867012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.7867338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.7867685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.7868035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.7868340Z ) 2025-05-07T20:32:23.7868707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.7869167Z def test_silu_mul_quant( 2025-05-07T20:32:23.7869509Z self, 2025-05-07T20:32:23.7869713Z T: int, 2025-05-07T20:32:23.7869921Z D: int, 2025-05-07T20:32:23.7870151Z scale_ub: Optional[float], 2025-05-07T20:32:23.7870442Z contiguous: bool, 2025-05-07T20:32:23.7870695Z compiled: bool, 2025-05-07T20:32:23.7870930Z ) -> None: 2025-05-07T20:32:23.7871160Z torch.manual_seed(2025) 2025-05-07T20:32:23.7871408Z 2025-05-07T20:32:23.7871700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.7872061Z 2025-05-07T20:32:23.7872266Z x_sign = torch.sign(x) 2025-05-07T20:32:23.7872624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.7872954Z x = x_sign * x_clamp 2025-05-07T20:32:23.7873204Z x0 = x[:, :D] 2025-05-07T20:32:23.7873436Z x1 = x[:, D:] 2025-05-07T20:32:23.7873661Z 2025-05-07T20:32:23.7873853Z if contiguous: 2025-05-07T20:32:23.7874095Z x0 = x0.contiguous() 2025-05-07T20:32:23.7874379Z x1 = x1.contiguous() 2025-05-07T20:32:23.7874629Z 2025-05-07T20:32:23.7874837Z if scale_ub is not None: 2025-05-07T20:32:23.7875128Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.7875484Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.7875808Z ) 2025-05-07T20:32:23.7876013Z else: 2025-05-07T20:32:23.7876238Z scale_ub_tensor = None 2025-05-07T20:32:23.7876499Z 2025-05-07T20:32:23.7876742Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.7877074Z op = silu_mul_quant 2025-05-07T20:32:23.7877336Z if compiled: 2025-05-07T20:32:23.7877597Z op = torch.compile(op) 2025-05-07T20:32:23.7877922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.7878207Z 2025-05-07T20:32:23.7878409Z y_fp8, y_scale = fn() 2025-05-07T20:32:23.7878710Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:23.7879021Z 2025-05-07T20:32:23.7879280Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.7879636Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:23.7879943Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:23.7880370Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:23.7880748Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:23.7881077Z 2025-05-07T20:32:23.7881284Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:23.7881493Z 2025-05-07T20:32:23.7881598Z moe/activation_test.py:126: 2025-05-07T20:32:23.7881911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.7882260Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:23.7882609Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:23.7883426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:23.7884297Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:23.7884871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.7885588Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.7886310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:23.7887065Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:23.7887826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:23.7888491Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:23.7889123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:23.7889817Z fn() 2025-05-07T20:32:23.7890362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:23.7890966Z self.fn.run( 2025-05-07T20:32:23.7891457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.7892006Z kernel = self.compile( 2025-05-07T20:32:23.7892573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.7893253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.7893663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.7893907Z 2025-05-07T20:32:23.7894125Z self = 2025-05-07T20:32:23.7895251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.7896676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9bcd260>} 2025-05-07T20:32:23.7898069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.7899127Z context = 2025-05-07T20:32:23.7899437Z 2025-05-07T20:32:23.7899609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.7900156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.7900645Z module_map=module_map) 2025-05-07T20:32:23.7901032Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.7901416Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:23.7901698Z E ^ 2025-05-07T20:32:23.7902181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.7902701Z 2025-05-07T20:32:23.7903134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.0200990Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:24.0202115Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:24.0203731Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:24.0205229Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:24.0206248Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.0207620Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:24.0209068Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0210579Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:24.0212028Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0213121Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] module_map=module_map) 2025-05-07T20:32:24.0214613Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:24.0215917Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:24.0216800Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:24.0218053Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:24.0219303Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:24.0220383Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:24.0221443Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:24.0222719Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:24.0224040Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:24.0224983Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:24.0226119Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:24.0227200Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:24.0228131Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:24.0229354Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:24.0230755Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:24.0231855Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0232856Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.0233631Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:24.0234806Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.0814471Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:24.0815582Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:24.0816975Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:24.0818470Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:24.0819492Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.0820849Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:24.0822295Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0823661Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:24.0825104Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0826196Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] module_map=module_map) 2025-05-07T20:32:24.0827510Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:24.0828810Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:24.0829694Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:24.0831098Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:24.0832367Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:24.0833444Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:24.0834509Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:24.0835787Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:24.0837238Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:24.0838179Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:24.0839320Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:24.0840498Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:24.0841307Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:24.0842540Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:24.0843949Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:24.0845053Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0846008Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.0846790Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:24.0847859Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.6245389Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:24.6246517Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:24.6247923Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:24.6249414Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:24.6250616Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.6251988Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:24.6253484Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.6254849Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:24.6256297Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.6257510Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] module_map=module_map) 2025-05-07T20:32:24.6258844Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:24.6260146Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:24.6261034Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:24.6262302Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:24.6263575Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:24.6264660Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:24.6265731Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:24.6267012Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:24.6268351Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:24.6269311Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:24.6270443Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:24.6271537Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:24.6272362Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:24.6273635Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:24.6275164Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:24.6276273Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.6277230Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.6278018Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:24.6279095Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.6861469Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:24.6863201Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:24.6864605Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:24.6866091Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:24.6867120Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.6868501Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:24.6869950Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.6871319Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:24.6872764Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.6873863Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] module_map=module_map) 2025-05-07T20:32:24.6875196Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:24.6876506Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:24.6877397Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:24.6878660Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:24.6879924Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:24.6881261Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:24.6882356Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:24.6883673Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:24.6885017Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:24.6885973Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:24.6887197Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:24.6888293Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:24.6889108Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:24.6890334Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:24.6891759Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:24.6892886Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.6893849Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.6894634Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:24.6895711Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.9899151Z 2025-05-07T20:32:24.9899650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9900703Z self=, 2025-05-07T20:32:24.9901643Z T=4096, 2025-05-07T20:32:24.9902047Z D=5120, 2025-05-07T20:32:24.9902343Z scale_ub=None, 2025-05-07T20:32:24.9902609Z contiguous=True, 2025-05-07T20:32:24.9902847Z compiled=True, 2025-05-07T20:32:24.9903057Z ) 2025-05-07T20:32:24.9903419Z self = 2025-05-07T20:32:24.9903929Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.9904206Z 2025-05-07T20:32:24.9904292Z @given( 2025-05-07T20:32:24.9904526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9904853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9905173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9905516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9905863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9906163Z ) 2025-05-07T20:32:24.9906531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9906991Z def test_silu_mul_quant( 2025-05-07T20:32:24.9907242Z self, 2025-05-07T20:32:24.9907606Z T: int, 2025-05-07T20:32:24.9907809Z D: int, 2025-05-07T20:32:24.9908034Z scale_ub: Optional[float], 2025-05-07T20:32:24.9908314Z contiguous: bool, 2025-05-07T20:32:24.9908558Z compiled: bool, 2025-05-07T20:32:24.9908793Z ) -> None: 2025-05-07T20:32:24.9909016Z torch.manual_seed(2025) 2025-05-07T20:32:24.9909262Z 2025-05-07T20:32:24.9909543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9909895Z 2025-05-07T20:32:24.9910091Z x_sign = torch.sign(x) 2025-05-07T20:32:24.9910393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.9910712Z x = x_sign * x_clamp 2025-05-07T20:32:24.9910955Z x0 = x[:, :D] 2025-05-07T20:32:24.9911179Z x1 = x[:, D:] 2025-05-07T20:32:24.9911394Z 2025-05-07T20:32:24.9911580Z if contiguous: 2025-05-07T20:32:24.9911938Z x0 = x0.contiguous() 2025-05-07T20:32:24.9912217Z x1 = x1.contiguous() 2025-05-07T20:32:24.9912488Z 2025-05-07T20:32:24.9912708Z if scale_ub is not None: 2025-05-07T20:32:24.9912994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.9913519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.9913839Z ) 2025-05-07T20:32:24.9914037Z else: 2025-05-07T20:32:24.9914255Z scale_ub_tensor = None 2025-05-07T20:32:24.9914513Z 2025-05-07T20:32:24.9914752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.9915079Z op = silu_mul_quant 2025-05-07T20:32:24.9915331Z if compiled: 2025-05-07T20:32:24.9915594Z op = torch.compile(op) 2025-05-07T20:32:24.9915898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9916178Z 2025-05-07T20:32:24.9916379Z y_fp8, y_scale = fn() 2025-05-07T20:32:24.9916681Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:24.9916980Z 2025-05-07T20:32:24.9917229Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.9917575Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:24.9917883Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:24.9918205Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:24.9918578Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.9918901Z 2025-05-07T20:32:24.9919106Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:24.9919313Z 2025-05-07T20:32:24.9919418Z moe/activation_test.py:126: 2025-05-07T20:32:24.9919728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9920075Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:24.9920518Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.9921341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:24.9922128Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:24.9922696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.9923418Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.9924137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:24.9924896Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.9925657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:24.9926328Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:24.9926961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:24.9927635Z fn() 2025-05-07T20:32:24.9928166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:24.9928767Z self.fn.run( 2025-05-07T20:32:24.9929256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.9929809Z kernel = self.compile( 2025-05-07T20:32:24.9930369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.9931051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.9931470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9931710Z 2025-05-07T20:32:24.9931934Z self = 2025-05-07T20:32:24.9933058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.9934648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f96ae700>} 2025-05-07T20:32:24.9936042Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.9937104Z context = 2025-05-07T20:32:24.9937406Z 2025-05-07T20:32:24.9937588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.9938130Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.9938625Z module_map=module_map) 2025-05-07T20:32:24.9939011Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.9939379Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:24.9939660Z E ^ 2025-05-07T20:32:24.9940141Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.9940609Z 2025-05-07T20:32:24.9941046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.9941578Z 2025-05-07T20:32:24.9941689Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9942124Z self=, 2025-05-07T20:32:24.9942571Z T=16384, 2025-05-07T20:32:24.9942794Z D=5120, 2025-05-07T20:32:24.9942994Z scale_ub=None, 2025-05-07T20:32:24.9943213Z contiguous=True, 2025-05-07T20:32:24.9943437Z compiled=True, 2025-05-07T20:32:24.9943653Z ) 2025-05-07T20:32:24.9943988Z self = 2025-05-07T20:32:24.9944509Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.9944794Z 2025-05-07T20:32:24.9944886Z @given( 2025-05-07T20:32:24.9945127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9945455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9951539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9951905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9952253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9952566Z ) 2025-05-07T20:32:24.9952988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9953446Z def test_silu_mul_quant( 2025-05-07T20:32:24.9953703Z self, 2025-05-07T20:32:24.9953908Z T: int, 2025-05-07T20:32:24.9954128Z D: int, 2025-05-07T20:32:24.9954350Z scale_ub: Optional[float], 2025-05-07T20:32:24.9954750Z contiguous: bool, 2025-05-07T20:32:24.9955009Z compiled: bool, 2025-05-07T20:32:24.9955244Z ) -> None: 2025-05-07T20:32:24.9955472Z torch.manual_seed(2025) 2025-05-07T20:32:24.9955728Z 2025-05-07T20:32:24.9956016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9956380Z 2025-05-07T20:32:24.9956583Z x_sign = torch.sign(x) 2025-05-07T20:32:24.9956887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.9957213Z x = x_sign * x_clamp 2025-05-07T20:32:24.9957464Z x0 = x[:, :D] 2025-05-07T20:32:24.9957684Z x1 = x[:, D:] 2025-05-07T20:32:24.9957903Z 2025-05-07T20:32:24.9958099Z if contiguous: 2025-05-07T20:32:24.9958337Z x0 = x0.contiguous() 2025-05-07T20:32:24.9958607Z x1 = x1.contiguous() 2025-05-07T20:32:24.9958859Z 2025-05-07T20:32:24.9959141Z if scale_ub is not None: 2025-05-07T20:32:24.9959436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.9959790Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.9960173Z ) 2025-05-07T20:32:24.9960372Z else: 2025-05-07T20:32:24.9960592Z scale_ub_tensor = None 2025-05-07T20:32:24.9960862Z 2025-05-07T20:32:24.9961104Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.9961436Z op = silu_mul_quant 2025-05-07T20:32:24.9961697Z if compiled: 2025-05-07T20:32:24.9961950Z op = torch.compile(op) 2025-05-07T20:32:24.9962262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9962548Z 2025-05-07T20:32:24.9962745Z y_fp8, y_scale = fn() 2025-05-07T20:32:24.9963041Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:24.9963347Z 2025-05-07T20:32:24.9963591Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.9963951Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:24.9964256Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:24.9964585Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:24.9964957Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.9965286Z 2025-05-07T20:32:24.9965497Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:24.9965704Z 2025-05-07T20:32:24.9965812Z moe/activation_test.py:126: 2025-05-07T20:32:24.9966127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9966484Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:24.9966827Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.9967653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:24.9968442Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:24.9969014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.9969725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.9970445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:24.9971205Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.9971974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:24.9972644Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:24.9973278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:24.9973822Z fn() 2025-05-07T20:32:24.9974348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:24.9975058Z self.fn.run( 2025-05-07T20:32:24.9975549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.9976107Z kernel = self.compile( 2025-05-07T20:32:24.9976671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.9977361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.9977782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9978024Z 2025-05-07T20:32:24.9978241Z self = 2025-05-07T20:32:24.9979366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.9980878Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f941eb60>} 2025-05-07T20:32:24.9982269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.9983328Z context = 2025-05-07T20:32:24.9983630Z 2025-05-07T20:32:24.9983805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.9984350Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.9984841Z module_map=module_map) 2025-05-07T20:32:24.9985225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.9985600Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:24.9985885Z E ^ 2025-05-07T20:32:24.9986365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.9986833Z 2025-05-07T20:32:24.9987266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.0172716Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:25.0174013Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:25.0175393Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:25.0176438Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:25.0177586Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:25.4352698Z 2025-05-07T20:32:25.4353179Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4353684Z self=, 2025-05-07T20:32:25.4354336Z T=1, 2025-05-07T20:32:25.4354607Z D=5120, 2025-05-07T20:32:25.4354885Z scale_ub=1200.0, 2025-05-07T20:32:25.4355205Z contiguous=True, 2025-05-07T20:32:25.4355479Z compiled=True, 2025-05-07T20:32:25.4355701Z ) 2025-05-07T20:32:25.4356029Z self = 2025-05-07T20:32:25.4356549Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.4356999Z 2025-05-07T20:32:25.4357086Z @given( 2025-05-07T20:32:25.4357329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4357649Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4357982Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4358327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4358665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4358970Z ) 2025-05-07T20:32:25.4359337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4359799Z def test_silu_mul_quant( 2025-05-07T20:32:25.4360046Z self, 2025-05-07T20:32:25.4360329Z T: int, 2025-05-07T20:32:25.4360537Z D: int, 2025-05-07T20:32:25.4360755Z scale_ub: Optional[float], 2025-05-07T20:32:25.4361041Z contiguous: bool, 2025-05-07T20:32:25.4361421Z compiled: bool, 2025-05-07T20:32:25.4361649Z ) -> None: 2025-05-07T20:32:25.4361875Z torch.manual_seed(2025) 2025-05-07T20:32:25.4362123Z 2025-05-07T20:32:25.4362403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4362783Z 2025-05-07T20:32:25.4363008Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4363313Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4363641Z x = x_sign * x_clamp 2025-05-07T20:32:25.4363889Z x0 = x[:, :D] 2025-05-07T20:32:25.4364116Z x1 = x[:, D:] 2025-05-07T20:32:25.4364338Z 2025-05-07T20:32:25.4364536Z if contiguous: 2025-05-07T20:32:25.4364769Z x0 = x0.contiguous() 2025-05-07T20:32:25.4365047Z x1 = x1.contiguous() 2025-05-07T20:32:25.4365299Z 2025-05-07T20:32:25.4365493Z if scale_ub is not None: 2025-05-07T20:32:25.4365785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4366144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4366477Z ) 2025-05-07T20:32:25.4366678Z else: 2025-05-07T20:32:25.4366895Z scale_ub_tensor = None 2025-05-07T20:32:25.4367154Z 2025-05-07T20:32:25.4367389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4367716Z op = silu_mul_quant 2025-05-07T20:32:25.4367982Z if compiled: 2025-05-07T20:32:25.4368233Z op = torch.compile(op) 2025-05-07T20:32:25.4368545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4368829Z 2025-05-07T20:32:25.4369030Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.4369204Z 2025-05-07T20:32:25.4369306Z moe/activation_test.py:117: 2025-05-07T20:32:25.4369612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4369958Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.4370248Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4370849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.4371425Z return fn(*args, **kwargs) 2025-05-07T20:32:25.4372109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.4372818Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.4373378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4374076Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4374764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4375318Z kernel = self.compile( 2025-05-07T20:32:25.4375879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4376562Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4377065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4377308Z 2025-05-07T20:32:25.4377526Z self = 2025-05-07T20:32:25.4378642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4380063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a47060>} 2025-05-07T20:32:25.4381455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4382628Z context = 2025-05-07T20:32:25.4382926Z 2025-05-07T20:32:25.4383103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4383641Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4384126Z module_map=module_map) 2025-05-07T20:32:25.4384503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4384864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.4385129Z E ^ 2025-05-07T20:32:25.4385610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4386076Z 2025-05-07T20:32:25.4386514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.4387044Z 2025-05-07T20:32:25.4387155Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.4387588Z self=, 2025-05-07T20:32:25.4388011Z T=1, 2025-05-07T20:32:25.4388196Z D=5120, 2025-05-07T20:32:25.4388394Z scale_ub=None, 2025-05-07T20:32:25.4388617Z contiguous=False, 2025-05-07T20:32:25.4388841Z compiled=True, 2025-05-07T20:32:25.4389048Z ) 2025-05-07T20:32:25.4389376Z self = 2025-05-07T20:32:25.4389877Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.4390147Z 2025-05-07T20:32:25.4390226Z @given( 2025-05-07T20:32:25.4390462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.4390786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.4391095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.4391433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.4391783Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.4392077Z ) 2025-05-07T20:32:25.4392443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.4392954Z def test_silu_mul_quant( 2025-05-07T20:32:25.4393204Z self, 2025-05-07T20:32:25.4393402Z T: int, 2025-05-07T20:32:25.4393603Z D: int, 2025-05-07T20:32:25.4393832Z scale_ub: Optional[float], 2025-05-07T20:32:25.4394108Z contiguous: bool, 2025-05-07T20:32:25.4394354Z compiled: bool, 2025-05-07T20:32:25.4394583Z ) -> None: 2025-05-07T20:32:25.4394798Z torch.manual_seed(2025) 2025-05-07T20:32:25.4395050Z 2025-05-07T20:32:25.4395334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.4395682Z 2025-05-07T20:32:25.4395882Z x_sign = torch.sign(x) 2025-05-07T20:32:25.4396183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.4396505Z x = x_sign * x_clamp 2025-05-07T20:32:25.4396752Z x0 = x[:, :D] 2025-05-07T20:32:25.4397063Z x1 = x[:, D:] 2025-05-07T20:32:25.4397279Z 2025-05-07T20:32:25.4397472Z if contiguous: 2025-05-07T20:32:25.4397710Z x0 = x0.contiguous() 2025-05-07T20:32:25.4397972Z x1 = x1.contiguous() 2025-05-07T20:32:25.4398220Z 2025-05-07T20:32:25.4398418Z if scale_ub is not None: 2025-05-07T20:32:25.4398702Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.4399045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.4399365Z ) 2025-05-07T20:32:25.4399565Z else: 2025-05-07T20:32:25.4399777Z scale_ub_tensor = None 2025-05-07T20:32:25.4400036Z 2025-05-07T20:32:25.4400338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4400659Z op = silu_mul_quant 2025-05-07T20:32:25.4400920Z if compiled: 2025-05-07T20:32:25.4401265Z op = torch.compile(op) 2025-05-07T20:32:25.4401573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.4401856Z 2025-05-07T20:32:25.4402051Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.4402341Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.4402641Z 2025-05-07T20:32:25.4402885Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.4403235Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.4403533Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.4403856Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.4404232Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.4404549Z 2025-05-07T20:32:25.4404755Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.4404957Z 2025-05-07T20:32:25.4405066Z moe/activation_test.py:126: 2025-05-07T20:32:25.4405370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4405729Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.4406075Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.4406889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.4407659Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.4408227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.4408938Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.4409652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.4410408Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.4411174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.4411851Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.4412477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.4413013Z fn() 2025-05-07T20:32:25.4413702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.4414324Z self.fn.run( 2025-05-07T20:32:25.4414805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.4415355Z kernel = self.compile( 2025-05-07T20:32:25.4415919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.4416595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.4417008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.4417253Z 2025-05-07T20:32:25.4417593Z self = 2025-05-07T20:32:25.4418720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.4420141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f95a6a20>} 2025-05-07T20:32:25.4426609Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.4427673Z context = 2025-05-07T20:32:25.4428069Z 2025-05-07T20:32:25.4428252Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.4428803Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.4429295Z module_map=module_map) 2025-05-07T20:32:25.4429682Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.4430053Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.4430327Z E ^ 2025-05-07T20:32:25.4430814Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.4431284Z 2025-05-07T20:32:25.4431738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.5844002Z 2025-05-07T20:32:25.5844200Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.5844849Z self=, 2025-05-07T20:32:25.5845466Z T=1, 2025-05-07T20:32:25.5845737Z D=5120, 2025-05-07T20:32:25.5845954Z scale_ub=None, 2025-05-07T20:32:25.5846243Z contiguous=True, 2025-05-07T20:32:25.5846568Z compiled=False, 2025-05-07T20:32:25.5846862Z ) 2025-05-07T20:32:25.5847261Z self = 2025-05-07T20:32:25.5847808Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.5848078Z 2025-05-07T20:32:25.5848161Z @given( 2025-05-07T20:32:25.5848404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.5848743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.5849059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.5849408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.5849752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.5850051Z ) 2025-05-07T20:32:25.5850411Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.5850876Z def test_silu_mul_quant( 2025-05-07T20:32:25.5851129Z self, 2025-05-07T20:32:25.5851328Z T: int, 2025-05-07T20:32:25.5851535Z D: int, 2025-05-07T20:32:25.5851766Z scale_ub: Optional[float], 2025-05-07T20:32:25.5852044Z contiguous: bool, 2025-05-07T20:32:25.5852297Z compiled: bool, 2025-05-07T20:32:25.5852531Z ) -> None: 2025-05-07T20:32:25.5852755Z torch.manual_seed(2025) 2025-05-07T20:32:25.5853011Z 2025-05-07T20:32:25.5853296Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.5853650Z 2025-05-07T20:32:25.5853856Z x_sign = torch.sign(x) 2025-05-07T20:32:25.5854160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.5854487Z x = x_sign * x_clamp 2025-05-07T20:32:25.5854732Z x0 = x[:, :D] 2025-05-07T20:32:25.5854957Z x1 = x[:, D:] 2025-05-07T20:32:25.5855177Z 2025-05-07T20:32:25.5855370Z if contiguous: 2025-05-07T20:32:25.5855809Z x0 = x0.contiguous() 2025-05-07T20:32:25.5856086Z x1 = x1.contiguous() 2025-05-07T20:32:25.5856334Z 2025-05-07T20:32:25.5856534Z if scale_ub is not None: 2025-05-07T20:32:25.5856823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.5857169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.5857491Z ) 2025-05-07T20:32:25.5857694Z else: 2025-05-07T20:32:25.5857912Z scale_ub_tensor = None 2025-05-07T20:32:25.5858176Z 2025-05-07T20:32:25.5858419Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.5858864Z op = silu_mul_quant 2025-05-07T20:32:25.5859129Z if compiled: 2025-05-07T20:32:25.5859387Z op = torch.compile(op) 2025-05-07T20:32:25.5859693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5859988Z 2025-05-07T20:32:25.5860258Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.5860432Z 2025-05-07T20:32:25.5860549Z moe/activation_test.py:117: 2025-05-07T20:32:25.5860859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5861211Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.5861510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5862229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.5862952Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.5863521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.5864245Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.5864947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.5865506Z kernel = self.compile( 2025-05-07T20:32:25.5866080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.5866767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.5867189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5867433Z 2025-05-07T20:32:25.5867654Z self = 2025-05-07T20:32:25.5868779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.5870211Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328f2b8860>} 2025-05-07T20:32:25.5871615Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.5872694Z context = 2025-05-07T20:32:25.5873036Z 2025-05-07T20:32:25.5873218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.5873766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.5874265Z module_map=module_map) 2025-05-07T20:32:25.5874649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.5875024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.5875291Z E ^ 2025-05-07T20:32:25.5875780Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.5876247Z 2025-05-07T20:32:25.5876771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.5877306Z 2025-05-07T20:32:25.5877420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.5877847Z self=, 2025-05-07T20:32:25.5878269Z T=128, 2025-05-07T20:32:25.5878467Z D=5120, 2025-05-07T20:32:25.5878662Z scale_ub=None, 2025-05-07T20:32:25.5878888Z contiguous=False, 2025-05-07T20:32:25.5879124Z compiled=True, 2025-05-07T20:32:25.5879333Z ) 2025-05-07T20:32:25.5879669Z self = 2025-05-07T20:32:25.5880408Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.5880689Z 2025-05-07T20:32:25.5880769Z @given( 2025-05-07T20:32:25.5881010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.5881338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.5881705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.5882065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.5882406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.5882714Z ) 2025-05-07T20:32:25.5889767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.5890241Z def test_silu_mul_quant( 2025-05-07T20:32:25.5890501Z self, 2025-05-07T20:32:25.5890708Z T: int, 2025-05-07T20:32:25.5890908Z D: int, 2025-05-07T20:32:25.5891139Z scale_ub: Optional[float], 2025-05-07T20:32:25.5891436Z contiguous: bool, 2025-05-07T20:32:25.5891683Z compiled: bool, 2025-05-07T20:32:25.5891920Z ) -> None: 2025-05-07T20:32:25.5892144Z torch.manual_seed(2025) 2025-05-07T20:32:25.5892393Z 2025-05-07T20:32:25.5892678Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.5893045Z 2025-05-07T20:32:25.5893250Z x_sign = torch.sign(x) 2025-05-07T20:32:25.5893556Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.5893887Z x = x_sign * x_clamp 2025-05-07T20:32:25.5894137Z x0 = x[:, :D] 2025-05-07T20:32:25.5894359Z x1 = x[:, D:] 2025-05-07T20:32:25.5894576Z 2025-05-07T20:32:25.5894771Z if contiguous: 2025-05-07T20:32:25.5895005Z x0 = x0.contiguous() 2025-05-07T20:32:25.5895275Z x1 = x1.contiguous() 2025-05-07T20:32:25.5895528Z 2025-05-07T20:32:25.5895722Z if scale_ub is not None: 2025-05-07T20:32:25.5896015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.5896373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.5896695Z ) 2025-05-07T20:32:25.5896898Z else: 2025-05-07T20:32:25.5897122Z scale_ub_tensor = None 2025-05-07T20:32:25.5897379Z 2025-05-07T20:32:25.5897621Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.5897955Z op = silu_mul_quant 2025-05-07T20:32:25.5898214Z if compiled: 2025-05-07T20:32:25.5898474Z op = torch.compile(op) 2025-05-07T20:32:25.5898786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5899077Z 2025-05-07T20:32:25.5899272Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.5899449Z 2025-05-07T20:32:25.5899553Z moe/activation_test.py:117: 2025-05-07T20:32:25.5899863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5900210Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.5900506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5901097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.5901674Z return fn(*args, **kwargs) 2025-05-07T20:32:25.5902365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.5903248Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.5903817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.5904525Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.5905219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.5905771Z kernel = self.compile( 2025-05-07T20:32:25.5906344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.5907076Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.5907496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5907734Z 2025-05-07T20:32:25.5907955Z self = 2025-05-07T20:32:25.5909124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.5910543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9b8df80>} 2025-05-07T20:32:25.5911937Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.5912999Z context = 2025-05-07T20:32:25.5913299Z 2025-05-07T20:32:25.5913847Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.5914393Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.5914896Z module_map=module_map) 2025-05-07T20:32:25.5915279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.5915648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.5915912Z E ^ 2025-05-07T20:32:25.5916398Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.5916865Z 2025-05-07T20:32:25.5917305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.5917839Z 2025-05-07T20:32:25.5917949Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.5918381Z self=, 2025-05-07T20:32:25.5918802Z T=128, 2025-05-07T20:32:25.5918997Z D=7168, 2025-05-07T20:32:25.5919194Z scale_ub=1200.0, 2025-05-07T20:32:25.5919431Z contiguous=False, 2025-05-07T20:32:25.5919672Z compiled=False, 2025-05-07T20:32:25.7462466Z ) 2025-05-07T20:32:25.7462937Z self = 2025-05-07T20:32:25.7463804Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.7464216Z 2025-05-07T20:32:25.7464335Z @given( 2025-05-07T20:32:25.7464662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.7465071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.7465392Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.7465742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.7466094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.7466396Z ) 2025-05-07T20:32:25.7466766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.7467229Z def test_silu_mul_quant( 2025-05-07T20:32:25.7467489Z self, 2025-05-07T20:32:25.7467692Z T: int, 2025-05-07T20:32:25.7468077Z D: int, 2025-05-07T20:32:25.7468314Z scale_ub: Optional[float], 2025-05-07T20:32:25.7468596Z contiguous: bool, 2025-05-07T20:32:25.7468852Z compiled: bool, 2025-05-07T20:32:25.7469091Z ) -> None: 2025-05-07T20:32:25.7469312Z torch.manual_seed(2025) 2025-05-07T20:32:25.7469565Z 2025-05-07T20:32:25.7469858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.7470215Z 2025-05-07T20:32:25.7470425Z x_sign = torch.sign(x) 2025-05-07T20:32:25.7470735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.7471131Z x = x_sign * x_clamp 2025-05-07T20:32:25.7471386Z x0 = x[:, :D] 2025-05-07T20:32:25.7471622Z x1 = x[:, D:] 2025-05-07T20:32:25.7471843Z 2025-05-07T20:32:25.7472054Z if contiguous: 2025-05-07T20:32:25.7472299Z x0 = x0.contiguous() 2025-05-07T20:32:25.7472640Z x1 = x1.contiguous() 2025-05-07T20:32:25.7472899Z 2025-05-07T20:32:25.7473110Z if scale_ub is not None: 2025-05-07T20:32:25.7473390Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.7473749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.7474079Z ) 2025-05-07T20:32:25.7474289Z else: 2025-05-07T20:32:25.7474509Z scale_ub_tensor = None 2025-05-07T20:32:25.7474778Z 2025-05-07T20:32:25.7475024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.7475353Z op = silu_mul_quant 2025-05-07T20:32:25.7475617Z if compiled: 2025-05-07T20:32:25.7475884Z op = torch.compile(op) 2025-05-07T20:32:25.7476193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7476484Z 2025-05-07T20:32:25.7476692Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.7476866Z 2025-05-07T20:32:25.7476973Z moe/activation_test.py:117: 2025-05-07T20:32:25.7477292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7477647Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.7477947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7478669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.7479391Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.7479956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.7480784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.7481482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.7482043Z kernel = self.compile( 2025-05-07T20:32:25.7482625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.7483352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.7483770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7484011Z 2025-05-07T20:32:25.7484232Z self = 2025-05-07T20:32:25.7485357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.7486783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328f033060>} 2025-05-07T20:32:25.7488176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.7489911Z context = 2025-05-07T20:32:25.7490222Z 2025-05-07T20:32:25.7490406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.7490953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.7491442Z module_map=module_map) 2025-05-07T20:32:25.7491827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.7492203Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.7492525Z E ^ 2025-05-07T20:32:25.7493018Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.7493484Z 2025-05-07T20:32:25.7493924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.7494496Z 2025-05-07T20:32:25.7494619Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.7495057Z self=, 2025-05-07T20:32:25.7495474Z T=128, 2025-05-07T20:32:25.7495675Z D=5120, 2025-05-07T20:32:25.7495880Z scale_ub=None, 2025-05-07T20:32:25.7496105Z contiguous=False, 2025-05-07T20:32:25.7496343Z compiled=False, 2025-05-07T20:32:25.7496565Z ) 2025-05-07T20:32:25.7496903Z self = 2025-05-07T20:32:25.7497415Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.7497703Z 2025-05-07T20:32:25.7497785Z @given( 2025-05-07T20:32:25.7498030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.7498353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.7498679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.7499029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.7499383Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.7499684Z ) 2025-05-07T20:32:25.7500050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.7500511Z def test_silu_mul_quant( 2025-05-07T20:32:25.7500761Z self, 2025-05-07T20:32:25.7500966Z T: int, 2025-05-07T20:32:25.7501178Z D: int, 2025-05-07T20:32:25.7501402Z scale_ub: Optional[float], 2025-05-07T20:32:25.7501690Z contiguous: bool, 2025-05-07T20:32:25.7501946Z compiled: bool, 2025-05-07T20:32:25.7502181Z ) -> None: 2025-05-07T20:32:25.7502410Z torch.manual_seed(2025) 2025-05-07T20:32:25.7502686Z 2025-05-07T20:32:25.7502999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.7503360Z 2025-05-07T20:32:25.7503567Z x_sign = torch.sign(x) 2025-05-07T20:32:25.7503868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.7504200Z x = x_sign * x_clamp 2025-05-07T20:32:25.7504466Z x0 = x[:, :D] 2025-05-07T20:32:25.7504690Z x1 = x[:, D:] 2025-05-07T20:32:25.7504910Z 2025-05-07T20:32:25.7505111Z if contiguous: 2025-05-07T20:32:25.7505350Z x0 = x0.contiguous() 2025-05-07T20:32:25.7505622Z x1 = x1.contiguous() 2025-05-07T20:32:25.7505875Z 2025-05-07T20:32:25.7506082Z if scale_ub is not None: 2025-05-07T20:32:25.7506364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.7506718Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.7507047Z ) 2025-05-07T20:32:25.7507250Z else: 2025-05-07T20:32:25.7507474Z scale_ub_tensor = None 2025-05-07T20:32:25.7507736Z 2025-05-07T20:32:25.7507979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.7508310Z op = silu_mul_quant 2025-05-07T20:32:25.7508576Z if compiled: 2025-05-07T20:32:25.7508835Z op = torch.compile(op) 2025-05-07T20:32:25.7509233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7509526Z 2025-05-07T20:32:25.7509725Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.7509903Z 2025-05-07T20:32:25.7510008Z moe/activation_test.py:117: 2025-05-07T20:32:25.7510318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7510668Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.7510963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7511683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.7512474Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.7513066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.7514081Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.7514875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.7515434Z kernel = self.compile( 2025-05-07T20:32:25.7515992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.7516682Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.7517099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7517340Z 2025-05-07T20:32:25.7517563Z self = 2025-05-07T20:32:25.7518684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.7520164Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e1872e0>} 2025-05-07T20:32:25.7521559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.7522619Z context = 2025-05-07T20:32:25.7522922Z 2025-05-07T20:32:25.7523097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.7523647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.7524141Z module_map=module_map) 2025-05-07T20:32:25.7524525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.7524894Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.7525173Z E ^ 2025-05-07T20:32:25.7525666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.7526135Z 2025-05-07T20:32:25.7526568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.7527107Z 2025-05-07T20:32:25.7527219Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.7527658Z self=, 2025-05-07T20:32:25.7528085Z T=128, 2025-05-07T20:32:25.7528278Z D=5120, 2025-05-07T20:32:25.7528485Z scale_ub=1200.0, 2025-05-07T20:32:25.7528723Z contiguous=True, 2025-05-07T20:32:25.7528955Z compiled=False, 2025-05-07T20:32:25.7529174Z ) 2025-05-07T20:32:25.7529513Z self = 2025-05-07T20:32:25.7530028Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.7530319Z 2025-05-07T20:32:25.7530400Z @given( 2025-05-07T20:32:25.7530762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.7531099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.7531420Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.7531768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.7532116Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.7532417Z ) 2025-05-07T20:32:25.7532787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.7533250Z def test_silu_mul_quant( 2025-05-07T20:32:25.7533564Z self, 2025-05-07T20:32:25.7533774Z T: int, 2025-05-07T20:32:25.7533985Z D: int, 2025-05-07T20:32:25.7534236Z scale_ub: Optional[float], 2025-05-07T20:32:25.7534518Z contiguous: bool, 2025-05-07T20:32:25.7534772Z compiled: bool, 2025-05-07T20:32:25.7535006Z ) -> None: 2025-05-07T20:32:25.7535270Z torch.manual_seed(2025) 2025-05-07T20:32:25.7535527Z 2025-05-07T20:32:25.7535822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.7536175Z 2025-05-07T20:32:25.7536381Z x_sign = torch.sign(x) 2025-05-07T20:32:25.7536688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.7537010Z x = x_sign * x_clamp 2025-05-07T20:32:25.7537262Z x0 = x[:, :D] 2025-05-07T20:32:25.7537490Z x1 = x[:, D:] 2025-05-07T20:32:25.7537704Z 2025-05-07T20:32:25.7537901Z if contiguous: 2025-05-07T20:32:25.7538143Z x0 = x0.contiguous() 2025-05-07T20:32:25.7538414Z x1 = x1.contiguous() 2025-05-07T20:32:25.7538665Z 2025-05-07T20:32:25.7538866Z if scale_ub is not None: 2025-05-07T20:32:25.7539151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.7539499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.7539828Z ) 2025-05-07T20:32:25.7540029Z else: 2025-05-07T20:32:25.7540253Z scale_ub_tensor = None 2025-05-07T20:32:25.7540522Z 2025-05-07T20:32:25.7540765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.7541089Z op = silu_mul_quant 2025-05-07T20:32:25.7541350Z if compiled: 2025-05-07T20:32:25.7541609Z op = torch.compile(op) 2025-05-07T20:32:25.7541915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7542203Z 2025-05-07T20:32:25.7542406Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.7542585Z 2025-05-07T20:32:25.7542690Z moe/activation_test.py:117: 2025-05-07T20:32:25.7543058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7543407Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.7543704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.7544418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.7545144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.7545705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.7546413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.7547108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.7547664Z kernel = self.compile( 2025-05-07T20:32:25.7548227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.7548910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.7549329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.7549567Z 2025-05-07T20:32:25.7549786Z self = 2025-05-07T20:32:25.7550988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.7552408Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e186ac0>} 2025-05-07T20:32:25.7553796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.7554900Z context = 2025-05-07T20:32:25.7555201Z 2025-05-07T20:32:25.7555381Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.7555965Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.7556462Z module_map=module_map) 2025-05-07T20:32:25.7556846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.7557214Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.7557483Z E ^ 2025-05-07T20:32:25.7557971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.7558439Z 2025-05-07T20:32:25.7558878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.9082748Z 2025-05-07T20:32:25.9082992Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.9083459Z self=, 2025-05-07T20:32:25.9083912Z T=1, 2025-05-07T20:32:25.9084147Z D=7168, 2025-05-07T20:32:25.9084353Z scale_ub=1200.0, 2025-05-07T20:32:25.9084593Z contiguous=True, 2025-05-07T20:32:25.9084826Z compiled=True, 2025-05-07T20:32:25.9085039Z ) 2025-05-07T20:32:25.9085380Z self = 2025-05-07T20:32:25.9085886Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.9086164Z 2025-05-07T20:32:25.9086246Z @given( 2025-05-07T20:32:25.9086488Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.9086811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.9087134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.9087483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.9087836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.9088132Z ) 2025-05-07T20:32:25.9088500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.9088959Z def test_silu_mul_quant( 2025-05-07T20:32:25.9089210Z self, 2025-05-07T20:32:25.9089415Z T: int, 2025-05-07T20:32:25.9089627Z D: int, 2025-05-07T20:32:25.9089857Z scale_ub: Optional[float], 2025-05-07T20:32:25.9090135Z contiguous: bool, 2025-05-07T20:32:25.9090391Z compiled: bool, 2025-05-07T20:32:25.9090624Z ) -> None: 2025-05-07T20:32:25.9090845Z torch.manual_seed(2025) 2025-05-07T20:32:25.9091098Z 2025-05-07T20:32:25.9091384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.9091737Z 2025-05-07T20:32:25.9091942Z x_sign = torch.sign(x) 2025-05-07T20:32:25.9092250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.9092573Z x = x_sign * x_clamp 2025-05-07T20:32:25.9092824Z x0 = x[:, :D] 2025-05-07T20:32:25.9093050Z x1 = x[:, D:] 2025-05-07T20:32:25.9093266Z 2025-05-07T20:32:25.9093465Z if contiguous: 2025-05-07T20:32:25.9093707Z x0 = x0.contiguous() 2025-05-07T20:32:25.9093983Z x1 = x1.contiguous() 2025-05-07T20:32:25.9094229Z 2025-05-07T20:32:25.9094601Z if scale_ub is not None: 2025-05-07T20:32:25.9094896Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.9095249Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.9095576Z ) 2025-05-07T20:32:25.9095777Z else: 2025-05-07T20:32:25.9095993Z scale_ub_tensor = None 2025-05-07T20:32:25.9096257Z 2025-05-07T20:32:25.9096501Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.9096827Z op = silu_mul_quant 2025-05-07T20:32:25.9097088Z if compiled: 2025-05-07T20:32:25.9097419Z op = torch.compile(op) 2025-05-07T20:32:25.9097724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.9098016Z 2025-05-07T20:32:25.9098219Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.9098399Z 2025-05-07T20:32:25.9098504Z moe/activation_test.py:117: 2025-05-07T20:32:25.9098883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.9099235Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.9099537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.9100130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.9100713Z return fn(*args, **kwargs) 2025-05-07T20:32:25.9107983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.9108716Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.9109305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.9110030Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.9110728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.9111300Z kernel = self.compile( 2025-05-07T20:32:25.9111878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.9112573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.9113001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.9113248Z 2025-05-07T20:32:25.9113646Z self = 2025-05-07T20:32:25.9114779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.9116210Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eaac680>} 2025-05-07T20:32:25.9117611Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.9118673Z context = 2025-05-07T20:32:25.9118973Z 2025-05-07T20:32:25.9119154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.9119710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.9120284Z module_map=module_map) 2025-05-07T20:32:25.9120672Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.9121043Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.9121310Z E ^ 2025-05-07T20:32:25.9121793Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.9122260Z 2025-05-07T20:32:25.9122857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.9123398Z 2025-05-07T20:32:25.9123511Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.9123944Z self=, 2025-05-07T20:32:25.9124367Z T=1, 2025-05-07T20:32:25.9124559Z D=7168, 2025-05-07T20:32:25.9124755Z scale_ub=1200.0, 2025-05-07T20:32:25.9124987Z contiguous=False, 2025-05-07T20:32:25.9125223Z compiled=True, 2025-05-07T20:32:25.9125427Z ) 2025-05-07T20:32:25.9125830Z self = 2025-05-07T20:32:25.9126346Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.9126623Z 2025-05-07T20:32:25.9126711Z @given( 2025-05-07T20:32:25.9126947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.9127341Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.9127663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.9128009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.9128358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.9128659Z ) 2025-05-07T20:32:25.9129029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.9129488Z def test_silu_mul_quant( 2025-05-07T20:32:25.9129739Z self, 2025-05-07T20:32:25.9129936Z T: int, 2025-05-07T20:32:25.9130140Z D: int, 2025-05-07T20:32:25.9130367Z scale_ub: Optional[float], 2025-05-07T20:32:25.9130651Z contiguous: bool, 2025-05-07T20:32:25.9130899Z compiled: bool, 2025-05-07T20:32:25.9131136Z ) -> None: 2025-05-07T20:32:25.9131360Z torch.manual_seed(2025) 2025-05-07T20:32:25.9131609Z 2025-05-07T20:32:25.9131893Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.9132261Z 2025-05-07T20:32:25.9132462Z x_sign = torch.sign(x) 2025-05-07T20:32:25.9132766Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.9133097Z x = x_sign * x_clamp 2025-05-07T20:32:25.9133339Z x0 = x[:, :D] 2025-05-07T20:32:25.9133563Z x1 = x[:, D:] 2025-05-07T20:32:25.9133778Z 2025-05-07T20:32:25.9133966Z if contiguous: 2025-05-07T20:32:25.9134211Z x0 = x0.contiguous() 2025-05-07T20:32:25.9134484Z x1 = x1.contiguous() 2025-05-07T20:32:25.9134731Z 2025-05-07T20:32:25.9134931Z if scale_ub is not None: 2025-05-07T20:32:25.9135222Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.9135570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.9135895Z ) 2025-05-07T20:32:25.9136093Z else: 2025-05-07T20:32:25.9136317Z scale_ub_tensor = None 2025-05-07T20:32:25.9136578Z 2025-05-07T20:32:25.9136824Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.9137155Z op = silu_mul_quant 2025-05-07T20:32:25.9137415Z if compiled: 2025-05-07T20:32:25.9137672Z op = torch.compile(op) 2025-05-07T20:32:25.9137979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.9138261Z 2025-05-07T20:32:25.9138459Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.9138629Z 2025-05-07T20:32:25.9138738Z moe/activation_test.py:117: 2025-05-07T20:32:25.9139039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.9139391Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.9139687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.9140280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.9140859Z return fn(*args, **kwargs) 2025-05-07T20:32:25.9141625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.9142340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.9142902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.9143622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.9144317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.9144874Z kernel = self.compile( 2025-05-07T20:32:25.9145430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.9146167Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.9146584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.9146862Z 2025-05-07T20:32:25.9147086Z self = 2025-05-07T20:32:25.9148207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.9149639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e737240>} 2025-05-07T20:32:25.9151029Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.9152095Z context = 2025-05-07T20:32:25.9152393Z 2025-05-07T20:32:25.9152566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.9153171Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.9153656Z module_map=module_map) 2025-05-07T20:32:25.9154036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.9154402Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.9154671Z E ^ 2025-05-07T20:32:25.9155155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.9155622Z 2025-05-07T20:32:25.9156056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.1347240Z 2025-05-07T20:32:26.1347601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.1348463Z self=, 2025-05-07T20:32:26.1349338Z T=1, 2025-05-07T20:32:26.1349734Z D=7168, 2025-05-07T20:32:26.1350114Z scale_ub=None, 2025-05-07T20:32:26.1350558Z contiguous=False, 2025-05-07T20:32:26.1351006Z compiled=True, 2025-05-07T20:32:26.1351409Z ) 2025-05-07T20:32:26.1352033Z self = 2025-05-07T20:32:26.1352988Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:26.1353404Z 2025-05-07T20:32:26.1353491Z @given( 2025-05-07T20:32:26.1353752Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.1354094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.1354427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.1354800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.1355164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.1355480Z ) 2025-05-07T20:32:26.1355856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.1356335Z def test_silu_mul_quant( 2025-05-07T20:32:26.1356609Z self, 2025-05-07T20:32:26.1357021Z T: int, 2025-05-07T20:32:26.1357253Z D: int, 2025-05-07T20:32:26.1357493Z scale_ub: Optional[float], 2025-05-07T20:32:26.1357781Z contiguous: bool, 2025-05-07T20:32:26.1358051Z compiled: bool, 2025-05-07T20:32:26.1358299Z ) -> None: 2025-05-07T20:32:26.1358531Z torch.manual_seed(2025) 2025-05-07T20:32:26.1358800Z 2025-05-07T20:32:26.1359097Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.1359460Z 2025-05-07T20:32:26.1359673Z x_sign = torch.sign(x) 2025-05-07T20:32:26.1360086Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.1360523Z x = x_sign * x_clamp 2025-05-07T20:32:26.1360785Z x0 = x[:, :D] 2025-05-07T20:32:26.1361022Z x1 = x[:, D:] 2025-05-07T20:32:26.1361259Z 2025-05-07T20:32:26.1361455Z if contiguous: 2025-05-07T20:32:26.1361793Z x0 = x0.contiguous() 2025-05-07T20:32:26.1362082Z x1 = x1.contiguous() 2025-05-07T20:32:26.1362339Z 2025-05-07T20:32:26.1362550Z if scale_ub is not None: 2025-05-07T20:32:26.1362851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.1363211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.1363548Z ) 2025-05-07T20:32:26.1363763Z else: 2025-05-07T20:32:26.1363988Z scale_ub_tensor = None 2025-05-07T20:32:26.1364263Z 2025-05-07T20:32:26.1364517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.1364850Z op = silu_mul_quant 2025-05-07T20:32:26.1365134Z if compiled: 2025-05-07T20:32:26.1365405Z op = torch.compile(op) 2025-05-07T20:32:26.1365731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.1366027Z 2025-05-07T20:32:26.1366240Z y_fp8, y_scale = fn() 2025-05-07T20:32:26.1366553Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:26.1366866Z 2025-05-07T20:32:26.1367133Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.1367496Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:26.1367814Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:26.1368155Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:26.1368547Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.1368878Z 2025-05-07T20:32:26.1369099Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:26.1369307Z 2025-05-07T20:32:26.1369426Z moe/activation_test.py:126: 2025-05-07T20:32:26.1369756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.1370115Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:26.1370478Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.1371318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:26.1372104Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:26.1372689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.1373414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.1374144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:26.1374907Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.1375683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:26.1376361Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:26.1377004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:26.1377551Z fn() 2025-05-07T20:32:26.1378179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:26.1378795Z self.fn.run( 2025-05-07T20:32:26.1379298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.1379867Z kernel = self.compile( 2025-05-07T20:32:26.1380446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.1381138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.1381611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.1381862Z 2025-05-07T20:32:26.1382084Z self = 2025-05-07T20:32:26.1383250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.1384758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e736980>} 2025-05-07T20:32:26.1386155Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.1387231Z context = 2025-05-07T20:32:26.1387549Z 2025-05-07T20:32:26.1387731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.1388300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.1388801Z module_map=module_map) 2025-05-07T20:32:26.1389204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.1389596Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:26.1389881Z E ^ 2025-05-07T20:32:26.1390379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.1390858Z 2025-05-07T20:32:26.1391298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.1391835Z 2025-05-07T20:32:26.1391957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.1392406Z self=, 2025-05-07T20:32:26.1392844Z T=1, 2025-05-07T20:32:26.1393077Z D=5120, 2025-05-07T20:32:26.1393310Z scale_ub=1200.0, 2025-05-07T20:32:26.1393558Z contiguous=False, 2025-05-07T20:32:26.1393813Z compiled=True, 2025-05-07T20:32:26.1394036Z ) 2025-05-07T20:32:26.1394387Z self = 2025-05-07T20:32:26.1394908Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:26.1395190Z 2025-05-07T20:32:26.1395286Z @given( 2025-05-07T20:32:26.1395535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.1395876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.1396211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.1396566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.1396926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.1397240Z ) 2025-05-07T20:32:26.1397614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.1398087Z def test_silu_mul_quant( 2025-05-07T20:32:26.1398353Z self, 2025-05-07T20:32:26.1398567Z T: int, 2025-05-07T20:32:26.1398783Z D: int, 2025-05-07T20:32:26.1399022Z scale_ub: Optional[float], 2025-05-07T20:32:26.1399404Z contiguous: bool, 2025-05-07T20:32:26.1399661Z compiled: bool, 2025-05-07T20:32:26.1399905Z ) -> None: 2025-05-07T20:32:26.1400215Z torch.manual_seed(2025) 2025-05-07T20:32:26.1400480Z 2025-05-07T20:32:26.1400798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.1401171Z 2025-05-07T20:32:26.1401392Z x_sign = torch.sign(x) 2025-05-07T20:32:26.1401703Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.1402038Z x = x_sign * x_clamp 2025-05-07T20:32:26.1402352Z x0 = x[:, :D] 2025-05-07T20:32:26.1402583Z x1 = x[:, D:] 2025-05-07T20:32:26.1402816Z 2025-05-07T20:32:26.1403019Z if contiguous: 2025-05-07T20:32:26.1403266Z x0 = x0.contiguous() 2025-05-07T20:32:26.1403543Z x1 = x1.contiguous() 2025-05-07T20:32:26.1403804Z 2025-05-07T20:32:26.1404055Z if scale_ub is not None: 2025-05-07T20:32:26.1404356Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.1404724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.1405053Z ) 2025-05-07T20:32:26.1405269Z else: 2025-05-07T20:32:26.1405499Z scale_ub_tensor = None 2025-05-07T20:32:26.1405768Z 2025-05-07T20:32:26.1406021Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.1406363Z op = silu_mul_quant 2025-05-07T20:32:26.1406635Z if compiled: 2025-05-07T20:32:26.1406902Z op = torch.compile(op) 2025-05-07T20:32:26.1407225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.1407528Z 2025-05-07T20:32:26.1407731Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.1407912Z 2025-05-07T20:32:26.1408021Z moe/activation_test.py:117: 2025-05-07T20:32:26.1408346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.1408706Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.1409018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.1409616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.1410202Z return fn(*args, **kwargs) 2025-05-07T20:32:26.1410896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.1411624Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.1412193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.1412911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.1413778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.1414341Z kernel = self.compile( 2025-05-07T20:32:26.1414920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.1415602Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.1416025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.1416271Z 2025-05-07T20:32:26.1416496Z self = 2025-05-07T20:32:26.1417623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.1419048Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a45d00>} 2025-05-07T20:32:26.1420573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.1421643Z context = 2025-05-07T20:32:26.1421945Z 2025-05-07T20:32:26.1422129Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.1422684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.1423237Z module_map=module_map) 2025-05-07T20:32:26.1423631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.1424100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.1424378Z E ^ 2025-05-07T20:32:26.1424874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.1425344Z 2025-05-07T20:32:26.1425792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.2812405Z 2025-05-07T20:32:26.2812652Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.2813143Z self=, 2025-05-07T20:32:26.2813836Z T=1, 2025-05-07T20:32:26.2814047Z D=5120, 2025-05-07T20:32:26.2814260Z scale_ub=1200.0, 2025-05-07T20:32:26.2814517Z contiguous=False, 2025-05-07T20:32:26.2814776Z compiled=False, 2025-05-07T20:32:26.2815007Z ) 2025-05-07T20:32:26.2815381Z self = 2025-05-07T20:32:26.2815982Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:26.2816302Z 2025-05-07T20:32:26.2816394Z @given( 2025-05-07T20:32:26.2816653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.2817022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.2817379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.2817777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.2818166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.2818503Z ) 2025-05-07T20:32:26.2818912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.2819451Z def test_silu_mul_quant( 2025-05-07T20:32:26.2819731Z self, 2025-05-07T20:32:26.2819943Z T: int, 2025-05-07T20:32:26.2820165Z D: int, 2025-05-07T20:32:26.2820413Z scale_ub: Optional[float], 2025-05-07T20:32:26.2820722Z contiguous: bool, 2025-05-07T20:32:26.2821001Z compiled: bool, 2025-05-07T20:32:26.2821258Z ) -> None: 2025-05-07T20:32:26.2821495Z torch.manual_seed(2025) 2025-05-07T20:32:26.2821776Z 2025-05-07T20:32:26.2822086Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.2822482Z 2025-05-07T20:32:26.2822701Z x_sign = torch.sign(x) 2025-05-07T20:32:26.2823068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.2823433Z x = x_sign * x_clamp 2025-05-07T20:32:26.2823708Z x0 = x[:, :D] 2025-05-07T20:32:26.2823947Z x1 = x[:, D:] 2025-05-07T20:32:26.2824184Z 2025-05-07T20:32:26.2824392Z if contiguous: 2025-05-07T20:32:26.2824648Z x0 = x0.contiguous() 2025-05-07T20:32:26.2824947Z x1 = x1.contiguous() 2025-05-07T20:32:26.2825222Z 2025-05-07T20:32:26.2825432Z if scale_ub is not None: 2025-05-07T20:32:26.2825753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.2826148Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.2826520Z ) 2025-05-07T20:32:26.2826740Z else: 2025-05-07T20:32:26.2826970Z scale_ub_tensor = None 2025-05-07T20:32:26.2827263Z 2025-05-07T20:32:26.2827526Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.2827891Z op = silu_mul_quant 2025-05-07T20:32:26.2828388Z if compiled: 2025-05-07T20:32:26.2838411Z op = torch.compile(op) 2025-05-07T20:32:26.2838773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.2839105Z 2025-05-07T20:32:26.2839335Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.2839530Z 2025-05-07T20:32:26.2839646Z moe/activation_test.py:117: 2025-05-07T20:32:26.2840001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.2840488Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.2840795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.2841705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.2842462Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.2843053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.2843872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.2844592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.2845167Z kernel = self.compile( 2025-05-07T20:32:26.2845748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.2846459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.2846898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.2847147Z 2025-05-07T20:32:26.2847378Z self = 2025-05-07T20:32:26.2848523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.2849994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a3f240>} 2025-05-07T20:32:26.2851413Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.2852499Z context = 2025-05-07T20:32:26.2852806Z 2025-05-07T20:32:26.2852999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.2853552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.2854056Z module_map=module_map) 2025-05-07T20:32:26.2854453Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.2854839Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.2855133Z E ^ 2025-05-07T20:32:26.2855632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.2856105Z 2025-05-07T20:32:26.2856553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.2857089Z 2025-05-07T20:32:26.2857201Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.2857647Z self=, 2025-05-07T20:32:26.2858078Z T=16384, 2025-05-07T20:32:26.2858284Z D=5120, 2025-05-07T20:32:26.2858498Z scale_ub=1200.0, 2025-05-07T20:32:26.2858744Z contiguous=False, 2025-05-07T20:32:26.2858984Z compiled=True, 2025-05-07T20:32:26.2859213Z ) 2025-05-07T20:32:26.2859556Z self = 2025-05-07T20:32:26.2860182Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:26.2860486Z 2025-05-07T20:32:26.2860572Z @given( 2025-05-07T20:32:26.2860821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.2861165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.2861488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.2861842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.2862198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.2862500Z ) 2025-05-07T20:32:26.2862878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.2863416Z def test_silu_mul_quant( 2025-05-07T20:32:26.2863678Z self, 2025-05-07T20:32:26.2863885Z T: int, 2025-05-07T20:32:26.2864101Z D: int, 2025-05-07T20:32:26.2864347Z scale_ub: Optional[float], 2025-05-07T20:32:26.2864634Z contiguous: bool, 2025-05-07T20:32:26.2864940Z compiled: bool, 2025-05-07T20:32:26.2865193Z ) -> None: 2025-05-07T20:32:26.2865423Z torch.manual_seed(2025) 2025-05-07T20:32:26.2865688Z 2025-05-07T20:32:26.2865984Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.2866345Z 2025-05-07T20:32:26.2866560Z x_sign = torch.sign(x) 2025-05-07T20:32:26.2866876Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.2867207Z x = x_sign * x_clamp 2025-05-07T20:32:26.2867472Z x0 = x[:, :D] 2025-05-07T20:32:26.2867715Z x1 = x[:, D:] 2025-05-07T20:32:26.2867936Z 2025-05-07T20:32:26.2868147Z if contiguous: 2025-05-07T20:32:26.2868400Z x0 = x0.contiguous() 2025-05-07T20:32:26.2868672Z x1 = x1.contiguous() 2025-05-07T20:32:26.2868935Z 2025-05-07T20:32:26.2869146Z if scale_ub is not None: 2025-05-07T20:32:26.2869444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.2869801Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.2870139Z ) 2025-05-07T20:32:26.2870353Z else: 2025-05-07T20:32:26.2870576Z scale_ub_tensor = None 2025-05-07T20:32:26.2870848Z 2025-05-07T20:32:26.2871102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.2871433Z op = silu_mul_quant 2025-05-07T20:32:26.2871705Z if compiled: 2025-05-07T20:32:26.2871977Z op = torch.compile(op) 2025-05-07T20:32:26.2872292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.2872598Z 2025-05-07T20:32:26.2872812Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.2873011Z 2025-05-07T20:32:26.2873142Z moe/activation_test.py:117: 2025-05-07T20:32:26.2873464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.2873821Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.2874123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.2874712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.2875304Z return fn(*args, **kwargs) 2025-05-07T20:32:26.2875995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.2876708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.2877278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.2877999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.2878700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.2879256Z kernel = self.compile( 2025-05-07T20:32:26.2879831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.2880674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.2881107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.2881351Z 2025-05-07T20:32:26.2881568Z self = 2025-05-07T20:32:26.2882695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.2884125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a3dd00>} 2025-05-07T20:32:26.2885568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.2886678Z context = 2025-05-07T20:32:26.2886987Z 2025-05-07T20:32:26.2887166Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.2887723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.2888213Z module_map=module_map) 2025-05-07T20:32:26.2888595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.2888971Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.2889244Z E ^ 2025-05-07T20:32:26.2889725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.2890199Z 2025-05-07T20:32:26.2890638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.2891181Z 2025-05-07T20:32:26.2891290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.2891730Z self=, 2025-05-07T20:32:26.2892161Z T=2048, 2025-05-07T20:32:26.2892376Z D=7168, 2025-05-07T20:32:26.2892594Z scale_ub=1200.0, 2025-05-07T20:32:26.2892832Z contiguous=False, 2025-05-07T20:32:26.2893079Z compiled=True, 2025-05-07T20:32:26.4717045Z ) 2025-05-07T20:32:26.4717680Z self = 2025-05-07T20:32:26.4718291Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:26.4718723Z 2025-05-07T20:32:26.4718848Z @given( 2025-05-07T20:32:26.4719184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.4719645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.4720088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.4720708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.4721207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.4721615Z ) 2025-05-07T20:32:26.4722045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.4722518Z def test_silu_mul_quant( 2025-05-07T20:32:26.4722776Z self, 2025-05-07T20:32:26.4722981Z T: int, 2025-05-07T20:32:26.4723193Z D: int, 2025-05-07T20:32:26.4723422Z scale_ub: Optional[float], 2025-05-07T20:32:26.4723701Z contiguous: bool, 2025-05-07T20:32:26.4723957Z compiled: bool, 2025-05-07T20:32:26.4724199Z ) -> None: 2025-05-07T20:32:26.4724432Z torch.manual_seed(2025) 2025-05-07T20:32:26.4724683Z 2025-05-07T20:32:26.4724978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.4725340Z 2025-05-07T20:32:26.4725541Z x_sign = torch.sign(x) 2025-05-07T20:32:26.4725850Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.4726183Z x = x_sign * x_clamp 2025-05-07T20:32:26.4726687Z x0 = x[:, :D] 2025-05-07T20:32:26.4726924Z x1 = x[:, D:] 2025-05-07T20:32:26.4727149Z 2025-05-07T20:32:26.4727341Z if contiguous: 2025-05-07T20:32:26.4727593Z x0 = x0.contiguous() 2025-05-07T20:32:26.4727873Z x1 = x1.contiguous() 2025-05-07T20:32:26.4728122Z 2025-05-07T20:32:26.4728324Z if scale_ub is not None: 2025-05-07T20:32:26.4728614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.4728970Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.4729292Z ) 2025-05-07T20:32:26.4729562Z else: 2025-05-07T20:32:26.4729786Z scale_ub_tensor = None 2025-05-07T20:32:26.4730043Z 2025-05-07T20:32:26.4730286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.4730618Z op = silu_mul_quant 2025-05-07T20:32:26.4730873Z if compiled: 2025-05-07T20:32:26.4731197Z op = torch.compile(op) 2025-05-07T20:32:26.4731514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4731800Z 2025-05-07T20:32:26.4732003Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.4732173Z 2025-05-07T20:32:26.4732285Z moe/activation_test.py:117: 2025-05-07T20:32:26.4732593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4732973Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.4733291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4733880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.4734469Z return fn(*args, **kwargs) 2025-05-07T20:32:26.4735199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.4735909Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.4736478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.4737191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.4737884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.4738437Z kernel = self.compile( 2025-05-07T20:32:26.4739003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.4739688Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.4740106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4740354Z 2025-05-07T20:32:26.4740569Z self = 2025-05-07T20:32:26.4741705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.4743146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a3fc40>} 2025-05-07T20:32:26.4744543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.4745600Z context = 2025-05-07T20:32:26.4745911Z 2025-05-07T20:32:26.4746089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.4746642Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.4747138Z module_map=module_map) 2025-05-07T20:32:26.4747604Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.4747982Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.4748260Z E ^ 2025-05-07T20:32:26.4748745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.4749222Z 2025-05-07T20:32:26.4749659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.4750201Z 2025-05-07T20:32:26.4750310Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.4750749Z self=, 2025-05-07T20:32:26.4751213Z T=1, 2025-05-07T20:32:26.4751411Z D=5120, 2025-05-07T20:32:26.4751628Z scale_ub=None, 2025-05-07T20:32:26.4751853Z contiguous=False, 2025-05-07T20:32:26.4752093Z compiled=False, 2025-05-07T20:32:26.4752311Z ) 2025-05-07T20:32:26.4752687Z self = 2025-05-07T20:32:26.4753210Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:26.4753484Z 2025-05-07T20:32:26.4753574Z @given( 2025-05-07T20:32:26.4753810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.4754142Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.4754466Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.4754813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.4755157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.4755457Z ) 2025-05-07T20:32:26.4755827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.4756284Z def test_silu_mul_quant( 2025-05-07T20:32:26.4756540Z self, 2025-05-07T20:32:26.4756745Z T: int, 2025-05-07T20:32:26.4756950Z D: int, 2025-05-07T20:32:26.4757180Z scale_ub: Optional[float], 2025-05-07T20:32:26.4757470Z contiguous: bool, 2025-05-07T20:32:26.4757726Z compiled: bool, 2025-05-07T20:32:26.4757961Z ) -> None: 2025-05-07T20:32:26.4758190Z torch.manual_seed(2025) 2025-05-07T20:32:26.4758441Z 2025-05-07T20:32:26.4758729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.4759091Z 2025-05-07T20:32:26.4759297Z x_sign = torch.sign(x) 2025-05-07T20:32:26.4759600Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.4759928Z x = x_sign * x_clamp 2025-05-07T20:32:26.4760247Z x0 = x[:, :D] 2025-05-07T20:32:26.4760475Z x1 = x[:, D:] 2025-05-07T20:32:26.4760699Z 2025-05-07T20:32:26.4760897Z if contiguous: 2025-05-07T20:32:26.4761141Z x0 = x0.contiguous() 2025-05-07T20:32:26.4761415Z x1 = x1.contiguous() 2025-05-07T20:32:26.4761671Z 2025-05-07T20:32:26.4761871Z if scale_ub is not None: 2025-05-07T20:32:26.4762167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.4762529Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.4762879Z ) 2025-05-07T20:32:26.4763112Z else: 2025-05-07T20:32:26.4763339Z scale_ub_tensor = None 2025-05-07T20:32:26.4763600Z 2025-05-07T20:32:26.4763843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.4764177Z op = silu_mul_quant 2025-05-07T20:32:26.4764444Z if compiled: 2025-05-07T20:32:26.4764702Z op = torch.compile(op) 2025-05-07T20:32:26.4765016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4765310Z 2025-05-07T20:32:26.4765505Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.4765684Z 2025-05-07T20:32:26.4765789Z moe/activation_test.py:117: 2025-05-07T20:32:26.4766107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4766456Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.4766756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4767563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.4768284Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.4768841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.4769553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.4770245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.4770835Z kernel = self.compile( 2025-05-07T20:32:26.4771400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.4772082Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.4772539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4772783Z 2025-05-07T20:32:26.4772999Z self = 2025-05-07T20:32:26.4774121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.4775540Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f96afc40>} 2025-05-07T20:32:26.4776936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.4777996Z context = 2025-05-07T20:32:26.4778300Z 2025-05-07T20:32:26.4778478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.4779024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.4779518Z module_map=module_map) 2025-05-07T20:32:26.4779894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.4780264Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.4780536Z E ^ 2025-05-07T20:32:26.4781016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.4781494Z 2025-05-07T20:32:26.4781927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.4782466Z 2025-05-07T20:32:26.4782575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.4783011Z self=, 2025-05-07T20:32:26.4783430Z T=4096, 2025-05-07T20:32:26.4783633Z D=7168, 2025-05-07T20:32:26.4783838Z scale_ub=1200.0, 2025-05-07T20:32:26.4784066Z contiguous=False, 2025-05-07T20:32:26.4784304Z compiled=False, 2025-05-07T20:32:26.4784519Z ) 2025-05-07T20:32:26.4784857Z self = 2025-05-07T20:32:26.4785379Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:26.4785677Z 2025-05-07T20:32:26.4785758Z @given( 2025-05-07T20:32:26.4786004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.4786332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.4786656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.4787007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.4787347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.4787654Z ) 2025-05-07T20:32:26.4788134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.4788600Z def test_silu_mul_quant( 2025-05-07T20:32:26.4788850Z self, 2025-05-07T20:32:26.4789059Z T: int, 2025-05-07T20:32:26.4789268Z D: int, 2025-05-07T20:32:26.4789495Z scale_ub: Optional[float], 2025-05-07T20:32:26.4789781Z contiguous: bool, 2025-05-07T20:32:26.4790040Z compiled: bool, 2025-05-07T20:32:26.4790266Z ) -> None: 2025-05-07T20:32:26.4790490Z torch.manual_seed(2025) 2025-05-07T20:32:26.4790743Z 2025-05-07T20:32:26.4791021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.4791424Z 2025-05-07T20:32:26.4791627Z x_sign = torch.sign(x) 2025-05-07T20:32:26.4791926Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.4792251Z x = x_sign * x_clamp 2025-05-07T20:32:26.4792508Z x0 = x[:, :D] 2025-05-07T20:32:26.4792774Z x1 = x[:, D:] 2025-05-07T20:32:26.4792992Z 2025-05-07T20:32:26.4793191Z if contiguous: 2025-05-07T20:32:26.4793429Z x0 = x0.contiguous() 2025-05-07T20:32:26.4793701Z x1 = x1.contiguous() 2025-05-07T20:32:26.4793951Z 2025-05-07T20:32:26.4794150Z if scale_ub is not None: 2025-05-07T20:32:26.4794435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.4794786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.4795108Z ) 2025-05-07T20:32:26.4795305Z else: 2025-05-07T20:32:26.4795523Z scale_ub_tensor = None 2025-05-07T20:32:26.4795787Z 2025-05-07T20:32:26.4796024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.4796356Z op = silu_mul_quant 2025-05-07T20:32:26.4796625Z if compiled: 2025-05-07T20:32:26.4796875Z op = torch.compile(op) 2025-05-07T20:32:26.4797185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4797480Z 2025-05-07T20:32:26.4797680Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.4797857Z 2025-05-07T20:32:26.4797959Z moe/activation_test.py:117: 2025-05-07T20:32:26.4798271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4798621Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.4798910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.4799633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.4800427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.4800990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.4801705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.4802402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.4802965Z kernel = self.compile( 2025-05-07T20:32:26.4803531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.4804215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.4804631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.4804876Z 2025-05-07T20:32:26.4805092Z self = 2025-05-07T20:32:26.4806213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.4807635Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0ad40>} 2025-05-07T20:32:26.4809118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.4810179Z context = 2025-05-07T20:32:26.4810482Z 2025-05-07T20:32:26.4810657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.4811208Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.4811697Z module_map=module_map) 2025-05-07T20:32:26.4812121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.4812487Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.4812759Z E ^ 2025-05-07T20:32:26.4813244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.4813944Z 2025-05-07T20:32:26.4814386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.6368024Z 2025-05-07T20:32:26.6368388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.6369030Z self=, 2025-05-07T20:32:26.6369621Z T=16384, 2025-05-07T20:32:26.6369934Z D=7168, 2025-05-07T20:32:26.6370209Z scale_ub=None, 2025-05-07T20:32:26.6370437Z contiguous=True, 2025-05-07T20:32:26.6370665Z compiled=True, 2025-05-07T20:32:26.6370880Z ) 2025-05-07T20:32:26.6371224Z self = 2025-05-07T20:32:26.6371744Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:26.6372040Z 2025-05-07T20:32:26.6372125Z @given( 2025-05-07T20:32:26.6372374Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.6372708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.6373069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.6373445Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.6373797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.6374096Z ) 2025-05-07T20:32:26.6374467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.6383788Z def test_silu_mul_quant( 2025-05-07T20:32:26.6384209Z self, 2025-05-07T20:32:26.6384508Z T: int, 2025-05-07T20:32:26.6384709Z D: int, 2025-05-07T20:32:26.6384942Z scale_ub: Optional[float], 2025-05-07T20:32:26.6385223Z contiguous: bool, 2025-05-07T20:32:26.6385468Z compiled: bool, 2025-05-07T20:32:26.6385700Z ) -> None: 2025-05-07T20:32:26.6385924Z torch.manual_seed(2025) 2025-05-07T20:32:26.6386171Z 2025-05-07T20:32:26.6386458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.6386815Z 2025-05-07T20:32:26.6387014Z x_sign = torch.sign(x) 2025-05-07T20:32:26.6387320Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.6387639Z x = x_sign * x_clamp 2025-05-07T20:32:26.6387884Z x0 = x[:, :D] 2025-05-07T20:32:26.6388108Z x1 = x[:, D:] 2025-05-07T20:32:26.6388319Z 2025-05-07T20:32:26.6388505Z if contiguous: 2025-05-07T20:32:26.6388742Z x0 = x0.contiguous() 2025-05-07T20:32:26.6389009Z x1 = x1.contiguous() 2025-05-07T20:32:26.6389254Z 2025-05-07T20:32:26.6389448Z if scale_ub is not None: 2025-05-07T20:32:26.6389739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.6390089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.6390404Z ) 2025-05-07T20:32:26.6390602Z else: 2025-05-07T20:32:26.6390818Z scale_ub_tensor = None 2025-05-07T20:32:26.6391072Z 2025-05-07T20:32:26.6391314Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.6391816Z op = silu_mul_quant 2025-05-07T20:32:26.6392078Z if compiled: 2025-05-07T20:32:26.6392333Z op = torch.compile(op) 2025-05-07T20:32:26.6392639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.6392919Z 2025-05-07T20:32:26.6393118Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.6393289Z 2025-05-07T20:32:26.6393394Z moe/activation_test.py:117: 2025-05-07T20:32:26.6393701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.6394044Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.6394406Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.6394986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.6395562Z return fn(*args, **kwargs) 2025-05-07T20:32:26.6396250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.6397021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.6397573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.6398282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.6398969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.6399518Z kernel = self.compile( 2025-05-07T20:32:26.6400071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.6400888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.6401297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.6401531Z 2025-05-07T20:32:26.6401755Z self = 2025-05-07T20:32:26.6402872Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.6404291Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0b920>} 2025-05-07T20:32:26.6405676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.6406736Z context = 2025-05-07T20:32:26.6407033Z 2025-05-07T20:32:26.6407213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.6407762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.6408248Z module_map=module_map) 2025-05-07T20:32:26.6408627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.6408988Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.6409260Z E ^ 2025-05-07T20:32:26.6409742Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.6410207Z 2025-05-07T20:32:26.6410641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.6411173Z 2025-05-07T20:32:26.6411277Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.6411707Z self=, 2025-05-07T20:32:26.6412117Z T=4096, 2025-05-07T20:32:26.6412305Z D=5120, 2025-05-07T20:32:26.6412500Z scale_ub=None, 2025-05-07T20:32:26.6412812Z contiguous=False, 2025-05-07T20:32:26.6413040Z compiled=True, 2025-05-07T20:32:26.6413246Z ) 2025-05-07T20:32:26.6413845Z self = 2025-05-07T20:32:26.6414360Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:26.6414649Z 2025-05-07T20:32:26.6414728Z @given( 2025-05-07T20:32:26.6414962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.6415287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.6415602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.6416029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.6416373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.6416663Z ) 2025-05-07T20:32:26.6417030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.6417545Z def test_silu_mul_quant( 2025-05-07T20:32:26.6417787Z self, 2025-05-07T20:32:26.6417992Z T: int, 2025-05-07T20:32:26.6418193Z D: int, 2025-05-07T20:32:26.6418412Z scale_ub: Optional[float], 2025-05-07T20:32:26.6418685Z contiguous: bool, 2025-05-07T20:32:26.6418929Z compiled: bool, 2025-05-07T20:32:26.6419151Z ) -> None: 2025-05-07T20:32:26.6419362Z torch.manual_seed(2025) 2025-05-07T20:32:26.6419607Z 2025-05-07T20:32:26.6419883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.6420230Z 2025-05-07T20:32:26.6420424Z x_sign = torch.sign(x) 2025-05-07T20:32:26.6420728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.6421040Z x = x_sign * x_clamp 2025-05-07T20:32:26.6421280Z x0 = x[:, :D] 2025-05-07T20:32:26.6421503Z x1 = x[:, D:] 2025-05-07T20:32:26.6421707Z 2025-05-07T20:32:26.6421896Z if contiguous: 2025-05-07T20:32:26.6422135Z x0 = x0.contiguous() 2025-05-07T20:32:26.6422393Z x1 = x1.contiguous() 2025-05-07T20:32:26.6422641Z 2025-05-07T20:32:26.6422839Z if scale_ub is not None: 2025-05-07T20:32:26.6423111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.6423462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.6423778Z ) 2025-05-07T20:32:26.6423972Z else: 2025-05-07T20:32:26.6424180Z scale_ub_tensor = None 2025-05-07T20:32:26.6424433Z 2025-05-07T20:32:26.6424672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.6424992Z op = silu_mul_quant 2025-05-07T20:32:26.6425245Z if compiled: 2025-05-07T20:32:26.6425495Z op = torch.compile(op) 2025-05-07T20:32:26.6425796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.6426077Z 2025-05-07T20:32:26.6426280Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.6426450Z 2025-05-07T20:32:26.6426548Z moe/activation_test.py:117: 2025-05-07T20:32:26.6426851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.6427203Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.6427491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.6428063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.6428643Z return fn(*args, **kwargs) 2025-05-07T20:32:26.6429326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.6430032Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.6430587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.6431290Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.6432093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.6432643Z kernel = self.compile( 2025-05-07T20:32:26.6433203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.6433880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.6434285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.6434526Z 2025-05-07T20:32:26.6434743Z self = 2025-05-07T20:32:26.6435858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.6437316Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9be04a0>} 2025-05-07T20:32:26.6438746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.6439790Z context = 2025-05-07T20:32:26.6440092Z 2025-05-07T20:32:26.6440347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.6440895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.6441378Z module_map=module_map) 2025-05-07T20:32:26.6441748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.6442112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.6442375Z E ^ 2025-05-07T20:32:26.6442848Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.6443324Z 2025-05-07T20:32:26.6443751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.7804688Z 2025-05-07T20:32:26.7804837Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.7805305Z self=, 2025-05-07T20:32:26.7805960Z T=4096, 2025-05-07T20:32:26.7806239Z D=5120, 2025-05-07T20:32:26.7806519Z scale_ub=1200.0, 2025-05-07T20:32:26.7806837Z contiguous=False, 2025-05-07T20:32:26.7807105Z compiled=False, 2025-05-07T20:32:26.7807317Z ) 2025-05-07T20:32:26.7807647Z self = 2025-05-07T20:32:26.7808172Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:26.7808464Z 2025-05-07T20:32:26.7808545Z @given( 2025-05-07T20:32:26.7808784Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.7809112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.7809433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.7809780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.7810115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.7810413Z ) 2025-05-07T20:32:26.7810778Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.7811235Z def test_silu_mul_quant( 2025-05-07T20:32:26.7811485Z self, 2025-05-07T20:32:26.7811690Z T: int, 2025-05-07T20:32:26.7811892Z D: int, 2025-05-07T20:32:26.7812123Z scale_ub: Optional[float], 2025-05-07T20:32:26.7812406Z contiguous: bool, 2025-05-07T20:32:26.7812655Z compiled: bool, 2025-05-07T20:32:26.7812889Z ) -> None: 2025-05-07T20:32:26.7813114Z torch.manual_seed(2025) 2025-05-07T20:32:26.7813552Z 2025-05-07T20:32:26.7814001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.7814366Z 2025-05-07T20:32:26.7814570Z x_sign = torch.sign(x) 2025-05-07T20:32:26.7814870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.7815198Z x = x_sign * x_clamp 2025-05-07T20:32:26.7815452Z x0 = x[:, :D] 2025-05-07T20:32:26.7815674Z x1 = x[:, D:] 2025-05-07T20:32:26.7815892Z 2025-05-07T20:32:26.7816086Z if contiguous: 2025-05-07T20:32:26.7816323Z x0 = x0.contiguous() 2025-05-07T20:32:26.7816595Z x1 = x1.contiguous() 2025-05-07T20:32:26.7816939Z 2025-05-07T20:32:26.7817134Z if scale_ub is not None: 2025-05-07T20:32:26.7817417Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.7817771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.7818092Z ) 2025-05-07T20:32:26.7818292Z else: 2025-05-07T20:32:26.7818573Z scale_ub_tensor = None 2025-05-07T20:32:26.7818837Z 2025-05-07T20:32:26.7819081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.7819412Z op = silu_mul_quant 2025-05-07T20:32:26.7819673Z if compiled: 2025-05-07T20:32:26.7819929Z op = torch.compile(op) 2025-05-07T20:32:26.7820239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7820527Z 2025-05-07T20:32:26.7820725Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.7820900Z 2025-05-07T20:32:26.7821005Z moe/activation_test.py:117: 2025-05-07T20:32:26.7821314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7821664Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.7821960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7822682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.7823455Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.7824020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.7824731Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.7825424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.7825973Z kernel = self.compile( 2025-05-07T20:32:26.7826540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.7827228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.7827651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7827890Z 2025-05-07T20:32:26.7828105Z self = 2025-05-07T20:32:26.7829234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.7830663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0c2c0>} 2025-05-07T20:32:26.7832058Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.7833124Z context = 2025-05-07T20:32:26.7833428Z 2025-05-07T20:32:26.7833602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.7834151Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.7834645Z module_map=module_map) 2025-05-07T20:32:26.7835106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.7835485Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.7835758Z E ^ 2025-05-07T20:32:26.7836244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.7836712Z 2025-05-07T20:32:26.7837145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.7837683Z 2025-05-07T20:32:26.7837791Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.7838267Z self=, 2025-05-07T20:32:26.7838688Z T=4096, 2025-05-07T20:32:26.7838878Z D=5120, 2025-05-07T20:32:26.7839079Z scale_ub=1200.0, 2025-05-07T20:32:26.7839310Z contiguous=False, 2025-05-07T20:32:26.7839580Z compiled=True, 2025-05-07T20:32:26.7839793Z ) 2025-05-07T20:32:26.7840207Z self = 2025-05-07T20:32:26.7840721Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:26.7841007Z 2025-05-07T20:32:26.7841086Z @given( 2025-05-07T20:32:26.7841324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.7841646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.7841966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.7842307Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.7842652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.7842942Z ) 2025-05-07T20:32:26.7843303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.7843761Z def test_silu_mul_quant( 2025-05-07T20:32:26.7844005Z self, 2025-05-07T20:32:26.7844203Z T: int, 2025-05-07T20:32:26.7844405Z D: int, 2025-05-07T20:32:26.7844630Z scale_ub: Optional[float], 2025-05-07T20:32:26.7844912Z contiguous: bool, 2025-05-07T20:32:26.7845161Z compiled: bool, 2025-05-07T20:32:26.7845391Z ) -> None: 2025-05-07T20:32:26.7845612Z torch.manual_seed(2025) 2025-05-07T20:32:26.7845861Z 2025-05-07T20:32:26.7846137Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.7846486Z 2025-05-07T20:32:26.7846684Z x_sign = torch.sign(x) 2025-05-07T20:32:26.7846980Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.7847300Z x = x_sign * x_clamp 2025-05-07T20:32:26.7847551Z x0 = x[:, :D] 2025-05-07T20:32:26.7847773Z x1 = x[:, D:] 2025-05-07T20:32:26.7847983Z 2025-05-07T20:32:26.7848177Z if contiguous: 2025-05-07T20:32:26.7848416Z x0 = x0.contiguous() 2025-05-07T20:32:26.7848681Z x1 = x1.contiguous() 2025-05-07T20:32:26.7848933Z 2025-05-07T20:32:26.7849129Z if scale_ub is not None: 2025-05-07T20:32:26.7849411Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.7849760Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.7850081Z ) 2025-05-07T20:32:26.7850275Z else: 2025-05-07T20:32:26.7850491Z scale_ub_tensor = None 2025-05-07T20:32:26.7850753Z 2025-05-07T20:32:26.7850986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.7851315Z op = silu_mul_quant 2025-05-07T20:32:26.7851575Z if compiled: 2025-05-07T20:32:26.7851825Z op = torch.compile(op) 2025-05-07T20:32:26.7852134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7852420Z 2025-05-07T20:32:26.7852614Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.7852794Z 2025-05-07T20:32:26.7852913Z moe/activation_test.py:117: 2025-05-07T20:32:26.7853251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7853678Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.7853967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.7854547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.7855124Z return fn(*args, **kwargs) 2025-05-07T20:32:26.7855801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.7856513Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.7857070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.7857818Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.7858500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.7859090Z kernel = self.compile( 2025-05-07T20:32:26.7859655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.7860337Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.7860744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.7860989Z 2025-05-07T20:32:26.7861205Z self = 2025-05-07T20:32:26.7862320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.7863741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0db20>} 2025-05-07T20:32:26.7865131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.7866187Z context = 2025-05-07T20:32:26.7866493Z 2025-05-07T20:32:26.7866663Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.7867203Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.7867698Z module_map=module_map) 2025-05-07T20:32:26.7868079Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.7868446Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.7868716Z E ^ 2025-05-07T20:32:26.7869195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.7869664Z 2025-05-07T20:32:26.7870104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.7870633Z 2025-05-07T20:32:26.7870746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.7871174Z self=, 2025-05-07T20:32:26.7871587Z T=2048, 2025-05-07T20:32:26.7871774Z D=7168, 2025-05-07T20:32:26.7871971Z scale_ub=1200.0, 2025-05-07T20:32:26.7872204Z contiguous=False, 2025-05-07T20:32:26.7872434Z compiled=False, 2025-05-07T20:32:26.9827851Z ) 2025-05-07T20:32:26.9828347Z self = 2025-05-07T20:32:26.9829161Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:26.9829597Z 2025-05-07T20:32:26.9829708Z @given( 2025-05-07T20:32:26.9830031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.9830472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.9831100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.9831457Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.9831800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.9832098Z ) 2025-05-07T20:32:26.9832463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.9832921Z def test_silu_mul_quant( 2025-05-07T20:32:26.9833212Z self, 2025-05-07T20:32:26.9833424Z T: int, 2025-05-07T20:32:26.9833622Z D: int, 2025-05-07T20:32:26.9833850Z scale_ub: Optional[float], 2025-05-07T20:32:26.9834201Z contiguous: bool, 2025-05-07T20:32:26.9834461Z compiled: bool, 2025-05-07T20:32:26.9834691Z ) -> None: 2025-05-07T20:32:26.9834915Z torch.manual_seed(2025) 2025-05-07T20:32:26.9835168Z 2025-05-07T20:32:26.9835448Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.9835868Z 2025-05-07T20:32:26.9836073Z x_sign = torch.sign(x) 2025-05-07T20:32:26.9836398Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.9836717Z x = x_sign * x_clamp 2025-05-07T20:32:26.9836969Z x0 = x[:, :D] 2025-05-07T20:32:26.9837199Z x1 = x[:, D:] 2025-05-07T20:32:26.9837409Z 2025-05-07T20:32:26.9837619Z if contiguous: 2025-05-07T20:32:26.9837939Z x0 = x0.contiguous() 2025-05-07T20:32:26.9838204Z x1 = x1.contiguous() 2025-05-07T20:32:26.9838453Z 2025-05-07T20:32:26.9838655Z if scale_ub is not None: 2025-05-07T20:32:26.9838938Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.9839295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.9839617Z ) 2025-05-07T20:32:26.9839818Z else: 2025-05-07T20:32:26.9840032Z scale_ub_tensor = None 2025-05-07T20:32:26.9840412Z 2025-05-07T20:32:26.9840653Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.9840985Z op = silu_mul_quant 2025-05-07T20:32:26.9841250Z if compiled: 2025-05-07T20:32:26.9841509Z op = torch.compile(op) 2025-05-07T20:32:26.9841817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9842108Z 2025-05-07T20:32:26.9842465Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.9851148Z 2025-05-07T20:32:26.9851324Z moe/activation_test.py:117: 2025-05-07T20:32:26.9851662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9852016Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.9852321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9853045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.9853770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.9854340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.9855063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.9855761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.9856319Z kernel = self.compile( 2025-05-07T20:32:26.9856892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.9857577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.9857994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9858233Z 2025-05-07T20:32:26.9858453Z self = 2025-05-07T20:32:26.9859700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.9861132Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0e700>} 2025-05-07T20:32:26.9862522Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.9863628Z context = 2025-05-07T20:32:26.9863966Z 2025-05-07T20:32:26.9864148Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.9864684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.9865164Z module_map=module_map) 2025-05-07T20:32:26.9865581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.9865950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.9866216Z E ^ 2025-05-07T20:32:26.9866697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.9867160Z 2025-05-07T20:32:26.9867594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.9868122Z 2025-05-07T20:32:26.9868229Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.9868660Z self=, 2025-05-07T20:32:26.9869080Z T=1, 2025-05-07T20:32:26.9869267Z D=7168, 2025-05-07T20:32:26.9869458Z scale_ub=None, 2025-05-07T20:32:26.9869675Z contiguous=True, 2025-05-07T20:32:26.9869912Z compiled=False, 2025-05-07T20:32:26.9870118Z ) 2025-05-07T20:32:26.9870455Z self = 2025-05-07T20:32:26.9870962Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:26.9871230Z 2025-05-07T20:32:26.9871307Z @given( 2025-05-07T20:32:26.9871541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.9871860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.9872173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.9872577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.9872982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.9873275Z ) 2025-05-07T20:32:26.9873632Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.9874090Z def test_silu_mul_quant( 2025-05-07T20:32:26.9874341Z self, 2025-05-07T20:32:26.9874536Z T: int, 2025-05-07T20:32:26.9874737Z D: int, 2025-05-07T20:32:26.9874960Z scale_ub: Optional[float], 2025-05-07T20:32:26.9875240Z contiguous: bool, 2025-05-07T20:32:26.9875487Z compiled: bool, 2025-05-07T20:32:26.9875724Z ) -> None: 2025-05-07T20:32:26.9875942Z torch.manual_seed(2025) 2025-05-07T20:32:26.9876191Z 2025-05-07T20:32:26.9876473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.9876818Z 2025-05-07T20:32:26.9877015Z x_sign = torch.sign(x) 2025-05-07T20:32:26.9877317Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.9877639Z x = x_sign * x_clamp 2025-05-07T20:32:26.9877882Z x0 = x[:, :D] 2025-05-07T20:32:26.9878106Z x1 = x[:, D:] 2025-05-07T20:32:26.9878323Z 2025-05-07T20:32:26.9878509Z if contiguous: 2025-05-07T20:32:26.9878826Z x0 = x0.contiguous() 2025-05-07T20:32:26.9879098Z x1 = x1.contiguous() 2025-05-07T20:32:26.9879338Z 2025-05-07T20:32:26.9879530Z if scale_ub is not None: 2025-05-07T20:32:26.9879822Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.9880362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.9880692Z ) 2025-05-07T20:32:26.9880886Z else: 2025-05-07T20:32:26.9881095Z scale_ub_tensor = None 2025-05-07T20:32:26.9881355Z 2025-05-07T20:32:26.9881593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.9881910Z op = silu_mul_quant 2025-05-07T20:32:26.9882165Z if compiled: 2025-05-07T20:32:26.9882416Z op = torch.compile(op) 2025-05-07T20:32:26.9882713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9882998Z 2025-05-07T20:32:26.9883366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.9883618Z 2025-05-07T20:32:26.9883752Z moe/activation_test.py:117: 2025-05-07T20:32:26.9884058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9884399Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.9884750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9885476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.9886184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.9886740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.9887441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.9888126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.9888674Z kernel = self.compile( 2025-05-07T20:32:26.9889233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.9889910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.9890318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9890561Z 2025-05-07T20:32:26.9890777Z self = 2025-05-07T20:32:26.9891889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.9893329Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0fa60>} 2025-05-07T20:32:26.9894739Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.9895786Z context = 2025-05-07T20:32:26.9896094Z 2025-05-07T20:32:26.9896271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.9896812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.9897296Z module_map=module_map) 2025-05-07T20:32:26.9897664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.9898030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.9898296Z E ^ 2025-05-07T20:32:26.9898771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.9899243Z 2025-05-07T20:32:26.9899675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.9900205Z 2025-05-07T20:32:26.9900311Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.9900740Z self=, 2025-05-07T20:32:26.9901147Z T=16384, 2025-05-07T20:32:26.9901350Z D=7168, 2025-05-07T20:32:26.9901630Z scale_ub=1200.0, 2025-05-07T20:32:26.9901860Z contiguous=False, 2025-05-07T20:32:26.9902091Z compiled=True, 2025-05-07T20:32:26.9902303Z ) 2025-05-07T20:32:26.9902629Z self = 2025-05-07T20:32:26.9903198Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:26.9903493Z 2025-05-07T20:32:26.9903571Z @given( 2025-05-07T20:32:26.9903807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.9904123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.9904552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.9904894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.9905226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.9905524Z ) 2025-05-07T20:32:26.9905886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.9906382Z def test_silu_mul_quant( 2025-05-07T20:32:26.9906630Z self, 2025-05-07T20:32:26.9906828Z T: int, 2025-05-07T20:32:26.9907024Z D: int, 2025-05-07T20:32:26.9907249Z scale_ub: Optional[float], 2025-05-07T20:32:26.9907528Z contiguous: bool, 2025-05-07T20:32:26.9907770Z compiled: bool, 2025-05-07T20:32:26.9907997Z ) -> None: 2025-05-07T20:32:26.9908221Z torch.manual_seed(2025) 2025-05-07T20:32:26.9908470Z 2025-05-07T20:32:26.9908746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.9909102Z 2025-05-07T20:32:26.9909299Z x_sign = torch.sign(x) 2025-05-07T20:32:26.9909594Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.9909912Z x = x_sign * x_clamp 2025-05-07T20:32:26.9910160Z x0 = x[:, :D] 2025-05-07T20:32:26.9910374Z x1 = x[:, D:] 2025-05-07T20:32:26.9910591Z 2025-05-07T20:32:26.9910783Z if contiguous: 2025-05-07T20:32:26.9911020Z x0 = x0.contiguous() 2025-05-07T20:32:26.9911285Z x1 = x1.contiguous() 2025-05-07T20:32:26.9911528Z 2025-05-07T20:32:26.9911716Z if scale_ub is not None: 2025-05-07T20:32:26.9911991Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.9912340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.9912653Z ) 2025-05-07T20:32:26.9912849Z else: 2025-05-07T20:32:26.9913073Z scale_ub_tensor = None 2025-05-07T20:32:26.9913680Z 2025-05-07T20:32:26.9913928Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.9914258Z op = silu_mul_quant 2025-05-07T20:32:26.9914514Z if compiled: 2025-05-07T20:32:26.9914766Z op = torch.compile(op) 2025-05-07T20:32:26.9915070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9915351Z 2025-05-07T20:32:26.9915548Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.9915731Z 2025-05-07T20:32:26.9915834Z moe/activation_test.py:117: 2025-05-07T20:32:26.9916137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9916484Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.9916769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.9917345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:26.9917921Z return fn(*args, **kwargs) 2025-05-07T20:32:26.9918598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.9919320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.9919876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.9920675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.9921513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.9922063Z kernel = self.compile( 2025-05-07T20:32:26.9922621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.9923301Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.9923716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.9923955Z 2025-05-07T20:32:26.9924171Z self = 2025-05-07T20:32:26.9925371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.9926850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7fe8d60>} 2025-05-07T20:32:26.9928231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.9929286Z context = 2025-05-07T20:32:26.9929591Z 2025-05-07T20:32:26.9929765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.9930309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.9930788Z module_map=module_map) 2025-05-07T20:32:26.9931166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.9931532Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.9931797Z E ^ 2025-05-07T20:32:26.9932282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.9932754Z 2025-05-07T20:32:26.9933189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.1219933Z 2025-05-07T20:32:27.1220641Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.1221905Z self=, 2025-05-07T20:32:27.1222994Z T=1, 2025-05-07T20:32:27.1223252Z D=7168, 2025-05-07T20:32:27.1223458Z scale_ub=None, 2025-05-07T20:32:27.1223688Z contiguous=False, 2025-05-07T20:32:27.1223925Z compiled=False, 2025-05-07T20:32:27.1224146Z ) 2025-05-07T20:32:27.1224474Z self = 2025-05-07T20:32:27.1224987Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:27.1225306Z 2025-05-07T20:32:27.1225389Z @given( 2025-05-07T20:32:27.1225631Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.1225956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.1226269Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.1226611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.1226953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.1227245Z ) 2025-05-07T20:32:27.1227608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.1228065Z def test_silu_mul_quant( 2025-05-07T20:32:27.1228319Z self, 2025-05-07T20:32:27.1228514Z T: int, 2025-05-07T20:32:27.1228717Z D: int, 2025-05-07T20:32:27.1228945Z scale_ub: Optional[float], 2025-05-07T20:32:27.1229222Z contiguous: bool, 2025-05-07T20:32:27.1229470Z compiled: bool, 2025-05-07T20:32:27.1229701Z ) -> None: 2025-05-07T20:32:27.1229920Z torch.manual_seed(2025) 2025-05-07T20:32:27.1230171Z 2025-05-07T20:32:27.1230618Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.1230973Z 2025-05-07T20:32:27.1231172Z x_sign = torch.sign(x) 2025-05-07T20:32:27.1231471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.1231786Z x = x_sign * x_clamp 2025-05-07T20:32:27.1232034Z x0 = x[:, :D] 2025-05-07T20:32:27.1232258Z x1 = x[:, D:] 2025-05-07T20:32:27.1232468Z 2025-05-07T20:32:27.1232659Z if contiguous: 2025-05-07T20:32:27.1232897Z x0 = x0.contiguous() 2025-05-07T20:32:27.1233222Z x1 = x1.contiguous() 2025-05-07T20:32:27.1233469Z 2025-05-07T20:32:27.1233664Z if scale_ub is not None: 2025-05-07T20:32:27.1233947Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.1234291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.1234673Z ) 2025-05-07T20:32:27.1234873Z else: 2025-05-07T20:32:27.1235093Z scale_ub_tensor = None 2025-05-07T20:32:27.1235351Z 2025-05-07T20:32:27.1235592Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.1235910Z op = silu_mul_quant 2025-05-07T20:32:27.1236167Z if compiled: 2025-05-07T20:32:27.1236422Z op = torch.compile(op) 2025-05-07T20:32:27.1236727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.1237013Z 2025-05-07T20:32:27.1237212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.1237382Z 2025-05-07T20:32:27.1237483Z moe/activation_test.py:117: 2025-05-07T20:32:27.1237793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.1238133Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.1238424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.1239135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.1239856Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.1240516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.1241223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.1241916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.1242465Z kernel = self.compile( 2025-05-07T20:32:27.1243025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.1243705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.1244112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.1244349Z 2025-05-07T20:32:27.1244564Z self = 2025-05-07T20:32:27.1245689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.1247101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7fe9760>} 2025-05-07T20:32:27.1248490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.1249546Z context = 2025-05-07T20:32:27.1249843Z 2025-05-07T20:32:27.1250021Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.1250561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.1251133Z module_map=module_map) 2025-05-07T20:32:27.1251514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.1251881Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.1252145Z E ^ 2025-05-07T20:32:27.1252626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.1253094Z 2025-05-07T20:32:27.1253530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.1254105Z 2025-05-07T20:32:27.1254217Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.1254644Z self=, 2025-05-07T20:32:27.1255058Z T=2048, 2025-05-07T20:32:27.1255250Z D=7168, 2025-05-07T20:32:27.1255443Z scale_ub=None, 2025-05-07T20:32:27.1255708Z contiguous=False, 2025-05-07T20:32:27.1255947Z compiled=True, 2025-05-07T20:32:27.1256152Z ) 2025-05-07T20:32:27.1256485Z self = 2025-05-07T20:32:27.1257001Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:27.1257280Z 2025-05-07T20:32:27.1257357Z @given( 2025-05-07T20:32:27.1257597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.1257921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.1258240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.1258578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.1258922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.1259218Z ) 2025-05-07T20:32:27.1259574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.1260029Z def test_silu_mul_quant( 2025-05-07T20:32:27.1260280Z self, 2025-05-07T20:32:27.1260475Z T: int, 2025-05-07T20:32:27.1260683Z D: int, 2025-05-07T20:32:27.1260909Z scale_ub: Optional[float], 2025-05-07T20:32:27.1261186Z contiguous: bool, 2025-05-07T20:32:27.1261436Z compiled: bool, 2025-05-07T20:32:27.1261664Z ) -> None: 2025-05-07T20:32:27.1261881Z torch.manual_seed(2025) 2025-05-07T20:32:27.1262129Z 2025-05-07T20:32:27.1262410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.1262763Z 2025-05-07T20:32:27.1262959Z x_sign = torch.sign(x) 2025-05-07T20:32:27.1263259Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.1263583Z x = x_sign * x_clamp 2025-05-07T20:32:27.1263828Z x0 = x[:, :D] 2025-05-07T20:32:27.1264050Z x1 = x[:, D:] 2025-05-07T20:32:27.1264264Z 2025-05-07T20:32:27.1264452Z if contiguous: 2025-05-07T20:32:27.1264691Z x0 = x0.contiguous() 2025-05-07T20:32:27.1264963Z x1 = x1.contiguous() 2025-05-07T20:32:27.1265209Z 2025-05-07T20:32:27.1265411Z if scale_ub is not None: 2025-05-07T20:32:27.1265695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.1266037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.1266360Z ) 2025-05-07T20:32:27.1266560Z else: 2025-05-07T20:32:27.1266771Z scale_ub_tensor = None 2025-05-07T20:32:27.1267032Z 2025-05-07T20:32:27.1267272Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.1267595Z op = silu_mul_quant 2025-05-07T20:32:27.1267857Z if compiled: 2025-05-07T20:32:27.1268115Z op = torch.compile(op) 2025-05-07T20:32:27.1268420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.1268702Z 2025-05-07T20:32:27.1268901Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.1269077Z 2025-05-07T20:32:27.1269178Z moe/activation_test.py:117: 2025-05-07T20:32:27.1269564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.1269912Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.1270201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.1270776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.1271352Z return fn(*args, **kwargs) 2025-05-07T20:32:27.1272034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.1272740Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.1273351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.1274059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.1274745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.1275342Z kernel = self.compile( 2025-05-07T20:32:27.1275900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.1276580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.1276988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.1277230Z 2025-05-07T20:32:27.1277441Z self = 2025-05-07T20:32:27.1278559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.1279971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7feaf20>} 2025-05-07T20:32:27.1281425Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.1282484Z context = 2025-05-07T20:32:27.1282787Z 2025-05-07T20:32:27.1282958Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.1283501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.1283981Z module_map=module_map) 2025-05-07T20:32:27.1284361Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.1284730Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.1284998Z E ^ 2025-05-07T20:32:27.1285473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.1285948Z 2025-05-07T20:32:27.1286385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.1286915Z 2025-05-07T20:32:27.1287026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.1287451Z self=, 2025-05-07T20:32:27.1287869Z T=4096, 2025-05-07T20:32:27.1288062Z D=7168, 2025-05-07T20:32:27.1288263Z scale_ub=None, 2025-05-07T20:32:27.1288485Z contiguous=False, 2025-05-07T20:32:27.1288715Z compiled=True, 2025-05-07T20:32:27.3521557Z ) 2025-05-07T20:32:27.3522550Z self = 2025-05-07T20:32:27.3523667Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:27.3524077Z 2025-05-07T20:32:27.3524186Z @given( 2025-05-07T20:32:27.3532997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.3533467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.3533981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.3534334Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.3534674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.3534962Z ) 2025-05-07T20:32:27.3535326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.3535793Z def test_silu_mul_quant( 2025-05-07T20:32:27.3536045Z self, 2025-05-07T20:32:27.3536241Z T: int, 2025-05-07T20:32:27.3536441Z D: int, 2025-05-07T20:32:27.3536729Z scale_ub: Optional[float], 2025-05-07T20:32:27.3537006Z contiguous: bool, 2025-05-07T20:32:27.3537255Z compiled: bool, 2025-05-07T20:32:27.3537485Z ) -> None: 2025-05-07T20:32:27.3537699Z torch.manual_seed(2025) 2025-05-07T20:32:27.3537953Z 2025-05-07T20:32:27.3538234Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.3538652Z 2025-05-07T20:32:27.3538858Z x_sign = torch.sign(x) 2025-05-07T20:32:27.3539158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.3539471Z x = x_sign * x_clamp 2025-05-07T20:32:27.3539714Z x0 = x[:, :D] 2025-05-07T20:32:27.3539933Z x1 = x[:, D:] 2025-05-07T20:32:27.3540142Z 2025-05-07T20:32:27.3540331Z if contiguous: 2025-05-07T20:32:27.3540571Z x0 = x0.contiguous() 2025-05-07T20:32:27.3540835Z x1 = x1.contiguous() 2025-05-07T20:32:27.3541080Z 2025-05-07T20:32:27.3541280Z if scale_ub is not None: 2025-05-07T20:32:27.3541560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.3541903Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.3542225Z ) 2025-05-07T20:32:27.3542429Z else: 2025-05-07T20:32:27.3542641Z scale_ub_tensor = None 2025-05-07T20:32:27.3542907Z 2025-05-07T20:32:27.3543146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.3543465Z op = silu_mul_quant 2025-05-07T20:32:27.3543725Z if compiled: 2025-05-07T20:32:27.3543976Z op = torch.compile(op) 2025-05-07T20:32:27.3544278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.3544563Z 2025-05-07T20:32:27.3544760Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.3544928Z 2025-05-07T20:32:27.3545030Z moe/activation_test.py:117: 2025-05-07T20:32:27.3545332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.3545679Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.3545975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.3546557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.3547144Z return fn(*args, **kwargs) 2025-05-07T20:32:27.3547836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.3548546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.3549105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.3549819Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.3550513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.3551058Z kernel = self.compile( 2025-05-07T20:32:27.3551623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.3552307Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.3552719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.3552961Z 2025-05-07T20:32:27.3553260Z self = 2025-05-07T20:32:27.3554393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.3555826Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8fd80e0>} 2025-05-07T20:32:27.3557221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.3558324Z context = 2025-05-07T20:32:27.3558630Z 2025-05-07T20:32:27.3558805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.3559402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.3559892Z module_map=module_map) 2025-05-07T20:32:27.3560403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.3560770Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.3561038Z E ^ 2025-05-07T20:32:27.3561516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.3561991Z 2025-05-07T20:32:27.3562422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.3562964Z 2025-05-07T20:32:27.3563075Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.3563506Z self=, 2025-05-07T20:32:27.3563924Z T=16384, 2025-05-07T20:32:27.3564128Z D=5120, 2025-05-07T20:32:27.3564327Z scale_ub=1200.0, 2025-05-07T20:32:27.3564558Z contiguous=False, 2025-05-07T20:32:27.3564789Z compiled=False, 2025-05-07T20:32:27.3565001Z ) 2025-05-07T20:32:27.3565327Z self = 2025-05-07T20:32:27.3565857Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:27.3566160Z 2025-05-07T20:32:27.3566238Z @given( 2025-05-07T20:32:27.3566478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.3566797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.3567123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.3567471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.3567813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.3568106Z ) 2025-05-07T20:32:27.3568468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.3568923Z def test_silu_mul_quant( 2025-05-07T20:32:27.3569173Z self, 2025-05-07T20:32:27.3569375Z T: int, 2025-05-07T20:32:27.3569575Z D: int, 2025-05-07T20:32:27.3569793Z scale_ub: Optional[float], 2025-05-07T20:32:27.3570068Z contiguous: bool, 2025-05-07T20:32:27.3570315Z compiled: bool, 2025-05-07T20:32:27.3570538Z ) -> None: 2025-05-07T20:32:27.3570757Z torch.manual_seed(2025) 2025-05-07T20:32:27.3571004Z 2025-05-07T20:32:27.3571280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.3571636Z 2025-05-07T20:32:27.3571840Z x_sign = torch.sign(x) 2025-05-07T20:32:27.3572136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.3572455Z x = x_sign * x_clamp 2025-05-07T20:32:27.3572704Z x0 = x[:, :D] 2025-05-07T20:32:27.3572922Z x1 = x[:, D:] 2025-05-07T20:32:27.3573148Z 2025-05-07T20:32:27.3573337Z if contiguous: 2025-05-07T20:32:27.3573573Z x0 = x0.contiguous() 2025-05-07T20:32:27.3573918Z x1 = x1.contiguous() 2025-05-07T20:32:27.3574173Z 2025-05-07T20:32:27.3574363Z if scale_ub is not None: 2025-05-07T20:32:27.3574649Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.3574995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.3575318Z ) 2025-05-07T20:32:27.3575513Z else: 2025-05-07T20:32:27.3575732Z scale_ub_tensor = None 2025-05-07T20:32:27.3575990Z 2025-05-07T20:32:27.3576224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.3576590Z op = silu_mul_quant 2025-05-07T20:32:27.3576846Z if compiled: 2025-05-07T20:32:27.3577099Z op = torch.compile(op) 2025-05-07T20:32:27.3577402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.3577686Z 2025-05-07T20:32:27.3577884Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.3578107Z 2025-05-07T20:32:27.3578207Z moe/activation_test.py:117: 2025-05-07T20:32:27.3578522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.3578869Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.3579155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.3579869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.3580583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.3581138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.3581853Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.3582545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.3583096Z kernel = self.compile( 2025-05-07T20:32:27.3583663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.3584346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.3584758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.3584997Z 2025-05-07T20:32:27.3585210Z self = 2025-05-07T20:32:27.3586329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.3587755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8fd8b80>} 2025-05-07T20:32:27.3589153Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.3590218Z context = 2025-05-07T20:32:27.3590519Z 2025-05-07T20:32:27.3590692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.3591235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.3591715Z module_map=module_map) 2025-05-07T20:32:27.3592090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.3592453Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.3592725Z E ^ 2025-05-07T20:32:27.3593214Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.3593727Z 2025-05-07T20:32:27.3594160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.3594780Z 2025-05-07T20:32:27.3594888Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.3595322Z self=, 2025-05-07T20:32:27.3595741Z T=16384, 2025-05-07T20:32:27.3595935Z D=5120, 2025-05-07T20:32:27.3596132Z scale_ub=1200.0, 2025-05-07T20:32:27.3596356Z contiguous=True, 2025-05-07T20:32:27.3596576Z compiled=True, 2025-05-07T20:32:27.3596781Z ) 2025-05-07T20:32:27.3597108Z self = 2025-05-07T20:32:27.3597689Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:27.3597983Z 2025-05-07T20:32:27.3598063Z @given( 2025-05-07T20:32:27.3598301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.3598626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.3598981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.3599326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.3599667Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.3599961Z ) 2025-05-07T20:32:27.3600405Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.3600858Z def test_silu_mul_quant( 2025-05-07T20:32:27.3601101Z self, 2025-05-07T20:32:27.3601298Z T: int, 2025-05-07T20:32:27.3601497Z D: int, 2025-05-07T20:32:27.3601717Z scale_ub: Optional[float], 2025-05-07T20:32:27.3601993Z contiguous: bool, 2025-05-07T20:32:27.3602243Z compiled: bool, 2025-05-07T20:32:27.3602469Z ) -> None: 2025-05-07T20:32:27.3602683Z torch.manual_seed(2025) 2025-05-07T20:32:27.3602933Z 2025-05-07T20:32:27.3603212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.3603558Z 2025-05-07T20:32:27.3603758Z x_sign = torch.sign(x) 2025-05-07T20:32:27.3604062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.3604377Z x = x_sign * x_clamp 2025-05-07T20:32:27.3604622Z x0 = x[:, :D] 2025-05-07T20:32:27.3604841Z x1 = x[:, D:] 2025-05-07T20:32:27.3605048Z 2025-05-07T20:32:27.3605237Z if contiguous: 2025-05-07T20:32:27.3605470Z x0 = x0.contiguous() 2025-05-07T20:32:27.3605731Z x1 = x1.contiguous() 2025-05-07T20:32:27.3605980Z 2025-05-07T20:32:27.3606179Z if scale_ub is not None: 2025-05-07T20:32:27.3606460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.3606813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.3607129Z ) 2025-05-07T20:32:27.3607324Z else: 2025-05-07T20:32:27.3607535Z scale_ub_tensor = None 2025-05-07T20:32:27.3607792Z 2025-05-07T20:32:27.3608028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.3608369Z op = silu_mul_quant 2025-05-07T20:32:27.3608630Z if compiled: 2025-05-07T20:32:27.3608879Z op = torch.compile(op) 2025-05-07T20:32:27.3609180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.3609461Z 2025-05-07T20:32:27.3609656Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.3609826Z 2025-05-07T20:32:27.3609931Z moe/activation_test.py:117: 2025-05-07T20:32:27.3610227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.3610577Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.3610868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.3611446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.3612023Z return fn(*args, **kwargs) 2025-05-07T20:32:27.3612704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.3613609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.3614293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.3615009Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.3615702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.3616254Z kernel = self.compile( 2025-05-07T20:32:27.3616819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.3617564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.3617976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.3618214Z 2025-05-07T20:32:27.3618430Z self = 2025-05-07T20:32:27.3619615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.3621039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8fda2a0>} 2025-05-07T20:32:27.3622430Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.3623545Z context = 2025-05-07T20:32:27.3623844Z 2025-05-07T20:32:27.3624016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.3624561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.3625059Z module_map=module_map) 2025-05-07T20:32:27.3625434Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.3625796Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.3626061Z E ^ 2025-05-07T20:32:27.3626539Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.3627007Z 2025-05-07T20:32:27.3627442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.5148599Z 2025-05-07T20:32:27.5148759Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.5149271Z self=, 2025-05-07T20:32:27.5149853Z T=16384, 2025-05-07T20:32:27.5150139Z D=5120, 2025-05-07T20:32:27.5150379Z scale_ub=None, 2025-05-07T20:32:27.5150711Z contiguous=False, 2025-05-07T20:32:27.5150952Z compiled=True, 2025-05-07T20:32:27.5151161Z ) 2025-05-07T20:32:27.5151501Z self = 2025-05-07T20:32:27.5152021Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:27.5152317Z 2025-05-07T20:32:27.5152399Z @given( 2025-05-07T20:32:27.5152643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.5153030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.5153422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.5153853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.5154287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.5154657Z ) 2025-05-07T20:32:27.5155035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.5155493Z def test_silu_mul_quant( 2025-05-07T20:32:27.5155743Z self, 2025-05-07T20:32:27.5155946Z T: int, 2025-05-07T20:32:27.5156152Z D: int, 2025-05-07T20:32:27.5156545Z scale_ub: Optional[float], 2025-05-07T20:32:27.5156829Z contiguous: bool, 2025-05-07T20:32:27.5157079Z compiled: bool, 2025-05-07T20:32:27.5157311Z ) -> None: 2025-05-07T20:32:27.5157529Z torch.manual_seed(2025) 2025-05-07T20:32:27.5157780Z 2025-05-07T20:32:27.5158063Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.5158417Z 2025-05-07T20:32:27.5158620Z x_sign = torch.sign(x) 2025-05-07T20:32:27.5158925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.5159312Z x = x_sign * x_clamp 2025-05-07T20:32:27.5159558Z x0 = x[:, :D] 2025-05-07T20:32:27.5159782Z x1 = x[:, D:] 2025-05-07T20:32:27.5159996Z 2025-05-07T20:32:27.5160276Z if contiguous: 2025-05-07T20:32:27.5160515Z x0 = x0.contiguous() 2025-05-07T20:32:27.5160784Z x1 = x1.contiguous() 2025-05-07T20:32:27.5161096Z 2025-05-07T20:32:27.5161309Z if scale_ub is not None: 2025-05-07T20:32:27.5161595Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.5161942Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.5162267Z ) 2025-05-07T20:32:27.5162466Z else: 2025-05-07T20:32:27.5162680Z scale_ub_tensor = None 2025-05-07T20:32:27.5162943Z 2025-05-07T20:32:27.5163213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.5163560Z op = silu_mul_quant 2025-05-07T20:32:27.5163821Z if compiled: 2025-05-07T20:32:27.5164077Z op = torch.compile(op) 2025-05-07T20:32:27.5164390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.5164672Z 2025-05-07T20:32:27.5164872Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.5165041Z 2025-05-07T20:32:27.5165150Z moe/activation_test.py:117: 2025-05-07T20:32:27.5165454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5165809Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.5166102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.5166682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.5167266Z return fn(*args, **kwargs) 2025-05-07T20:32:27.5167949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.5168661Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.5169215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.5169924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.5170621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.5171171Z kernel = self.compile( 2025-05-07T20:32:27.5171737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.5172423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.5172842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5173083Z 2025-05-07T20:32:27.5173299Z self = 2025-05-07T20:32:27.5174421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.5175851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8fdb060>} 2025-05-07T20:32:27.5177335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.5178405Z context = 2025-05-07T20:32:27.5178710Z 2025-05-07T20:32:27.5178886Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.5179441Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.5179931Z module_map=module_map) 2025-05-07T20:32:27.5180351Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.5180723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.5180994Z E ^ 2025-05-07T20:32:27.5181477Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.5181986Z 2025-05-07T20:32:27.5182424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.5182961Z 2025-05-07T20:32:27.5183070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.5183508Z self=, 2025-05-07T20:32:27.5183929Z T=2048, 2025-05-07T20:32:27.5184119Z D=5120, 2025-05-07T20:32:27.5184319Z scale_ub=None, 2025-05-07T20:32:27.5184545Z contiguous=False, 2025-05-07T20:32:27.5184776Z compiled=True, 2025-05-07T20:32:27.5184991Z ) 2025-05-07T20:32:27.5185326Z self = 2025-05-07T20:32:27.5185836Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:27.5186123Z 2025-05-07T20:32:27.5186204Z @given( 2025-05-07T20:32:27.5186442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.5186773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.5187093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.5187437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.5187782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.5188078Z ) 2025-05-07T20:32:27.5188444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.5188927Z def test_silu_mul_quant( 2025-05-07T20:32:27.5189178Z self, 2025-05-07T20:32:27.5189373Z T: int, 2025-05-07T20:32:27.5189575Z D: int, 2025-05-07T20:32:27.5189804Z scale_ub: Optional[float], 2025-05-07T20:32:27.5190086Z contiguous: bool, 2025-05-07T20:32:27.5190336Z compiled: bool, 2025-05-07T20:32:27.5190574Z ) -> None: 2025-05-07T20:32:27.5190798Z torch.manual_seed(2025) 2025-05-07T20:32:27.5191050Z 2025-05-07T20:32:27.5191335Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.5191699Z 2025-05-07T20:32:27.5191895Z x_sign = torch.sign(x) 2025-05-07T20:32:27.5192200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.5192524Z x = x_sign * x_clamp 2025-05-07T20:32:27.5192771Z x0 = x[:, :D] 2025-05-07T20:32:27.5192997Z x1 = x[:, D:] 2025-05-07T20:32:27.5193234Z 2025-05-07T20:32:27.5193448Z if contiguous: 2025-05-07T20:32:27.5193689Z x0 = x0.contiguous() 2025-05-07T20:32:27.5193956Z x1 = x1.contiguous() 2025-05-07T20:32:27.5194202Z 2025-05-07T20:32:27.5194400Z if scale_ub is not None: 2025-05-07T20:32:27.5194690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.5195045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.5195371Z ) 2025-05-07T20:32:27.5195563Z else: 2025-05-07T20:32:27.5195779Z scale_ub_tensor = None 2025-05-07T20:32:27.5196038Z 2025-05-07T20:32:27.5196278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.5196710Z op = silu_mul_quant 2025-05-07T20:32:27.5196976Z if compiled: 2025-05-07T20:32:27.5197229Z op = torch.compile(op) 2025-05-07T20:32:27.5197536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.5197820Z 2025-05-07T20:32:27.5198014Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.5198192Z 2025-05-07T20:32:27.5198293Z moe/activation_test.py:117: 2025-05-07T20:32:27.5198600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5198942Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.5199282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.5199870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.5200549Z return fn(*args, **kwargs) 2025-05-07T20:32:27.5201231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.5209485Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.5210076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.5210801Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.5211497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.5212058Z kernel = self.compile( 2025-05-07T20:32:27.5212628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.5213621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.5214047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.5214294Z 2025-05-07T20:32:27.5214513Z self = 2025-05-07T20:32:27.5215641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.5217070Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7d1c7c0>} 2025-05-07T20:32:27.5218465Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.5219524Z context = 2025-05-07T20:32:27.5219829Z 2025-05-07T20:32:27.5220007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.5220563Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.5221052Z module_map=module_map) 2025-05-07T20:32:27.5221437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.5221809Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.5222082Z E ^ 2025-05-07T20:32:27.5222565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.5223036Z 2025-05-07T20:32:27.5223467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.8794056Z 2025-05-07T20:32:27.8794311Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8794821Z self=, 2025-05-07T20:32:27.8795277Z T=2048, 2025-05-07T20:32:27.8795474Z D=5120, 2025-05-07T20:32:27.8795680Z scale_ub=1200.0, 2025-05-07T20:32:27.8795912Z contiguous=False, 2025-05-07T20:32:27.8796316Z compiled=True, 2025-05-07T20:32:27.8796535Z ) 2025-05-07T20:32:27.8796866Z self = 2025-05-07T20:32:27.8797383Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:27.8797667Z 2025-05-07T20:32:27.8797754Z @given( 2025-05-07T20:32:27.8797990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8798316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8798637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8799039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8799384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8799687Z ) 2025-05-07T20:32:27.8800057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8800589Z def test_silu_mul_quant( 2025-05-07T20:32:27.8800912Z self, 2025-05-07T20:32:27.8801116Z T: int, 2025-05-07T20:32:27.8801324Z D: int, 2025-05-07T20:32:27.8801551Z scale_ub: Optional[float], 2025-05-07T20:32:27.8801836Z contiguous: bool, 2025-05-07T20:32:27.8802084Z compiled: bool, 2025-05-07T20:32:27.8802319Z ) -> None: 2025-05-07T20:32:27.8802547Z torch.manual_seed(2025) 2025-05-07T20:32:27.8802796Z 2025-05-07T20:32:27.8803083Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8803444Z 2025-05-07T20:32:27.8803645Z x_sign = torch.sign(x) 2025-05-07T20:32:27.8803950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.8804284Z x = x_sign * x_clamp 2025-05-07T20:32:27.8804534Z x0 = x[:, :D] 2025-05-07T20:32:27.8804755Z x1 = x[:, D:] 2025-05-07T20:32:27.8804974Z 2025-05-07T20:32:27.8805169Z if contiguous: 2025-05-07T20:32:27.8805405Z x0 = x0.contiguous() 2025-05-07T20:32:27.8805683Z x1 = x1.contiguous() 2025-05-07T20:32:27.8805938Z 2025-05-07T20:32:27.8806130Z if scale_ub is not None: 2025-05-07T20:32:27.8806420Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.8806768Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.8807087Z ) 2025-05-07T20:32:27.8807287Z else: 2025-05-07T20:32:27.8807506Z scale_ub_tensor = None 2025-05-07T20:32:27.8807762Z 2025-05-07T20:32:27.8808004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.8808331Z op = silu_mul_quant 2025-05-07T20:32:27.8808592Z if compiled: 2025-05-07T20:32:27.8808849Z op = torch.compile(op) 2025-05-07T20:32:27.8809163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.8809443Z 2025-05-07T20:32:27.8809645Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.8809821Z 2025-05-07T20:32:27.8809923Z moe/activation_test.py:117: 2025-05-07T20:32:27.8810241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.8810591Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.8810887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.8811478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.8812054Z return fn(*args, **kwargs) 2025-05-07T20:32:27.8812740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.8813628Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.8814193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.8814898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.8815592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.8816270Z kernel = self.compile( 2025-05-07T20:32:27.8816838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.8817519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.8817935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.8818174Z 2025-05-07T20:32:27.8818396Z self = 2025-05-07T20:32:27.8819513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.8821004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7d1d580>} 2025-05-07T20:32:27.8822463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.8823570Z context = 2025-05-07T20:32:27.8823882Z 2025-05-07T20:32:27.8824062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.8824606Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.8825099Z module_map=module_map) 2025-05-07T20:32:27.8825485Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.8825850Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.8826124Z E ^ 2025-05-07T20:32:27.8826618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.8827091Z 2025-05-07T20:32:27.8827531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.8828062Z 2025-05-07T20:32:27.8828171Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.8828601Z self=, 2025-05-07T20:32:27.8829022Z T=4096, 2025-05-07T20:32:27.8829214Z D=5120, 2025-05-07T20:32:27.8829415Z scale_ub=1200.0, 2025-05-07T20:32:27.8829647Z contiguous=True, 2025-05-07T20:32:27.8829872Z compiled=True, 2025-05-07T20:32:27.8830087Z ) 2025-05-07T20:32:27.8830422Z self = 2025-05-07T20:32:27.8830937Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:27.8831217Z 2025-05-07T20:32:27.8831296Z @given( 2025-05-07T20:32:27.8831533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8831865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8832183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8832525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8832870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8833165Z ) 2025-05-07T20:32:27.8833533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8834043Z def test_silu_mul_quant( 2025-05-07T20:32:27.8834294Z self, 2025-05-07T20:32:27.8834491Z T: int, 2025-05-07T20:32:27.8834697Z D: int, 2025-05-07T20:32:27.8834937Z scale_ub: Optional[float], 2025-05-07T20:32:27.8835219Z contiguous: bool, 2025-05-07T20:32:27.8835470Z compiled: bool, 2025-05-07T20:32:27.8835708Z ) -> None: 2025-05-07T20:32:27.8835930Z torch.manual_seed(2025) 2025-05-07T20:32:27.8836183Z 2025-05-07T20:32:27.8836472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.8836905Z 2025-05-07T20:32:27.8837113Z x_sign = torch.sign(x) 2025-05-07T20:32:27.8837419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.8837740Z x = x_sign * x_clamp 2025-05-07T20:32:27.8837994Z x0 = x[:, :D] 2025-05-07T20:32:27.8838220Z x1 = x[:, D:] 2025-05-07T20:32:27.8838435Z 2025-05-07T20:32:27.8838631Z if contiguous: 2025-05-07T20:32:27.8838874Z x0 = x0.contiguous() 2025-05-07T20:32:27.8839142Z x1 = x1.contiguous() 2025-05-07T20:32:27.8839390Z 2025-05-07T20:32:27.8839640Z if scale_ub is not None: 2025-05-07T20:32:27.8839919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.8840343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.8840664Z ) 2025-05-07T20:32:27.8840863Z else: 2025-05-07T20:32:27.8841080Z scale_ub_tensor = None 2025-05-07T20:32:27.8841391Z 2025-05-07T20:32:27.8841637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.8841971Z op = silu_mul_quant 2025-05-07T20:32:27.8842230Z if compiled: 2025-05-07T20:32:27.8842483Z op = torch.compile(op) 2025-05-07T20:32:27.8842793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.8843081Z 2025-05-07T20:32:27.8843277Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.8843453Z 2025-05-07T20:32:27.8843555Z moe/activation_test.py:117: 2025-05-07T20:32:27.8843861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.8844209Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.8844501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.8845081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:27.8845661Z return fn(*args, **kwargs) 2025-05-07T20:32:27.8846346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.8847058Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.8847617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.8848326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.8849012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.8849566Z kernel = self.compile( 2025-05-07T20:32:27.8850138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.8850818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.8851228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.8851475Z 2025-05-07T20:32:27.8851693Z self = 2025-05-07T20:32:27.8852819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.8854251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7d1e840>} 2025-05-07T20:32:27.8855642Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.8856708Z context = 2025-05-07T20:32:27.8857013Z 2025-05-07T20:32:27.8857184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.8857873Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.8858358Z module_map=module_map) 2025-05-07T20:32:27.8858740Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.8859110Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.8859375Z E ^ 2025-05-07T20:32:27.8859860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.8860333Z 2025-05-07T20:32:27.8860767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.0540761Z 2025-05-07T20:32:28.0541095Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.0541553Z self=, 2025-05-07T20:32:28.0542049Z T=128, 2025-05-07T20:32:28.0542421Z D=5120, 2025-05-07T20:32:28.0542626Z scale_ub=1200.0, 2025-05-07T20:32:28.0542871Z contiguous=False, 2025-05-07T20:32:28.0543105Z compiled=True, 2025-05-07T20:32:28.0543321Z ) 2025-05-07T20:32:28.0543685Z self = 2025-05-07T20:32:28.0544230Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.0544513Z 2025-05-07T20:32:28.0544596Z @given( 2025-05-07T20:32:28.0544837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.0545165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.0545487Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.0545839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.0546185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.0546480Z ) 2025-05-07T20:32:28.0546847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.0547319Z def test_silu_mul_quant( 2025-05-07T20:32:28.0547571Z self, 2025-05-07T20:32:28.0547770Z T: int, 2025-05-07T20:32:28.0547978Z D: int, 2025-05-07T20:32:28.0548208Z scale_ub: Optional[float], 2025-05-07T20:32:28.0548487Z contiguous: bool, 2025-05-07T20:32:28.0548744Z compiled: bool, 2025-05-07T20:32:28.0548978Z ) -> None: 2025-05-07T20:32:28.0549197Z torch.manual_seed(2025) 2025-05-07T20:32:28.0549450Z 2025-05-07T20:32:28.0549735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.0550090Z 2025-05-07T20:32:28.0550296Z x_sign = torch.sign(x) 2025-05-07T20:32:28.0550603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.0550927Z x = x_sign * x_clamp 2025-05-07T20:32:28.0551186Z x0 = x[:, :D] 2025-05-07T20:32:28.0551415Z x1 = x[:, D:] 2025-05-07T20:32:28.0551632Z 2025-05-07T20:32:28.0551835Z if contiguous: 2025-05-07T20:32:28.0552080Z x0 = x0.contiguous() 2025-05-07T20:32:28.0552352Z x1 = x1.contiguous() 2025-05-07T20:32:28.0552607Z 2025-05-07T20:32:28.0552811Z if scale_ub is not None: 2025-05-07T20:32:28.0553095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.0553444Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.0553764Z ) 2025-05-07T20:32:28.0553963Z else: 2025-05-07T20:32:28.0554174Z scale_ub_tensor = None 2025-05-07T20:32:28.0554434Z 2025-05-07T20:32:28.0554672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.0554997Z op = silu_mul_quant 2025-05-07T20:32:28.0555253Z if compiled: 2025-05-07T20:32:28.0555510Z op = torch.compile(op) 2025-05-07T20:32:28.0555814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.0556101Z 2025-05-07T20:32:28.0556301Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.0556475Z 2025-05-07T20:32:28.0556577Z moe/activation_test.py:117: 2025-05-07T20:32:28.0557005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.0557356Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.0557650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.0558225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.0558806Z return fn(*args, **kwargs) 2025-05-07T20:32:28.0559487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.0560349Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.0560909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.0561623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.0562368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.0562915Z kernel = self.compile( 2025-05-07T20:32:28.0563482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.0564164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.0564580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.0564818Z 2025-05-07T20:32:28.0565034Z self = 2025-05-07T20:32:28.0566161Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.0567590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7d1f4c0>} 2025-05-07T20:32:28.0568986Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.0570041Z context = 2025-05-07T20:32:28.0570346Z 2025-05-07T20:32:28.0570523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.0571070Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.0571565Z module_map=module_map) 2025-05-07T20:32:28.0571943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.0572312Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.0572585Z E ^ 2025-05-07T20:32:28.0573073Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.0573545Z 2025-05-07T20:32:28.0573978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.0574514Z 2025-05-07T20:32:28.0574623Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.0575056Z self=, 2025-05-07T20:32:28.0575470Z T=16384, 2025-05-07T20:32:28.0575674Z D=7168, 2025-05-07T20:32:28.0575876Z scale_ub=1200.0, 2025-05-07T20:32:28.0576104Z contiguous=True, 2025-05-07T20:32:28.0576336Z compiled=True, 2025-05-07T20:32:28.0576548Z ) 2025-05-07T20:32:28.0576877Z self = 2025-05-07T20:32:28.0577400Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.0577695Z 2025-05-07T20:32:28.0577777Z @given( 2025-05-07T20:32:28.0578103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.0578430Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.0578749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.0579096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.0579436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.0579736Z ) 2025-05-07T20:32:28.0580101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.0580558Z def test_silu_mul_quant( 2025-05-07T20:32:28.0580812Z self, 2025-05-07T20:32:28.0581062Z T: int, 2025-05-07T20:32:28.0581265Z D: int, 2025-05-07T20:32:28.0581495Z scale_ub: Optional[float], 2025-05-07T20:32:28.0581782Z contiguous: bool, 2025-05-07T20:32:28.0582035Z compiled: bool, 2025-05-07T20:32:28.0582264Z ) -> None: 2025-05-07T20:32:28.0582493Z torch.manual_seed(2025) 2025-05-07T20:32:28.0582791Z 2025-05-07T20:32:28.0583082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.0583439Z 2025-05-07T20:32:28.0583644Z x_sign = torch.sign(x) 2025-05-07T20:32:28.0583943Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.0584265Z x = x_sign * x_clamp 2025-05-07T20:32:28.0584517Z x0 = x[:, :D] 2025-05-07T20:32:28.0584736Z x1 = x[:, D:] 2025-05-07T20:32:28.0584952Z 2025-05-07T20:32:28.0585143Z if contiguous: 2025-05-07T20:32:28.0585378Z x0 = x0.contiguous() 2025-05-07T20:32:28.0585645Z x1 = x1.contiguous() 2025-05-07T20:32:28.0585899Z 2025-05-07T20:32:28.0586090Z if scale_ub is not None: 2025-05-07T20:32:28.0586377Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.0586724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.0587044Z ) 2025-05-07T20:32:28.0587243Z else: 2025-05-07T20:32:28.0587459Z scale_ub_tensor = None 2025-05-07T20:32:28.0587723Z 2025-05-07T20:32:28.0587962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.0588289Z op = silu_mul_quant 2025-05-07T20:32:28.0588552Z if compiled: 2025-05-07T20:32:28.0588808Z op = torch.compile(op) 2025-05-07T20:32:28.0589112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.0589396Z 2025-05-07T20:32:28.0589596Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.0589765Z 2025-05-07T20:32:28.0589867Z moe/activation_test.py:117: 2025-05-07T20:32:28.0590180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.0590533Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.0590824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.0591406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.0591989Z return fn(*args, **kwargs) 2025-05-07T20:32:28.0592678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.0593391Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.0593951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.0594663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.0595351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.0595907Z kernel = self.compile( 2025-05-07T20:32:28.0596471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.0597152Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.0597564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.0597808Z 2025-05-07T20:32:28.0598105Z self = 2025-05-07T20:32:28.0599225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.0600737Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7e34c20>} 2025-05-07T20:32:28.0602170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.0603238Z context = 2025-05-07T20:32:28.0603580Z 2025-05-07T20:32:28.0603759Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.0604308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.0604790Z module_map=module_map) 2025-05-07T20:32:28.0605167Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.0605529Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.0605802Z E ^ 2025-05-07T20:32:28.0606287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.0606756Z 2025-05-07T20:32:28.0607193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.1756033Z 2025-05-07T20:32:28.1756472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.1757141Z self=, 2025-05-07T20:32:28.1768302Z T=16384, 2025-05-07T20:32:28.1768524Z D=5120, 2025-05-07T20:32:28.1768730Z scale_ub=1200.0, 2025-05-07T20:32:28.1768957Z contiguous=True, 2025-05-07T20:32:28.1769190Z compiled=False, 2025-05-07T20:32:28.1769407Z ) 2025-05-07T20:32:28.1769738Z self = 2025-05-07T20:32:28.1770252Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.1770551Z 2025-05-07T20:32:28.1770630Z @given( 2025-05-07T20:32:28.1770869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.1771190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.1771507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.1771848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.1772181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.1772476Z ) 2025-05-07T20:32:28.1772842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.1773306Z def test_silu_mul_quant( 2025-05-07T20:32:28.1773549Z self, 2025-05-07T20:32:28.1773750Z T: int, 2025-05-07T20:32:28.1774081Z D: int, 2025-05-07T20:32:28.1774303Z scale_ub: Optional[float], 2025-05-07T20:32:28.1774577Z contiguous: bool, 2025-05-07T20:32:28.1774867Z compiled: bool, 2025-05-07T20:32:28.1775101Z ) -> None: 2025-05-07T20:32:28.1775321Z torch.manual_seed(2025) 2025-05-07T20:32:28.1775565Z 2025-05-07T20:32:28.1775844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.1776201Z 2025-05-07T20:32:28.1776396Z x_sign = torch.sign(x) 2025-05-07T20:32:28.1776696Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.1777015Z x = x_sign * x_clamp 2025-05-07T20:32:28.1777256Z x0 = x[:, :D] 2025-05-07T20:32:28.1777490Z x1 = x[:, D:] 2025-05-07T20:32:28.1777719Z 2025-05-07T20:32:28.1778199Z if contiguous: 2025-05-07T20:32:28.1778461Z x0 = x0.contiguous() 2025-05-07T20:32:28.1778727Z x1 = x1.contiguous() 2025-05-07T20:32:28.1778969Z 2025-05-07T20:32:28.1779166Z if scale_ub is not None: 2025-05-07T20:32:28.1779445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.1779785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.1780101Z ) 2025-05-07T20:32:28.1780303Z else: 2025-05-07T20:32:28.1780512Z scale_ub_tensor = None 2025-05-07T20:32:28.1780904Z 2025-05-07T20:32:28.1781214Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.1781540Z op = silu_mul_quant 2025-05-07T20:32:28.1781791Z if compiled: 2025-05-07T20:32:28.1782047Z op = torch.compile(op) 2025-05-07T20:32:28.1782352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.1782701Z 2025-05-07T20:32:28.1782901Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.1783078Z 2025-05-07T20:32:28.1783186Z moe/activation_test.py:117: 2025-05-07T20:32:28.1783513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.1783880Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.1784167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.1784889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.1785599Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.1786160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.1786870Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.1787558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.1788111Z kernel = self.compile( 2025-05-07T20:32:28.1788679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.1789361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.1789774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.1790017Z 2025-05-07T20:32:28.1790228Z self = 2025-05-07T20:32:28.1791350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.1792780Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7e35580>} 2025-05-07T20:32:28.1794223Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.1795278Z context = 2025-05-07T20:32:28.1795582Z 2025-05-07T20:32:28.1795754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.1796298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.1796779Z module_map=module_map) 2025-05-07T20:32:28.1797159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.1797524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.1797791Z E ^ 2025-05-07T20:32:28.1798266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.1798738Z 2025-05-07T20:32:28.1799317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.1799853Z 2025-05-07T20:32:28.1799969Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.1800528Z self=, 2025-05-07T20:32:28.1800945Z T=1, 2025-05-07T20:32:28.1801129Z D=7168, 2025-05-07T20:32:28.1801325Z scale_ub=1200.0, 2025-05-07T20:32:28.1801553Z contiguous=False, 2025-05-07T20:32:28.1801785Z compiled=False, 2025-05-07T20:32:28.1801991Z ) 2025-05-07T20:32:28.1802312Z self = 2025-05-07T20:32:28.1802865Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.1803140Z 2025-05-07T20:32:28.1803224Z @given( 2025-05-07T20:32:28.1803456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.1803827Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.1804151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.1804489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.1804830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.1805128Z ) 2025-05-07T20:32:28.1805488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.1805944Z def test_silu_mul_quant( 2025-05-07T20:32:28.1806193Z self, 2025-05-07T20:32:28.1806394Z T: int, 2025-05-07T20:32:28.1806594Z D: int, 2025-05-07T20:32:28.1806815Z scale_ub: Optional[float], 2025-05-07T20:32:28.1807099Z contiguous: bool, 2025-05-07T20:32:28.1807338Z compiled: bool, 2025-05-07T20:32:28.1807569Z ) -> None: 2025-05-07T20:32:28.1807790Z torch.manual_seed(2025) 2025-05-07T20:32:28.1808034Z 2025-05-07T20:32:28.1808319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.1808677Z 2025-05-07T20:32:28.1808879Z x_sign = torch.sign(x) 2025-05-07T20:32:28.1809180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.1809500Z x = x_sign * x_clamp 2025-05-07T20:32:28.1809742Z x0 = x[:, :D] 2025-05-07T20:32:28.1809962Z x1 = x[:, D:] 2025-05-07T20:32:28.1810180Z 2025-05-07T20:32:28.1810367Z if contiguous: 2025-05-07T20:32:28.1810605Z x0 = x0.contiguous() 2025-05-07T20:32:28.1810882Z x1 = x1.contiguous() 2025-05-07T20:32:28.1811131Z 2025-05-07T20:32:28.1811322Z if scale_ub is not None: 2025-05-07T20:32:28.1811608Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.1811959Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.1812280Z ) 2025-05-07T20:32:28.1812481Z else: 2025-05-07T20:32:28.1812698Z scale_ub_tensor = None 2025-05-07T20:32:28.1812950Z 2025-05-07T20:32:28.1813195Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.1813829Z op = silu_mul_quant 2025-05-07T20:32:28.1814083Z if compiled: 2025-05-07T20:32:28.1814338Z op = torch.compile(op) 2025-05-07T20:32:28.1814643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.1814918Z 2025-05-07T20:32:28.1815113Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.1815286Z 2025-05-07T20:32:28.1815388Z moe/activation_test.py:117: 2025-05-07T20:32:28.1815692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.1816030Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.1816325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.1817045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.1817752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.1818505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.1819226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.1819915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.1820465Z kernel = self.compile( 2025-05-07T20:32:28.1821032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.1821716Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.1822177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.1822421Z 2025-05-07T20:32:28.1822635Z self = 2025-05-07T20:32:28.1823759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.1825243Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7e368e0>} 2025-05-07T20:32:28.1826626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.1827674Z context = 2025-05-07T20:32:28.1827980Z 2025-05-07T20:32:28.1828154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.1828696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.1829181Z module_map=module_map) 2025-05-07T20:32:28.1829560Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.1829933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.1830219Z E ^ 2025-05-07T20:32:28.1830700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.1831165Z 2025-05-07T20:32:28.1831592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.1832124Z 2025-05-07T20:32:28.1832234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.1832676Z self=, 2025-05-07T20:32:28.1833091Z T=4096, 2025-05-07T20:32:28.1833283Z D=7168, 2025-05-07T20:32:28.1833482Z scale_ub=1200.0, 2025-05-07T20:32:28.1833712Z contiguous=False, 2025-05-07T20:32:28.1833942Z compiled=True, 2025-05-07T20:32:28.3434543Z ) 2025-05-07T20:32:28.3435931Z self = 2025-05-07T20:32:28.3437092Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.3437679Z 2025-05-07T20:32:28.3437847Z @given( 2025-05-07T20:32:28.3438344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.3439011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.3439662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.3440482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.3441181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.3441783Z ) 2025-05-07T20:32:28.3442524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.3443464Z def test_silu_mul_quant( 2025-05-07T20:32:28.3443970Z self, 2025-05-07T20:32:28.3444196Z T: int, 2025-05-07T20:32:28.3444434Z D: int, 2025-05-07T20:32:28.3444668Z scale_ub: Optional[float], 2025-05-07T20:32:28.3444958Z contiguous: bool, 2025-05-07T20:32:28.3445603Z compiled: bool, 2025-05-07T20:32:28.3445860Z ) -> None: 2025-05-07T20:32:28.3446087Z torch.manual_seed(2025) 2025-05-07T20:32:28.3446351Z 2025-05-07T20:32:28.3446648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.3447019Z 2025-05-07T20:32:28.3447223Z x_sign = torch.sign(x) 2025-05-07T20:32:28.3447540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.3447874Z x = x_sign * x_clamp 2025-05-07T20:32:28.3448128Z x0 = x[:, :D] 2025-05-07T20:32:28.3448456Z x1 = x[:, D:] 2025-05-07T20:32:28.3448683Z 2025-05-07T20:32:28.3448876Z if contiguous: 2025-05-07T20:32:28.3449127Z x0 = x0.contiguous() 2025-05-07T20:32:28.3449408Z x1 = x1.contiguous() 2025-05-07T20:32:28.3449664Z 2025-05-07T20:32:28.3449873Z if scale_ub is not None: 2025-05-07T20:32:28.3450258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.3450625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.3450964Z ) 2025-05-07T20:32:28.3451207Z else: 2025-05-07T20:32:28.3451445Z scale_ub_tensor = None 2025-05-07T20:32:28.3451732Z 2025-05-07T20:32:28.3451978Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.3452324Z op = silu_mul_quant 2025-05-07T20:32:28.3452596Z if compiled: 2025-05-07T20:32:28.3452861Z op = torch.compile(op) 2025-05-07T20:32:28.3453186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3453492Z 2025-05-07T20:32:28.3453695Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.3453882Z 2025-05-07T20:32:28.3453991Z moe/activation_test.py:117: 2025-05-07T20:32:28.3454314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3454677Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.3454979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3455586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.3456185Z return fn(*args, **kwargs) 2025-05-07T20:32:28.3456879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.3457606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.3458179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.3458901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.3459592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.3460154Z kernel = self.compile( 2025-05-07T20:32:28.3460728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.3461429Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.3461847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3462097Z 2025-05-07T20:32:28.3462315Z self = 2025-05-07T20:32:28.3463457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.3464913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7e37a60>} 2025-05-07T20:32:28.3466415Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.3467498Z context = 2025-05-07T20:32:28.3467814Z 2025-05-07T20:32:28.3467992Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.3468553Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.3469045Z module_map=module_map) 2025-05-07T20:32:28.3469437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.3469818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.3470137Z E ^ 2025-05-07T20:32:28.3470634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.3471112Z 2025-05-07T20:32:28.3471552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.3472134Z 2025-05-07T20:32:28.3472257Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.3472695Z self=, 2025-05-07T20:32:28.3473128Z T=128, 2025-05-07T20:32:28.3473332Z D=7168, 2025-05-07T20:32:28.3473540Z scale_ub=1200.0, 2025-05-07T20:32:28.3473813Z contiguous=False, 2025-05-07T20:32:28.3474078Z compiled=True, 2025-05-07T20:32:28.3474296Z ) 2025-05-07T20:32:28.3474640Z self = 2025-05-07T20:32:28.3475169Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.3475458Z 2025-05-07T20:32:28.3475550Z @given( 2025-05-07T20:32:28.3475791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.3476130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.3476459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.3476810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.3477168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.3477480Z ) 2025-05-07T20:32:28.3477849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.3478324Z def test_silu_mul_quant( 2025-05-07T20:32:28.3478586Z self, 2025-05-07T20:32:28.3478803Z T: int, 2025-05-07T20:32:28.3479005Z D: int, 2025-05-07T20:32:28.3479246Z scale_ub: Optional[float], 2025-05-07T20:32:28.3479542Z contiguous: bool, 2025-05-07T20:32:28.3479796Z compiled: bool, 2025-05-07T20:32:28.3480042Z ) -> None: 2025-05-07T20:32:28.3480408Z torch.manual_seed(2025) 2025-05-07T20:32:28.3480660Z 2025-05-07T20:32:28.3480954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.3481319Z 2025-05-07T20:32:28.3481519Z x_sign = torch.sign(x) 2025-05-07T20:32:28.3481834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.3482170Z x = x_sign * x_clamp 2025-05-07T20:32:28.3482420Z x0 = x[:, :D] 2025-05-07T20:32:28.3482653Z x1 = x[:, D:] 2025-05-07T20:32:28.3482880Z 2025-05-07T20:32:28.3483072Z if contiguous: 2025-05-07T20:32:28.3483320Z x0 = x0.contiguous() 2025-05-07T20:32:28.3483606Z x1 = x1.contiguous() 2025-05-07T20:32:28.3483857Z 2025-05-07T20:32:28.3484077Z if scale_ub is not None: 2025-05-07T20:32:28.3484417Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.3484777Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.3485105Z ) 2025-05-07T20:32:28.3485315Z else: 2025-05-07T20:32:28.3485545Z scale_ub_tensor = None 2025-05-07T20:32:28.3485806Z 2025-05-07T20:32:28.3486054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.3486390Z op = silu_mul_quant 2025-05-07T20:32:28.3486653Z if compiled: 2025-05-07T20:32:28.3487012Z op = torch.compile(op) 2025-05-07T20:32:28.3487334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3487625Z 2025-05-07T20:32:28.3487836Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.3488009Z 2025-05-07T20:32:28.3488124Z moe/activation_test.py:117: 2025-05-07T20:32:28.3488434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3488790Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.3489092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3489684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.3490313Z return fn(*args, **kwargs) 2025-05-07T20:32:28.3491010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.3491809Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.3492380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.3493296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.3494004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.3494626Z kernel = self.compile( 2025-05-07T20:32:28.3495193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.3495889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.3496321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3496566Z 2025-05-07T20:32:28.3496792Z self = 2025-05-07T20:32:28.3497930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.3499372Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f817cea0>} 2025-05-07T20:32:28.3500782Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.3501859Z context = 2025-05-07T20:32:28.3502162Z 2025-05-07T20:32:28.3502352Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.3502905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.3503436Z module_map=module_map) 2025-05-07T20:32:28.3503858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.3504229Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.3504510Z E ^ 2025-05-07T20:32:28.3505006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.3505478Z 2025-05-07T20:32:28.3505923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.3506459Z 2025-05-07T20:32:28.3506572Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.3507021Z self=, 2025-05-07T20:32:28.3507452Z T=2048, 2025-05-07T20:32:28.3507663Z D=7168, 2025-05-07T20:32:28.3507866Z scale_ub=None, 2025-05-07T20:32:28.3508097Z contiguous=True, 2025-05-07T20:32:28.3508340Z compiled=True, 2025-05-07T20:32:28.4771378Z ) 2025-05-07T20:32:28.4772133Z self = 2025-05-07T20:32:28.4772677Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.4772961Z 2025-05-07T20:32:28.4773056Z @given( 2025-05-07T20:32:28.4773298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.4773637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.4773969Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.4774317Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.4774673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.4775104Z ) 2025-05-07T20:32:28.4775483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.4775951Z def test_silu_mul_quant( 2025-05-07T20:32:28.4776211Z self, 2025-05-07T20:32:28.4776423Z T: int, 2025-05-07T20:32:28.4776714Z D: int, 2025-05-07T20:32:28.4776951Z scale_ub: Optional[float], 2025-05-07T20:32:28.4777249Z contiguous: bool, 2025-05-07T20:32:28.4777500Z compiled: bool, 2025-05-07T20:32:28.4777744Z ) -> None: 2025-05-07T20:32:28.4777978Z torch.manual_seed(2025) 2025-05-07T20:32:28.4778232Z 2025-05-07T20:32:28.4778574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.4778928Z 2025-05-07T20:32:28.4779141Z x_sign = torch.sign(x) 2025-05-07T20:32:28.4779456Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.4779783Z x = x_sign * x_clamp 2025-05-07T20:32:28.4780046Z x0 = x[:, :D] 2025-05-07T20:32:28.4780300Z x1 = x[:, D:] 2025-05-07T20:32:28.4780524Z 2025-05-07T20:32:28.4780719Z if contiguous: 2025-05-07T20:32:28.4780972Z x0 = x0.contiguous() 2025-05-07T20:32:28.4781251Z x1 = x1.contiguous() 2025-05-07T20:32:28.4781503Z 2025-05-07T20:32:28.4781715Z if scale_ub is not None: 2025-05-07T20:32:28.4782012Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.4782367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.4782702Z ) 2025-05-07T20:32:28.4782915Z else: 2025-05-07T20:32:28.4783148Z scale_ub_tensor = None 2025-05-07T20:32:28.4783417Z 2025-05-07T20:32:28.4783671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.4784013Z op = silu_mul_quant 2025-05-07T20:32:28.4784277Z if compiled: 2025-05-07T20:32:28.4784547Z op = torch.compile(op) 2025-05-07T20:32:28.4795012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.4795326Z 2025-05-07T20:32:28.4795533Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.4795720Z 2025-05-07T20:32:28.4795831Z moe/activation_test.py:117: 2025-05-07T20:32:28.4796158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.4796519Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.4796835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.4797447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.4798051Z return fn(*args, **kwargs) 2025-05-07T20:32:28.4798748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.4799484Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.4800064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.4800897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.4801607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.4802176Z kernel = self.compile( 2025-05-07T20:32:28.4802896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.4803590Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.4804018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.4804259Z 2025-05-07T20:32:28.4804487Z self = 2025-05-07T20:32:28.4805631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.4807136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f817dc60>} 2025-05-07T20:32:28.4808550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.4809674Z context = 2025-05-07T20:32:28.4809978Z 2025-05-07T20:32:28.4810164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.4810714Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.4811215Z module_map=module_map) 2025-05-07T20:32:28.4811603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.4811986Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.4812259Z E ^ 2025-05-07T20:32:28.4812754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.4813227Z 2025-05-07T20:32:28.4814115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.4814660Z 2025-05-07T20:32:28.4814775Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.4815216Z self=, 2025-05-07T20:32:28.4815643Z T=16384, 2025-05-07T20:32:28.4815849Z D=5120, 2025-05-07T20:32:28.4816047Z scale_ub=None, 2025-05-07T20:32:28.4816281Z contiguous=False, 2025-05-07T20:32:28.4816524Z compiled=False, 2025-05-07T20:32:28.4816736Z ) 2025-05-07T20:32:28.4817074Z self = 2025-05-07T20:32:28.4817606Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.4817901Z 2025-05-07T20:32:28.4817983Z @given( 2025-05-07T20:32:28.4818230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.4818564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.4818890Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.4819249Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.4819600Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.4819908Z ) 2025-05-07T20:32:28.4820272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.4820739Z def test_silu_mul_quant( 2025-05-07T20:32:28.4821000Z self, 2025-05-07T20:32:28.4821202Z T: int, 2025-05-07T20:32:28.4821416Z D: int, 2025-05-07T20:32:28.4821649Z scale_ub: Optional[float], 2025-05-07T20:32:28.4821932Z contiguous: bool, 2025-05-07T20:32:28.4822195Z compiled: bool, 2025-05-07T20:32:28.4822438Z ) -> None: 2025-05-07T20:32:28.4822662Z torch.manual_seed(2025) 2025-05-07T20:32:28.4822922Z 2025-05-07T20:32:28.4823214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.4823607Z 2025-05-07T20:32:28.4823833Z x_sign = torch.sign(x) 2025-05-07T20:32:28.4824332Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.4826444Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.4828436Z 2025-05-07T20:32:28.4828571Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:28.4828794Z 2025-05-07T20:32:28.4828902Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.4829342Z self=, 2025-05-07T20:32:28.4829832Z T=4096, 2025-05-07T20:32:28.4830032Z D=7168, 2025-05-07T20:32:28.4830237Z scale_ub=1200.0, 2025-05-07T20:32:28.4830478Z contiguous=True, 2025-05-07T20:32:28.4830705Z compiled=True, 2025-05-07T20:32:28.4830926Z ) 2025-05-07T20:32:28.4831263Z self = 2025-05-07T20:32:28.4831786Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.4832069Z 2025-05-07T20:32:28.4832151Z @given( 2025-05-07T20:32:28.4832396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.4832729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.4833047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.4833400Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.4833800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.4834093Z ) 2025-05-07T20:32:28.4834470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.4834940Z def test_silu_mul_quant( 2025-05-07T20:32:28.4835198Z self, 2025-05-07T20:32:28.4835398Z T: int, 2025-05-07T20:32:28.4835607Z D: int, 2025-05-07T20:32:28.4835840Z scale_ub: Optional[float], 2025-05-07T20:32:28.4836120Z contiguous: bool, 2025-05-07T20:32:28.4836377Z compiled: bool, 2025-05-07T20:32:28.4836615Z ) -> None: 2025-05-07T20:32:28.4836835Z torch.manual_seed(2025) 2025-05-07T20:32:28.4837094Z 2025-05-07T20:32:28.4837384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.4837740Z 2025-05-07T20:32:28.4837950Z x_sign = torch.sign(x) 2025-05-07T20:32:28.4838263Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.4840437Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.4842373Z 2025-05-07T20:32:28.4842508Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:28.4842729Z 2025-05-07T20:32:28.4842838Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.4843278Z self=, 2025-05-07T20:32:28.4843707Z T=16384, 2025-05-07T20:32:28.4843906Z D=7168, 2025-05-07T20:32:28.4844116Z scale_ub=None, 2025-05-07T20:32:28.4844364Z contiguous=False, 2025-05-07T20:32:28.4844633Z compiled=False, 2025-05-07T20:32:28.4844854Z ) 2025-05-07T20:32:28.4845286Z self = 2025-05-07T20:32:28.4845806Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.4846107Z 2025-05-07T20:32:28.4846189Z @given( 2025-05-07T20:32:28.4846436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.4846770Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.4847085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.4847431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.4847780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.4848119Z ) 2025-05-07T20:32:28.4848487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.4848949Z def test_silu_mul_quant( 2025-05-07T20:32:28.4849197Z self, 2025-05-07T20:32:28.4849404Z T: int, 2025-05-07T20:32:28.4849613Z D: int, 2025-05-07T20:32:28.4849885Z scale_ub: Optional[float], 2025-05-07T20:32:28.4850179Z contiguous: bool, 2025-05-07T20:32:28.4850434Z compiled: bool, 2025-05-07T20:32:28.4850671Z ) -> None: 2025-05-07T20:32:28.4850891Z torch.manual_seed(2025) 2025-05-07T20:32:28.4851149Z 2025-05-07T20:32:28.4851440Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.4853558Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.4855500Z 2025-05-07T20:32:28.4855627Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.6077968Z 2025-05-07T20:32:28.6078365Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6078862Z self=, 2025-05-07T20:32:28.6079299Z T=2048, 2025-05-07T20:32:28.6079550Z D=7168, 2025-05-07T20:32:28.6079754Z scale_ub=1200.0, 2025-05-07T20:32:28.6079999Z contiguous=True, 2025-05-07T20:32:28.6080332Z compiled=True, 2025-05-07T20:32:28.6080562Z ) 2025-05-07T20:32:28.6080899Z self = 2025-05-07T20:32:28.6081443Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.6081733Z 2025-05-07T20:32:28.6081828Z @given( 2025-05-07T20:32:28.6082072Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6082413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6082748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6083109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6083470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6083785Z ) 2025-05-07T20:32:28.6084168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6084639Z def test_silu_mul_quant( 2025-05-07T20:32:28.6084908Z self, 2025-05-07T20:32:28.6085125Z T: int, 2025-05-07T20:32:28.6085334Z D: int, 2025-05-07T20:32:28.6085578Z scale_ub: Optional[float], 2025-05-07T20:32:28.6085872Z contiguous: bool, 2025-05-07T20:32:28.6086133Z compiled: bool, 2025-05-07T20:32:28.6086378Z ) -> None: 2025-05-07T20:32:28.6086613Z torch.manual_seed(2025) 2025-05-07T20:32:28.6086870Z 2025-05-07T20:32:28.6087168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6087533Z 2025-05-07T20:32:28.6087746Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6088397Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6090498Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.6092551Z 2025-05-07T20:32:28.6092677Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:28.6092900Z 2025-05-07T20:32:28.6093016Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6093449Z self=, 2025-05-07T20:32:28.6093992Z T=2048, 2025-05-07T20:32:28.6094194Z D=7168, 2025-05-07T20:32:28.6094408Z scale_ub=None, 2025-05-07T20:32:28.6094629Z contiguous=True, 2025-05-07T20:32:28.6094865Z compiled=False, 2025-05-07T20:32:28.6095083Z ) 2025-05-07T20:32:28.6095410Z self = 2025-05-07T20:32:28.6095937Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.6096220Z 2025-05-07T20:32:28.6096310Z @given( 2025-05-07T20:32:28.6096547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6096879Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6097207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6097548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6097904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6098209Z ) 2025-05-07T20:32:28.6098578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6099047Z def test_silu_mul_quant( 2025-05-07T20:32:28.6099304Z self, 2025-05-07T20:32:28.6099510Z T: int, 2025-05-07T20:32:28.6099713Z D: int, 2025-05-07T20:32:28.6099945Z scale_ub: Optional[float], 2025-05-07T20:32:28.6100231Z contiguous: bool, 2025-05-07T20:32:28.6100480Z compiled: bool, 2025-05-07T20:32:28.6100716Z ) -> None: 2025-05-07T20:32:28.6100941Z torch.manual_seed(2025) 2025-05-07T20:32:28.6101194Z 2025-05-07T20:32:28.6101481Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6101844Z 2025-05-07T20:32:28.6102043Z > x_sign = torch.sign(x) 2025-05-07T20:32:28.6104058Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.6105978Z 2025-05-07T20:32:28.6106102Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:28.6106331Z 2025-05-07T20:32:28.6106439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6106878Z self=, 2025-05-07T20:32:28.6107293Z T=1, 2025-05-07T20:32:28.6107491Z D=7168, 2025-05-07T20:32:28.6107695Z scale_ub=1200.0, 2025-05-07T20:32:28.6107925Z contiguous=True, 2025-05-07T20:32:28.6108161Z compiled=False, 2025-05-07T20:32:28.6108377Z ) 2025-05-07T20:32:28.6108708Z self = 2025-05-07T20:32:28.6109226Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.6109601Z 2025-05-07T20:32:28.6109683Z @given( 2025-05-07T20:32:28.6109923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6110245Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6110569Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6110916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6111256Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6111558Z ) 2025-05-07T20:32:28.6111926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6112432Z def test_silu_mul_quant( 2025-05-07T20:32:28.6112687Z self, 2025-05-07T20:32:28.6112891Z T: int, 2025-05-07T20:32:28.6113091Z D: int, 2025-05-07T20:32:28.6113633Z scale_ub: Optional[float], 2025-05-07T20:32:28.6113999Z contiguous: bool, 2025-05-07T20:32:28.6114359Z compiled: bool, 2025-05-07T20:32:28.6114589Z ) -> None: 2025-05-07T20:32:28.6114822Z torch.manual_seed(2025) 2025-05-07T20:32:28.6115079Z 2025-05-07T20:32:28.6115360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6115718Z 2025-05-07T20:32:28.6115930Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6116228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6116558Z x = x_sign * x_clamp 2025-05-07T20:32:28.6116811Z x0 = x[:, :D] 2025-05-07T20:32:28.6117033Z x1 = x[:, D:] 2025-05-07T20:32:28.6117256Z 2025-05-07T20:32:28.6117455Z if contiguous: 2025-05-07T20:32:28.6117699Z x0 = x0.contiguous() 2025-05-07T20:32:28.6117973Z x1 = x1.contiguous() 2025-05-07T20:32:28.6118226Z 2025-05-07T20:32:28.6118423Z if scale_ub is not None: 2025-05-07T20:32:28.6118709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6119061Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6119393Z ) 2025-05-07T20:32:28.6119589Z else: 2025-05-07T20:32:28.6119810Z scale_ub_tensor = None 2025-05-07T20:32:28.6120073Z 2025-05-07T20:32:28.6120425Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6120757Z op = silu_mul_quant 2025-05-07T20:32:28.6121020Z if compiled: 2025-05-07T20:32:28.6121274Z op = torch.compile(op) 2025-05-07T20:32:28.6121589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6121876Z 2025-05-07T20:32:28.6122073Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6122254Z 2025-05-07T20:32:28.6122358Z moe/activation_test.py:117: 2025-05-07T20:32:28.6122669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6123012Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6123310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6124058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6124811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6125370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6126085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6126780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6127339Z kernel = self.compile( 2025-05-07T20:32:28.6127907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6128596Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6129021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6129264Z 2025-05-07T20:32:28.6129608Z self = 2025-05-07T20:32:28.6130738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6132169Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7bc4b80>} 2025-05-07T20:32:28.6133569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6134743Z context = 2025-05-07T20:32:28.6135044Z 2025-05-07T20:32:28.6135219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6135823Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6136319Z module_map=module_map) 2025-05-07T20:32:28.6136698Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6137077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6137353Z E ^ 2025-05-07T20:32:28.6137842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6138311Z 2025-05-07T20:32:28.6138749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.6139293Z 2025-05-07T20:32:28.6139403Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.6139839Z self=, 2025-05-07T20:32:28.6140265Z T=128, 2025-05-07T20:32:28.6140464Z D=5120, 2025-05-07T20:32:28.6140670Z scale_ub=None, 2025-05-07T20:32:28.6140902Z contiguous=True, 2025-05-07T20:32:28.6141130Z compiled=False, 2025-05-07T20:32:28.6141346Z ) 2025-05-07T20:32:28.6141675Z self = 2025-05-07T20:32:28.6142186Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.6142479Z 2025-05-07T20:32:28.6142561Z @given( 2025-05-07T20:32:28.6142804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.6143125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.6143446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.6143793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.6144142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.6144436Z ) 2025-05-07T20:32:28.6144800Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.6145263Z def test_silu_mul_quant( 2025-05-07T20:32:28.6145516Z self, 2025-05-07T20:32:28.6145725Z T: int, 2025-05-07T20:32:28.6145931Z D: int, 2025-05-07T20:32:28.6146154Z scale_ub: Optional[float], 2025-05-07T20:32:28.6146440Z contiguous: bool, 2025-05-07T20:32:28.6146690Z compiled: bool, 2025-05-07T20:32:28.6146916Z ) -> None: 2025-05-07T20:32:28.6147139Z torch.manual_seed(2025) 2025-05-07T20:32:28.6147392Z 2025-05-07T20:32:28.6147672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.6148038Z 2025-05-07T20:32:28.6148244Z x_sign = torch.sign(x) 2025-05-07T20:32:28.6148546Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.6148873Z x = x_sign * x_clamp 2025-05-07T20:32:28.6149127Z x0 = x[:, :D] 2025-05-07T20:32:28.6149354Z x1 = x[:, D:] 2025-05-07T20:32:28.6149565Z 2025-05-07T20:32:28.6149762Z if contiguous: 2025-05-07T20:32:28.6150011Z x0 = x0.contiguous() 2025-05-07T20:32:28.6150367Z x1 = x1.contiguous() 2025-05-07T20:32:28.6150621Z 2025-05-07T20:32:28.6150827Z if scale_ub is not None: 2025-05-07T20:32:28.6151113Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.6151466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.6151791Z ) 2025-05-07T20:32:28.6151989Z else: 2025-05-07T20:32:28.6152215Z scale_ub_tensor = None 2025-05-07T20:32:28.6152481Z 2025-05-07T20:32:28.6152719Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.6153104Z op = silu_mul_quant 2025-05-07T20:32:28.6153370Z if compiled: 2025-05-07T20:32:28.6153630Z op = torch.compile(op) 2025-05-07T20:32:28.6153948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6154294Z 2025-05-07T20:32:28.6154500Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.6154715Z 2025-05-07T20:32:28.6154821Z moe/activation_test.py:117: 2025-05-07T20:32:28.6155137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6155487Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.6155780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.6156503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.6157219Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.6157785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.6158493Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.6159189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.6159743Z kernel = self.compile( 2025-05-07T20:32:28.6160399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.6161085Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.6161501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.6161739Z 2025-05-07T20:32:28.6161959Z self = 2025-05-07T20:32:28.6163075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.6164510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7bc5a80>} 2025-05-07T20:32:28.6165914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.6166976Z context = 2025-05-07T20:32:28.6167276Z 2025-05-07T20:32:28.6167456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.6167997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.6168487Z module_map=module_map) 2025-05-07T20:32:28.6168871Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.6169239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.6169531Z E ^ 2025-05-07T20:32:28.6170011Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.6170486Z 2025-05-07T20:32:28.6170920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7298273Z 2025-05-07T20:32:28.7298575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7299073Z self=, 2025-05-07T20:32:28.7299519Z T=128, 2025-05-07T20:32:28.7299738Z D=7168, 2025-05-07T20:32:28.7310005Z scale_ub=None, 2025-05-07T20:32:28.7310277Z contiguous=True, 2025-05-07T20:32:28.7310527Z compiled=False, 2025-05-07T20:32:28.7310757Z ) 2025-05-07T20:32:28.7311147Z self = 2025-05-07T20:32:28.7311834Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.7312118Z 2025-05-07T20:32:28.7312201Z @given( 2025-05-07T20:32:28.7312456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7312797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7313130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7313860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7314219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7314530Z ) 2025-05-07T20:32:28.7314899Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7315370Z def test_silu_mul_quant( 2025-05-07T20:32:28.7315638Z self, 2025-05-07T20:32:28.7315844Z T: int, 2025-05-07T20:32:28.7316060Z D: int, 2025-05-07T20:32:28.7316297Z scale_ub: Optional[float], 2025-05-07T20:32:28.7316585Z contiguous: bool, 2025-05-07T20:32:28.7316851Z compiled: bool, 2025-05-07T20:32:28.7317096Z ) -> None: 2025-05-07T20:32:28.7317326Z torch.manual_seed(2025) 2025-05-07T20:32:28.7317591Z 2025-05-07T20:32:28.7317887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7318254Z 2025-05-07T20:32:28.7318457Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7318780Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7319110Z x = x_sign * x_clamp 2025-05-07T20:32:28.7319363Z x0 = x[:, :D] 2025-05-07T20:32:28.7319595Z x1 = x[:, D:] 2025-05-07T20:32:28.7319822Z 2025-05-07T20:32:28.7320017Z if contiguous: 2025-05-07T20:32:28.7320377Z x0 = x0.contiguous() 2025-05-07T20:32:28.7320656Z x1 = x1.contiguous() 2025-05-07T20:32:28.7320907Z 2025-05-07T20:32:28.7321118Z if scale_ub is not None: 2025-05-07T20:32:28.7321413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7321770Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7322104Z ) 2025-05-07T20:32:28.7322315Z else: 2025-05-07T20:32:28.7322534Z scale_ub_tensor = None 2025-05-07T20:32:28.7322806Z 2025-05-07T20:32:28.7323059Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7323397Z op = silu_mul_quant 2025-05-07T20:32:28.7323662Z if compiled: 2025-05-07T20:32:28.7323929Z op = torch.compile(op) 2025-05-07T20:32:28.7324250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7324540Z 2025-05-07T20:32:28.7324749Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7324926Z 2025-05-07T20:32:28.7325040Z moe/activation_test.py:117: 2025-05-07T20:32:28.7325355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7325714Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7326022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7326747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7327473Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7328044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7328904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7329605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7330170Z kernel = self.compile( 2025-05-07T20:32:28.7330741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7331433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7331850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7332159Z 2025-05-07T20:32:28.7332378Z self = 2025-05-07T20:32:28.7333513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7335026Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7bc6980>} 2025-05-07T20:32:28.7336420Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7337494Z context = 2025-05-07T20:32:28.7337807Z 2025-05-07T20:32:28.7337987Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7338544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7339036Z module_map=module_map) 2025-05-07T20:32:28.7339431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7339816Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7340095Z E ^ 2025-05-07T20:32:28.7340592Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7341071Z 2025-05-07T20:32:28.7341506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7342039Z 2025-05-07T20:32:28.7342156Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7342591Z self=, 2025-05-07T20:32:28.7343020Z T=2048, 2025-05-07T20:32:28.7343229Z D=7168, 2025-05-07T20:32:28.7343433Z scale_ub=1200.0, 2025-05-07T20:32:28.7343673Z contiguous=True, 2025-05-07T20:32:28.7343911Z compiled=False, 2025-05-07T20:32:28.7344132Z ) 2025-05-07T20:32:28.7344464Z self = 2025-05-07T20:32:28.7344994Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.7345288Z 2025-05-07T20:32:28.7345377Z @given( 2025-05-07T20:32:28.7345618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7345954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7346283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7346631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7346986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7347293Z ) 2025-05-07T20:32:28.7347662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7348135Z def test_silu_mul_quant( 2025-05-07T20:32:28.7348396Z self, 2025-05-07T20:32:28.7348608Z T: int, 2025-05-07T20:32:28.7348815Z D: int, 2025-05-07T20:32:28.7349050Z scale_ub: Optional[float], 2025-05-07T20:32:28.7349346Z contiguous: bool, 2025-05-07T20:32:28.7349602Z compiled: bool, 2025-05-07T20:32:28.7349843Z ) -> None: 2025-05-07T20:32:28.7350163Z torch.manual_seed(2025) 2025-05-07T20:32:28.7350420Z 2025-05-07T20:32:28.7350714Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7352861Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.7354846Z 2025-05-07T20:32:28.7354995Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.7355234Z 2025-05-07T20:32:28.7355400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7355842Z self=, 2025-05-07T20:32:28.7356268Z T=1, 2025-05-07T20:32:28.7356472Z D=5120, 2025-05-07T20:32:28.7356675Z scale_ub=1200.0, 2025-05-07T20:32:28.7356920Z contiguous=True, 2025-05-07T20:32:28.7357160Z compiled=False, 2025-05-07T20:32:28.7357374Z ) 2025-05-07T20:32:28.7357715Z self = 2025-05-07T20:32:28.7358235Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.7358516Z 2025-05-07T20:32:28.7358601Z @given( 2025-05-07T20:32:28.7358850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7359185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7359514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7359860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7360314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7360631Z ) 2025-05-07T20:32:28.7360998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7361469Z def test_silu_mul_quant( 2025-05-07T20:32:28.7361730Z self, 2025-05-07T20:32:28.7361934Z T: int, 2025-05-07T20:32:28.7362151Z D: int, 2025-05-07T20:32:28.7362387Z scale_ub: Optional[float], 2025-05-07T20:32:28.7362674Z contiguous: bool, 2025-05-07T20:32:28.7362941Z compiled: bool, 2025-05-07T20:32:28.7363185Z ) -> None: 2025-05-07T20:32:28.7363414Z torch.manual_seed(2025) 2025-05-07T20:32:28.7363687Z 2025-05-07T20:32:28.7363980Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7364343Z 2025-05-07T20:32:28.7364546Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7364858Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7365194Z x = x_sign * x_clamp 2025-05-07T20:32:28.7365446Z x0 = x[:, :D] 2025-05-07T20:32:28.7365686Z x1 = x[:, D:] 2025-05-07T20:32:28.7365915Z 2025-05-07T20:32:28.7366111Z if contiguous: 2025-05-07T20:32:28.7366361Z x0 = x0.contiguous() 2025-05-07T20:32:28.7366649Z x1 = x1.contiguous() 2025-05-07T20:32:28.7366906Z 2025-05-07T20:32:28.7367117Z if scale_ub is not None: 2025-05-07T20:32:28.7367414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7367770Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7368103Z ) 2025-05-07T20:32:28.7368317Z else: 2025-05-07T20:32:28.7368540Z scale_ub_tensor = None 2025-05-07T20:32:28.7368811Z 2025-05-07T20:32:28.7369064Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7369403Z op = silu_mul_quant 2025-05-07T20:32:28.7369666Z if compiled: 2025-05-07T20:32:28.7369933Z op = torch.compile(op) 2025-05-07T20:32:28.7370340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7370629Z 2025-05-07T20:32:28.7370835Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7371008Z 2025-05-07T20:32:28.7371122Z moe/activation_test.py:117: 2025-05-07T20:32:28.7371431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7371785Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7372084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7372809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7373570Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7374140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7374858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7375606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7376160Z kernel = self.compile( 2025-05-07T20:32:28.7376732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7377422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7377839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7378087Z 2025-05-07T20:32:28.7378305Z self = 2025-05-07T20:32:28.7379431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7380863Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7bc7e20>} 2025-05-07T20:32:28.7382261Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7383323Z context = 2025-05-07T20:32:28.7383632Z 2025-05-07T20:32:28.7383810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7384366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7384861Z module_map=module_map) 2025-05-07T20:32:28.7385242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7385620Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7385899Z E ^ 2025-05-07T20:32:28.7386397Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7386877Z 2025-05-07T20:32:28.7387314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.8194342Z 2025-05-07T20:32:28.8194581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8195042Z self=, 2025-05-07T20:32:28.8195461Z T=2048, 2025-05-07T20:32:28.8195660Z D=5120, 2025-05-07T20:32:28.8195865Z scale_ub=None, 2025-05-07T20:32:28.8196083Z contiguous=True, 2025-05-07T20:32:28.8196332Z compiled=False, 2025-05-07T20:32:28.8196579Z ) 2025-05-07T20:32:28.8196920Z self = 2025-05-07T20:32:28.8197448Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.8197727Z 2025-05-07T20:32:28.8197814Z @given( 2025-05-07T20:32:28.8198400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8198735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8199049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8199401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8199748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8200049Z ) 2025-05-07T20:32:28.8200495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8200958Z def test_silu_mul_quant( 2025-05-07T20:32:28.8201210Z self, 2025-05-07T20:32:28.8201485Z T: int, 2025-05-07T20:32:28.8201694Z D: int, 2025-05-07T20:32:28.8201920Z scale_ub: Optional[float], 2025-05-07T20:32:28.8202201Z contiguous: bool, 2025-05-07T20:32:28.8202461Z compiled: bool, 2025-05-07T20:32:28.8202699Z ) -> None: 2025-05-07T20:32:28.8202919Z torch.manual_seed(2025) 2025-05-07T20:32:28.8203249Z 2025-05-07T20:32:28.8203572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8203959Z 2025-05-07T20:32:28.8204163Z > x_sign = torch.sign(x) 2025-05-07T20:32:28.8206193Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8208128Z 2025-05-07T20:32:28.8208252Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:28.8208473Z 2025-05-07T20:32:28.8208585Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8209020Z self=, 2025-05-07T20:32:28.8209441Z T=16384, 2025-05-07T20:32:28.8209647Z D=5120, 2025-05-07T20:32:28.8209840Z scale_ub=None, 2025-05-07T20:32:28.8210069Z contiguous=True, 2025-05-07T20:32:28.8210304Z compiled=False, 2025-05-07T20:32:28.8210513Z ) 2025-05-07T20:32:28.8210846Z self = 2025-05-07T20:32:28.8211372Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.8211664Z 2025-05-07T20:32:28.8211751Z @given( 2025-05-07T20:32:28.8211988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8212317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8212641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8212983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8213632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8214011Z ) 2025-05-07T20:32:28.8214379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8214844Z def test_silu_mul_quant( 2025-05-07T20:32:28.8215101Z self, 2025-05-07T20:32:28.8215305Z T: int, 2025-05-07T20:32:28.8215506Z D: int, 2025-05-07T20:32:28.8215736Z scale_ub: Optional[float], 2025-05-07T20:32:28.8216025Z contiguous: bool, 2025-05-07T20:32:28.8216272Z compiled: bool, 2025-05-07T20:32:28.8216505Z ) -> None: 2025-05-07T20:32:28.8216729Z torch.manual_seed(2025) 2025-05-07T20:32:28.8216980Z 2025-05-07T20:32:28.8217264Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8219558Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8221499Z 2025-05-07T20:32:28.8221629Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8221850Z 2025-05-07T20:32:28.8221963Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8222392Z self=, 2025-05-07T20:32:28.8222874Z T=4096, 2025-05-07T20:32:28.8223070Z D=5120, 2025-05-07T20:32:28.8223263Z scale_ub=None, 2025-05-07T20:32:28.8223485Z contiguous=True, 2025-05-07T20:32:28.8223719Z compiled=False, 2025-05-07T20:32:28.8223927Z ) 2025-05-07T20:32:28.8224262Z self = 2025-05-07T20:32:28.8224853Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.8225134Z 2025-05-07T20:32:28.8225214Z @given( 2025-05-07T20:32:28.8225458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8225788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8226107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8226448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8226790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8227089Z ) 2025-05-07T20:32:28.8227456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8227923Z def test_silu_mul_quant( 2025-05-07T20:32:28.8228177Z self, 2025-05-07T20:32:28.8228374Z T: int, 2025-05-07T20:32:28.8228584Z D: int, 2025-05-07T20:32:28.8228812Z scale_ub: Optional[float], 2025-05-07T20:32:28.8229097Z contiguous: bool, 2025-05-07T20:32:28.8229349Z compiled: bool, 2025-05-07T20:32:28.8229588Z ) -> None: 2025-05-07T20:32:28.8229807Z torch.manual_seed(2025) 2025-05-07T20:32:28.8230068Z 2025-05-07T20:32:28.8230357Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8232480Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8234454Z 2025-05-07T20:32:28.8234582Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8234808Z 2025-05-07T20:32:28.8234925Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8235360Z self=, 2025-05-07T20:32:28.8235779Z T=2048, 2025-05-07T20:32:28.8235970Z D=5120, 2025-05-07T20:32:28.8236173Z scale_ub=None, 2025-05-07T20:32:28.8236399Z contiguous=False, 2025-05-07T20:32:28.8236635Z compiled=False, 2025-05-07T20:32:28.8236851Z ) 2025-05-07T20:32:28.8237185Z self = 2025-05-07T20:32:28.8237707Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.8237997Z 2025-05-07T20:32:28.8238078Z @given( 2025-05-07T20:32:28.8238321Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8238650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8238964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8239310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8239802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8240100Z ) 2025-05-07T20:32:28.8240531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8240996Z def test_silu_mul_quant( 2025-05-07T20:32:28.8241248Z self, 2025-05-07T20:32:28.8241445Z T: int, 2025-05-07T20:32:28.8241651Z D: int, 2025-05-07T20:32:28.8241881Z scale_ub: Optional[float], 2025-05-07T20:32:28.8242165Z contiguous: bool, 2025-05-07T20:32:28.8242416Z compiled: bool, 2025-05-07T20:32:28.8242650Z ) -> None: 2025-05-07T20:32:28.8242917Z torch.manual_seed(2025) 2025-05-07T20:32:28.8243171Z 2025-05-07T20:32:28.8243460Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8245584Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8247550Z 2025-05-07T20:32:28.8247675Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8247902Z 2025-05-07T20:32:28.8248011Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8248445Z self=, 2025-05-07T20:32:28.8248866Z T=4096, 2025-05-07T20:32:28.8249057Z D=7168, 2025-05-07T20:32:28.8249259Z scale_ub=None, 2025-05-07T20:32:28.8249484Z contiguous=True, 2025-05-07T20:32:28.8249712Z compiled=True, 2025-05-07T20:32:28.8249925Z ) 2025-05-07T20:32:28.8250261Z self = 2025-05-07T20:32:28.8250775Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.8251061Z 2025-05-07T20:32:28.8251149Z @given( 2025-05-07T20:32:28.8251384Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8251713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8252037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8252377Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8252721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8253026Z ) 2025-05-07T20:32:28.8253388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8253856Z def test_silu_mul_quant( 2025-05-07T20:32:28.8254110Z self, 2025-05-07T20:32:28.8254316Z T: int, 2025-05-07T20:32:28.8254517Z D: int, 2025-05-07T20:32:28.8254755Z scale_ub: Optional[float], 2025-05-07T20:32:28.8255042Z contiguous: bool, 2025-05-07T20:32:28.8255298Z compiled: bool, 2025-05-07T20:32:28.8255531Z ) -> None: 2025-05-07T20:32:28.8255761Z torch.manual_seed(2025) 2025-05-07T20:32:28.8256009Z 2025-05-07T20:32:28.8256292Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8258419Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8260340Z 2025-05-07T20:32:28.8260470Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8260775Z 2025-05-07T20:32:28.8260889Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8261315Z self=, 2025-05-07T20:32:28.8261732Z T=2048, 2025-05-07T20:32:28.8261927Z D=5120, 2025-05-07T20:32:28.8262123Z scale_ub=1200.0, 2025-05-07T20:32:28.8262357Z contiguous=False, 2025-05-07T20:32:28.8262592Z compiled=False, 2025-05-07T20:32:28.8809581Z ) 2025-05-07T20:32:28.8809939Z self = 2025-05-07T20:32:28.8810468Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.8810959Z 2025-05-07T20:32:28.8811061Z @given( 2025-05-07T20:32:28.8811295Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8811621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8811940Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8812400Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8812742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8813039Z ) 2025-05-07T20:32:28.8813736Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8814230Z def test_silu_mul_quant( 2025-05-07T20:32:28.8814486Z self, 2025-05-07T20:32:28.8814681Z T: int, 2025-05-07T20:32:28.8814887Z D: int, 2025-05-07T20:32:28.8815115Z scale_ub: Optional[float], 2025-05-07T20:32:28.8815401Z contiguous: bool, 2025-05-07T20:32:28.8815655Z compiled: bool, 2025-05-07T20:32:28.8815932Z ) -> None: 2025-05-07T20:32:28.8816164Z torch.manual_seed(2025) 2025-05-07T20:32:28.8816423Z 2025-05-07T20:32:28.8816696Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8818815Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8831111Z 2025-05-07T20:32:28.8831247Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8831476Z 2025-05-07T20:32:28.8831600Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8832035Z self=, 2025-05-07T20:32:28.8832448Z T=4096, 2025-05-07T20:32:28.8832645Z D=7168, 2025-05-07T20:32:28.8832846Z scale_ub=1200.0, 2025-05-07T20:32:28.8833077Z contiguous=True, 2025-05-07T20:32:28.8833314Z compiled=False, 2025-05-07T20:32:28.8833535Z ) 2025-05-07T20:32:28.8833912Z self = 2025-05-07T20:32:28.8834434Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.8834718Z 2025-05-07T20:32:28.8834805Z @given( 2025-05-07T20:32:28.8835039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8835367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8835688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8836032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8836373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8836671Z ) 2025-05-07T20:32:28.8837036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8837486Z def test_silu_mul_quant( 2025-05-07T20:32:28.8837737Z self, 2025-05-07T20:32:28.8837945Z T: int, 2025-05-07T20:32:28.8838145Z D: int, 2025-05-07T20:32:28.8838583Z scale_ub: Optional[float], 2025-05-07T20:32:28.8838871Z contiguous: bool, 2025-05-07T20:32:28.8839114Z compiled: bool, 2025-05-07T20:32:28.8839347Z ) -> None: 2025-05-07T20:32:28.8839570Z torch.manual_seed(2025) 2025-05-07T20:32:28.8839815Z 2025-05-07T20:32:28.8840098Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8842309Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8844379Z 2025-05-07T20:32:28.8844510Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8844731Z 2025-05-07T20:32:28.8844847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8845270Z self=, 2025-05-07T20:32:28.8845686Z T=16384, 2025-05-07T20:32:28.8845887Z D=7168, 2025-05-07T20:32:28.8846081Z scale_ub=None, 2025-05-07T20:32:28.8846303Z contiguous=False, 2025-05-07T20:32:28.8846541Z compiled=True, 2025-05-07T20:32:28.8846744Z ) 2025-05-07T20:32:28.8847075Z self = 2025-05-07T20:32:28.8847596Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.8847886Z 2025-05-07T20:32:28.8847971Z @given( 2025-05-07T20:32:28.8848201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8848528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8848848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8849184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8849529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8849827Z ) 2025-05-07T20:32:28.8850187Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8850650Z def test_silu_mul_quant( 2025-05-07T20:32:28.8850905Z self, 2025-05-07T20:32:28.8851112Z T: int, 2025-05-07T20:32:28.8851310Z D: int, 2025-05-07T20:32:28.8851537Z scale_ub: Optional[float], 2025-05-07T20:32:28.8851823Z contiguous: bool, 2025-05-07T20:32:28.8852067Z compiled: bool, 2025-05-07T20:32:28.8852299Z ) -> None: 2025-05-07T20:32:28.8852523Z torch.manual_seed(2025) 2025-05-07T20:32:28.8852769Z 2025-05-07T20:32:28.8853053Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8855180Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8857097Z 2025-05-07T20:32:28.8857227Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8857447Z 2025-05-07T20:32:28.8857562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8857989Z self=, 2025-05-07T20:32:28.8858412Z T=4096, 2025-05-07T20:32:28.8858608Z D=7168, 2025-05-07T20:32:28.8858803Z scale_ub=None, 2025-05-07T20:32:28.8859027Z contiguous=True, 2025-05-07T20:32:28.8859345Z compiled=False, 2025-05-07T20:32:28.8859553Z ) 2025-05-07T20:32:28.8859887Z self = 2025-05-07T20:32:28.8860401Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.8860679Z 2025-05-07T20:32:28.8860758Z @given( 2025-05-07T20:32:28.8860996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8861320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8861636Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8862019Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8862362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8862660Z ) 2025-05-07T20:32:28.8863017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8863474Z def test_silu_mul_quant( 2025-05-07T20:32:28.8863766Z self, 2025-05-07T20:32:28.8863963Z T: int, 2025-05-07T20:32:28.8864177Z D: int, 2025-05-07T20:32:28.8864406Z scale_ub: Optional[float], 2025-05-07T20:32:28.8864699Z contiguous: bool, 2025-05-07T20:32:28.8864978Z compiled: bool, 2025-05-07T20:32:28.8865205Z ) -> None: 2025-05-07T20:32:28.8865420Z torch.manual_seed(2025) 2025-05-07T20:32:28.8865672Z 2025-05-07T20:32:28.8865951Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8868072Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8869990Z 2025-05-07T20:32:28.8870119Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8870337Z 2025-05-07T20:32:28.8870443Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8870875Z self=, 2025-05-07T20:32:28.8871291Z T=16384, 2025-05-07T20:32:28.8871485Z D=7168, 2025-05-07T20:32:28.8871689Z scale_ub=None, 2025-05-07T20:32:28.8871914Z contiguous=True, 2025-05-07T20:32:28.8872140Z compiled=False, 2025-05-07T20:32:28.8872360Z ) 2025-05-07T20:32:28.8872690Z self = 2025-05-07T20:32:28.8873209Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.8873499Z 2025-05-07T20:32:28.8873579Z @given( 2025-05-07T20:32:28.8873820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8874156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8874494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8874865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8875208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8875500Z ) 2025-05-07T20:32:28.8875863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8876322Z def test_silu_mul_quant( 2025-05-07T20:32:28.8876573Z self, 2025-05-07T20:32:28.8876770Z T: int, 2025-05-07T20:32:28.8876976Z D: int, 2025-05-07T20:32:28.8877206Z scale_ub: Optional[float], 2025-05-07T20:32:28.8877484Z contiguous: bool, 2025-05-07T20:32:28.8877735Z compiled: bool, 2025-05-07T20:32:28.8877967Z ) -> None: 2025-05-07T20:32:28.8878182Z torch.manual_seed(2025) 2025-05-07T20:32:28.8878436Z 2025-05-07T20:32:28.8878720Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8881006Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8882971Z 2025-05-07T20:32:28.8883092Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:28.8883317Z 2025-05-07T20:32:28.8883423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8883854Z self=, 2025-05-07T20:32:28.8884316Z T=16384, 2025-05-07T20:32:28.8884509Z D=7168, 2025-05-07T20:32:28.8884715Z scale_ub=1200.0, 2025-05-07T20:32:28.8884945Z contiguous=True, 2025-05-07T20:32:28.8885175Z compiled=False, 2025-05-07T20:32:28.8885385Z ) 2025-05-07T20:32:28.8885718Z self = 2025-05-07T20:32:28.8886230Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.8886528Z 2025-05-07T20:32:28.8886607Z @given( 2025-05-07T20:32:28.8886846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8887169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8887483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8887823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8888166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8888459Z ) 2025-05-07T20:32:28.8888831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8889299Z def test_silu_mul_quant( 2025-05-07T20:32:28.8889551Z self, 2025-05-07T20:32:28.8889755Z T: int, 2025-05-07T20:32:28.8889962Z D: int, 2025-05-07T20:32:28.8890183Z scale_ub: Optional[float], 2025-05-07T20:32:28.8890469Z contiguous: bool, 2025-05-07T20:32:28.8890722Z compiled: bool, 2025-05-07T20:32:28.8890946Z ) -> None: 2025-05-07T20:32:28.8891174Z torch.manual_seed(2025) 2025-05-07T20:32:28.8891425Z 2025-05-07T20:32:28.8891699Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8893877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:28.8895808Z 2025-05-07T20:32:28.8895929Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.0678928Z 2025-05-07T20:32:29.0679415Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.0680064Z self=, 2025-05-07T20:32:29.0680728Z T=128, 2025-05-07T20:32:29.0681054Z D=5120, 2025-05-07T20:32:29.0681319Z scale_ub=1200.0, 2025-05-07T20:32:29.0681633Z contiguous=False, 2025-05-07T20:32:29.0681932Z compiled=False, 2025-05-07T20:32:29.0682176Z ) 2025-05-07T20:32:29.0682520Z self = 2025-05-07T20:32:29.0683037Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.0683330Z 2025-05-07T20:32:29.0683410Z @given( 2025-05-07T20:32:29.0683876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.0684208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.0684519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.0684864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.0685215Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.0685510Z ) 2025-05-07T20:32:29.0685877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.0686335Z def test_silu_mul_quant( 2025-05-07T20:32:29.0686591Z self, 2025-05-07T20:32:29.0686857Z T: int, 2025-05-07T20:32:29.0687062Z D: int, 2025-05-07T20:32:29.0687290Z scale_ub: Optional[float], 2025-05-07T20:32:29.0687568Z contiguous: bool, 2025-05-07T20:32:29.0687817Z compiled: bool, 2025-05-07T20:32:29.0688056Z ) -> None: 2025-05-07T20:32:29.0688350Z torch.manual_seed(2025) 2025-05-07T20:32:29.0688604Z 2025-05-07T20:32:29.0688895Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.0689242Z 2025-05-07T20:32:29.0689447Z x_sign = torch.sign(x) 2025-05-07T20:32:29.0689748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.0690063Z x = x_sign * x_clamp 2025-05-07T20:32:29.0690313Z x0 = x[:, :D] 2025-05-07T20:32:29.0690544Z x1 = x[:, D:] 2025-05-07T20:32:29.0690753Z 2025-05-07T20:32:29.0690946Z if contiguous: 2025-05-07T20:32:29.0691190Z x0 = x0.contiguous() 2025-05-07T20:32:29.0691466Z x1 = x1.contiguous() 2025-05-07T20:32:29.0691708Z 2025-05-07T20:32:29.0691912Z if scale_ub is not None: 2025-05-07T20:32:29.0692199Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.0692542Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.0692864Z ) 2025-05-07T20:32:29.0693070Z else: 2025-05-07T20:32:29.0693290Z scale_ub_tensor = None 2025-05-07T20:32:29.0693553Z 2025-05-07T20:32:29.0693796Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.0694125Z op = silu_mul_quant 2025-05-07T20:32:29.0694390Z if compiled: 2025-05-07T20:32:29.0694657Z op = torch.compile(op) 2025-05-07T20:32:29.0694961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0695254Z 2025-05-07T20:32:29.0695460Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.0695628Z 2025-05-07T20:32:29.0695730Z moe/activation_test.py:117: 2025-05-07T20:32:29.0696040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0696388Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.0696681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0697394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.0698118Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.0698681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.0699385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.0700078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.0700632Z kernel = self.compile( 2025-05-07T20:32:29.0701194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.0701876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.0702290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0702531Z 2025-05-07T20:32:29.0702744Z self = 2025-05-07T20:32:29.0703973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.0705433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7754ae0>} 2025-05-07T20:32:29.0706820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.0707921Z context = 2025-05-07T20:32:29.0708221Z 2025-05-07T20:32:29.0708400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.0708940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.0709465Z module_map=module_map) 2025-05-07T20:32:29.0709842Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.0710211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.0710479Z E ^ 2025-05-07T20:32:29.0710965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.0711429Z 2025-05-07T20:32:29.0711872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.0712406Z 2025-05-07T20:32:29.0712519Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.0712945Z self=, 2025-05-07T20:32:29.0713629Z T=2048, 2025-05-07T20:32:29.0713896Z D=7168, 2025-05-07T20:32:29.0714098Z scale_ub=None, 2025-05-07T20:32:29.0714331Z contiguous=False, 2025-05-07T20:32:29.0714567Z compiled=False, 2025-05-07T20:32:29.0714780Z ) 2025-05-07T20:32:29.0715113Z self = 2025-05-07T20:32:29.0715632Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.0715912Z 2025-05-07T20:32:29.0715992Z @given( 2025-05-07T20:32:29.0716235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.0716565Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.0716888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.0717227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.0717578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.0717875Z ) 2025-05-07T20:32:29.0718238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.0718699Z def test_silu_mul_quant( 2025-05-07T20:32:29.0718958Z self, 2025-05-07T20:32:29.0719161Z T: int, 2025-05-07T20:32:29.0719376Z D: int, 2025-05-07T20:32:29.0719605Z scale_ub: Optional[float], 2025-05-07T20:32:29.0719882Z contiguous: bool, 2025-05-07T20:32:29.0720201Z compiled: bool, 2025-05-07T20:32:29.0720433Z ) -> None: 2025-05-07T20:32:29.0720651Z torch.manual_seed(2025) 2025-05-07T20:32:29.0720902Z 2025-05-07T20:32:29.0721184Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.0723320Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.0725390Z 2025-05-07T20:32:29.0725518Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.0725738Z 2025-05-07T20:32:29.0725849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.0726274Z self=, 2025-05-07T20:32:29.0726693Z T=128, 2025-05-07T20:32:29.0726889Z D=7168, 2025-05-07T20:32:29.0727084Z scale_ub=1200.0, 2025-05-07T20:32:29.0727316Z contiguous=True, 2025-05-07T20:32:29.0727546Z compiled=True, 2025-05-07T20:32:29.0727752Z ) 2025-05-07T20:32:29.0728141Z self = 2025-05-07T20:32:29.0728658Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.0728935Z 2025-05-07T20:32:29.0729018Z @given( 2025-05-07T20:32:29.0729250Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.0729642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.0729967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.0730303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.0730646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.0730942Z ) 2025-05-07T20:32:29.0731298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.0731759Z def test_silu_mul_quant( 2025-05-07T20:32:29.0732010Z self, 2025-05-07T20:32:29.0732206Z T: int, 2025-05-07T20:32:29.0732412Z D: int, 2025-05-07T20:32:29.0732641Z scale_ub: Optional[float], 2025-05-07T20:32:29.0732926Z contiguous: bool, 2025-05-07T20:32:29.0733173Z compiled: bool, 2025-05-07T20:32:29.0733408Z ) -> None: 2025-05-07T20:32:29.0733634Z torch.manual_seed(2025) 2025-05-07T20:32:29.0733905Z 2025-05-07T20:32:29.0734212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.0734575Z 2025-05-07T20:32:29.0734776Z x_sign = torch.sign(x) 2025-05-07T20:32:29.0735086Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.0735416Z x = x_sign * x_clamp 2025-05-07T20:32:29.0735662Z x0 = x[:, :D] 2025-05-07T20:32:29.0735892Z x1 = x[:, D:] 2025-05-07T20:32:29.0736114Z 2025-05-07T20:32:29.0736301Z if contiguous: 2025-05-07T20:32:29.0736541Z x0 = x0.contiguous() 2025-05-07T20:32:29.0736810Z x1 = x1.contiguous() 2025-05-07T20:32:29.0737053Z 2025-05-07T20:32:29.0737254Z if scale_ub is not None: 2025-05-07T20:32:29.0737541Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.0737891Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.0738208Z ) 2025-05-07T20:32:29.0738409Z else: 2025-05-07T20:32:29.0738630Z scale_ub_tensor = None 2025-05-07T20:32:29.0738888Z 2025-05-07T20:32:29.0739128Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.0739462Z op = silu_mul_quant 2025-05-07T20:32:29.0739718Z if compiled: 2025-05-07T20:32:29.0739975Z op = torch.compile(op) 2025-05-07T20:32:29.0740284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0740565Z 2025-05-07T20:32:29.0740766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.0740935Z 2025-05-07T20:32:29.0741049Z moe/activation_test.py:117: 2025-05-07T20:32:29.0741352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0741700Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.0741996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0742581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.0743158Z return fn(*args, **kwargs) 2025-05-07T20:32:29.0744016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.0744738Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.0745291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.0746005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.0746695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.0747247Z kernel = self.compile( 2025-05-07T20:32:29.0747805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.0748544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.0748957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0749194Z 2025-05-07T20:32:29.0749458Z self = 2025-05-07T20:32:29.0750586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.0752016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7610040>} 2025-05-07T20:32:29.0753407Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.0754471Z context = 2025-05-07T20:32:29.0754771Z 2025-05-07T20:32:29.0754944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.0755500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.0755991Z module_map=module_map) 2025-05-07T20:32:29.0756370Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.0756738Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.0757010Z E ^ 2025-05-07T20:32:29.0757496Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.0757965Z 2025-05-07T20:32:29.0758406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.3898914Z 2025-05-07T20:32:29.3899632Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.3900852Z self=, 2025-05-07T20:32:29.3901824Z T=128, 2025-05-07T20:32:29.3902223Z D=7168, 2025-05-07T20:32:29.3902631Z scale_ub=1200.0, 2025-05-07T20:32:29.3903093Z contiguous=True, 2025-05-07T20:32:29.3903554Z compiled=False, 2025-05-07T20:32:29.3903791Z ) 2025-05-07T20:32:29.3904123Z self = 2025-05-07T20:32:29.3904643Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.3904925Z 2025-05-07T20:32:29.3905014Z @given( 2025-05-07T20:32:29.3905257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.3905591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.3914486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.3914861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.3915210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.3915507Z ) 2025-05-07T20:32:29.3915883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.3916359Z def test_silu_mul_quant( 2025-05-07T20:32:29.3916611Z self, 2025-05-07T20:32:29.3916987Z T: int, 2025-05-07T20:32:29.3917201Z D: int, 2025-05-07T20:32:29.3917425Z scale_ub: Optional[float], 2025-05-07T20:32:29.3917716Z contiguous: bool, 2025-05-07T20:32:29.3917974Z compiled: bool, 2025-05-07T20:32:29.3918207Z ) -> None: 2025-05-07T20:32:29.3918437Z torch.manual_seed(2025) 2025-05-07T20:32:29.3918699Z 2025-05-07T20:32:29.3918991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3919345Z 2025-05-07T20:32:29.3919552Z x_sign = torch.sign(x) 2025-05-07T20:32:29.3919928Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.3922084Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.3924125Z 2025-05-07T20:32:29.3924250Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:29.3924479Z 2025-05-07T20:32:29.3924589Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.3925022Z self=, 2025-05-07T20:32:29.3925444Z T=128, 2025-05-07T20:32:29.3925632Z D=5120, 2025-05-07T20:32:29.3925836Z scale_ub=1200.0, 2025-05-07T20:32:29.3926070Z contiguous=True, 2025-05-07T20:32:29.3926295Z compiled=True, 2025-05-07T20:32:29.3926509Z ) 2025-05-07T20:32:29.3926843Z self = 2025-05-07T20:32:29.3927356Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.3927645Z 2025-05-07T20:32:29.3927726Z @given( 2025-05-07T20:32:29.3927968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.3928289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.3928611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.3928959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.3929303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.3929601Z ) 2025-05-07T20:32:29.3929964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.3930428Z def test_silu_mul_quant( 2025-05-07T20:32:29.3930674Z self, 2025-05-07T20:32:29.3930880Z T: int, 2025-05-07T20:32:29.3931086Z D: int, 2025-05-07T20:32:29.3931311Z scale_ub: Optional[float], 2025-05-07T20:32:29.3931596Z contiguous: bool, 2025-05-07T20:32:29.3931852Z compiled: bool, 2025-05-07T20:32:29.3932080Z ) -> None: 2025-05-07T20:32:29.3932313Z torch.manual_seed(2025) 2025-05-07T20:32:29.3932569Z 2025-05-07T20:32:29.3932847Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3933203Z 2025-05-07T20:32:29.3933411Z x_sign = torch.sign(x) 2025-05-07T20:32:29.3933710Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.3935771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.3937687Z 2025-05-07T20:32:29.3937954Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:29.3938184Z 2025-05-07T20:32:29.3938292Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.3938727Z self=, 2025-05-07T20:32:29.3939139Z T=128, 2025-05-07T20:32:29.3939338Z D=7168, 2025-05-07T20:32:29.3939544Z scale_ub=None, 2025-05-07T20:32:29.3939762Z contiguous=True, 2025-05-07T20:32:29.3939996Z compiled=True, 2025-05-07T20:32:29.3940211Z ) 2025-05-07T20:32:29.3940537Z self = 2025-05-07T20:32:29.3941095Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.3941377Z 2025-05-07T20:32:29.3941458Z @given( 2025-05-07T20:32:29.3941703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.3942024Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.3942390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.3942739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.3943083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.3943403Z ) 2025-05-07T20:32:29.3943931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.3944592Z def test_silu_mul_quant( 2025-05-07T20:32:29.3944941Z self, 2025-05-07T20:32:29.3945232Z T: int, 2025-05-07T20:32:29.3945524Z D: int, 2025-05-07T20:32:29.3945829Z scale_ub: Optional[float], 2025-05-07T20:32:29.3946234Z contiguous: bool, 2025-05-07T20:32:29.3946574Z compiled: bool, 2025-05-07T20:32:29.3946884Z ) -> None: 2025-05-07T20:32:29.3947194Z torch.manual_seed(2025) 2025-05-07T20:32:29.3947542Z 2025-05-07T20:32:29.3947931Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3950892Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.3953571Z 2025-05-07T20:32:29.3953738Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.3954065Z 2025-05-07T20:32:29.3954485Z FAILED 2025-05-07T20:32:29.3954636Z 2025-05-07T20:32:29.3954829Z =================================== FAILURES =================================== 2025-05-07T20:32:29.3955436Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:29.3956111Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:29.3957026Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:29.3957789Z | yield 2025-05-07T20:32:29.3958422Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:29.3959179Z | self._callTestMethod(testMethod) 2025-05-07T20:32:29.3959602Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:29.3960482Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:29.3961321Z | if method() is not None: 2025-05-07T20:32:29.3961684Z | ~~~~~~^^ 2025-05-07T20:32:29.3962599Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:29.3963661Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.3964090Z | ^^^^^^^ 2025-05-07T20:32:29.3965024Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:29.3965959Z | raise the_error_hypothesis_found 2025-05-07T20:32:29.3966590Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:29.3967200Z +-+---------------- 1 ---------------- 2025-05-07T20:32:29.3967612Z | Traceback (most recent call last): 2025-05-07T20:32:29.3968643Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:29.3969842Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3972837Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.3975748Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:29.3976404Z | self=, 2025-05-07T20:32:29.3976979Z | T=2048, 2025-05-07T20:32:29.3977317Z | D=5120, # or any other generated value 2025-05-07T20:32:29.3977798Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:29.3978307Z | contiguous=True, # or any other generated value 2025-05-07T20:32:29.3978847Z | compiled=False, # or any other generated value 2025-05-07T20:32:29.3979287Z | ) 2025-05-07T20:32:29.3979545Z | 2025-05-07T20:32:29.3980207Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:29.3981105Z +---------------- 2 ---------------- 2025-05-07T20:32:29.3981520Z | Traceback (most recent call last): 2025-05-07T20:32:29.3982367Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:29.3983473Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3985703Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.3987725Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:29.3988176Z | self=, 2025-05-07T20:32:29.3988588Z | T=128, 2025-05-07T20:32:29.3988790Z | D=7168, 2025-05-07T20:32:29.3988999Z | scale_ub=None, 2025-05-07T20:32:29.3989237Z | contiguous=True, 2025-05-07T20:32:29.3989482Z | compiled=True, 2025-05-07T20:32:29.3989712Z | ) 2025-05-07T20:32:29.3989889Z | 2025-05-07T20:32:29.3990427Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:29.3991053Z +---------------- 3 ---------------- 2025-05-07T20:32:29.3991449Z | Traceback (most recent call last): 2025-05-07T20:32:29.3992603Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:29.3993768Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3996801Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.3999587Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:29.4000398Z | self=, 2025-05-07T20:32:29.4000969Z | T=128, 2025-05-07T20:32:29.4001254Z | D=5120, 2025-05-07T20:32:29.4001562Z | scale_ub=1200.0, 2025-05-07T20:32:29.4001913Z | contiguous=True, 2025-05-07T20:32:29.4002267Z | compiled=True, 2025-05-07T20:32:29.4002593Z | ) 2025-05-07T20:32:29.4002853Z | 2025-05-07T20:32:29.4003610Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:29.4004494Z +---------------- 4 ---------------- 2025-05-07T20:32:29.4004916Z | Traceback (most recent call last): 2025-05-07T20:32:29.4005950Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:29.4006975Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4007401Z | ~~~~~~^^ 2025-05-07T20:32:29.4008330Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:29.4009347Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4010581Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:29.4011755Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4012169Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:29.4012557Z | a, 2025-05-07T20:32:29.4012849Z | ^^ 2025-05-07T20:32:29.4013144Z | ...<23 lines>... 2025-05-07T20:32:29.4013760Z | USE_INT64=use_int64, 2025-05-07T20:32:29.4014295Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.4014641Z | ) 2025-05-07T20:32:29.4014916Z | ^ 2025-05-07T20:32:29.4015671Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:29.4016741Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4017398Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.4018333Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:29.4019455Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4020141Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.4021067Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:29.4022086Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4022903Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.4023801Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:29.4024617Z | fn() 2025-05-07T20:32:29.4024899Z | ~~^^ 2025-05-07T20:32:29.4025701Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:29.4026612Z | self.fn.run( 2025-05-07T20:32:29.4026934Z | ~~~~~~~~~~~^ 2025-05-07T20:32:29.4027342Z | *args, 2025-05-07T20:32:29.4027638Z | ^^^^^^ 2025-05-07T20:32:29.4027944Z | **current, 2025-05-07T20:32:29.4028267Z | ^^^^^^^^^^ 2025-05-07T20:32:29.4028577Z | ) 2025-05-07T20:32:29.4028846Z | ^ 2025-05-07T20:32:29.4029560Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:29.4030480Z | kernel = self.compile( 2025-05-07T20:32:29.4030851Z | src, 2025-05-07T20:32:29.4031160Z | target=target, 2025-05-07T20:32:29.4031527Z | options=options.__dict__, 2025-05-07T20:32:29.4031922Z | ) 2025-05-07T20:32:29.4032704Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:29.4033718Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4034752Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:29.4035897Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4036577Z | module_map=module_map) 2025-05-07T20:32:29.4037106Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4037614Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4037981Z | ^ 2025-05-07T20:32:29.4038652Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4039464Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:29.4040049Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:29.4040906Z | self=, 2025-05-07T20:32:29.4041532Z | T=1, # or any other generated value 2025-05-07T20:32:29.4041990Z | D=5120, # or any other generated value 2025-05-07T20:32:29.4042483Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:29.4043009Z | contiguous=True, # or any other generated value 2025-05-07T20:32:29.4043510Z | compiled=True, # or any other generated value 2025-05-07T20:32:29.4043959Z | ) 2025-05-07T20:32:29.4044235Z | 2025-05-07T20:32:29.4044969Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:29.4045807Z +------------------------------------ 2025-05-07T20:32:29.4046301Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:29.4046815Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4047384Z self=, 2025-05-07T20:32:29.4047927Z T=1, 2025-05-07T20:32:29.4048184Z D=5120, 2025-05-07T20:32:29.4048437Z scale_ub=None, 2025-05-07T20:32:29.4048727Z contiguous=True, 2025-05-07T20:32:29.4049026Z compiled=True, 2025-05-07T20:32:29.4049301Z ) 2025-05-07T20:32:29.4049735Z self = 2025-05-07T20:32:29.4050406Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4050776Z 2025-05-07T20:32:29.4050978Z @given( 2025-05-07T20:32:29.4051301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4051742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4052165Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4052632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4053094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4053504Z ) 2025-05-07T20:32:29.4053994Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4054666Z def test_silu_mul_quant( 2025-05-07T20:32:29.4054998Z self, 2025-05-07T20:32:29.4055263Z T: int, 2025-05-07T20:32:29.4055536Z D: int, 2025-05-07T20:32:29.4055855Z scale_ub: Optional[float], 2025-05-07T20:32:29.4056245Z contiguous: bool, 2025-05-07T20:32:29.4056583Z compiled: bool, 2025-05-07T20:32:29.4056952Z ) -> None: 2025-05-07T20:32:29.4057245Z torch.manual_seed(2025) 2025-05-07T20:32:29.4057580Z 2025-05-07T20:32:29.4057949Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4058401Z 2025-05-07T20:32:29.4058661Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4059055Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4059468Z x = x_sign * x_clamp 2025-05-07T20:32:29.4059790Z x0 = x[:, :D] 2025-05-07T20:32:29.4060083Z x1 = x[:, D:] 2025-05-07T20:32:29.4060354Z 2025-05-07T20:32:29.4060609Z if contiguous: 2025-05-07T20:32:29.4060929Z x0 = x0.contiguous() 2025-05-07T20:32:29.4061290Z x1 = x1.contiguous() 2025-05-07T20:32:29.4061616Z 2025-05-07T20:32:29.4061879Z if scale_ub is not None: 2025-05-07T20:32:29.4062254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4062713Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4063165Z ) 2025-05-07T20:32:29.4063458Z else: 2025-05-07T20:32:29.4063752Z scale_ub_tensor = None 2025-05-07T20:32:29.4064117Z 2025-05-07T20:32:29.4064447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4064895Z op = silu_mul_quant 2025-05-07T20:32:29.4065260Z if compiled: 2025-05-07T20:32:29.4065617Z op = torch.compile(op) 2025-05-07T20:32:29.4066036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4066441Z 2025-05-07T20:32:29.4066719Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4067126Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4067555Z 2025-05-07T20:32:29.4067895Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4068372Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4068787Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4069239Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4069762Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4070205Z 2025-05-07T20:32:29.4070494Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4070771Z 2025-05-07T20:32:29.4070925Z moe/activation_test.py:126: 2025-05-07T20:32:29.4071357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4071842Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4072301Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4073424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4074482Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4075265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4076238Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4077330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4078380Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4079437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4080462Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4081346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4082162Z fn() 2025-05-07T20:32:29.4082903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4083739Z self.fn.run( 2025-05-07T20:32:29.4084378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4085177Z kernel = self.compile( 2025-05-07T20:32:29.4085928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4086870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4087463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4087802Z 2025-05-07T20:32:29.4088098Z self = 2025-05-07T20:32:29.4089622Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4091578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3294bd36a0>} 2025-05-07T20:32:29.4093421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4094811Z context = 2025-05-07T20:32:29.4095194Z 2025-05-07T20:32:29.4095413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4096128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4096768Z module_map=module_map) 2025-05-07T20:32:29.4097251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4097720Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4098078Z E ^ 2025-05-07T20:32:29.4098702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4099329Z 2025-05-07T20:32:29.4099897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4100598Z 2025-05-07T20:32:29.4100738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4101297Z self=, 2025-05-07T20:32:29.4101837Z T=2048, 2025-05-07T20:32:29.4102084Z D=5120, 2025-05-07T20:32:29.4102342Z scale_ub=1200.0, 2025-05-07T20:32:29.4102654Z contiguous=True, 2025-05-07T20:32:29.4102954Z compiled=False, 2025-05-07T20:32:29.4103234Z ) 2025-05-07T20:32:29.4103663Z self = 2025-05-07T20:32:29.4124884Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.4125319Z 2025-05-07T20:32:29.4125433Z @given( 2025-05-07T20:32:29.4125768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4126471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4126916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4127398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4127870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4128281Z ) 2025-05-07T20:32:29.4128776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4129408Z def test_silu_mul_quant( 2025-05-07T20:32:29.4129761Z self, 2025-05-07T20:32:29.4130029Z T: int, 2025-05-07T20:32:29.4130403Z D: int, 2025-05-07T20:32:29.4130726Z scale_ub: Optional[float], 2025-05-07T20:32:29.4131098Z contiguous: bool, 2025-05-07T20:32:29.4131454Z compiled: bool, 2025-05-07T20:32:29.4131765Z ) -> None: 2025-05-07T20:32:29.4132063Z torch.manual_seed(2025) 2025-05-07T20:32:29.4132410Z 2025-05-07T20:32:29.4132891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4133373Z 2025-05-07T20:32:29.4133647Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4134065Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4134503Z x = x_sign * x_clamp 2025-05-07T20:32:29.4134844Z x0 = x[:, :D] 2025-05-07T20:32:29.4135150Z x1 = x[:, D:] 2025-05-07T20:32:29.4135459Z 2025-05-07T20:32:29.4135724Z if contiguous: 2025-05-07T20:32:29.4136073Z x0 = x0.contiguous() 2025-05-07T20:32:29.4136461Z x1 = x1.contiguous() 2025-05-07T20:32:29.4136807Z 2025-05-07T20:32:29.4137095Z if scale_ub is not None: 2025-05-07T20:32:29.4137492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4137959Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4138414Z ) 2025-05-07T20:32:29.4138695Z else: 2025-05-07T20:32:29.4138990Z scale_ub_tensor = None 2025-05-07T20:32:29.4139371Z 2025-05-07T20:32:29.4139715Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4140174Z op = silu_mul_quant 2025-05-07T20:32:29.4140546Z if compiled: 2025-05-07T20:32:29.4140919Z op = torch.compile(op) 2025-05-07T20:32:29.4141339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4141745Z 2025-05-07T20:32:29.4142024Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4142258Z 2025-05-07T20:32:29.4142411Z moe/activation_test.py:117: 2025-05-07T20:32:29.4142839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4143337Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4143737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4144740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4145764Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4146571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4147561Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4148532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4149319Z kernel = self.compile( 2025-05-07T20:32:29.4150118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4151094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4151669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4152000Z 2025-05-07T20:32:29.4152277Z self = 2025-05-07T20:32:29.4153882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4155870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f3294845f80>} 2025-05-07T20:32:29.4157775Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4159256Z context = 2025-05-07T20:32:29.4159661Z 2025-05-07T20:32:29.4159902Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4160777Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4161494Z module_map=module_map) 2025-05-07T20:32:29.4162023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4162536Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4162879Z E ^ 2025-05-07T20:32:29.4163547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4164204Z 2025-05-07T20:32:29.4164823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4165570Z 2025-05-07T20:32:29.4165720Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4166226Z self=, 2025-05-07T20:32:29.4166644Z T=2048, 2025-05-07T20:32:29.4166841Z D=5120, 2025-05-07T20:32:29.4167038Z scale_ub=1200.0, 2025-05-07T20:32:29.4167271Z contiguous=True, 2025-05-07T20:32:29.4167499Z compiled=True, 2025-05-07T20:32:29.4167707Z ) 2025-05-07T20:32:29.4168042Z self = 2025-05-07T20:32:29.4168556Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.4168833Z 2025-05-07T20:32:29.4168915Z @given( 2025-05-07T20:32:29.4169144Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4169466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4169784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4170117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4170455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4170750Z ) 2025-05-07T20:32:29.4171103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4171553Z def test_silu_mul_quant( 2025-05-07T20:32:29.4171800Z self, 2025-05-07T20:32:29.4171992Z T: int, 2025-05-07T20:32:29.4172196Z D: int, 2025-05-07T20:32:29.4172419Z scale_ub: Optional[float], 2025-05-07T20:32:29.4172698Z contiguous: bool, 2025-05-07T20:32:29.4172943Z compiled: bool, 2025-05-07T20:32:29.4173168Z ) -> None: 2025-05-07T20:32:29.4173391Z torch.manual_seed(2025) 2025-05-07T20:32:29.4173636Z 2025-05-07T20:32:29.4173959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4174319Z 2025-05-07T20:32:29.4174513Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4174812Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4175135Z x = x_sign * x_clamp 2025-05-07T20:32:29.4175377Z x0 = x[:, :D] 2025-05-07T20:32:29.4175603Z x1 = x[:, D:] 2025-05-07T20:32:29.4175824Z 2025-05-07T20:32:29.4176007Z if contiguous: 2025-05-07T20:32:29.4176251Z x0 = x0.contiguous() 2025-05-07T20:32:29.4176523Z x1 = x1.contiguous() 2025-05-07T20:32:29.4176763Z 2025-05-07T20:32:29.4176965Z if scale_ub is not None: 2025-05-07T20:32:29.4177342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4177684Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4178003Z ) 2025-05-07T20:32:29.4178200Z else: 2025-05-07T20:32:29.4178418Z scale_ub_tensor = None 2025-05-07T20:32:29.4178675Z 2025-05-07T20:32:29.4178917Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4179244Z op = silu_mul_quant 2025-05-07T20:32:29.4179498Z if compiled: 2025-05-07T20:32:29.4179758Z op = torch.compile(op) 2025-05-07T20:32:29.4180111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4180391Z 2025-05-07T20:32:29.4180589Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4180881Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4181175Z 2025-05-07T20:32:29.4181419Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4181805Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4182105Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4182429Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4182798Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4183120Z 2025-05-07T20:32:29.4183323Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4183531Z 2025-05-07T20:32:29.4183632Z moe/activation_test.py:126: 2025-05-07T20:32:29.4183943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4184287Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4184626Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4185441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4186217Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4186782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4187485Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4188194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4188933Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4189686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4190349Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4190969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4191494Z fn() 2025-05-07T20:32:29.4192018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4192626Z self.fn.run( 2025-05-07T20:32:29.4193103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4193658Z kernel = self.compile( 2025-05-07T20:32:29.4194268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4194940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4195348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4195589Z 2025-05-07T20:32:29.4195803Z self = 2025-05-07T20:32:29.4196922Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4198431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32944c9d00>} 2025-05-07T20:32:29.4199807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4200976Z context = 2025-05-07T20:32:29.4201279Z 2025-05-07T20:32:29.4201452Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4202043Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4202527Z module_map=module_map) 2025-05-07T20:32:29.4202897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4203309Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4203590Z E ^ 2025-05-07T20:32:29.4204094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4204603Z 2025-05-07T20:32:29.4205033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4205568Z 2025-05-07T20:32:29.4205675Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4206099Z self=, 2025-05-07T20:32:29.4206507Z T=16384, 2025-05-07T20:32:29.4206709Z D=7168, 2025-05-07T20:32:29.4206906Z scale_ub=1200.0, 2025-05-07T20:32:29.4207126Z contiguous=False, 2025-05-07T20:32:29.4207355Z compiled=False, 2025-05-07T20:32:29.4207559Z ) 2025-05-07T20:32:29.4207881Z self = 2025-05-07T20:32:29.4208400Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.4208699Z 2025-05-07T20:32:29.4208776Z @given( 2025-05-07T20:32:29.4209009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4209326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4209645Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4209984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4210316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4210612Z ) 2025-05-07T20:32:29.4210969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4211418Z def test_silu_mul_quant( 2025-05-07T20:32:29.4211663Z self, 2025-05-07T20:32:29.4211859Z T: int, 2025-05-07T20:32:29.4212054Z D: int, 2025-05-07T20:32:29.4212277Z scale_ub: Optional[float], 2025-05-07T20:32:29.4212556Z contiguous: bool, 2025-05-07T20:32:29.4212807Z compiled: bool, 2025-05-07T20:32:29.4213031Z ) -> None: 2025-05-07T20:32:29.4213257Z torch.manual_seed(2025) 2025-05-07T20:32:29.4213790Z 2025-05-07T20:32:29.4214066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4214416Z 2025-05-07T20:32:29.4214616Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4214909Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4215229Z x = x_sign * x_clamp 2025-05-07T20:32:29.4215474Z x0 = x[:, :D] 2025-05-07T20:32:29.4215688Z x1 = x[:, D:] 2025-05-07T20:32:29.4215900Z 2025-05-07T20:32:29.4216094Z if contiguous: 2025-05-07T20:32:29.4216325Z x0 = x0.contiguous() 2025-05-07T20:32:29.4216590Z x1 = x1.contiguous() 2025-05-07T20:32:29.4216836Z 2025-05-07T20:32:29.4217025Z if scale_ub is not None: 2025-05-07T20:32:29.4217304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4217653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4218161Z ) 2025-05-07T20:32:29.4218360Z else: 2025-05-07T20:32:29.4218575Z scale_ub_tensor = None 2025-05-07T20:32:29.4218832Z 2025-05-07T20:32:29.4219062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4219387Z op = silu_mul_quant 2025-05-07T20:32:29.4219642Z if compiled: 2025-05-07T20:32:29.4219889Z op = torch.compile(op) 2025-05-07T20:32:29.4220191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4220471Z 2025-05-07T20:32:29.4220662Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4220899Z 2025-05-07T20:32:29.4220998Z moe/activation_test.py:117: 2025-05-07T20:32:29.4221298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4221631Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4221918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4222705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4223416Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4224015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4224720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4225406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4225956Z kernel = self.compile( 2025-05-07T20:32:29.4226511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4227189Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4227318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4227325Z 2025-05-07T20:32:29.4227542Z self = 2025-05-07T20:32:29.4228350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4228874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32944a4e00>} 2025-05-07T20:32:29.4229642Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4229840Z context = 2025-05-07T20:32:29.4229853Z 2025-05-07T20:32:29.4230026Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4230303Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4230418Z module_map=module_map) 2025-05-07T20:32:29.4230582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4230682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4230767Z E ^ 2025-05-07T20:32:29.4231132Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4231138Z 2025-05-07T20:32:29.4231571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4231579Z 2025-05-07T20:32:29.4231683Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4231911Z self=, 2025-05-07T20:32:29.4231998Z T=1, 2025-05-07T20:32:29.4232075Z D=7168, 2025-05-07T20:32:29.4232238Z scale_ub=None, 2025-05-07T20:32:29.4232332Z contiguous=True, 2025-05-07T20:32:29.4232421Z compiled=True, 2025-05-07T20:32:29.4232495Z ) 2025-05-07T20:32:29.4232727Z self = 2025-05-07T20:32:29.4232894Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4232899Z 2025-05-07T20:32:29.4232984Z @given( 2025-05-07T20:32:29.4233105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4233211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4233377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4233496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4233611Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4233693Z ) 2025-05-07T20:32:29.4233944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4234080Z def test_silu_mul_quant( 2025-05-07T20:32:29.4234173Z self, 2025-05-07T20:32:29.4234251Z T: int, 2025-05-07T20:32:29.4234334Z D: int, 2025-05-07T20:32:29.4234434Z scale_ub: Optional[float], 2025-05-07T20:32:29.4234524Z contiguous: bool, 2025-05-07T20:32:29.4234616Z compiled: bool, 2025-05-07T20:32:29.4234695Z ) -> None: 2025-05-07T20:32:29.4234791Z torch.manual_seed(2025) 2025-05-07T20:32:29.4234870Z 2025-05-07T20:32:29.4235043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4235117Z 2025-05-07T20:32:29.4235219Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4235346Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4235436Z x = x_sign * x_clamp 2025-05-07T20:32:29.4235527Z x0 = x[:, :D] 2025-05-07T20:32:29.4235608Z x1 = x[:, D:] 2025-05-07T20:32:29.4235689Z 2025-05-07T20:32:29.4235778Z if contiguous: 2025-05-07T20:32:29.4235869Z x0 = x0.contiguous() 2025-05-07T20:32:29.4235970Z x1 = x1.contiguous() 2025-05-07T20:32:29.4236043Z 2025-05-07T20:32:29.4236134Z if scale_ub is not None: 2025-05-07T20:32:29.4236247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4236384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4236463Z ) 2025-05-07T20:32:29.4236547Z else: 2025-05-07T20:32:29.4236642Z scale_ub_tensor = None 2025-05-07T20:32:29.4236715Z 2025-05-07T20:32:29.4236853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4236947Z op = silu_mul_quant 2025-05-07T20:32:29.4237031Z if compiled: 2025-05-07T20:32:29.4237138Z op = torch.compile(op) 2025-05-07T20:32:29.4237244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4237323Z 2025-05-07T20:32:29.4237414Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4237540Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4237621Z 2025-05-07T20:32:29.4237759Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4237862Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4237969Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4238092Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4238233Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4238314Z 2025-05-07T20:32:29.4238415Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4238420Z 2025-05-07T20:32:29.4238530Z moe/activation_test.py:126: 2025-05-07T20:32:29.4238659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4238766Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4238909Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4239567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4239672Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4240050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4240345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4240731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4240996Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4241425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4241602Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4241951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4242078Z fn() 2025-05-07T20:32:29.4242490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4242573Z self.fn.run( 2025-05-07T20:32:29.4242922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4243016Z kernel = self.compile( 2025-05-07T20:32:29.4243408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4243594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4243723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4243728Z 2025-05-07T20:32:29.4243977Z self = 2025-05-07T20:32:29.4244798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4245317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328ecd3ec0>} 2025-05-07T20:32:29.4246089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4246287Z context = 2025-05-07T20:32:29.4246291Z 2025-05-07T20:32:29.4246465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4246736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4246851Z module_map=module_map) 2025-05-07T20:32:29.4247020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4247123Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4247204Z E ^ 2025-05-07T20:32:29.4247568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4247573Z 2025-05-07T20:32:29.4248000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4248004Z 2025-05-07T20:32:29.4248118Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4248345Z self=, 2025-05-07T20:32:29.4248431Z T=4096, 2025-05-07T20:32:29.4248507Z D=5120, 2025-05-07T20:32:29.4248589Z scale_ub=None, 2025-05-07T20:32:29.4248682Z contiguous=False, 2025-05-07T20:32:29.4248771Z compiled=False, 2025-05-07T20:32:29.4248845Z ) 2025-05-07T20:32:29.4249152Z self = 2025-05-07T20:32:29.4249334Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.4249339Z 2025-05-07T20:32:29.4249418Z @given( 2025-05-07T20:32:29.4249545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4249646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4249769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4249887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4250045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4250127Z ) 2025-05-07T20:32:29.4250379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4250474Z def test_silu_mul_quant( 2025-05-07T20:32:29.4250557Z self, 2025-05-07T20:32:29.4250698Z T: int, 2025-05-07T20:32:29.4250776Z D: int, 2025-05-07T20:32:29.4250889Z scale_ub: Optional[float], 2025-05-07T20:32:29.4250983Z contiguous: bool, 2025-05-07T20:32:29.4251072Z compiled: bool, 2025-05-07T20:32:29.4251158Z ) -> None: 2025-05-07T20:32:29.4251254Z torch.manual_seed(2025) 2025-05-07T20:32:29.4251335Z 2025-05-07T20:32:29.4251507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4251581Z 2025-05-07T20:32:29.4251680Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4251808Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4251898Z x = x_sign * x_clamp 2025-05-07T20:32:29.4251988Z x0 = x[:, :D] 2025-05-07T20:32:29.4252070Z x1 = x[:, D:] 2025-05-07T20:32:29.4252144Z 2025-05-07T20:32:29.4252235Z if contiguous: 2025-05-07T20:32:29.4252328Z x0 = x0.contiguous() 2025-05-07T20:32:29.4252417Z x1 = x1.contiguous() 2025-05-07T20:32:29.4252499Z 2025-05-07T20:32:29.4252590Z if scale_ub is not None: 2025-05-07T20:32:29.4252705Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4252847Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4252922Z ) 2025-05-07T20:32:29.4253005Z else: 2025-05-07T20:32:29.4253100Z scale_ub_tensor = None 2025-05-07T20:32:29.4253173Z 2025-05-07T20:32:29.4253309Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4253400Z op = silu_mul_quant 2025-05-07T20:32:29.4253487Z if compiled: 2025-05-07T20:32:29.4253595Z op = torch.compile(op) 2025-05-07T20:32:29.4253704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4253777Z 2025-05-07T20:32:29.4253873Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4253878Z 2025-05-07T20:32:29.4253975Z moe/activation_test.py:117: 2025-05-07T20:32:29.4254109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4254219Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4254321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4254843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4254941Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4255310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4255545Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4255897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4256003Z kernel = self.compile( 2025-05-07T20:32:29.4256398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4256581Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4256801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4256806Z 2025-05-07T20:32:29.4257015Z self = 2025-05-07T20:32:29.4257818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4258339Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb49bc0>} 2025-05-07T20:32:29.4259164Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4259410Z context = 2025-05-07T20:32:29.4259415Z 2025-05-07T20:32:29.4259585Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4259861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4259969Z module_map=module_map) 2025-05-07T20:32:29.4260133Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4260238Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4260315Z E ^ 2025-05-07T20:32:29.4260677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4260692Z 2025-05-07T20:32:29.4261119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4261124Z 2025-05-07T20:32:29.4261232Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4261469Z self=, 2025-05-07T20:32:29.4261548Z T=4096, 2025-05-07T20:32:29.4261625Z D=7168, 2025-05-07T20:32:29.4261715Z scale_ub=None, 2025-05-07T20:32:29.4261803Z contiguous=False, 2025-05-07T20:32:29.4261887Z compiled=False, 2025-05-07T20:32:29.4261969Z ) 2025-05-07T20:32:29.4262193Z self = 2025-05-07T20:32:29.4262377Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.4262381Z 2025-05-07T20:32:29.4262460Z @given( 2025-05-07T20:32:29.4262582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4262688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4262804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4262922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4263047Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4263125Z ) 2025-05-07T20:32:29.4263381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4263476Z def test_silu_mul_quant( 2025-05-07T20:32:29.4263556Z self, 2025-05-07T20:32:29.4263640Z T: int, 2025-05-07T20:32:29.4263730Z D: int, 2025-05-07T20:32:29.4263843Z scale_ub: Optional[float], 2025-05-07T20:32:29.4263959Z contiguous: bool, 2025-05-07T20:32:29.4264050Z compiled: bool, 2025-05-07T20:32:29.4264129Z ) -> None: 2025-05-07T20:32:29.4264231Z torch.manual_seed(2025) 2025-05-07T20:32:29.4264312Z 2025-05-07T20:32:29.4264484Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4264568Z 2025-05-07T20:32:29.4264661Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4264793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4264885Z x = x_sign * x_clamp 2025-05-07T20:32:29.4264966Z x0 = x[:, :D] 2025-05-07T20:32:29.4265137Z x1 = x[:, D:] 2025-05-07T20:32:29.4265211Z 2025-05-07T20:32:29.4265296Z if contiguous: 2025-05-07T20:32:29.4265396Z x0 = x0.contiguous() 2025-05-07T20:32:29.4265487Z x1 = x1.contiguous() 2025-05-07T20:32:29.4265560Z 2025-05-07T20:32:29.4265658Z if scale_ub is not None: 2025-05-07T20:32:29.4265764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4265903Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4265985Z ) 2025-05-07T20:32:29.4266061Z else: 2025-05-07T20:32:29.4266203Z scale_ub_tensor = None 2025-05-07T20:32:29.4266276Z 2025-05-07T20:32:29.4266411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4266509Z op = silu_mul_quant 2025-05-07T20:32:29.4266594Z if compiled: 2025-05-07T20:32:29.4266695Z op = torch.compile(op) 2025-05-07T20:32:29.4266860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4266936Z 2025-05-07T20:32:29.4267027Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4267032Z 2025-05-07T20:32:29.4267137Z moe/activation_test.py:117: 2025-05-07T20:32:29.4267269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4267378Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4267479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4267994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4268101Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4268471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4268699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4269064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4269159Z kernel = self.compile( 2025-05-07T20:32:29.4269556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4269735Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4269862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4269867Z 2025-05-07T20:32:29.4270083Z self = 2025-05-07T20:32:29.4270882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4271410Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb4a340>} 2025-05-07T20:32:29.4272182Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4272378Z context = 2025-05-07T20:32:29.4272390Z 2025-05-07T20:32:29.4272560Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4272831Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4272948Z module_map=module_map) 2025-05-07T20:32:29.4273110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4273209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4273292Z E ^ 2025-05-07T20:32:29.4273814Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4273822Z 2025-05-07T20:32:29.4274256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4274261Z 2025-05-07T20:32:29.4274367Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4274595Z self=, 2025-05-07T20:32:29.4274678Z T=128, 2025-05-07T20:32:29.4274754Z D=7168, 2025-05-07T20:32:29.4274836Z scale_ub=None, 2025-05-07T20:32:29.4274929Z contiguous=False, 2025-05-07T20:32:29.4275057Z compiled=True, 2025-05-07T20:32:29.4275131Z ) 2025-05-07T20:32:29.4275365Z self = 2025-05-07T20:32:29.4275540Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4275545Z 2025-05-07T20:32:29.4275671Z @given( 2025-05-07T20:32:29.4275797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4275909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4292581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4292748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4292871Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4292957Z ) 2025-05-07T20:32:29.4293222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4293324Z def test_silu_mul_quant( 2025-05-07T20:32:29.4293412Z self, 2025-05-07T20:32:29.4293497Z T: int, 2025-05-07T20:32:29.4293577Z D: int, 2025-05-07T20:32:29.4293687Z scale_ub: Optional[float], 2025-05-07T20:32:29.4293781Z contiguous: bool, 2025-05-07T20:32:29.4293872Z compiled: bool, 2025-05-07T20:32:29.4293962Z ) -> None: 2025-05-07T20:32:29.4294060Z torch.manual_seed(2025) 2025-05-07T20:32:29.4294140Z 2025-05-07T20:32:29.4294332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4294409Z 2025-05-07T20:32:29.4294513Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4294644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4294736Z x = x_sign * x_clamp 2025-05-07T20:32:29.4294827Z x0 = x[:, :D] 2025-05-07T20:32:29.4294910Z x1 = x[:, D:] 2025-05-07T20:32:29.4294985Z 2025-05-07T20:32:29.4295081Z if contiguous: 2025-05-07T20:32:29.4295176Z x0 = x0.contiguous() 2025-05-07T20:32:29.4295268Z x1 = x1.contiguous() 2025-05-07T20:32:29.4295354Z 2025-05-07T20:32:29.4295448Z if scale_ub is not None: 2025-05-07T20:32:29.4295558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4295708Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4295787Z ) 2025-05-07T20:32:29.4295873Z else: 2025-05-07T20:32:29.4295974Z scale_ub_tensor = None 2025-05-07T20:32:29.4296054Z 2025-05-07T20:32:29.4296197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4296290Z op = silu_mul_quant 2025-05-07T20:32:29.4296378Z if compiled: 2025-05-07T20:32:29.4296492Z op = torch.compile(op) 2025-05-07T20:32:29.4296601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4296680Z 2025-05-07T20:32:29.4296785Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4296910Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4296983Z 2025-05-07T20:32:29.4297135Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4297241Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4297352Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4297477Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4297623Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4297707Z 2025-05-07T20:32:29.4297956Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4297962Z 2025-05-07T20:32:29.4298069Z moe/activation_test.py:126: 2025-05-07T20:32:29.4298215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4298329Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4298470Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4299066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4299212Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4299594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4299829Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4300257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4300533Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4300923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4301107Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4301462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4301542Z fn() 2025-05-07T20:32:29.4301966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4302052Z self.fn.run( 2025-05-07T20:32:29.4302404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4302512Z kernel = self.compile( 2025-05-07T20:32:29.4302912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4303105Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4303240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4303244Z 2025-05-07T20:32:29.4303459Z self = 2025-05-07T20:32:29.4304272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4304805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328f23e840>} 2025-05-07T20:32:29.4305589Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4305792Z context = 2025-05-07T20:32:29.4305797Z 2025-05-07T20:32:29.4305976Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4306254Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4306367Z module_map=module_map) 2025-05-07T20:32:29.4306542Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4306650Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4306728Z E ^ 2025-05-07T20:32:29.4307106Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4307111Z 2025-05-07T20:32:29.4307620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4307625Z 2025-05-07T20:32:29.4307741Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4307974Z self=, 2025-05-07T20:32:29.4308054Z T=128, 2025-05-07T20:32:29.4308139Z D=7168, 2025-05-07T20:32:29.4308225Z scale_ub=None, 2025-05-07T20:32:29.4308317Z contiguous=False, 2025-05-07T20:32:29.4308412Z compiled=False, 2025-05-07T20:32:29.4308488Z ) 2025-05-07T20:32:29.4308714Z self = 2025-05-07T20:32:29.4308945Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.4308949Z 2025-05-07T20:32:29.4309031Z @given( 2025-05-07T20:32:29.4309162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4309264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4309423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4309558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4309676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4309754Z ) 2025-05-07T20:32:29.4310016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4310112Z def test_silu_mul_quant( 2025-05-07T20:32:29.4310204Z self, 2025-05-07T20:32:29.4310285Z T: int, 2025-05-07T20:32:29.4310363Z D: int, 2025-05-07T20:32:29.4310472Z scale_ub: Optional[float], 2025-05-07T20:32:29.4310566Z contiguous: bool, 2025-05-07T20:32:29.4310656Z compiled: bool, 2025-05-07T20:32:29.4310743Z ) -> None: 2025-05-07T20:32:29.4310840Z torch.manual_seed(2025) 2025-05-07T20:32:29.4310915Z 2025-05-07T20:32:29.4311100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4311180Z 2025-05-07T20:32:29.4311276Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4311415Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4311508Z x = x_sign * x_clamp 2025-05-07T20:32:29.4311593Z x0 = x[:, :D] 2025-05-07T20:32:29.4311683Z x1 = x[:, D:] 2025-05-07T20:32:29.4311756Z 2025-05-07T20:32:29.4311849Z if contiguous: 2025-05-07T20:32:29.4311944Z x0 = x0.contiguous() 2025-05-07T20:32:29.4312038Z x1 = x1.contiguous() 2025-05-07T20:32:29.4312119Z 2025-05-07T20:32:29.4312213Z if scale_ub is not None: 2025-05-07T20:32:29.4312320Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4312475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4312554Z ) 2025-05-07T20:32:29.4312633Z else: 2025-05-07T20:32:29.4312739Z scale_ub_tensor = None 2025-05-07T20:32:29.4312814Z 2025-05-07T20:32:29.4312948Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4313053Z op = silu_mul_quant 2025-05-07T20:32:29.4313146Z if compiled: 2025-05-07T20:32:29.4313256Z op = torch.compile(op) 2025-05-07T20:32:29.4313599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4313714Z 2025-05-07T20:32:29.4313860Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4313867Z 2025-05-07T20:32:29.4313991Z moe/activation_test.py:117: 2025-05-07T20:32:29.4314123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4314233Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4314333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4314851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4314959Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4315331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4315757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4316115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4316214Z kernel = self.compile( 2025-05-07T20:32:29.4316619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4316803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4316941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4317004Z 2025-05-07T20:32:29.4317214Z self = 2025-05-07T20:32:29.4318022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4318609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb55f80>} 2025-05-07T20:32:29.4319377Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4319577Z context = 2025-05-07T20:32:29.4319584Z 2025-05-07T20:32:29.4319753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4320024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4320225Z module_map=module_map) 2025-05-07T20:32:29.4320391Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4320509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4320586Z E ^ 2025-05-07T20:32:29.4320949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4320954Z 2025-05-07T20:32:29.4321385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4321390Z 2025-05-07T20:32:29.4321495Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4321733Z self=, 2025-05-07T20:32:29.4321815Z T=4096, 2025-05-07T20:32:29.4321893Z D=5120, 2025-05-07T20:32:29.4321987Z scale_ub=1200.0, 2025-05-07T20:32:29.4322072Z contiguous=True, 2025-05-07T20:32:29.4322156Z compiled=False, 2025-05-07T20:32:29.4322231Z ) 2025-05-07T20:32:29.4322455Z self = 2025-05-07T20:32:29.4322643Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.4322647Z 2025-05-07T20:32:29.4322730Z @given( 2025-05-07T20:32:29.4322852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4322957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4323073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4323193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4323316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4323389Z ) 2025-05-07T20:32:29.4323642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4323746Z def test_silu_mul_quant( 2025-05-07T20:32:29.4323823Z self, 2025-05-07T20:32:29.4323898Z T: int, 2025-05-07T20:32:29.4323977Z D: int, 2025-05-07T20:32:29.4324076Z scale_ub: Optional[float], 2025-05-07T20:32:29.4324169Z contiguous: bool, 2025-05-07T20:32:29.4324262Z compiled: bool, 2025-05-07T20:32:29.4324428Z ) -> None: 2025-05-07T20:32:29.4324530Z torch.manual_seed(2025) 2025-05-07T20:32:29.4324604Z 2025-05-07T20:32:29.4324776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4324860Z 2025-05-07T20:32:29.4324953Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4325081Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4325180Z x = x_sign * x_clamp 2025-05-07T20:32:29.4325263Z x0 = x[:, :D] 2025-05-07T20:32:29.4325344Z x1 = x[:, D:] 2025-05-07T20:32:29.4325462Z 2025-05-07T20:32:29.4325547Z if contiguous: 2025-05-07T20:32:29.4325637Z x0 = x0.contiguous() 2025-05-07T20:32:29.4325733Z x1 = x1.contiguous() 2025-05-07T20:32:29.4325805Z 2025-05-07T20:32:29.4325897Z if scale_ub is not None: 2025-05-07T20:32:29.4326011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4326221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4326305Z ) 2025-05-07T20:32:29.4326385Z else: 2025-05-07T20:32:29.4326485Z scale_ub_tensor = None 2025-05-07T20:32:29.4326562Z 2025-05-07T20:32:29.4326693Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4326786Z op = silu_mul_quant 2025-05-07T20:32:29.4326882Z if compiled: 2025-05-07T20:32:29.4326981Z op = torch.compile(op) 2025-05-07T20:32:29.4327087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4327166Z 2025-05-07T20:32:29.4327261Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4327266Z 2025-05-07T20:32:29.4327371Z moe/activation_test.py:117: 2025-05-07T20:32:29.4327505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4327609Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4327719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4328235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4328334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4328712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4328942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4329299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4329401Z kernel = self.compile( 2025-05-07T20:32:29.4329796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4329983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4330111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4330119Z 2025-05-07T20:32:29.4330332Z self = 2025-05-07T20:32:29.4331140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4331660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb54fe0>} 2025-05-07T20:32:29.4332437Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4332633Z context = 2025-05-07T20:32:29.4332640Z 2025-05-07T20:32:29.4332893Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4333168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4333276Z module_map=module_map) 2025-05-07T20:32:29.4333446Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4333546Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4333623Z E ^ 2025-05-07T20:32:29.4333994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4333999Z 2025-05-07T20:32:29.4334463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4334468Z 2025-05-07T20:32:29.4334579Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4334810Z self=, 2025-05-07T20:32:29.4334927Z T=1, 2025-05-07T20:32:29.4335011Z D=5120, 2025-05-07T20:32:29.4335101Z scale_ub=None, 2025-05-07T20:32:29.4335188Z contiguous=True, 2025-05-07T20:32:29.4335276Z compiled=True, 2025-05-07T20:32:29.4335350Z ) 2025-05-07T20:32:29.4335579Z self = 2025-05-07T20:32:29.4335744Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4335749Z 2025-05-07T20:32:29.4335826Z @given( 2025-05-07T20:32:29.4335951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4336051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4336170Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4336294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4336408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4336481Z ) 2025-05-07T20:32:29.4336745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4336849Z def test_silu_mul_quant( 2025-05-07T20:32:29.4336933Z self, 2025-05-07T20:32:29.4337010Z T: int, 2025-05-07T20:32:29.4337087Z D: int, 2025-05-07T20:32:29.4337198Z scale_ub: Optional[float], 2025-05-07T20:32:29.4337288Z contiguous: bool, 2025-05-07T20:32:29.4337374Z compiled: bool, 2025-05-07T20:32:29.4337458Z ) -> None: 2025-05-07T20:32:29.4337555Z torch.manual_seed(2025) 2025-05-07T20:32:29.4337626Z 2025-05-07T20:32:29.4337804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4337881Z 2025-05-07T20:32:29.4337976Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4338107Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4338196Z x = x_sign * x_clamp 2025-05-07T20:32:29.4338285Z x0 = x[:, :D] 2025-05-07T20:32:29.4338365Z x1 = x[:, D:] 2025-05-07T20:32:29.4338441Z 2025-05-07T20:32:29.4338532Z if contiguous: 2025-05-07T20:32:29.4338630Z x0 = x0.contiguous() 2025-05-07T20:32:29.4338722Z x1 = x1.contiguous() 2025-05-07T20:32:29.4338801Z 2025-05-07T20:32:29.4338892Z if scale_ub is not None: 2025-05-07T20:32:29.4339000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4339144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4339219Z ) 2025-05-07T20:32:29.4339296Z else: 2025-05-07T20:32:29.4339399Z scale_ub_tensor = None 2025-05-07T20:32:29.4339472Z 2025-05-07T20:32:29.4339604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4339706Z op = silu_mul_quant 2025-05-07T20:32:29.4339792Z if compiled: 2025-05-07T20:32:29.4339902Z op = torch.compile(op) 2025-05-07T20:32:29.4340009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4340082Z 2025-05-07T20:32:29.4340182Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4340389Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4340462Z 2025-05-07T20:32:29.4340608Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4340711Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4340812Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4340942Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4341086Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4341165Z 2025-05-07T20:32:29.4341267Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4341311Z 2025-05-07T20:32:29.4341412Z moe/activation_test.py:126: 2025-05-07T20:32:29.4341547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4341655Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4341794Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4342411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4342514Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4342885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4343111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4343491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4343764Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4344148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4344323Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4344678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4344754Z fn() 2025-05-07T20:32:29.4345169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4345251Z self.fn.run( 2025-05-07T20:32:29.4345603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4345696Z kernel = self.compile( 2025-05-07T20:32:29.4346087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4346272Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4346399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4346403Z 2025-05-07T20:32:29.4346611Z self = 2025-05-07T20:32:29.4347422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4347941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eb57a60>} 2025-05-07T20:32:29.4348714Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4348912Z context = 2025-05-07T20:32:29.4348916Z 2025-05-07T20:32:29.4349087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4349358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4349548Z module_map=module_map) 2025-05-07T20:32:29.4349717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4349821Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4349896Z E ^ 2025-05-07T20:32:29.4350266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4350271Z 2025-05-07T20:32:29.4350692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4350735Z 2025-05-07T20:32:29.4350844Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4351071Z self=, 2025-05-07T20:32:29.4351147Z T=2048, 2025-05-07T20:32:29.4351226Z D=5120, 2025-05-07T20:32:29.4351307Z scale_ub=None, 2025-05-07T20:32:29.4351433Z contiguous=True, 2025-05-07T20:32:29.4351520Z compiled=True, 2025-05-07T20:32:29.4351597Z ) 2025-05-07T20:32:29.4351825Z self = 2025-05-07T20:32:29.4351998Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4352003Z 2025-05-07T20:32:29.4352080Z @given( 2025-05-07T20:32:29.4352203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4352303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4352418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4352539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4352656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4352730Z ) 2025-05-07T20:32:29.4352984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4353077Z def test_silu_mul_quant( 2025-05-07T20:32:29.4353157Z self, 2025-05-07T20:32:29.4353236Z T: int, 2025-05-07T20:32:29.4353311Z D: int, 2025-05-07T20:32:29.4353418Z scale_ub: Optional[float], 2025-05-07T20:32:29.4353510Z contiguous: bool, 2025-05-07T20:32:29.4353595Z compiled: bool, 2025-05-07T20:32:29.4353675Z ) -> None: 2025-05-07T20:32:29.4353770Z torch.manual_seed(2025) 2025-05-07T20:32:29.4353842Z 2025-05-07T20:32:29.4354018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4354091Z 2025-05-07T20:32:29.4354183Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4354322Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4354423Z x = x_sign * x_clamp 2025-05-07T20:32:29.4354504Z x0 = x[:, :D] 2025-05-07T20:32:29.4354585Z x1 = x[:, D:] 2025-05-07T20:32:29.4354665Z 2025-05-07T20:32:29.4354751Z if contiguous: 2025-05-07T20:32:29.4354843Z x0 = x0.contiguous() 2025-05-07T20:32:29.4354943Z x1 = x1.contiguous() 2025-05-07T20:32:29.4355018Z 2025-05-07T20:32:29.4355114Z if scale_ub is not None: 2025-05-07T20:32:29.4355229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4355366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4355443Z ) 2025-05-07T20:32:29.4355532Z else: 2025-05-07T20:32:29.4355629Z scale_ub_tensor = None 2025-05-07T20:32:29.4355708Z 2025-05-07T20:32:29.4355842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4355934Z op = silu_mul_quant 2025-05-07T20:32:29.4356027Z if compiled: 2025-05-07T20:32:29.4356131Z op = torch.compile(op) 2025-05-07T20:32:29.4356238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4356317Z 2025-05-07T20:32:29.4356409Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4356532Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4356617Z 2025-05-07T20:32:29.4356756Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4356942Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4357051Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4357175Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4357325Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4357398Z 2025-05-07T20:32:29.4357500Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4357504Z 2025-05-07T20:32:29.4357610Z moe/activation_test.py:126: 2025-05-07T20:32:29.4357739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4357992Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4358134Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4358706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4358859Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4359229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4359458Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4359839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4360103Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4360557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4360738Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4361136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4361285Z fn() 2025-05-07T20:32:29.4361734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4361961Z self.fn.run( 2025-05-07T20:32:29.4362414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4362540Z kernel = self.compile( 2025-05-07T20:32:29.4362968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4363216Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4363363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4363370Z 2025-05-07T20:32:29.4363669Z self = 2025-05-07T20:32:29.4364587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4365141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e716700>} 2025-05-07T20:32:29.4365980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4366208Z context = 2025-05-07T20:32:29.4366216Z 2025-05-07T20:32:29.4366494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4366846Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4366987Z module_map=module_map) 2025-05-07T20:32:29.4367223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4367441Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4367555Z E ^ 2025-05-07T20:32:29.4368052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4368057Z 2025-05-07T20:32:29.4368610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4368615Z 2025-05-07T20:32:29.4368791Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4369056Z self=, 2025-05-07T20:32:29.4369210Z T=128, 2025-05-07T20:32:29.4369364Z D=5120, 2025-05-07T20:32:29.4369537Z scale_ub=None, 2025-05-07T20:32:29.4369708Z contiguous=True, 2025-05-07T20:32:29.4369824Z compiled=True, 2025-05-07T20:32:29.4369931Z ) 2025-05-07T20:32:29.4370248Z self = 2025-05-07T20:32:29.4370491Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4370496Z 2025-05-07T20:32:29.4370674Z @given( 2025-05-07T20:32:29.4370880Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4371016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4371225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4371375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4371507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4371718Z ) 2025-05-07T20:32:29.4372009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4372135Z def test_silu_mul_quant( 2025-05-07T20:32:29.4372303Z self, 2025-05-07T20:32:29.4372413Z T: int, 2025-05-07T20:32:29.4372507Z D: int, 2025-05-07T20:32:29.4372824Z scale_ub: Optional[float], 2025-05-07T20:32:29.4372956Z contiguous: bool, 2025-05-07T20:32:29.4373142Z compiled: bool, 2025-05-07T20:32:29.4373254Z ) -> None: 2025-05-07T20:32:29.4373379Z torch.manual_seed(2025) 2025-05-07T20:32:29.4373554Z 2025-05-07T20:32:29.4373777Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4373887Z 2025-05-07T20:32:29.4374069Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4374227Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4374347Z x = x_sign * x_clamp 2025-05-07T20:32:29.4374527Z x0 = x[:, :D] 2025-05-07T20:32:29.4374658Z x1 = x[:, D:] 2025-05-07T20:32:29.4374770Z 2025-05-07T20:32:29.4374944Z if contiguous: 2025-05-07T20:32:29.4375073Z x0 = x0.contiguous() 2025-05-07T20:32:29.4375215Z x1 = x1.contiguous() 2025-05-07T20:32:29.4375366Z 2025-05-07T20:32:29.4375506Z if scale_ub is not None: 2025-05-07T20:32:29.4375701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4375880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4375991Z ) 2025-05-07T20:32:29.4376199Z else: 2025-05-07T20:32:29.4376376Z scale_ub_tensor = None 2025-05-07T20:32:29.4376501Z 2025-05-07T20:32:29.4376727Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4376854Z op = silu_mul_quant 2025-05-07T20:32:29.4377008Z if compiled: 2025-05-07T20:32:29.4377126Z op = torch.compile(op) 2025-05-07T20:32:29.4377311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4377494Z 2025-05-07T20:32:29.4377618Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4377772Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4377916Z 2025-05-07T20:32:29.4378074Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4378259Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4378469Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4378710Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4378924Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4379037Z 2025-05-07T20:32:29.4379155Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4379160Z 2025-05-07T20:32:29.4379417Z moe/activation_test.py:126: 2025-05-07T20:32:29.4379583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4379723Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4379999Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4380668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4380878Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4381296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4381606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4382054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4382372Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4382810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4383065Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4383466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4383615Z fn() 2025-05-07T20:32:29.4384061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4384200Z self.fn.run( 2025-05-07T20:32:29.4384614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4384792Z kernel = self.compile( 2025-05-07T20:32:29.4385265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4385477Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4385660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4385665Z 2025-05-07T20:32:29.4385942Z self = 2025-05-07T20:32:29.4386766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4387431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9bcd260>} 2025-05-07T20:32:29.4388231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4388565Z context = 2025-05-07T20:32:29.4388570Z 2025-05-07T20:32:29.4388767Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4389068Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4389291Z module_map=module_map) 2025-05-07T20:32:29.4389505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4389695Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4389805Z E ^ 2025-05-07T20:32:29.4390281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4390287Z 2025-05-07T20:32:29.4390764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4390769Z 2025-05-07T20:32:29.4390951Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4391280Z self=, 2025-05-07T20:32:29.4391389Z T=4096, 2025-05-07T20:32:29.4391496Z D=5120, 2025-05-07T20:32:29.4391642Z scale_ub=None, 2025-05-07T20:32:29.4391785Z contiguous=True, 2025-05-07T20:32:29.4391946Z compiled=True, 2025-05-07T20:32:29.4392123Z ) 2025-05-07T20:32:29.4392381Z self = 2025-05-07T20:32:29.4392585Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4392631Z 2025-05-07T20:32:29.4392774Z @given( 2025-05-07T20:32:29.4392916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4393211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4393360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4393507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4393687Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4393797Z ) 2025-05-07T20:32:29.4394107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4394355Z def test_silu_mul_quant( 2025-05-07T20:32:29.4394464Z self, 2025-05-07T20:32:29.4394571Z T: int, 2025-05-07T20:32:29.4394711Z D: int, 2025-05-07T20:32:29.4394840Z scale_ub: Optional[float], 2025-05-07T20:32:29.4395068Z contiguous: bool, 2025-05-07T20:32:29.4395201Z compiled: bool, 2025-05-07T20:32:29.4395311Z ) -> None: 2025-05-07T20:32:29.4395479Z torch.manual_seed(2025) 2025-05-07T20:32:29.4395586Z 2025-05-07T20:32:29.4395814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4395991Z 2025-05-07T20:32:29.4396129Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4396284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4396440Z x = x_sign * x_clamp 2025-05-07T20:32:29.4396551Z x0 = x[:, :D] 2025-05-07T20:32:29.4396684Z x1 = x[:, D:] 2025-05-07T20:32:29.4396857Z 2025-05-07T20:32:29.4397031Z if contiguous: 2025-05-07T20:32:29.4397187Z x0 = x0.contiguous() 2025-05-07T20:32:29.4397309Z x1 = x1.contiguous() 2025-05-07T20:32:29.4397434Z 2025-05-07T20:32:29.4397576Z if scale_ub is not None: 2025-05-07T20:32:29.4397764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4397952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4398096Z ) 2025-05-07T20:32:29.4398202Z else: 2025-05-07T20:32:29.4398353Z scale_ub_tensor = None 2025-05-07T20:32:29.4398476Z 2025-05-07T20:32:29.4398692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4398860Z op = silu_mul_quant 2025-05-07T20:32:29.4398975Z if compiled: 2025-05-07T20:32:29.4399128Z op = torch.compile(op) 2025-05-07T20:32:29.4399299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4399386Z 2025-05-07T20:32:29.4399559Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4399758Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4399865Z 2025-05-07T20:32:29.4400055Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4400330Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4400445Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4400774Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4401058Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4401163Z 2025-05-07T20:32:29.4401330Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4401335Z 2025-05-07T20:32:29.4401464Z moe/activation_test.py:126: 2025-05-07T20:32:29.4401609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4401843Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4402030Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4402668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4402843Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4403247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4403577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4404068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4404397Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4404813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4405015Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4405415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4405594Z fn() 2025-05-07T20:32:29.4406052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4406199Z self.fn.run( 2025-05-07T20:32:29.4406579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4406785Z kernel = self.compile( 2025-05-07T20:32:29.4407197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4407473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4407680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4407684Z 2025-05-07T20:32:29.4407923Z self = 2025-05-07T20:32:29.4408788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4409339Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f96ae700>} 2025-05-07T20:32:29.4410229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4410472Z context = 2025-05-07T20:32:29.4410477Z 2025-05-07T20:32:29.4410675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4411011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4411148Z module_map=module_map) 2025-05-07T20:32:29.4411366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4411564Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4411685Z E ^ 2025-05-07T20:32:29.4412112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4412120Z 2025-05-07T20:32:29.4412655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4412660Z 2025-05-07T20:32:29.4412819Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4413099Z self=, 2025-05-07T20:32:29.4413260Z T=16384, 2025-05-07T20:32:29.4413654Z D=5120, 2025-05-07T20:32:29.4413804Z scale_ub=None, 2025-05-07T20:32:29.4413987Z contiguous=True, 2025-05-07T20:32:29.4414134Z compiled=True, 2025-05-07T20:32:29.4414223Z ) 2025-05-07T20:32:29.4414634Z self = 2025-05-07T20:32:29.4414890Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4414895Z 2025-05-07T20:32:29.4415027Z @given( 2025-05-07T20:32:29.4415213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4415409Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4415547Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4415814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4415981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4416085Z ) 2025-05-07T20:32:29.4416402Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4416526Z def test_silu_mul_quant( 2025-05-07T20:32:29.4416618Z self, 2025-05-07T20:32:29.4416821Z T: int, 2025-05-07T20:32:29.4416950Z D: int, 2025-05-07T20:32:29.4417111Z scale_ub: Optional[float], 2025-05-07T20:32:29.4417233Z contiguous: bool, 2025-05-07T20:32:29.4417350Z compiled: bool, 2025-05-07T20:32:29.4417524Z ) -> None: 2025-05-07T20:32:29.4417684Z torch.manual_seed(2025) 2025-05-07T20:32:29.4417786Z 2025-05-07T20:32:29.4418072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4418179Z 2025-05-07T20:32:29.4418306Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4418529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4418684Z x = x_sign * x_clamp 2025-05-07T20:32:29.4418830Z x0 = x[:, :D] 2025-05-07T20:32:29.4418939Z x1 = x[:, D:] 2025-05-07T20:32:29.4419041Z 2025-05-07T20:32:29.4419173Z if contiguous: 2025-05-07T20:32:29.4419340Z x0 = x0.contiguous() 2025-05-07T20:32:29.4419496Z x1 = x1.contiguous() 2025-05-07T20:32:29.4419630Z 2025-05-07T20:32:29.4419752Z if scale_ub is not None: 2025-05-07T20:32:29.4419891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4420079Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4420251Z ) 2025-05-07T20:32:29.4420374Z else: 2025-05-07T20:32:29.4420533Z scale_ub_tensor = None 2025-05-07T20:32:29.4420638Z 2025-05-07T20:32:29.4420832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4420943Z op = silu_mul_quant 2025-05-07T20:32:29.4421126Z if compiled: 2025-05-07T20:32:29.4421343Z op = torch.compile(op) 2025-05-07T20:32:29.4421480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4421582Z 2025-05-07T20:32:29.4421736Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4421895Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4422041Z 2025-05-07T20:32:29.4422256Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4422390Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4422556Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4422734Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4422893Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4423088Z 2025-05-07T20:32:29.4423223Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4423228Z 2025-05-07T20:32:29.4423488Z moe/activation_test.py:126: 2025-05-07T20:32:29.4423708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4423845Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4424086Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4424705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4424839Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4425291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4425618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4426049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4427067Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4427501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4427756Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4428138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4428244Z fn() 2025-05-07T20:32:29.4428707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4428873Z self.fn.run( 2025-05-07T20:32:29.4429322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4429447Z kernel = self.compile( 2025-05-07T20:32:29.4429869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4430119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4430263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4430268Z 2025-05-07T20:32:29.4430607Z self = 2025-05-07T20:32:29.4431459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4432010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f941eb60>} 2025-05-07T20:32:29.4432847Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4433074Z context = 2025-05-07T20:32:29.4433079Z 2025-05-07T20:32:29.4433347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4433687Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4433860Z module_map=module_map) 2025-05-07T20:32:29.4434054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4434187Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4434315Z E ^ 2025-05-07T20:32:29.4434797Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4434802Z 2025-05-07T20:32:29.4435296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4435338Z 2025-05-07T20:32:29.4435554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4435816Z self=, 2025-05-07T20:32:29.4435956Z T=1, 2025-05-07T20:32:29.4436048Z D=5120, 2025-05-07T20:32:29.4436239Z scale_ub=1200.0, 2025-05-07T20:32:29.4436401Z contiguous=True, 2025-05-07T20:32:29.4436516Z compiled=True, 2025-05-07T20:32:29.4436619Z ) 2025-05-07T20:32:29.4436906Z self = 2025-05-07T20:32:29.4437111Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.4437159Z 2025-05-07T20:32:29.4437369Z @given( 2025-05-07T20:32:29.4437521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4437651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4437830Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4438042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4438186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4443116Z ) 2025-05-07T20:32:29.4443394Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4443491Z def test_silu_mul_quant( 2025-05-07T20:32:29.4443575Z self, 2025-05-07T20:32:29.4443652Z T: int, 2025-05-07T20:32:29.4443734Z D: int, 2025-05-07T20:32:29.4443838Z scale_ub: Optional[float], 2025-05-07T20:32:29.4443930Z contiguous: bool, 2025-05-07T20:32:29.4444027Z compiled: bool, 2025-05-07T20:32:29.4444113Z ) -> None: 2025-05-07T20:32:29.4444208Z torch.manual_seed(2025) 2025-05-07T20:32:29.4444292Z 2025-05-07T20:32:29.4444472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4444546Z 2025-05-07T20:32:29.4444648Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4444776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4444874Z x = x_sign * x_clamp 2025-05-07T20:32:29.4444961Z x0 = x[:, :D] 2025-05-07T20:32:29.4445041Z x1 = x[:, D:] 2025-05-07T20:32:29.4445113Z 2025-05-07T20:32:29.4445203Z if contiguous: 2025-05-07T20:32:29.4445295Z x0 = x0.contiguous() 2025-05-07T20:32:29.4445390Z x1 = x1.contiguous() 2025-05-07T20:32:29.4445464Z 2025-05-07T20:32:29.4445555Z if scale_ub is not None: 2025-05-07T20:32:29.4445668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4445805Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4445884Z ) 2025-05-07T20:32:29.4445966Z else: 2025-05-07T20:32:29.4446061Z scale_ub_tensor = None 2025-05-07T20:32:29.4446134Z 2025-05-07T20:32:29.4446276Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4446368Z op = silu_mul_quant 2025-05-07T20:32:29.4446456Z if compiled: 2025-05-07T20:32:29.4446572Z op = torch.compile(op) 2025-05-07T20:32:29.4446680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4446759Z 2025-05-07T20:32:29.4446851Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4446856Z 2025-05-07T20:32:29.4446955Z moe/activation_test.py:117: 2025-05-07T20:32:29.4447094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4447196Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4447296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4447686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4447783Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4448299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4448398Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4448872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4449107Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4449459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4449555Z kernel = self.compile( 2025-05-07T20:32:29.4449957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4450139Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4450315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4450320Z 2025-05-07T20:32:29.4450531Z self = 2025-05-07T20:32:29.4451339Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4452702Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a47060>} 2025-05-07T20:32:29.4453477Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4453686Z context = 2025-05-07T20:32:29.4453693Z 2025-05-07T20:32:29.4453864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4454143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4454259Z module_map=module_map) 2025-05-07T20:32:29.4454433Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4454541Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4454621Z E ^ 2025-05-07T20:32:29.4454988Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4454993Z 2025-05-07T20:32:29.4455430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4455434Z 2025-05-07T20:32:29.4455539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4455777Z self=, 2025-05-07T20:32:29.4455855Z T=1, 2025-05-07T20:32:29.4455932Z D=5120, 2025-05-07T20:32:29.4456019Z scale_ub=None, 2025-05-07T20:32:29.4456109Z contiguous=False, 2025-05-07T20:32:29.4456196Z compiled=True, 2025-05-07T20:32:29.4456281Z ) 2025-05-07T20:32:29.4456512Z self = 2025-05-07T20:32:29.4456682Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4456690Z 2025-05-07T20:32:29.4456768Z @given( 2025-05-07T20:32:29.4456893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4457003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4457120Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4457239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4457363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4457441Z ) 2025-05-07T20:32:29.4457700Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4457801Z def test_silu_mul_quant( 2025-05-07T20:32:29.4457880Z self, 2025-05-07T20:32:29.4457963Z T: int, 2025-05-07T20:32:29.4458041Z D: int, 2025-05-07T20:32:29.4458144Z scale_ub: Optional[float], 2025-05-07T20:32:29.4458327Z contiguous: bool, 2025-05-07T20:32:29.4458420Z compiled: bool, 2025-05-07T20:32:29.4458499Z ) -> None: 2025-05-07T20:32:29.4458603Z torch.manual_seed(2025) 2025-05-07T20:32:29.4458677Z 2025-05-07T20:32:29.4458852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4458934Z 2025-05-07T20:32:29.4459030Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4459157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4459253Z x = x_sign * x_clamp 2025-05-07T20:32:29.4459332Z x0 = x[:, :D] 2025-05-07T20:32:29.4459457Z x1 = x[:, D:] 2025-05-07T20:32:29.4459535Z 2025-05-07T20:32:29.4459620Z if contiguous: 2025-05-07T20:32:29.4459716Z x0 = x0.contiguous() 2025-05-07T20:32:29.4459807Z x1 = x1.contiguous() 2025-05-07T20:32:29.4459881Z 2025-05-07T20:32:29.4460023Z if scale_ub is not None: 2025-05-07T20:32:29.4460136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4460275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4460356Z ) 2025-05-07T20:32:29.4460434Z else: 2025-05-07T20:32:29.4460531Z scale_ub_tensor = None 2025-05-07T20:32:29.4460611Z 2025-05-07T20:32:29.4460744Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4460834Z op = silu_mul_quant 2025-05-07T20:32:29.4460925Z if compiled: 2025-05-07T20:32:29.4461025Z op = torch.compile(op) 2025-05-07T20:32:29.4461138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4461213Z 2025-05-07T20:32:29.4461305Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4461432Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4461504Z 2025-05-07T20:32:29.4461641Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4461751Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4461857Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4461981Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4462131Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4462205Z 2025-05-07T20:32:29.4462305Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4462315Z 2025-05-07T20:32:29.4462413Z moe/activation_test.py:126: 2025-05-07T20:32:29.4462543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4462655Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4462795Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4463368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4463476Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4463855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4464090Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4464470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4464739Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4465131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4465306Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4465660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4465743Z fn() 2025-05-07T20:32:29.4466157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4466402Z self.fn.run( 2025-05-07T20:32:29.4466762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4466860Z kernel = self.compile( 2025-05-07T20:32:29.4467259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4467438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4467567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4467617Z 2025-05-07T20:32:29.4467825Z self = 2025-05-07T20:32:29.4468627Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4469196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f95a6a20>} 2025-05-07T20:32:29.4469964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4470164Z context = 2025-05-07T20:32:29.4470168Z 2025-05-07T20:32:29.4470338Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4470613Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4470728Z module_map=module_map) 2025-05-07T20:32:29.4470891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4470998Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4471076Z E ^ 2025-05-07T20:32:29.4471444Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4471449Z 2025-05-07T20:32:29.4471880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4471885Z 2025-05-07T20:32:29.4471989Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4472218Z self=, 2025-05-07T20:32:29.4472300Z T=1, 2025-05-07T20:32:29.4472379Z D=5120, 2025-05-07T20:32:29.4472469Z scale_ub=None, 2025-05-07T20:32:29.4472553Z contiguous=True, 2025-05-07T20:32:29.4472637Z compiled=False, 2025-05-07T20:32:29.4472715Z ) 2025-05-07T20:32:29.4472937Z self = 2025-05-07T20:32:29.4473104Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.4473112Z 2025-05-07T20:32:29.4473196Z @given( 2025-05-07T20:32:29.4473319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4473423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4473542Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4473661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4473779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4473853Z ) 2025-05-07T20:32:29.4474106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4474208Z def test_silu_mul_quant( 2025-05-07T20:32:29.4474285Z self, 2025-05-07T20:32:29.4474362Z T: int, 2025-05-07T20:32:29.4474448Z D: int, 2025-05-07T20:32:29.4474547Z scale_ub: Optional[float], 2025-05-07T20:32:29.4474637Z contiguous: bool, 2025-05-07T20:32:29.4474727Z compiled: bool, 2025-05-07T20:32:29.4474807Z ) -> None: 2025-05-07T20:32:29.4474980Z torch.manual_seed(2025) 2025-05-07T20:32:29.4475059Z 2025-05-07T20:32:29.4475235Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4475318Z 2025-05-07T20:32:29.4475414Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4475544Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4475635Z x = x_sign * x_clamp 2025-05-07T20:32:29.4475719Z x0 = x[:, :D] 2025-05-07T20:32:29.4475802Z x1 = x[:, D:] 2025-05-07T20:32:29.4475878Z 2025-05-07T20:32:29.4475961Z if contiguous: 2025-05-07T20:32:29.4476094Z x0 = x0.contiguous() 2025-05-07T20:32:29.4476193Z x1 = x1.contiguous() 2025-05-07T20:32:29.4476262Z 2025-05-07T20:32:29.4476355Z if scale_ub is not None: 2025-05-07T20:32:29.4476467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4476603Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4476724Z ) 2025-05-07T20:32:29.4476808Z else: 2025-05-07T20:32:29.4476904Z scale_ub_tensor = None 2025-05-07T20:32:29.4476978Z 2025-05-07T20:32:29.4477107Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4477196Z op = silu_mul_quant 2025-05-07T20:32:29.4477281Z if compiled: 2025-05-07T20:32:29.4477382Z op = torch.compile(op) 2025-05-07T20:32:29.4477489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4477568Z 2025-05-07T20:32:29.4477659Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4477665Z 2025-05-07T20:32:29.4477763Z moe/activation_test.py:117: 2025-05-07T20:32:29.4477894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4477994Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4478097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4478617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4478716Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4479091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4479322Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4479673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4479773Z kernel = self.compile( 2025-05-07T20:32:29.4480213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4480400Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4480529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4480537Z 2025-05-07T20:32:29.4480745Z self = 2025-05-07T20:32:29.4481554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4482075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328f2b8860>} 2025-05-07T20:32:29.4482851Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4483047Z context = 2025-05-07T20:32:29.4483051Z 2025-05-07T20:32:29.4483223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4483575Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4483685Z module_map=module_map) 2025-05-07T20:32:29.4483854Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4483953Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4484028Z E ^ 2025-05-07T20:32:29.4484399Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4484404Z 2025-05-07T20:32:29.4484831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4484874Z 2025-05-07T20:32:29.4484987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4485216Z self=, 2025-05-07T20:32:29.4485296Z T=128, 2025-05-07T20:32:29.4485416Z D=5120, 2025-05-07T20:32:29.4485498Z scale_ub=None, 2025-05-07T20:32:29.4485589Z contiguous=False, 2025-05-07T20:32:29.4485676Z compiled=True, 2025-05-07T20:32:29.4485750Z ) 2025-05-07T20:32:29.4485975Z self = 2025-05-07T20:32:29.4486156Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4486161Z 2025-05-07T20:32:29.4486239Z @given( 2025-05-07T20:32:29.4486365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4486466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4486581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4486706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4486818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4486891Z ) 2025-05-07T20:32:29.4487145Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4487240Z def test_silu_mul_quant( 2025-05-07T20:32:29.4487322Z self, 2025-05-07T20:32:29.4487402Z T: int, 2025-05-07T20:32:29.4487477Z D: int, 2025-05-07T20:32:29.4487577Z scale_ub: Optional[float], 2025-05-07T20:32:29.4487668Z contiguous: bool, 2025-05-07T20:32:29.4487753Z compiled: bool, 2025-05-07T20:32:29.4487837Z ) -> None: 2025-05-07T20:32:29.4487930Z torch.manual_seed(2025) 2025-05-07T20:32:29.4488003Z 2025-05-07T20:32:29.4488179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4488252Z 2025-05-07T20:32:29.4488345Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4488478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4488566Z x = x_sign * x_clamp 2025-05-07T20:32:29.4488645Z x0 = x[:, :D] 2025-05-07T20:32:29.4488726Z x1 = x[:, D:] 2025-05-07T20:32:29.4488799Z 2025-05-07T20:32:29.4488886Z if contiguous: 2025-05-07T20:32:29.4488982Z x0 = x0.contiguous() 2025-05-07T20:32:29.4489075Z x1 = x1.contiguous() 2025-05-07T20:32:29.4489153Z 2025-05-07T20:32:29.4489244Z if scale_ub is not None: 2025-05-07T20:32:29.4489351Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4489488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4489565Z ) 2025-05-07T20:32:29.4489639Z else: 2025-05-07T20:32:29.4489739Z scale_ub_tensor = None 2025-05-07T20:32:29.4489812Z 2025-05-07T20:32:29.4489943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4490038Z op = silu_mul_quant 2025-05-07T20:32:29.4490126Z if compiled: 2025-05-07T20:32:29.4490228Z op = torch.compile(op) 2025-05-07T20:32:29.4490335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4490408Z 2025-05-07T20:32:29.4490502Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4490509Z 2025-05-07T20:32:29.4490606Z moe/activation_test.py:117: 2025-05-07T20:32:29.4490815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4490925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4491024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4491406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4491499Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4492010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4492155Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4492523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4492755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4493155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4493250Z kernel = self.compile( 2025-05-07T20:32:29.4493648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4493833Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4493961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4493965Z 2025-05-07T20:32:29.4494179Z self = 2025-05-07T20:32:29.4494980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4495505Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9b8df80>} 2025-05-07T20:32:29.4496285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4496483Z context = 2025-05-07T20:32:29.4496488Z 2025-05-07T20:32:29.4496657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4496926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4497040Z module_map=module_map) 2025-05-07T20:32:29.4497204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4497302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4497378Z E ^ 2025-05-07T20:32:29.4497741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4497753Z 2025-05-07T20:32:29.4498186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4498191Z 2025-05-07T20:32:29.4498296Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4498524Z self=, 2025-05-07T20:32:29.4498605Z T=128, 2025-05-07T20:32:29.4498679Z D=7168, 2025-05-07T20:32:29.4498761Z scale_ub=1200.0, 2025-05-07T20:32:29.4498851Z contiguous=False, 2025-05-07T20:32:29.4498937Z compiled=False, 2025-05-07T20:32:29.4499011Z ) 2025-05-07T20:32:29.4499234Z self = 2025-05-07T20:32:29.4499413Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.4499417Z 2025-05-07T20:32:29.4499498Z @given( 2025-05-07T20:32:29.4499623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4499824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4499947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4500065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4500178Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4500254Z ) 2025-05-07T20:32:29.4500507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4500605Z def test_silu_mul_quant( 2025-05-07T20:32:29.4500681Z self, 2025-05-07T20:32:29.4500757Z T: int, 2025-05-07T20:32:29.4500880Z D: int, 2025-05-07T20:32:29.4500979Z scale_ub: Optional[float], 2025-05-07T20:32:29.4501067Z contiguous: bool, 2025-05-07T20:32:29.4501157Z compiled: bool, 2025-05-07T20:32:29.4501233Z ) -> None: 2025-05-07T20:32:29.4501327Z torch.manual_seed(2025) 2025-05-07T20:32:29.4501441Z 2025-05-07T20:32:29.4501621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4501694Z 2025-05-07T20:32:29.4501791Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4501917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4502009Z x = x_sign * x_clamp 2025-05-07T20:32:29.4502088Z x0 = x[:, :D] 2025-05-07T20:32:29.4502167Z x1 = x[:, D:] 2025-05-07T20:32:29.4502249Z 2025-05-07T20:32:29.4502334Z if contiguous: 2025-05-07T20:32:29.4502431Z x0 = x0.contiguous() 2025-05-07T20:32:29.4502519Z x1 = x1.contiguous() 2025-05-07T20:32:29.4502594Z 2025-05-07T20:32:29.4502691Z if scale_ub is not None: 2025-05-07T20:32:29.4502796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4502931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4503011Z ) 2025-05-07T20:32:29.4503088Z else: 2025-05-07T20:32:29.4503188Z scale_ub_tensor = None 2025-05-07T20:32:29.4503261Z 2025-05-07T20:32:29.4503398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4503493Z op = silu_mul_quant 2025-05-07T20:32:29.4503577Z if compiled: 2025-05-07T20:32:29.4503676Z op = torch.compile(op) 2025-05-07T20:32:29.4503785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4503857Z 2025-05-07T20:32:29.4503947Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4503951Z 2025-05-07T20:32:29.4504054Z moe/activation_test.py:117: 2025-05-07T20:32:29.4504181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4504283Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4504384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4504896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4505000Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4505374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4505605Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4505964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4506059Z kernel = self.compile( 2025-05-07T20:32:29.4506458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4506638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4506766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4506771Z 2025-05-07T20:32:29.4506983Z self = 2025-05-07T20:32:29.4507866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4508394Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328f033060>} 2025-05-07T20:32:29.4509165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4509397Z context = 2025-05-07T20:32:29.4509402Z 2025-05-07T20:32:29.4509573Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4509844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4510009Z module_map=module_map) 2025-05-07T20:32:29.4510176Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4510275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4510355Z E ^ 2025-05-07T20:32:29.4510716Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4510721Z 2025-05-07T20:32:29.4511148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4511153Z 2025-05-07T20:32:29.4511257Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4511487Z self=, 2025-05-07T20:32:29.4511568Z T=128, 2025-05-07T20:32:29.4511643Z D=5120, 2025-05-07T20:32:29.4511725Z scale_ub=None, 2025-05-07T20:32:29.4511815Z contiguous=False, 2025-05-07T20:32:29.4511901Z compiled=False, 2025-05-07T20:32:29.4511973Z ) 2025-05-07T20:32:29.4512203Z self = 2025-05-07T20:32:29.4512376Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.4512381Z 2025-05-07T20:32:29.4512460Z @given( 2025-05-07T20:32:29.4512578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4512677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4512798Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4512915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4513032Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4513113Z ) 2025-05-07T20:32:29.4513612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4513757Z def test_silu_mul_quant( 2025-05-07T20:32:29.4513851Z self, 2025-05-07T20:32:29.4513930Z T: int, 2025-05-07T20:32:29.4514013Z D: int, 2025-05-07T20:32:29.4514117Z scale_ub: Optional[float], 2025-05-07T20:32:29.4514206Z contiguous: bool, 2025-05-07T20:32:29.4514296Z compiled: bool, 2025-05-07T20:32:29.4514374Z ) -> None: 2025-05-07T20:32:29.4514470Z torch.manual_seed(2025) 2025-05-07T20:32:29.4514544Z 2025-05-07T20:32:29.4514717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4514789Z 2025-05-07T20:32:29.4514883Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4515008Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4515096Z x = x_sign * x_clamp 2025-05-07T20:32:29.4515181Z x0 = x[:, :D] 2025-05-07T20:32:29.4515263Z x1 = x[:, D:] 2025-05-07T20:32:29.4515334Z 2025-05-07T20:32:29.4515421Z if contiguous: 2025-05-07T20:32:29.4515513Z x0 = x0.contiguous() 2025-05-07T20:32:29.4515607Z x1 = x1.contiguous() 2025-05-07T20:32:29.4515681Z 2025-05-07T20:32:29.4515770Z if scale_ub is not None: 2025-05-07T20:32:29.4516028Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4516168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4516243Z ) 2025-05-07T20:32:29.4516320Z else: 2025-05-07T20:32:29.4516414Z scale_ub_tensor = None 2025-05-07T20:32:29.4516486Z 2025-05-07T20:32:29.4516621Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4516711Z op = silu_mul_quant 2025-05-07T20:32:29.4516795Z if compiled: 2025-05-07T20:32:29.4516898Z op = torch.compile(op) 2025-05-07T20:32:29.4517062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4517137Z 2025-05-07T20:32:29.4517226Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4517230Z 2025-05-07T20:32:29.4517326Z moe/activation_test.py:117: 2025-05-07T20:32:29.4517459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4517623Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4517723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4518243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4518340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4518714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4518945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4519297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4519397Z kernel = self.compile( 2025-05-07T20:32:29.4519793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4519973Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4520168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4520173Z 2025-05-07T20:32:29.4520381Z self = 2025-05-07T20:32:29.4521190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4521710Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e1872e0>} 2025-05-07T20:32:29.4522491Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4522693Z context = 2025-05-07T20:32:29.4522698Z 2025-05-07T20:32:29.4522864Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4523138Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4523244Z module_map=module_map) 2025-05-07T20:32:29.4523409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4523509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4523585Z E ^ 2025-05-07T20:32:29.4523980Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4523990Z 2025-05-07T20:32:29.4524440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4524445Z 2025-05-07T20:32:29.4524550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4524864Z self=, 2025-05-07T20:32:29.4524942Z T=128, 2025-05-07T20:32:29.4525022Z D=5120, 2025-05-07T20:32:29.4525105Z scale_ub=1200.0, 2025-05-07T20:32:29.4525193Z contiguous=True, 2025-05-07T20:32:29.4525280Z compiled=False, 2025-05-07T20:32:29.4525350Z ) 2025-05-07T20:32:29.4525572Z self = 2025-05-07T20:32:29.4525751Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.4525755Z 2025-05-07T20:32:29.4525832Z @given( 2025-05-07T20:32:29.4525990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4526093Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4526208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4526327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4526505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4526577Z ) 2025-05-07T20:32:29.4526838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4526931Z def test_silu_mul_quant( 2025-05-07T20:32:29.4527006Z self, 2025-05-07T20:32:29.4527087Z T: int, 2025-05-07T20:32:29.4527163Z D: int, 2025-05-07T20:32:29.4527260Z scale_ub: Optional[float], 2025-05-07T20:32:29.4527356Z contiguous: bool, 2025-05-07T20:32:29.4527440Z compiled: bool, 2025-05-07T20:32:29.4527516Z ) -> None: 2025-05-07T20:32:29.4527617Z torch.manual_seed(2025) 2025-05-07T20:32:29.4527691Z 2025-05-07T20:32:29.4527869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4527942Z 2025-05-07T20:32:29.4528033Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4528162Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4528248Z x = x_sign * x_clamp 2025-05-07T20:32:29.4528332Z x0 = x[:, :D] 2025-05-07T20:32:29.4528420Z x1 = x[:, D:] 2025-05-07T20:32:29.4528492Z 2025-05-07T20:32:29.4528575Z if contiguous: 2025-05-07T20:32:29.4528669Z x0 = x0.contiguous() 2025-05-07T20:32:29.4528757Z x1 = x1.contiguous() 2025-05-07T20:32:29.4528827Z 2025-05-07T20:32:29.4528919Z if scale_ub is not None: 2025-05-07T20:32:29.4529023Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4529161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4529236Z ) 2025-05-07T20:32:29.4529310Z else: 2025-05-07T20:32:29.4529409Z scale_ub_tensor = None 2025-05-07T20:32:29.4529480Z 2025-05-07T20:32:29.4529611Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4529701Z op = silu_mul_quant 2025-05-07T20:32:29.4529785Z if compiled: 2025-05-07T20:32:29.4529883Z op = torch.compile(op) 2025-05-07T20:32:29.4529994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4530070Z 2025-05-07T20:32:29.4530161Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4530165Z 2025-05-07T20:32:29.4530264Z moe/activation_test.py:117: 2025-05-07T20:32:29.4530392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4530498Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4530596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4531107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4531212Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4531582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4531810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4532247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4532343Z kernel = self.compile( 2025-05-07T20:32:29.4532742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4532921Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4533049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4533053Z 2025-05-07T20:32:29.4533266Z self = 2025-05-07T20:32:29.4534103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4534630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e186ac0>} 2025-05-07T20:32:29.4535435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4535629Z context = 2025-05-07T20:32:29.4535638Z 2025-05-07T20:32:29.4535804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4536073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4536188Z module_map=module_map) 2025-05-07T20:32:29.4536349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4536445Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4536528Z E ^ 2025-05-07T20:32:29.4536900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4536907Z 2025-05-07T20:32:29.4537336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4537340Z 2025-05-07T20:32:29.4537444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4537671Z self=, 2025-05-07T20:32:29.4537755Z T=1, 2025-05-07T20:32:29.4537829Z D=7168, 2025-05-07T20:32:29.4537910Z scale_ub=1200.0, 2025-05-07T20:32:29.4538002Z contiguous=True, 2025-05-07T20:32:29.4538086Z compiled=True, 2025-05-07T20:32:29.4538158Z ) 2025-05-07T20:32:29.4538385Z self = 2025-05-07T20:32:29.4538553Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.4538557Z 2025-05-07T20:32:29.4538635Z @given( 2025-05-07T20:32:29.4538757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4538860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4538978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4539095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4539209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4539288Z ) 2025-05-07T20:32:29.4539539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4539637Z def test_silu_mul_quant( 2025-05-07T20:32:29.4539711Z self, 2025-05-07T20:32:29.4539785Z T: int, 2025-05-07T20:32:29.4539868Z D: int, 2025-05-07T20:32:29.4539965Z scale_ub: Optional[float], 2025-05-07T20:32:29.4540053Z contiguous: bool, 2025-05-07T20:32:29.4540142Z compiled: bool, 2025-05-07T20:32:29.4540218Z ) -> None: 2025-05-07T20:32:29.4540313Z torch.manual_seed(2025) 2025-05-07T20:32:29.4540390Z 2025-05-07T20:32:29.4540640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4540713Z 2025-05-07T20:32:29.4540806Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4540931Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4541017Z x = x_sign * x_clamp 2025-05-07T20:32:29.4541098Z x0 = x[:, :D] 2025-05-07T20:32:29.4541175Z x1 = x[:, D:] 2025-05-07T20:32:29.4541248Z 2025-05-07T20:32:29.4541331Z if contiguous: 2025-05-07T20:32:29.4541423Z x0 = x0.contiguous() 2025-05-07T20:32:29.4541515Z x1 = x1.contiguous() 2025-05-07T20:32:29.4541625Z 2025-05-07T20:32:29.4541716Z if scale_ub is not None: 2025-05-07T20:32:29.4541824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4541959Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4542032Z ) 2025-05-07T20:32:29.4542111Z else: 2025-05-07T20:32:29.4542247Z scale_ub_tensor = None 2025-05-07T20:32:29.4542318Z 2025-05-07T20:32:29.4542459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4542549Z op = silu_mul_quant 2025-05-07T20:32:29.4542638Z if compiled: 2025-05-07T20:32:29.4542737Z op = torch.compile(op) 2025-05-07T20:32:29.4542843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4542919Z 2025-05-07T20:32:29.4543009Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4543013Z 2025-05-07T20:32:29.4543108Z moe/activation_test.py:117: 2025-05-07T20:32:29.4543236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4543340Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4543439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4543819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4543915Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4544433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4544529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4544895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4545126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4545476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4545573Z kernel = self.compile( 2025-05-07T20:32:29.4545971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4546148Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4546280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4546287Z 2025-05-07T20:32:29.4546498Z self = 2025-05-07T20:32:29.4547301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4547823Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328eaac680>} 2025-05-07T20:32:29.4548597Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4548797Z context = 2025-05-07T20:32:29.4548804Z 2025-05-07T20:32:29.4548969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4549326Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4549437Z module_map=module_map) 2025-05-07T20:32:29.4549599Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4549698Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4549775Z E ^ 2025-05-07T20:32:29.4550137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4550142Z 2025-05-07T20:32:29.4550608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4550612Z 2025-05-07T20:32:29.4550716Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4550946Z self=, 2025-05-07T20:32:29.4551060Z T=1, 2025-05-07T20:32:29.4551137Z D=7168, 2025-05-07T20:32:29.4551228Z scale_ub=1200.0, 2025-05-07T20:32:29.4551314Z contiguous=False, 2025-05-07T20:32:29.4551394Z compiled=True, 2025-05-07T20:32:29.4551470Z ) 2025-05-07T20:32:29.4551692Z self = 2025-05-07T20:32:29.4551862Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4551871Z 2025-05-07T20:32:29.4551946Z @given( 2025-05-07T20:32:29.4552064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4552165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4552281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4552396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4552512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4552583Z ) 2025-05-07T20:32:29.4552833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4552938Z def test_silu_mul_quant( 2025-05-07T20:32:29.4553013Z self, 2025-05-07T20:32:29.4553093Z T: int, 2025-05-07T20:32:29.4553167Z D: int, 2025-05-07T20:32:29.4553266Z scale_ub: Optional[float], 2025-05-07T20:32:29.4553356Z contiguous: bool, 2025-05-07T20:32:29.4553443Z compiled: bool, 2025-05-07T20:32:29.4553519Z ) -> None: 2025-05-07T20:32:29.4553617Z torch.manual_seed(2025) 2025-05-07T20:32:29.4553689Z 2025-05-07T20:32:29.4553861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4553944Z 2025-05-07T20:32:29.4554055Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4554196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4554294Z x = x_sign * x_clamp 2025-05-07T20:32:29.4554374Z x0 = x[:, :D] 2025-05-07T20:32:29.4554453Z x1 = x[:, D:] 2025-05-07T20:32:29.4554531Z 2025-05-07T20:32:29.4554614Z if contiguous: 2025-05-07T20:32:29.4554711Z x0 = x0.contiguous() 2025-05-07T20:32:29.4554799Z x1 = x1.contiguous() 2025-05-07T20:32:29.4554870Z 2025-05-07T20:32:29.4554963Z if scale_ub is not None: 2025-05-07T20:32:29.4555069Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4555204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4555281Z ) 2025-05-07T20:32:29.4555357Z else: 2025-05-07T20:32:29.4555450Z scale_ub_tensor = None 2025-05-07T20:32:29.4555525Z 2025-05-07T20:32:29.4555654Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4555746Z op = silu_mul_quant 2025-05-07T20:32:29.4555832Z if compiled: 2025-05-07T20:32:29.4555930Z op = torch.compile(op) 2025-05-07T20:32:29.4556036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4556106Z 2025-05-07T20:32:29.4556197Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4556202Z 2025-05-07T20:32:29.4556383Z moe/activation_test.py:117: 2025-05-07T20:32:29.4556515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4556616Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4556720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4557093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4557186Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4557703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4557864Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4558240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4558469Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4558863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4558960Z kernel = self.compile( 2025-05-07T20:32:29.4559354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4559541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4559669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4559673Z 2025-05-07T20:32:29.4559880Z self = 2025-05-07T20:32:29.4560746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4561272Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e737240>} 2025-05-07T20:32:29.4562044Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4562236Z context = 2025-05-07T20:32:29.4562241Z 2025-05-07T20:32:29.4562406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4562680Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4562788Z module_map=module_map) 2025-05-07T20:32:29.4562955Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4563052Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4563130Z E ^ 2025-05-07T20:32:29.4563501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4563505Z 2025-05-07T20:32:29.4563956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4563960Z 2025-05-07T20:32:29.4564082Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4564315Z self=, 2025-05-07T20:32:29.4564392Z T=1, 2025-05-07T20:32:29.4564474Z D=7168, 2025-05-07T20:32:29.4567836Z scale_ub=None, 2025-05-07T20:32:29.4567948Z contiguous=False, 2025-05-07T20:32:29.4568033Z compiled=True, 2025-05-07T20:32:29.4568107Z ) 2025-05-07T20:32:29.4568338Z self = 2025-05-07T20:32:29.4568505Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4568513Z 2025-05-07T20:32:29.4568596Z @given( 2025-05-07T20:32:29.4568816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4568918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4569035Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4569149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4569263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4569341Z ) 2025-05-07T20:32:29.4569596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4569695Z def test_silu_mul_quant( 2025-05-07T20:32:29.4569835Z self, 2025-05-07T20:32:29.4569911Z T: int, 2025-05-07T20:32:29.4569987Z D: int, 2025-05-07T20:32:29.4570083Z scale_ub: Optional[float], 2025-05-07T20:32:29.4570171Z contiguous: bool, 2025-05-07T20:32:29.4570259Z compiled: bool, 2025-05-07T20:32:29.4570337Z ) -> None: 2025-05-07T20:32:29.4570473Z torch.manual_seed(2025) 2025-05-07T20:32:29.4570548Z 2025-05-07T20:32:29.4570724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4570798Z 2025-05-07T20:32:29.4570893Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4571020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4571111Z x = x_sign * x_clamp 2025-05-07T20:32:29.4571190Z x0 = x[:, :D] 2025-05-07T20:32:29.4571267Z x1 = x[:, D:] 2025-05-07T20:32:29.4571342Z 2025-05-07T20:32:29.4571426Z if contiguous: 2025-05-07T20:32:29.4571515Z x0 = x0.contiguous() 2025-05-07T20:32:29.4571609Z x1 = x1.contiguous() 2025-05-07T20:32:29.4571680Z 2025-05-07T20:32:29.4571769Z if scale_ub is not None: 2025-05-07T20:32:29.4571879Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4572014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4572089Z ) 2025-05-07T20:32:29.4572173Z else: 2025-05-07T20:32:29.4572269Z scale_ub_tensor = None 2025-05-07T20:32:29.4572340Z 2025-05-07T20:32:29.4572475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4572564Z op = silu_mul_quant 2025-05-07T20:32:29.4572655Z if compiled: 2025-05-07T20:32:29.4572754Z op = torch.compile(op) 2025-05-07T20:32:29.4572858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4572931Z 2025-05-07T20:32:29.4573020Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.4573139Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.4573218Z 2025-05-07T20:32:29.4573354Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4573456Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.4573558Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.4573681Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.4573828Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4573903Z 2025-05-07T20:32:29.4574003Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.4574007Z 2025-05-07T20:32:29.4574107Z moe/activation_test.py:126: 2025-05-07T20:32:29.4574236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4574340Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.4574477Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.4575052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.4575156Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.4575524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4575750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4576213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.4576477Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.4576864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.4577037Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.4577387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.4577503Z fn() 2025-05-07T20:32:29.4577913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.4577996Z self.fn.run( 2025-05-07T20:32:29.4578348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4578479Z kernel = self.compile( 2025-05-07T20:32:29.4578876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4579059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4579185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4579190Z 2025-05-07T20:32:29.4579400Z self = 2025-05-07T20:32:29.4580198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4580717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f328e736980>} 2025-05-07T20:32:29.4581490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4581685Z context = 2025-05-07T20:32:29.4581689Z 2025-05-07T20:32:29.4581858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4582128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4582238Z module_map=module_map) 2025-05-07T20:32:29.4582403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4582505Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.4582586Z E ^ 2025-05-07T20:32:29.4582947Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4582955Z 2025-05-07T20:32:29.4583385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4583395Z 2025-05-07T20:32:29.4583499Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4583726Z self=, 2025-05-07T20:32:29.4583806Z T=1, 2025-05-07T20:32:29.4583881Z D=5120, 2025-05-07T20:32:29.4583964Z scale_ub=1200.0, 2025-05-07T20:32:29.4584051Z contiguous=False, 2025-05-07T20:32:29.4584133Z compiled=True, 2025-05-07T20:32:29.4584203Z ) 2025-05-07T20:32:29.4584429Z self = 2025-05-07T20:32:29.4584600Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4584605Z 2025-05-07T20:32:29.4584684Z @given( 2025-05-07T20:32:29.4584803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4584904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4585100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4585219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4585335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4585410Z ) 2025-05-07T20:32:29.4585661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4585754Z def test_silu_mul_quant( 2025-05-07T20:32:29.4585833Z self, 2025-05-07T20:32:29.4585908Z T: int, 2025-05-07T20:32:29.4585982Z D: int, 2025-05-07T20:32:29.4586083Z scale_ub: Optional[float], 2025-05-07T20:32:29.4586210Z contiguous: bool, 2025-05-07T20:32:29.4586300Z compiled: bool, 2025-05-07T20:32:29.4586378Z ) -> None: 2025-05-07T20:32:29.4586474Z torch.manual_seed(2025) 2025-05-07T20:32:29.4586549Z 2025-05-07T20:32:29.4586720Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4586834Z 2025-05-07T20:32:29.4586938Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4587063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4587152Z x = x_sign * x_clamp 2025-05-07T20:32:29.4587233Z x0 = x[:, :D] 2025-05-07T20:32:29.4587311Z x1 = x[:, D:] 2025-05-07T20:32:29.4587383Z 2025-05-07T20:32:29.4587467Z if contiguous: 2025-05-07T20:32:29.4587558Z x0 = x0.contiguous() 2025-05-07T20:32:29.4587650Z x1 = x1.contiguous() 2025-05-07T20:32:29.4587721Z 2025-05-07T20:32:29.4587809Z if scale_ub is not None: 2025-05-07T20:32:29.4587918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4588054Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4588126Z ) 2025-05-07T20:32:29.4588204Z else: 2025-05-07T20:32:29.4588296Z scale_ub_tensor = None 2025-05-07T20:32:29.4588368Z 2025-05-07T20:32:29.4588504Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4588597Z op = silu_mul_quant 2025-05-07T20:32:29.4588682Z if compiled: 2025-05-07T20:32:29.4588783Z op = torch.compile(op) 2025-05-07T20:32:29.4588889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4588958Z 2025-05-07T20:32:29.4589051Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4589055Z 2025-05-07T20:32:29.4589152Z moe/activation_test.py:117: 2025-05-07T20:32:29.4589283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4589382Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4589487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4589862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4589954Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4590463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4590567Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4590931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4591162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4591511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4591606Z kernel = self.compile( 2025-05-07T20:32:29.4592004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4592183Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4592312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4592316Z 2025-05-07T20:32:29.4592524Z self = 2025-05-07T20:32:29.4593426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4593947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a45d00>} 2025-05-07T20:32:29.4594715Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4594951Z context = 2025-05-07T20:32:29.4594955Z 2025-05-07T20:32:29.4595122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4595434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4595546Z module_map=module_map) 2025-05-07T20:32:29.4595708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4595807Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4595882Z E ^ 2025-05-07T20:32:29.4596246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4596250Z 2025-05-07T20:32:29.4596678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4596685Z 2025-05-07T20:32:29.4596789Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4597017Z self=, 2025-05-07T20:32:29.4597092Z T=1, 2025-05-07T20:32:29.4597165Z D=5120, 2025-05-07T20:32:29.4597255Z scale_ub=1200.0, 2025-05-07T20:32:29.4597340Z contiguous=False, 2025-05-07T20:32:29.4597428Z compiled=False, 2025-05-07T20:32:29.4597504Z ) 2025-05-07T20:32:29.4597724Z self = 2025-05-07T20:32:29.4597894Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.4597898Z 2025-05-07T20:32:29.4597975Z @given( 2025-05-07T20:32:29.4598094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4598196Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4598309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4598430Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4598546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4598619Z ) 2025-05-07T20:32:29.4598870Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4598967Z def test_silu_mul_quant( 2025-05-07T20:32:29.4599046Z self, 2025-05-07T20:32:29.4599125Z T: int, 2025-05-07T20:32:29.4599203Z D: int, 2025-05-07T20:32:29.4599301Z scale_ub: Optional[float], 2025-05-07T20:32:29.4599389Z contiguous: bool, 2025-05-07T20:32:29.4599475Z compiled: bool, 2025-05-07T20:32:29.4599552Z ) -> None: 2025-05-07T20:32:29.4599647Z torch.manual_seed(2025) 2025-05-07T20:32:29.4599717Z 2025-05-07T20:32:29.4599888Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4599965Z 2025-05-07T20:32:29.4600057Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4600249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4600340Z x = x_sign * x_clamp 2025-05-07T20:32:29.4600419Z x0 = x[:, :D] 2025-05-07T20:32:29.4600498Z x1 = x[:, D:] 2025-05-07T20:32:29.4600573Z 2025-05-07T20:32:29.4600655Z if contiguous: 2025-05-07T20:32:29.4600747Z x0 = x0.contiguous() 2025-05-07T20:32:29.4600839Z x1 = x1.contiguous() 2025-05-07T20:32:29.4601033Z 2025-05-07T20:32:29.4601126Z if scale_ub is not None: 2025-05-07T20:32:29.4601236Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4601372Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4601448Z ) 2025-05-07T20:32:29.4601523Z else: 2025-05-07T20:32:29.4601618Z scale_ub_tensor = None 2025-05-07T20:32:29.4601690Z 2025-05-07T20:32:29.4601819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4601908Z op = silu_mul_quant 2025-05-07T20:32:29.4602035Z if compiled: 2025-05-07T20:32:29.4602135Z op = torch.compile(op) 2025-05-07T20:32:29.4602240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4602315Z 2025-05-07T20:32:29.4602403Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4602407Z 2025-05-07T20:32:29.4602546Z moe/activation_test.py:117: 2025-05-07T20:32:29.4602678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4602777Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4602879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4603391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4603487Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4603860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4604089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4604439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4604532Z kernel = self.compile( 2025-05-07T20:32:29.4604924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4605110Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4605236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4605241Z 2025-05-07T20:32:29.4605446Z self = 2025-05-07T20:32:29.4606247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4606766Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a3f240>} 2025-05-07T20:32:29.4607544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4607742Z context = 2025-05-07T20:32:29.4607746Z 2025-05-07T20:32:29.4607917Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4608189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4608295Z module_map=module_map) 2025-05-07T20:32:29.4608462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4608561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4608639Z E ^ 2025-05-07T20:32:29.4609007Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4609012Z 2025-05-07T20:32:29.4609438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4609444Z 2025-05-07T20:32:29.4609630Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4609860Z self=, 2025-05-07T20:32:29.4609936Z T=16384, 2025-05-07T20:32:29.4610014Z D=5120, 2025-05-07T20:32:29.4610096Z scale_ub=1200.0, 2025-05-07T20:32:29.4610181Z contiguous=False, 2025-05-07T20:32:29.4610266Z compiled=True, 2025-05-07T20:32:29.4610337Z ) 2025-05-07T20:32:29.4610563Z self = 2025-05-07T20:32:29.4610745Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4610788Z 2025-05-07T20:32:29.4610863Z @given( 2025-05-07T20:32:29.4610986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4611085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4611198Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4611360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4611479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4611550Z ) 2025-05-07T20:32:29.4611805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4611897Z def test_silu_mul_quant( 2025-05-07T20:32:29.4611975Z self, 2025-05-07T20:32:29.4612049Z T: int, 2025-05-07T20:32:29.4612123Z D: int, 2025-05-07T20:32:29.4612221Z scale_ub: Optional[float], 2025-05-07T20:32:29.4612309Z contiguous: bool, 2025-05-07T20:32:29.4612395Z compiled: bool, 2025-05-07T20:32:29.4612478Z ) -> None: 2025-05-07T20:32:29.4612572Z torch.manual_seed(2025) 2025-05-07T20:32:29.4612643Z 2025-05-07T20:32:29.4612817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4612889Z 2025-05-07T20:32:29.4612981Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4613112Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4613207Z x = x_sign * x_clamp 2025-05-07T20:32:29.4613288Z x0 = x[:, :D] 2025-05-07T20:32:29.4613617Z x1 = x[:, D:] 2025-05-07T20:32:29.4613724Z 2025-05-07T20:32:29.4613816Z if contiguous: 2025-05-07T20:32:29.4613908Z x0 = x0.contiguous() 2025-05-07T20:32:29.4613995Z x1 = x1.contiguous() 2025-05-07T20:32:29.4614069Z 2025-05-07T20:32:29.4614160Z if scale_ub is not None: 2025-05-07T20:32:29.4614265Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4614404Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4614483Z ) 2025-05-07T20:32:29.4614558Z else: 2025-05-07T20:32:29.4614654Z scale_ub_tensor = None 2025-05-07T20:32:29.4614726Z 2025-05-07T20:32:29.4614861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4614952Z op = silu_mul_quant 2025-05-07T20:32:29.4615038Z if compiled: 2025-05-07T20:32:29.4615144Z op = torch.compile(op) 2025-05-07T20:32:29.4615250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4615319Z 2025-05-07T20:32:29.4615410Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4615415Z 2025-05-07T20:32:29.4615513Z moe/activation_test.py:117: 2025-05-07T20:32:29.4615646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4615747Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4615846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4616225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4616320Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4616827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4616930Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4617435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4617670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4618017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4618110Z kernel = self.compile( 2025-05-07T20:32:29.4618508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4618686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4618871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4618878Z 2025-05-07T20:32:29.4619086Z self = 2025-05-07T20:32:29.4619889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4620464Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a3dd00>} 2025-05-07T20:32:29.4621226Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4621425Z context = 2025-05-07T20:32:29.4621429Z 2025-05-07T20:32:29.4621596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4621866Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4621978Z module_map=module_map) 2025-05-07T20:32:29.4622144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4622245Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4622319Z E ^ 2025-05-07T20:32:29.4622683Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4622687Z 2025-05-07T20:32:29.4623115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4623120Z 2025-05-07T20:32:29.4623222Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4623449Z self=, 2025-05-07T20:32:29.4623528Z T=2048, 2025-05-07T20:32:29.4623602Z D=7168, 2025-05-07T20:32:29.4623687Z scale_ub=1200.0, 2025-05-07T20:32:29.4623772Z contiguous=False, 2025-05-07T20:32:29.4623853Z compiled=True, 2025-05-07T20:32:29.4623930Z ) 2025-05-07T20:32:29.4624157Z self = 2025-05-07T20:32:29.4624335Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4624340Z 2025-05-07T20:32:29.4624419Z @given( 2025-05-07T20:32:29.4624540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4624637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4624755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4624871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4624989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4625063Z ) 2025-05-07T20:32:29.4625315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4625409Z def test_silu_mul_quant( 2025-05-07T20:32:29.4625484Z self, 2025-05-07T20:32:29.4625557Z T: int, 2025-05-07T20:32:29.4625638Z D: int, 2025-05-07T20:32:29.4625735Z scale_ub: Optional[float], 2025-05-07T20:32:29.4625929Z contiguous: bool, 2025-05-07T20:32:29.4626019Z compiled: bool, 2025-05-07T20:32:29.4626095Z ) -> None: 2025-05-07T20:32:29.4626188Z torch.manual_seed(2025) 2025-05-07T20:32:29.4626262Z 2025-05-07T20:32:29.4626434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4626508Z 2025-05-07T20:32:29.4626599Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4626722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4626816Z x = x_sign * x_clamp 2025-05-07T20:32:29.4626935Z x0 = x[:, :D] 2025-05-07T20:32:29.4627013Z x1 = x[:, D:] 2025-05-07T20:32:29.4627091Z 2025-05-07T20:32:29.4627174Z if contiguous: 2025-05-07T20:32:29.4627271Z x0 = x0.contiguous() 2025-05-07T20:32:29.4627358Z x1 = x1.contiguous() 2025-05-07T20:32:29.4627428Z 2025-05-07T20:32:29.4627563Z if scale_ub is not None: 2025-05-07T20:32:29.4627675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4627813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4627887Z ) 2025-05-07T20:32:29.4627962Z else: 2025-05-07T20:32:29.4628057Z scale_ub_tensor = None 2025-05-07T20:32:29.4628127Z 2025-05-07T20:32:29.4628257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4628349Z op = silu_mul_quant 2025-05-07T20:32:29.4628430Z if compiled: 2025-05-07T20:32:29.4628528Z op = torch.compile(op) 2025-05-07T20:32:29.4628637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4628706Z 2025-05-07T20:32:29.4628796Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4628804Z 2025-05-07T20:32:29.4628902Z moe/activation_test.py:117: 2025-05-07T20:32:29.4629027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4629132Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4629233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4629610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4629704Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4630214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4630310Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4630681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4630911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4631264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4631356Z kernel = self.compile( 2025-05-07T20:32:29.4631758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4631938Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4632062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4632067Z 2025-05-07T20:32:29.4632274Z self = 2025-05-07T20:32:29.4633068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4633587Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9a3fc40>} 2025-05-07T20:32:29.4634433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4634630Z context = 2025-05-07T20:32:29.4634634Z 2025-05-07T20:32:29.4634803Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4635070Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4635173Z module_map=module_map) 2025-05-07T20:32:29.4635336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4635471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4635547Z E ^ 2025-05-07T20:32:29.4635909Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4635913Z 2025-05-07T20:32:29.4636380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4636385Z 2025-05-07T20:32:29.4636490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4636718Z self=, 2025-05-07T20:32:29.4636798Z T=1, 2025-05-07T20:32:29.4636872Z D=5120, 2025-05-07T20:32:29.4636953Z scale_ub=None, 2025-05-07T20:32:29.4637039Z contiguous=False, 2025-05-07T20:32:29.4637122Z compiled=False, 2025-05-07T20:32:29.4637192Z ) 2025-05-07T20:32:29.4637414Z self = 2025-05-07T20:32:29.4637583Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.4637587Z 2025-05-07T20:32:29.4637662Z @given( 2025-05-07T20:32:29.4637783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4637881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4637998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4638123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4638233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4638308Z ) 2025-05-07T20:32:29.4638554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4638649Z def test_silu_mul_quant( 2025-05-07T20:32:29.4638725Z self, 2025-05-07T20:32:29.4638799Z T: int, 2025-05-07T20:32:29.4638873Z D: int, 2025-05-07T20:32:29.4638974Z scale_ub: Optional[float], 2025-05-07T20:32:29.4639062Z contiguous: bool, 2025-05-07T20:32:29.4639148Z compiled: bool, 2025-05-07T20:32:29.4639225Z ) -> None: 2025-05-07T20:32:29.4639317Z torch.manual_seed(2025) 2025-05-07T20:32:29.4639387Z 2025-05-07T20:32:29.4639557Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4639627Z 2025-05-07T20:32:29.4639724Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4639849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4639936Z x = x_sign * x_clamp 2025-05-07T20:32:29.4640018Z x0 = x[:, :D] 2025-05-07T20:32:29.4640096Z x1 = x[:, D:] 2025-05-07T20:32:29.4640238Z 2025-05-07T20:32:29.4640324Z if contiguous: 2025-05-07T20:32:29.4640413Z x0 = x0.contiguous() 2025-05-07T20:32:29.4640499Z x1 = x1.contiguous() 2025-05-07T20:32:29.4640572Z 2025-05-07T20:32:29.4640659Z if scale_ub is not None: 2025-05-07T20:32:29.4640764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4640901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4640973Z ) 2025-05-07T20:32:29.4641051Z else: 2025-05-07T20:32:29.4641142Z scale_ub_tensor = None 2025-05-07T20:32:29.4641212Z 2025-05-07T20:32:29.4641344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4641435Z op = silu_mul_quant 2025-05-07T20:32:29.4641603Z if compiled: 2025-05-07T20:32:29.4641706Z op = torch.compile(op) 2025-05-07T20:32:29.4641809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4641878Z 2025-05-07T20:32:29.4641971Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4641975Z 2025-05-07T20:32:29.4642071Z moe/activation_test.py:117: 2025-05-07T20:32:29.4642200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4642299Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4642395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4642946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4643043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4643407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4643676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4644023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4644118Z kernel = self.compile( 2025-05-07T20:32:29.4644509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4644685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4644815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4644821Z 2025-05-07T20:32:29.4645027Z self = 2025-05-07T20:32:29.4645824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4646345Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f96afc40>} 2025-05-07T20:32:29.4647104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4647299Z context = 2025-05-07T20:32:29.4647304Z 2025-05-07T20:32:29.4647471Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4647740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4647846Z module_map=module_map) 2025-05-07T20:32:29.4648007Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4648111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4648190Z E ^ 2025-05-07T20:32:29.4648550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4648560Z 2025-05-07T20:32:29.4648984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4648989Z 2025-05-07T20:32:29.4649091Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4649317Z self=, 2025-05-07T20:32:29.4649399Z T=4096, 2025-05-07T20:32:29.4649474Z D=7168, 2025-05-07T20:32:29.4649557Z scale_ub=1200.0, 2025-05-07T20:32:29.4649641Z contiguous=False, 2025-05-07T20:32:29.4649722Z compiled=False, 2025-05-07T20:32:29.4649796Z ) 2025-05-07T20:32:29.4650016Z self = 2025-05-07T20:32:29.4650280Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.4650285Z 2025-05-07T20:32:29.4650360Z @given( 2025-05-07T20:32:29.4650478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4650578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4650690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4650805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4650920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4650991Z ) 2025-05-07T20:32:29.4651240Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4651376Z def test_silu_mul_quant( 2025-05-07T20:32:29.4651449Z self, 2025-05-07T20:32:29.4651525Z T: int, 2025-05-07T20:32:29.4651600Z D: int, 2025-05-07T20:32:29.4651696Z scale_ub: Optional[float], 2025-05-07T20:32:29.4651786Z contiguous: bool, 2025-05-07T20:32:29.4651932Z compiled: bool, 2025-05-07T20:32:29.4652012Z ) -> None: 2025-05-07T20:32:29.4652111Z torch.manual_seed(2025) 2025-05-07T20:32:29.4652181Z 2025-05-07T20:32:29.4652352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4652427Z 2025-05-07T20:32:29.4652516Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4652640Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4652730Z x = x_sign * x_clamp 2025-05-07T20:32:29.4652808Z x0 = x[:, :D] 2025-05-07T20:32:29.4652890Z x1 = x[:, D:] 2025-05-07T20:32:29.4652959Z 2025-05-07T20:32:29.4653041Z if contiguous: 2025-05-07T20:32:29.4653134Z x0 = x0.contiguous() 2025-05-07T20:32:29.4653220Z x1 = x1.contiguous() 2025-05-07T20:32:29.4653288Z 2025-05-07T20:32:29.4653378Z if scale_ub is not None: 2025-05-07T20:32:29.4653482Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4653618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4653698Z ) 2025-05-07T20:32:29.4653773Z else: 2025-05-07T20:32:29.4653867Z scale_ub_tensor = None 2025-05-07T20:32:29.4653938Z 2025-05-07T20:32:29.4654066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4654153Z op = silu_mul_quant 2025-05-07T20:32:29.4654237Z if compiled: 2025-05-07T20:32:29.4654334Z op = torch.compile(op) 2025-05-07T20:32:29.4654439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4654509Z 2025-05-07T20:32:29.4654600Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4654604Z 2025-05-07T20:32:29.4654705Z moe/activation_test.py:117: 2025-05-07T20:32:29.4654829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4654929Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4655030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4655550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4655649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4656014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4656240Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4656590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4656682Z kernel = self.compile( 2025-05-07T20:32:29.4657075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4657256Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4657380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4657386Z 2025-05-07T20:32:29.4657674Z self = 2025-05-07T20:32:29.4658470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4658986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0ad40>} 2025-05-07T20:32:29.4659757Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4659987Z context = 2025-05-07T20:32:29.4659992Z 2025-05-07T20:32:29.4660199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4660471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4660580Z module_map=module_map) 2025-05-07T20:32:29.4660741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4660839Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4660915Z E ^ 2025-05-07T20:32:29.4661277Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4661282Z 2025-05-07T20:32:29.4661707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4661712Z 2025-05-07T20:32:29.4661815Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4662040Z self=, 2025-05-07T20:32:29.4662123Z T=16384, 2025-05-07T20:32:29.4662199Z D=7168, 2025-05-07T20:32:29.4662284Z scale_ub=None, 2025-05-07T20:32:29.4662372Z contiguous=True, 2025-05-07T20:32:29.4662452Z compiled=True, 2025-05-07T20:32:29.4662522Z ) 2025-05-07T20:32:29.4662745Z self = 2025-05-07T20:32:29.4662920Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4662925Z 2025-05-07T20:32:29.4662999Z @given( 2025-05-07T20:32:29.4663120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4663216Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4663336Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4663451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4663565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4663639Z ) 2025-05-07T20:32:29.4663892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4663995Z def test_silu_mul_quant( 2025-05-07T20:32:29.4664093Z self, 2025-05-07T20:32:29.4664169Z T: int, 2025-05-07T20:32:29.4664261Z D: int, 2025-05-07T20:32:29.4664363Z scale_ub: Optional[float], 2025-05-07T20:32:29.4664450Z contiguous: bool, 2025-05-07T20:32:29.4664534Z compiled: bool, 2025-05-07T20:32:29.4664612Z ) -> None: 2025-05-07T20:32:29.4664705Z torch.manual_seed(2025) 2025-05-07T20:32:29.4664777Z 2025-05-07T20:32:29.4664944Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4665018Z 2025-05-07T20:32:29.4665109Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4665230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4665316Z x = x_sign * x_clamp 2025-05-07T20:32:29.4665396Z x0 = x[:, :D] 2025-05-07T20:32:29.4665474Z x1 = x[:, D:] 2025-05-07T20:32:29.4665544Z 2025-05-07T20:32:29.4665628Z if contiguous: 2025-05-07T20:32:29.4665798Z x0 = x0.contiguous() 2025-05-07T20:32:29.4665886Z x1 = x1.contiguous() 2025-05-07T20:32:29.4665958Z 2025-05-07T20:32:29.4666045Z if scale_ub is not None: 2025-05-07T20:32:29.4666151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4666285Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4666359Z ) 2025-05-07T20:32:29.4666435Z else: 2025-05-07T20:32:29.4666527Z scale_ub_tensor = None 2025-05-07T20:32:29.4666598Z 2025-05-07T20:32:29.4666729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4666857Z op = silu_mul_quant 2025-05-07T20:32:29.4666939Z if compiled: 2025-05-07T20:32:29.4667039Z op = torch.compile(op) 2025-05-07T20:32:29.4667142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4667211Z 2025-05-07T20:32:29.4667340Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4667344Z 2025-05-07T20:32:29.4667443Z moe/activation_test.py:117: 2025-05-07T20:32:29.4667573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4667671Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4667767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4668146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4668237Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4668744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4668847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4669212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4669441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4669797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4669889Z kernel = self.compile( 2025-05-07T20:32:29.4670288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4670463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4670587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4670594Z 2025-05-07T20:32:29.4670799Z self = 2025-05-07T20:32:29.4671600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4672126Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0b920>} 2025-05-07T20:32:29.4672893Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4673090Z context = 2025-05-07T20:32:29.4673094Z 2025-05-07T20:32:29.4673257Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4673526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4673636Z module_map=module_map) 2025-05-07T20:32:29.4673796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4673896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4673972Z E ^ 2025-05-07T20:32:29.4674465Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4674469Z 2025-05-07T20:32:29.4674898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4674902Z 2025-05-07T20:32:29.4675004Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4675229Z self=, 2025-05-07T20:32:29.4675306Z T=4096, 2025-05-07T20:32:29.4675380Z D=5120, 2025-05-07T20:32:29.4675461Z scale_ub=None, 2025-05-07T20:32:29.4675583Z contiguous=False, 2025-05-07T20:32:29.4675664Z compiled=True, 2025-05-07T20:32:29.4675738Z ) 2025-05-07T20:32:29.4675956Z self = 2025-05-07T20:32:29.4676129Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4676172Z 2025-05-07T20:32:29.4676251Z @given( 2025-05-07T20:32:29.4676372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4676468Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4676583Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4676697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4676810Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4676883Z ) 2025-05-07T20:32:29.4677131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4677225Z def test_silu_mul_quant( 2025-05-07T20:32:29.4677300Z self, 2025-05-07T20:32:29.4677374Z T: int, 2025-05-07T20:32:29.4677449Z D: int, 2025-05-07T20:32:29.4677547Z scale_ub: Optional[float], 2025-05-07T20:32:29.4677635Z contiguous: bool, 2025-05-07T20:32:29.4677722Z compiled: bool, 2025-05-07T20:32:29.4677798Z ) -> None: 2025-05-07T20:32:29.4677895Z torch.manual_seed(2025) 2025-05-07T20:32:29.4677967Z 2025-05-07T20:32:29.4678139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4678213Z 2025-05-07T20:32:29.4678302Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4678425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4678513Z x = x_sign * x_clamp 2025-05-07T20:32:29.4678591Z x0 = x[:, :D] 2025-05-07T20:32:29.4678668Z x1 = x[:, D:] 2025-05-07T20:32:29.4678742Z 2025-05-07T20:32:29.4678824Z if contiguous: 2025-05-07T20:32:29.4678912Z x0 = x0.contiguous() 2025-05-07T20:32:29.4679004Z x1 = x1.contiguous() 2025-05-07T20:32:29.4679073Z 2025-05-07T20:32:29.4679161Z if scale_ub is not None: 2025-05-07T20:32:29.4679266Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4679400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4679474Z ) 2025-05-07T20:32:29.4679551Z else: 2025-05-07T20:32:29.4679647Z scale_ub_tensor = None 2025-05-07T20:32:29.4679719Z 2025-05-07T20:32:29.4679846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4679933Z op = silu_mul_quant 2025-05-07T20:32:29.4680018Z if compiled: 2025-05-07T20:32:29.4680170Z op = torch.compile(op) 2025-05-07T20:32:29.4680273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4680344Z 2025-05-07T20:32:29.4680431Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4680436Z 2025-05-07T20:32:29.4680530Z moe/activation_test.py:117: 2025-05-07T20:32:29.4680667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4680764Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4680866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4681240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4681333Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4681931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4682028Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4682393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4682621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4682968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4683124Z kernel = self.compile( 2025-05-07T20:32:29.4683516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4683691Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4683860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4683869Z 2025-05-07T20:32:29.4684073Z self = 2025-05-07T20:32:29.4684870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4685383Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f9be04a0>} 2025-05-07T20:32:29.4686147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4686341Z context = 2025-05-07T20:32:29.4686350Z 2025-05-07T20:32:29.4686518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4686789Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4686894Z module_map=module_map) 2025-05-07T20:32:29.4687053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4687152Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4687227Z E ^ 2025-05-07T20:32:29.4687592Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4687599Z 2025-05-07T20:32:29.4688022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4688027Z 2025-05-07T20:32:29.4688129Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4688359Z self=, 2025-05-07T20:32:29.4691512Z T=4096, 2025-05-07T20:32:29.4691607Z D=5120, 2025-05-07T20:32:29.4691691Z scale_ub=1200.0, 2025-05-07T20:32:29.4691780Z contiguous=False, 2025-05-07T20:32:29.4691865Z compiled=False, 2025-05-07T20:32:29.4691935Z ) 2025-05-07T20:32:29.4692167Z self = 2025-05-07T20:32:29.4692349Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.4692353Z 2025-05-07T20:32:29.4692432Z @given( 2025-05-07T20:32:29.4692551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4692651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4692769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4692885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4692996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4693074Z ) 2025-05-07T20:32:29.4693422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4693518Z def test_silu_mul_quant( 2025-05-07T20:32:29.4693595Z self, 2025-05-07T20:32:29.4693670Z T: int, 2025-05-07T20:32:29.4693744Z D: int, 2025-05-07T20:32:29.4693844Z scale_ub: Optional[float], 2025-05-07T20:32:29.4693932Z contiguous: bool, 2025-05-07T20:32:29.4694018Z compiled: bool, 2025-05-07T20:32:29.4694095Z ) -> None: 2025-05-07T20:32:29.4694189Z torch.manual_seed(2025) 2025-05-07T20:32:29.4694263Z 2025-05-07T20:32:29.4694434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4694547Z 2025-05-07T20:32:29.4694640Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4694764Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4694851Z x = x_sign * x_clamp 2025-05-07T20:32:29.4694932Z x0 = x[:, :D] 2025-05-07T20:32:29.4695050Z x1 = x[:, D:] 2025-05-07T20:32:29.4695119Z 2025-05-07T20:32:29.4695211Z if contiguous: 2025-05-07T20:32:29.4695302Z x0 = x0.contiguous() 2025-05-07T20:32:29.4695392Z x1 = x1.contiguous() 2025-05-07T20:32:29.4695461Z 2025-05-07T20:32:29.4695550Z if scale_ub is not None: 2025-05-07T20:32:29.4695657Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4695792Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4695865Z ) 2025-05-07T20:32:29.4695941Z else: 2025-05-07T20:32:29.4696033Z scale_ub_tensor = None 2025-05-07T20:32:29.4696104Z 2025-05-07T20:32:29.4696238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4696326Z op = silu_mul_quant 2025-05-07T20:32:29.4696409Z if compiled: 2025-05-07T20:32:29.4696510Z op = torch.compile(op) 2025-05-07T20:32:29.4696614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4696686Z 2025-05-07T20:32:29.4696783Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4696788Z 2025-05-07T20:32:29.4696885Z moe/activation_test.py:117: 2025-05-07T20:32:29.4697014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4697113Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4697212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4697730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4697826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4698195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4698424Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4698772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4698870Z kernel = self.compile( 2025-05-07T20:32:29.4699267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4699444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4699574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4699578Z 2025-05-07T20:32:29.4699784Z self = 2025-05-07T20:32:29.4700584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4701103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0c2c0>} 2025-05-07T20:32:29.4701952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4702149Z context = 2025-05-07T20:32:29.4702154Z 2025-05-07T20:32:29.4702320Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4702592Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4702699Z module_map=module_map) 2025-05-07T20:32:29.4702901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4703004Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4703078Z E ^ 2025-05-07T20:32:29.4703441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4703484Z 2025-05-07T20:32:29.4703912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4703917Z 2025-05-07T20:32:29.4704029Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4704300Z self=, 2025-05-07T20:32:29.4704374Z T=4096, 2025-05-07T20:32:29.4704450Z D=5120, 2025-05-07T20:32:29.4704532Z scale_ub=1200.0, 2025-05-07T20:32:29.4704617Z contiguous=False, 2025-05-07T20:32:29.4704702Z compiled=True, 2025-05-07T20:32:29.4704772Z ) 2025-05-07T20:32:29.4704995Z self = 2025-05-07T20:32:29.4705174Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4705178Z 2025-05-07T20:32:29.4705253Z @given( 2025-05-07T20:32:29.4705372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4705476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4705593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4705714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4705828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4705900Z ) 2025-05-07T20:32:29.4706156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4706248Z def test_silu_mul_quant( 2025-05-07T20:32:29.4706321Z self, 2025-05-07T20:32:29.4706398Z T: int, 2025-05-07T20:32:29.4706472Z D: int, 2025-05-07T20:32:29.4706574Z scale_ub: Optional[float], 2025-05-07T20:32:29.4706665Z contiguous: bool, 2025-05-07T20:32:29.4706748Z compiled: bool, 2025-05-07T20:32:29.4706824Z ) -> None: 2025-05-07T20:32:29.4706921Z torch.manual_seed(2025) 2025-05-07T20:32:29.4706991Z 2025-05-07T20:32:29.4707161Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4707239Z 2025-05-07T20:32:29.4707334Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4707459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4707545Z x = x_sign * x_clamp 2025-05-07T20:32:29.4707623Z x0 = x[:, :D] 2025-05-07T20:32:29.4707705Z x1 = x[:, D:] 2025-05-07T20:32:29.4707776Z 2025-05-07T20:32:29.4707857Z if contiguous: 2025-05-07T20:32:29.4707949Z x0 = x0.contiguous() 2025-05-07T20:32:29.4708036Z x1 = x1.contiguous() 2025-05-07T20:32:29.4708105Z 2025-05-07T20:32:29.4708199Z if scale_ub is not None: 2025-05-07T20:32:29.4708306Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4708440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4708516Z ) 2025-05-07T20:32:29.4708590Z else: 2025-05-07T20:32:29.4708685Z scale_ub_tensor = None 2025-05-07T20:32:29.4708758Z 2025-05-07T20:32:29.4708971Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4709063Z op = silu_mul_quant 2025-05-07T20:32:29.4709146Z if compiled: 2025-05-07T20:32:29.4709243Z op = torch.compile(op) 2025-05-07T20:32:29.4709349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4709420Z 2025-05-07T20:32:29.4709508Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4709512Z 2025-05-07T20:32:29.4709610Z moe/activation_test.py:117: 2025-05-07T20:32:29.4709740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4709881Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4709979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4710352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4710447Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4710955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4711093Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4711462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4711690Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4712039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4712133Z kernel = self.compile( 2025-05-07T20:32:29.4712524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4712707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4712832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4712840Z 2025-05-07T20:32:29.4713045Z self = 2025-05-07T20:32:29.4714097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4714618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0db20>} 2025-05-07T20:32:29.4715392Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4715588Z context = 2025-05-07T20:32:29.4715592Z 2025-05-07T20:32:29.4715761Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4716039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4716146Z module_map=module_map) 2025-05-07T20:32:29.4716311Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4716411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4716486Z E ^ 2025-05-07T20:32:29.4716852Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4716857Z 2025-05-07T20:32:29.4717282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4717288Z 2025-05-07T20:32:29.4717396Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4717622Z self=, 2025-05-07T20:32:29.4717697Z T=2048, 2025-05-07T20:32:29.4717776Z D=7168, 2025-05-07T20:32:29.4717857Z scale_ub=1200.0, 2025-05-07T20:32:29.4718109Z contiguous=False, 2025-05-07T20:32:29.4718200Z compiled=False, 2025-05-07T20:32:29.4718271Z ) 2025-05-07T20:32:29.4718497Z self = 2025-05-07T20:32:29.4718676Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.4718681Z 2025-05-07T20:32:29.4718755Z @given( 2025-05-07T20:32:29.4718876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4718975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4719090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4719267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4719379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4719454Z ) 2025-05-07T20:32:29.4719705Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4719851Z def test_silu_mul_quant( 2025-05-07T20:32:29.4719930Z self, 2025-05-07T20:32:29.4720009Z T: int, 2025-05-07T20:32:29.4720085Z D: int, 2025-05-07T20:32:29.4720254Z scale_ub: Optional[float], 2025-05-07T20:32:29.4720342Z contiguous: bool, 2025-05-07T20:32:29.4720425Z compiled: bool, 2025-05-07T20:32:29.4720504Z ) -> None: 2025-05-07T20:32:29.4720598Z torch.manual_seed(2025) 2025-05-07T20:32:29.4720669Z 2025-05-07T20:32:29.4720842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4720914Z 2025-05-07T20:32:29.4721003Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4721132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4721220Z x = x_sign * x_clamp 2025-05-07T20:32:29.4721302Z x0 = x[:, :D] 2025-05-07T20:32:29.4721380Z x1 = x[:, D:] 2025-05-07T20:32:29.4721449Z 2025-05-07T20:32:29.4721533Z if contiguous: 2025-05-07T20:32:29.4721625Z x0 = x0.contiguous() 2025-05-07T20:32:29.4721717Z x1 = x1.contiguous() 2025-05-07T20:32:29.4721790Z 2025-05-07T20:32:29.4721879Z if scale_ub is not None: 2025-05-07T20:32:29.4721984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4722121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4722194Z ) 2025-05-07T20:32:29.4722268Z else: 2025-05-07T20:32:29.4722365Z scale_ub_tensor = None 2025-05-07T20:32:29.4722435Z 2025-05-07T20:32:29.4722568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4722659Z op = silu_mul_quant 2025-05-07T20:32:29.4722763Z if compiled: 2025-05-07T20:32:29.4722905Z op = torch.compile(op) 2025-05-07T20:32:29.4723048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4723151Z 2025-05-07T20:32:29.4723278Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4723290Z 2025-05-07T20:32:29.4723427Z moe/activation_test.py:117: 2025-05-07T20:32:29.4723616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4723762Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4723899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4724543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4724643Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4725010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4725241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4725589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4725682Z kernel = self.compile( 2025-05-07T20:32:29.4726181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4726361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4726491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4726495Z 2025-05-07T20:32:29.4726701Z self = 2025-05-07T20:32:29.4727498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4728053Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0e700>} 2025-05-07T20:32:29.4728823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4729103Z context = 2025-05-07T20:32:29.4729107Z 2025-05-07T20:32:29.4729272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4729546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4729653Z module_map=module_map) 2025-05-07T20:32:29.4729813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4729915Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4729989Z E ^ 2025-05-07T20:32:29.4730350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4730355Z 2025-05-07T20:32:29.4730780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4730793Z 2025-05-07T20:32:29.4730897Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4731125Z self=, 2025-05-07T20:32:29.4731201Z T=1, 2025-05-07T20:32:29.4731276Z D=7168, 2025-05-07T20:32:29.4731360Z scale_ub=None, 2025-05-07T20:32:29.4731446Z contiguous=True, 2025-05-07T20:32:29.4731534Z compiled=False, 2025-05-07T20:32:29.4731610Z ) 2025-05-07T20:32:29.4731829Z self = 2025-05-07T20:32:29.4731995Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.4732006Z 2025-05-07T20:32:29.4732082Z @given( 2025-05-07T20:32:29.4732201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4732304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4732419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4732546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4732663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4732736Z ) 2025-05-07T20:32:29.4732986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4733080Z def test_silu_mul_quant( 2025-05-07T20:32:29.4733154Z self, 2025-05-07T20:32:29.4733228Z T: int, 2025-05-07T20:32:29.4733304Z D: int, 2025-05-07T20:32:29.4733401Z scale_ub: Optional[float], 2025-05-07T20:32:29.4733493Z contiguous: bool, 2025-05-07T20:32:29.4733577Z compiled: bool, 2025-05-07T20:32:29.4733656Z ) -> None: 2025-05-07T20:32:29.4733752Z torch.manual_seed(2025) 2025-05-07T20:32:29.4733823Z 2025-05-07T20:32:29.4733995Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4734071Z 2025-05-07T20:32:29.4734161Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4734287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4734456Z x = x_sign * x_clamp 2025-05-07T20:32:29.4734537Z x0 = x[:, :D] 2025-05-07T20:32:29.4734616Z x1 = x[:, D:] 2025-05-07T20:32:29.4734689Z 2025-05-07T20:32:29.4734771Z if contiguous: 2025-05-07T20:32:29.4734860Z x0 = x0.contiguous() 2025-05-07T20:32:29.4734953Z x1 = x1.contiguous() 2025-05-07T20:32:29.4735025Z 2025-05-07T20:32:29.4735117Z if scale_ub is not None: 2025-05-07T20:32:29.4735222Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4735356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4735472Z ) 2025-05-07T20:32:29.4735547Z else: 2025-05-07T20:32:29.4735641Z scale_ub_tensor = None 2025-05-07T20:32:29.4735714Z 2025-05-07T20:32:29.4735843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4735971Z op = silu_mul_quant 2025-05-07T20:32:29.4736059Z if compiled: 2025-05-07T20:32:29.4736163Z op = torch.compile(op) 2025-05-07T20:32:29.4736267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4736340Z 2025-05-07T20:32:29.4736429Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4736433Z 2025-05-07T20:32:29.4736534Z moe/activation_test.py:117: 2025-05-07T20:32:29.4736661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4736760Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4736862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4737372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4737472Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4737842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4738077Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4738429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4738521Z kernel = self.compile( 2025-05-07T20:32:29.4738913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4739094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4739219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4739226Z 2025-05-07T20:32:29.4739432Z self = 2025-05-07T20:32:29.4740229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4740754Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8b0fa60>} 2025-05-07T20:32:29.4741519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4741712Z context = 2025-05-07T20:32:29.4741717Z 2025-05-07T20:32:29.4741889Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4742162Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4742271Z module_map=module_map) 2025-05-07T20:32:29.4742434Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4742533Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4742610Z E ^ 2025-05-07T20:32:29.4743051Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4743056Z 2025-05-07T20:32:29.4743481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4743490Z 2025-05-07T20:32:29.4743593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4743844Z self=, 2025-05-07T20:32:29.4743929Z T=16384, 2025-05-07T20:32:29.4744062Z D=7168, 2025-05-07T20:32:29.4744145Z scale_ub=1200.0, 2025-05-07T20:32:29.4744232Z contiguous=False, 2025-05-07T20:32:29.4744313Z compiled=True, 2025-05-07T20:32:29.4744385Z ) 2025-05-07T20:32:29.4744611Z self = 2025-05-07T20:32:29.4744792Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4744840Z 2025-05-07T20:32:29.4744919Z @given( 2025-05-07T20:32:29.4745038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4745136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4745252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4745369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4745482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4745556Z ) 2025-05-07T20:32:29.4745808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4745902Z def test_silu_mul_quant( 2025-05-07T20:32:29.4745979Z self, 2025-05-07T20:32:29.4746054Z T: int, 2025-05-07T20:32:29.4746129Z D: int, 2025-05-07T20:32:29.4746230Z scale_ub: Optional[float], 2025-05-07T20:32:29.4746319Z contiguous: bool, 2025-05-07T20:32:29.4746410Z compiled: bool, 2025-05-07T20:32:29.4746486Z ) -> None: 2025-05-07T20:32:29.4746583Z torch.manual_seed(2025) 2025-05-07T20:32:29.4746659Z 2025-05-07T20:32:29.4746829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4746902Z 2025-05-07T20:32:29.4746995Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4747121Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4747210Z x = x_sign * x_clamp 2025-05-07T20:32:29.4747292Z x0 = x[:, :D] 2025-05-07T20:32:29.4747369Z x1 = x[:, D:] 2025-05-07T20:32:29.4747439Z 2025-05-07T20:32:29.4747524Z if contiguous: 2025-05-07T20:32:29.4747616Z x0 = x0.contiguous() 2025-05-07T20:32:29.4747709Z x1 = x1.contiguous() 2025-05-07T20:32:29.4747779Z 2025-05-07T20:32:29.4747869Z if scale_ub is not None: 2025-05-07T20:32:29.4747975Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4748108Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4748186Z ) 2025-05-07T20:32:29.4748267Z else: 2025-05-07T20:32:29.4748361Z scale_ub_tensor = None 2025-05-07T20:32:29.4748432Z 2025-05-07T20:32:29.4748564Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4748654Z op = silu_mul_quant 2025-05-07T20:32:29.4748737Z if compiled: 2025-05-07T20:32:29.4748843Z op = torch.compile(op) 2025-05-07T20:32:29.4748954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4749026Z 2025-05-07T20:32:29.4749116Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4749123Z 2025-05-07T20:32:29.4749220Z moe/activation_test.py:117: 2025-05-07T20:32:29.4749346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4749445Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4749546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4750030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4750124Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4750634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4750734Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4751101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4751326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4751713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4751809Z kernel = self.compile( 2025-05-07T20:32:29.4752201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4752420Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4752553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4752557Z 2025-05-07T20:32:29.4752764Z self = 2025-05-07T20:32:29.4753564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4754081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7fe8d60>} 2025-05-07T20:32:29.4754849Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4755048Z context = 2025-05-07T20:32:29.4755052Z 2025-05-07T20:32:29.4755218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4755490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4755596Z module_map=module_map) 2025-05-07T20:32:29.4755760Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4755857Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4755932Z E ^ 2025-05-07T20:32:29.4756297Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4756304Z 2025-05-07T20:32:29.4756727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4756731Z 2025-05-07T20:32:29.4756840Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4757070Z self=, 2025-05-07T20:32:29.4757145Z T=1, 2025-05-07T20:32:29.4757222Z D=7168, 2025-05-07T20:32:29.4757304Z scale_ub=None, 2025-05-07T20:32:29.4757391Z contiguous=False, 2025-05-07T20:32:29.4757478Z compiled=False, 2025-05-07T20:32:29.4757548Z ) 2025-05-07T20:32:29.4757768Z self = 2025-05-07T20:32:29.4757939Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.4757944Z 2025-05-07T20:32:29.4758018Z @given( 2025-05-07T20:32:29.4758141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4758238Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4758351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4758471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4758590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4758661Z ) 2025-05-07T20:32:29.4758999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4759095Z def test_silu_mul_quant( 2025-05-07T20:32:29.4759169Z self, 2025-05-07T20:32:29.4759247Z T: int, 2025-05-07T20:32:29.4759320Z D: int, 2025-05-07T20:32:29.4759417Z scale_ub: Optional[float], 2025-05-07T20:32:29.4759508Z contiguous: bool, 2025-05-07T20:32:29.4759593Z compiled: bool, 2025-05-07T20:32:29.4759671Z ) -> None: 2025-05-07T20:32:29.4759765Z torch.manual_seed(2025) 2025-05-07T20:32:29.4759875Z 2025-05-07T20:32:29.4760054Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4760204Z 2025-05-07T20:32:29.4760296Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4760423Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4760509Z x = x_sign * x_clamp 2025-05-07T20:32:29.4760632Z x0 = x[:, :D] 2025-05-07T20:32:29.4760720Z x1 = x[:, D:] 2025-05-07T20:32:29.4760790Z 2025-05-07T20:32:29.4760874Z if contiguous: 2025-05-07T20:32:29.4760965Z x0 = x0.contiguous() 2025-05-07T20:32:29.4761054Z x1 = x1.contiguous() 2025-05-07T20:32:29.4761132Z 2025-05-07T20:32:29.4761220Z if scale_ub is not None: 2025-05-07T20:32:29.4761325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4761461Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4761534Z ) 2025-05-07T20:32:29.4761607Z else: 2025-05-07T20:32:29.4761704Z scale_ub_tensor = None 2025-05-07T20:32:29.4761778Z 2025-05-07T20:32:29.4761906Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4761998Z op = silu_mul_quant 2025-05-07T20:32:29.4762083Z if compiled: 2025-05-07T20:32:29.4762182Z op = torch.compile(op) 2025-05-07T20:32:29.4762292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4762367Z 2025-05-07T20:32:29.4762460Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4762464Z 2025-05-07T20:32:29.4762560Z moe/activation_test.py:117: 2025-05-07T20:32:29.4762687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4762789Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4762887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4763399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4763500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4763866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4764101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4764456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4764549Z kernel = self.compile( 2025-05-07T20:32:29.4764946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4765124Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4765248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4765256Z 2025-05-07T20:32:29.4765461Z self = 2025-05-07T20:32:29.4766258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4766860Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7fe9760>} 2025-05-07T20:32:29.4767629Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4767825Z context = 2025-05-07T20:32:29.4767830Z 2025-05-07T20:32:29.4767994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4768263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4768410Z module_map=module_map) 2025-05-07T20:32:29.4768572Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4768669Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4768747Z E ^ 2025-05-07T20:32:29.4769114Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4769156Z 2025-05-07T20:32:29.4769584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4769589Z 2025-05-07T20:32:29.4769692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4769919Z self=, 2025-05-07T20:32:29.4769999Z T=2048, 2025-05-07T20:32:29.4770075Z D=7168, 2025-05-07T20:32:29.4770159Z scale_ub=None, 2025-05-07T20:32:29.4770244Z contiguous=False, 2025-05-07T20:32:29.4770328Z compiled=True, 2025-05-07T20:32:29.4770402Z ) 2025-05-07T20:32:29.4770622Z self = 2025-05-07T20:32:29.4770802Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4770806Z 2025-05-07T20:32:29.4770885Z @given( 2025-05-07T20:32:29.4771008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4771114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4771232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4771348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4771466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4771539Z ) 2025-05-07T20:32:29.4771790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4771884Z def test_silu_mul_quant( 2025-05-07T20:32:29.4771959Z self, 2025-05-07T20:32:29.4772034Z T: int, 2025-05-07T20:32:29.4772113Z D: int, 2025-05-07T20:32:29.4772210Z scale_ub: Optional[float], 2025-05-07T20:32:29.4772298Z contiguous: bool, 2025-05-07T20:32:29.4772384Z compiled: bool, 2025-05-07T20:32:29.4772460Z ) -> None: 2025-05-07T20:32:29.4772553Z torch.manual_seed(2025) 2025-05-07T20:32:29.4772629Z 2025-05-07T20:32:29.4772803Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4772878Z 2025-05-07T20:32:29.4772969Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4773092Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4773183Z x = x_sign * x_clamp 2025-05-07T20:32:29.4773262Z x0 = x[:, :D] 2025-05-07T20:32:29.4773341Z x1 = x[:, D:] 2025-05-07T20:32:29.4773414Z 2025-05-07T20:32:29.4773496Z if contiguous: 2025-05-07T20:32:29.4773589Z x0 = x0.contiguous() 2025-05-07T20:32:29.4773682Z x1 = x1.contiguous() 2025-05-07T20:32:29.4773754Z 2025-05-07T20:32:29.4773842Z if scale_ub is not None: 2025-05-07T20:32:29.4773949Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4774082Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4774158Z ) 2025-05-07T20:32:29.4774239Z else: 2025-05-07T20:32:29.4774336Z scale_ub_tensor = None 2025-05-07T20:32:29.4774409Z 2025-05-07T20:32:29.4774624Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4774717Z op = silu_mul_quant 2025-05-07T20:32:29.4774803Z if compiled: 2025-05-07T20:32:29.4774902Z op = torch.compile(op) 2025-05-07T20:32:29.4775007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4775080Z 2025-05-07T20:32:29.4775169Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4775173Z 2025-05-07T20:32:29.4775268Z moe/activation_test.py:117: 2025-05-07T20:32:29.4775398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4775537Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4775638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4776015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4776168Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4776686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4776784Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4777150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4777379Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4777727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4777823Z kernel = self.compile( 2025-05-07T20:32:29.4778215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4778393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4778521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4778529Z 2025-05-07T20:32:29.4778739Z self = 2025-05-07T20:32:29.4779539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4780054Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7feaf20>} 2025-05-07T20:32:29.4780822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4781018Z context = 2025-05-07T20:32:29.4781026Z 2025-05-07T20:32:29.4781197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4781470Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4781575Z module_map=module_map) 2025-05-07T20:32:29.4781736Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4781837Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4781911Z E ^ 2025-05-07T20:32:29.4782276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4782281Z 2025-05-07T20:32:29.4782706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4782710Z 2025-05-07T20:32:29.4782813Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4783042Z self=, 2025-05-07T20:32:29.4783120Z T=4096, 2025-05-07T20:32:29.4783196Z D=7168, 2025-05-07T20:32:29.4783359Z scale_ub=None, 2025-05-07T20:32:29.4783445Z contiguous=False, 2025-05-07T20:32:29.4783533Z compiled=True, 2025-05-07T20:32:29.4783603Z ) 2025-05-07T20:32:29.4783823Z self = 2025-05-07T20:32:29.4784003Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4784007Z 2025-05-07T20:32:29.4784084Z @given( 2025-05-07T20:32:29.4784201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4784305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4784458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4784574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4784690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4784761Z ) 2025-05-07T20:32:29.4785017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4785152Z def test_silu_mul_quant( 2025-05-07T20:32:29.4785227Z self, 2025-05-07T20:32:29.4785306Z T: int, 2025-05-07T20:32:29.4785379Z D: int, 2025-05-07T20:32:29.4785476Z scale_ub: Optional[float], 2025-05-07T20:32:29.4785568Z contiguous: bool, 2025-05-07T20:32:29.4785652Z compiled: bool, 2025-05-07T20:32:29.4785728Z ) -> None: 2025-05-07T20:32:29.4785827Z torch.manual_seed(2025) 2025-05-07T20:32:29.4785899Z 2025-05-07T20:32:29.4786070Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4786149Z 2025-05-07T20:32:29.4786239Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4786366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4786452Z x = x_sign * x_clamp 2025-05-07T20:32:29.4786531Z x0 = x[:, :D] 2025-05-07T20:32:29.4786612Z x1 = x[:, D:] 2025-05-07T20:32:29.4786686Z 2025-05-07T20:32:29.4786767Z if contiguous: 2025-05-07T20:32:29.4786867Z x0 = x0.contiguous() 2025-05-07T20:32:29.4786955Z x1 = x1.contiguous() 2025-05-07T20:32:29.4787027Z 2025-05-07T20:32:29.4787118Z if scale_ub is not None: 2025-05-07T20:32:29.4787222Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4787356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4787432Z ) 2025-05-07T20:32:29.4787506Z else: 2025-05-07T20:32:29.4787598Z scale_ub_tensor = None 2025-05-07T20:32:29.4787671Z 2025-05-07T20:32:29.4787800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4787895Z op = silu_mul_quant 2025-05-07T20:32:29.4787978Z if compiled: 2025-05-07T20:32:29.4788075Z op = torch.compile(op) 2025-05-07T20:32:29.4788183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4788253Z 2025-05-07T20:32:29.4788347Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4788351Z 2025-05-07T20:32:29.4788456Z moe/activation_test.py:117: 2025-05-07T20:32:29.4788582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4788680Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4788782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4789158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4789253Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4789760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4789860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4790227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4790455Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4790890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4790984Z kernel = self.compile( 2025-05-07T20:32:29.4791375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4791556Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4791682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4791686Z 2025-05-07T20:32:29.4791892Z self = 2025-05-07T20:32:29.4792731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4793291Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8fd80e0>} 2025-05-07T20:32:29.4794107Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4794299Z context = 2025-05-07T20:32:29.4794304Z 2025-05-07T20:32:29.4794471Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4794742Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4794850Z module_map=module_map) 2025-05-07T20:32:29.4795013Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4795111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4795188Z E ^ 2025-05-07T20:32:29.4795565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4795570Z 2025-05-07T20:32:29.4795993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4795997Z 2025-05-07T20:32:29.4796102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4796327Z self=, 2025-05-07T20:32:29.4796404Z T=16384, 2025-05-07T20:32:29.4796482Z D=5120, 2025-05-07T20:32:29.4796565Z scale_ub=1200.0, 2025-05-07T20:32:29.4796653Z contiguous=False, 2025-05-07T20:32:29.4796739Z compiled=False, 2025-05-07T20:32:29.4796809Z ) 2025-05-07T20:32:29.4797030Z self = 2025-05-07T20:32:29.4797215Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.4797223Z 2025-05-07T20:32:29.4797297Z @given( 2025-05-07T20:32:29.4797421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4797519Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4797632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4797751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4797864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4797936Z ) 2025-05-07T20:32:29.4798189Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4798280Z def test_silu_mul_quant( 2025-05-07T20:32:29.4798358Z self, 2025-05-07T20:32:29.4798433Z T: int, 2025-05-07T20:32:29.4798507Z D: int, 2025-05-07T20:32:29.4798607Z scale_ub: Optional[float], 2025-05-07T20:32:29.4798695Z contiguous: bool, 2025-05-07T20:32:29.4798779Z compiled: bool, 2025-05-07T20:32:29.4798858Z ) -> None: 2025-05-07T20:32:29.4798953Z torch.manual_seed(2025) 2025-05-07T20:32:29.4799023Z 2025-05-07T20:32:29.4799275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4799348Z 2025-05-07T20:32:29.4799438Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4799567Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4799657Z x = x_sign * x_clamp 2025-05-07T20:32:29.4799738Z x0 = x[:, :D] 2025-05-07T20:32:29.4799816Z x1 = x[:, D:] 2025-05-07T20:32:29.4799885Z 2025-05-07T20:32:29.4799972Z if contiguous: 2025-05-07T20:32:29.4800061Z x0 = x0.contiguous() 2025-05-07T20:32:29.4800279Z x1 = x1.contiguous() 2025-05-07T20:32:29.4800352Z 2025-05-07T20:32:29.4800441Z if scale_ub is not None: 2025-05-07T20:32:29.4800546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4800683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4800799Z ) 2025-05-07T20:32:29.4800873Z else: 2025-05-07T20:32:29.4800975Z scale_ub_tensor = None 2025-05-07T20:32:29.4801045Z 2025-05-07T20:32:29.4801175Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4801269Z op = silu_mul_quant 2025-05-07T20:32:29.4801351Z if compiled: 2025-05-07T20:32:29.4801454Z op = torch.compile(op) 2025-05-07T20:32:29.4801557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4801627Z 2025-05-07T20:32:29.4801718Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4801722Z 2025-05-07T20:32:29.4801818Z moe/activation_test.py:117: 2025-05-07T20:32:29.4801949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4802052Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4802149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4802662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4802769Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4803136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4803365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4803716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4803809Z kernel = self.compile( 2025-05-07T20:32:29.4804255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4804435Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4804562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4804566Z 2025-05-07T20:32:29.4804772Z self = 2025-05-07T20:32:29.4805575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4806096Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8fd8b80>} 2025-05-07T20:32:29.4806862Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4807062Z context = 2025-05-07T20:32:29.4807066Z 2025-05-07T20:32:29.4807233Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4807583Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4807696Z module_map=module_map) 2025-05-07T20:32:29.4807858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4807957Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4808031Z E ^ 2025-05-07T20:32:29.4808393Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4808398Z 2025-05-07T20:32:29.4808825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4808891Z 2025-05-07T20:32:29.4808995Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4809224Z self=, 2025-05-07T20:32:29.4809300Z T=16384, 2025-05-07T20:32:29.4809374Z D=5120, 2025-05-07T20:32:29.4809458Z scale_ub=1200.0, 2025-05-07T20:32:29.4809582Z contiguous=True, 2025-05-07T20:32:29.4809667Z compiled=True, 2025-05-07T20:32:29.4809740Z ) 2025-05-07T20:32:29.4809961Z self = 2025-05-07T20:32:29.4810140Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.4810144Z 2025-05-07T20:32:29.4810222Z @given( 2025-05-07T20:32:29.4810339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4810440Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4810553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4810671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4810784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4810856Z ) 2025-05-07T20:32:29.4811107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4814713Z def test_silu_mul_quant( 2025-05-07T20:32:29.4814827Z self, 2025-05-07T20:32:29.4814906Z T: int, 2025-05-07T20:32:29.4814987Z D: int, 2025-05-07T20:32:29.4815092Z scale_ub: Optional[float], 2025-05-07T20:32:29.4815183Z contiguous: bool, 2025-05-07T20:32:29.4815270Z compiled: bool, 2025-05-07T20:32:29.4815355Z ) -> None: 2025-05-07T20:32:29.4815449Z torch.manual_seed(2025) 2025-05-07T20:32:29.4815521Z 2025-05-07T20:32:29.4815702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4815776Z 2025-05-07T20:32:29.4815869Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4816000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4816093Z x = x_sign * x_clamp 2025-05-07T20:32:29.4816177Z x0 = x[:, :D] 2025-05-07T20:32:29.4816257Z x1 = x[:, D:] 2025-05-07T20:32:29.4816328Z 2025-05-07T20:32:29.4816413Z if contiguous: 2025-05-07T20:32:29.4816505Z x0 = x0.contiguous() 2025-05-07T20:32:29.4816598Z x1 = x1.contiguous() 2025-05-07T20:32:29.4816677Z 2025-05-07T20:32:29.4816769Z if scale_ub is not None: 2025-05-07T20:32:29.4816877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4817018Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4817096Z ) 2025-05-07T20:32:29.4817174Z else: 2025-05-07T20:32:29.4817275Z scale_ub_tensor = None 2025-05-07T20:32:29.4817347Z 2025-05-07T20:32:29.4817479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4817571Z op = silu_mul_quant 2025-05-07T20:32:29.4817659Z if compiled: 2025-05-07T20:32:29.4817764Z op = torch.compile(op) 2025-05-07T20:32:29.4817870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4817941Z 2025-05-07T20:32:29.4818037Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4818042Z 2025-05-07T20:32:29.4818142Z moe/activation_test.py:117: 2025-05-07T20:32:29.4818444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4818555Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4818656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4819046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4819146Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4819655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4819758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4820183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4820418Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4820774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4820938Z kernel = self.compile( 2025-05-07T20:32:29.4821339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4821523Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4821654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4821658Z 2025-05-07T20:32:29.4821875Z self = 2025-05-07T20:32:29.4822675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4823197Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8fda2a0>} 2025-05-07T20:32:29.4824017Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4824221Z context = 2025-05-07T20:32:29.4824229Z 2025-05-07T20:32:29.4824398Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4824670Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4824782Z module_map=module_map) 2025-05-07T20:32:29.4824955Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4825056Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4825136Z E ^ 2025-05-07T20:32:29.4825505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4825513Z 2025-05-07T20:32:29.4825947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4825951Z 2025-05-07T20:32:29.4826061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4826290Z self=, 2025-05-07T20:32:29.4826373Z T=16384, 2025-05-07T20:32:29.4826450Z D=5120, 2025-05-07T20:32:29.4826533Z scale_ub=None, 2025-05-07T20:32:29.4826627Z contiguous=False, 2025-05-07T20:32:29.4826711Z compiled=True, 2025-05-07T20:32:29.4826785Z ) 2025-05-07T20:32:29.4827012Z self = 2025-05-07T20:32:29.4827192Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4827197Z 2025-05-07T20:32:29.4827278Z @given( 2025-05-07T20:32:29.4827400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4827584Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4827706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4827824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4827938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4828017Z ) 2025-05-07T20:32:29.4828272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4828365Z def test_silu_mul_quant( 2025-05-07T20:32:29.4828445Z self, 2025-05-07T20:32:29.4828522Z T: int, 2025-05-07T20:32:29.4828601Z D: int, 2025-05-07T20:32:29.4828741Z scale_ub: Optional[float], 2025-05-07T20:32:29.4828831Z contiguous: bool, 2025-05-07T20:32:29.4828919Z compiled: bool, 2025-05-07T20:32:29.4828998Z ) -> None: 2025-05-07T20:32:29.4829094Z torch.manual_seed(2025) 2025-05-07T20:32:29.4829170Z 2025-05-07T20:32:29.4829382Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4829461Z 2025-05-07T20:32:29.4829559Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4829686Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4829777Z x = x_sign * x_clamp 2025-05-07T20:32:29.4829861Z x0 = x[:, :D] 2025-05-07T20:32:29.4829940Z x1 = x[:, D:] 2025-05-07T20:32:29.4830014Z 2025-05-07T20:32:29.4830099Z if contiguous: 2025-05-07T20:32:29.4830190Z x0 = x0.contiguous() 2025-05-07T20:32:29.4830284Z x1 = x1.contiguous() 2025-05-07T20:32:29.4830355Z 2025-05-07T20:32:29.4830450Z if scale_ub is not None: 2025-05-07T20:32:29.4830559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4830695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4830770Z ) 2025-05-07T20:32:29.4830851Z else: 2025-05-07T20:32:29.4830945Z scale_ub_tensor = None 2025-05-07T20:32:29.4831021Z 2025-05-07T20:32:29.4831164Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4831257Z op = silu_mul_quant 2025-05-07T20:32:29.4831341Z if compiled: 2025-05-07T20:32:29.4831445Z op = torch.compile(op) 2025-05-07T20:32:29.4831553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4831628Z 2025-05-07T20:32:29.4831720Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4831724Z 2025-05-07T20:32:29.4831825Z moe/activation_test.py:117: 2025-05-07T20:32:29.4831959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4832064Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4832163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4832544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4832638Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4833159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4833259Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4833626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4833858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4834255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4834360Z kernel = self.compile( 2025-05-07T20:32:29.4834758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4834939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4835071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4835077Z 2025-05-07T20:32:29.4835365Z self = 2025-05-07T20:32:29.4836168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4836690Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f8fdb060>} 2025-05-07T20:32:29.4837459Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4837699Z context = 2025-05-07T20:32:29.4837703Z 2025-05-07T20:32:29.4837873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4838191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4838301Z module_map=module_map) 2025-05-07T20:32:29.4838465Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4838567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4838644Z E ^ 2025-05-07T20:32:29.4839009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4839013Z 2025-05-07T20:32:29.4839450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4839456Z 2025-05-07T20:32:29.4839566Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4839799Z self=, 2025-05-07T20:32:29.4839878Z T=2048, 2025-05-07T20:32:29.4839958Z D=5120, 2025-05-07T20:32:29.4840049Z scale_ub=None, 2025-05-07T20:32:29.4840220Z contiguous=False, 2025-05-07T20:32:29.4840304Z compiled=True, 2025-05-07T20:32:29.4840380Z ) 2025-05-07T20:32:29.4840604Z self = 2025-05-07T20:32:29.4840782Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.4840790Z 2025-05-07T20:32:29.4840868Z @given( 2025-05-07T20:32:29.4840990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4841094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4841213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4841334Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4841453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4841527Z ) 2025-05-07T20:32:29.4841780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4841881Z def test_silu_mul_quant( 2025-05-07T20:32:29.4841963Z self, 2025-05-07T20:32:29.4842040Z T: int, 2025-05-07T20:32:29.4842118Z D: int, 2025-05-07T20:32:29.4842217Z scale_ub: Optional[float], 2025-05-07T20:32:29.4842312Z contiguous: bool, 2025-05-07T20:32:29.4842399Z compiled: bool, 2025-05-07T20:32:29.4842476Z ) -> None: 2025-05-07T20:32:29.4842575Z torch.manual_seed(2025) 2025-05-07T20:32:29.4842647Z 2025-05-07T20:32:29.4842820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4842899Z 2025-05-07T20:32:29.4842991Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4843119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4843210Z x = x_sign * x_clamp 2025-05-07T20:32:29.4843291Z x0 = x[:, :D] 2025-05-07T20:32:29.4843370Z x1 = x[:, D:] 2025-05-07T20:32:29.4843445Z 2025-05-07T20:32:29.4843532Z if contiguous: 2025-05-07T20:32:29.4843626Z x0 = x0.contiguous() 2025-05-07T20:32:29.4843827Z x1 = x1.contiguous() 2025-05-07T20:32:29.4843899Z 2025-05-07T20:32:29.4843994Z if scale_ub is not None: 2025-05-07T20:32:29.4844102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4844238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4844318Z ) 2025-05-07T20:32:29.4844394Z else: 2025-05-07T20:32:29.4844490Z scale_ub_tensor = None 2025-05-07T20:32:29.4844566Z 2025-05-07T20:32:29.4844699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4844829Z op = silu_mul_quant 2025-05-07T20:32:29.4844918Z if compiled: 2025-05-07T20:32:29.4845018Z op = torch.compile(op) 2025-05-07T20:32:29.4845125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4845200Z 2025-05-07T20:32:29.4845290Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4845333Z 2025-05-07T20:32:29.4845434Z moe/activation_test.py:117: 2025-05-07T20:32:29.4845568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4845669Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4845771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4846151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4846247Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4846760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4846861Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4847230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4847462Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4847823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4847921Z kernel = self.compile( 2025-05-07T20:32:29.4848316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4848501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4848632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4848637Z 2025-05-07T20:32:29.4848848Z self = 2025-05-07T20:32:29.4849653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4850178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7d1c7c0>} 2025-05-07T20:32:29.4850949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4851146Z context = 2025-05-07T20:32:29.4851151Z 2025-05-07T20:32:29.4851320Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4851597Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4851708Z module_map=module_map) 2025-05-07T20:32:29.4851874Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4851974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4852050Z E ^ 2025-05-07T20:32:29.4852499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4852507Z 2025-05-07T20:32:29.4852936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4852941Z 2025-05-07T20:32:29.4853049Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4853278Z self=, 2025-05-07T20:32:29.4853355Z T=2048, 2025-05-07T20:32:29.4853434Z D=5120, 2025-05-07T20:32:29.4853518Z scale_ub=1200.0, 2025-05-07T20:32:29.4853605Z contiguous=False, 2025-05-07T20:32:29.4853731Z compiled=True, 2025-05-07T20:32:29.4853804Z ) 2025-05-07T20:32:29.4854026Z self = 2025-05-07T20:32:29.4854211Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4854216Z 2025-05-07T20:32:29.4854337Z @given( 2025-05-07T20:32:29.4854467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4854568Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4854685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4854806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4854921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4854996Z ) 2025-05-07T20:32:29.4855253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4855349Z def test_silu_mul_quant( 2025-05-07T20:32:29.4855426Z self, 2025-05-07T20:32:29.4855510Z T: int, 2025-05-07T20:32:29.4855586Z D: int, 2025-05-07T20:32:29.4855684Z scale_ub: Optional[float], 2025-05-07T20:32:29.4855778Z contiguous: bool, 2025-05-07T20:32:29.4855864Z compiled: bool, 2025-05-07T20:32:29.4855947Z ) -> None: 2025-05-07T20:32:29.4856043Z torch.manual_seed(2025) 2025-05-07T20:32:29.4856121Z 2025-05-07T20:32:29.4856301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4856376Z 2025-05-07T20:32:29.4856467Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4856594Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4856683Z x = x_sign * x_clamp 2025-05-07T20:32:29.4856763Z x0 = x[:, :D] 2025-05-07T20:32:29.4856845Z x1 = x[:, D:] 2025-05-07T20:32:29.4856917Z 2025-05-07T20:32:29.4857001Z if contiguous: 2025-05-07T20:32:29.4857095Z x0 = x0.contiguous() 2025-05-07T20:32:29.4857186Z x1 = x1.contiguous() 2025-05-07T20:32:29.4857260Z 2025-05-07T20:32:29.4857354Z if scale_ub is not None: 2025-05-07T20:32:29.4857461Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4857604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4857680Z ) 2025-05-07T20:32:29.4857760Z else: 2025-05-07T20:32:29.4857857Z scale_ub_tensor = None 2025-05-07T20:32:29.4857935Z 2025-05-07T20:32:29.4858067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4858162Z op = silu_mul_quant 2025-05-07T20:32:29.4858246Z if compiled: 2025-05-07T20:32:29.4858345Z op = torch.compile(op) 2025-05-07T20:32:29.4858456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4858529Z 2025-05-07T20:32:29.4858621Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4858628Z 2025-05-07T20:32:29.4858726Z moe/activation_test.py:117: 2025-05-07T20:32:29.4858855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4858961Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4859061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4859440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4859540Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4860135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4860236Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4860610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4860840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4861195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4861375Z kernel = self.compile( 2025-05-07T20:32:29.4861772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4861956Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4862085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4862126Z 2025-05-07T20:32:29.4862343Z self = 2025-05-07T20:32:29.4863145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4863670Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7d1d580>} 2025-05-07T20:32:29.4864498Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4864694Z context = 2025-05-07T20:32:29.4864701Z 2025-05-07T20:32:29.4864877Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4865152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4865261Z module_map=module_map) 2025-05-07T20:32:29.4865429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4865529Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4865610Z E ^ 2025-05-07T20:32:29.4865978Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4865985Z 2025-05-07T20:32:29.4866420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4866424Z 2025-05-07T20:32:29.4866531Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4866764Z self=, 2025-05-07T20:32:29.4866850Z T=4096, 2025-05-07T20:32:29.4866931Z D=5120, 2025-05-07T20:32:29.4867015Z scale_ub=1200.0, 2025-05-07T20:32:29.4867104Z contiguous=True, 2025-05-07T20:32:29.4867188Z compiled=True, 2025-05-07T20:32:29.4867260Z ) 2025-05-07T20:32:29.4867486Z self = 2025-05-07T20:32:29.4867664Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.4867669Z 2025-05-07T20:32:29.4867749Z @given( 2025-05-07T20:32:29.4867871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4867971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4868092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4868209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4868324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4868401Z ) 2025-05-07T20:32:29.4868656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4868829Z def test_silu_mul_quant( 2025-05-07T20:32:29.4868910Z self, 2025-05-07T20:32:29.4868987Z T: int, 2025-05-07T20:32:29.4869065Z D: int, 2025-05-07T20:32:29.4869164Z scale_ub: Optional[float], 2025-05-07T20:32:29.4869253Z contiguous: bool, 2025-05-07T20:32:29.4869340Z compiled: bool, 2025-05-07T20:32:29.4869418Z ) -> None: 2025-05-07T20:32:29.4869513Z torch.manual_seed(2025) 2025-05-07T20:32:29.4869587Z 2025-05-07T20:32:29.4869763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4869878Z 2025-05-07T20:32:29.4869973Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4870099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4870188Z x = x_sign * x_clamp 2025-05-07T20:32:29.4870271Z x0 = x[:, :D] 2025-05-07T20:32:29.4870352Z x1 = x[:, D:] 2025-05-07T20:32:29.4870462Z 2025-05-07T20:32:29.4870551Z if contiguous: 2025-05-07T20:32:29.4870647Z x0 = x0.contiguous() 2025-05-07T20:32:29.4870742Z x1 = x1.contiguous() 2025-05-07T20:32:29.4870814Z 2025-05-07T20:32:29.4870906Z if scale_ub is not None: 2025-05-07T20:32:29.4871014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4871152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4871228Z ) 2025-05-07T20:32:29.4871307Z else: 2025-05-07T20:32:29.4871401Z scale_ub_tensor = None 2025-05-07T20:32:29.4871473Z 2025-05-07T20:32:29.4871611Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4871701Z op = silu_mul_quant 2025-05-07T20:32:29.4871786Z if compiled: 2025-05-07T20:32:29.4871889Z op = torch.compile(op) 2025-05-07T20:32:29.4871996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4872072Z 2025-05-07T20:32:29.4872162Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4872167Z 2025-05-07T20:32:29.4872270Z moe/activation_test.py:117: 2025-05-07T20:32:29.4872402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4872503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4872604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4872992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4873088Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4873608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4873712Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4874138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4874368Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4874729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4874826Z kernel = self.compile( 2025-05-07T20:32:29.4875222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4875404Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4875534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4875538Z 2025-05-07T20:32:29.4875748Z self = 2025-05-07T20:32:29.4876557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4877223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7d1e840>} 2025-05-07T20:32:29.4877998Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4878194Z context = 2025-05-07T20:32:29.4878199Z 2025-05-07T20:32:29.4878370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4878681Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4878790Z module_map=module_map) 2025-05-07T20:32:29.4878956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4879055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4879169Z E ^ 2025-05-07T20:32:29.4879546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4879550Z 2025-05-07T20:32:29.4879978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4879983Z 2025-05-07T20:32:29.4880089Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4880369Z self=, 2025-05-07T20:32:29.4880446Z T=128, 2025-05-07T20:32:29.4880526Z D=5120, 2025-05-07T20:32:29.4880611Z scale_ub=1200.0, 2025-05-07T20:32:29.4880698Z contiguous=False, 2025-05-07T20:32:29.4880783Z compiled=True, 2025-05-07T20:32:29.4880856Z ) 2025-05-07T20:32:29.4881078Z self = 2025-05-07T20:32:29.4881258Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4881266Z 2025-05-07T20:32:29.4881344Z @given( 2025-05-07T20:32:29.4881474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4881574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4881690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4881811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4881926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4882000Z ) 2025-05-07T20:32:29.4882262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4882355Z def test_silu_mul_quant( 2025-05-07T20:32:29.4882433Z self, 2025-05-07T20:32:29.4882512Z T: int, 2025-05-07T20:32:29.4882588Z D: int, 2025-05-07T20:32:29.4882690Z scale_ub: Optional[float], 2025-05-07T20:32:29.4882779Z contiguous: bool, 2025-05-07T20:32:29.4882864Z compiled: bool, 2025-05-07T20:32:29.4882949Z ) -> None: 2025-05-07T20:32:29.4883045Z torch.manual_seed(2025) 2025-05-07T20:32:29.4883122Z 2025-05-07T20:32:29.4883301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4883378Z 2025-05-07T20:32:29.4883470Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4883598Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4883687Z x = x_sign * x_clamp 2025-05-07T20:32:29.4883767Z x0 = x[:, :D] 2025-05-07T20:32:29.4883851Z x1 = x[:, D:] 2025-05-07T20:32:29.4883923Z 2025-05-07T20:32:29.4884007Z if contiguous: 2025-05-07T20:32:29.4884100Z x0 = x0.contiguous() 2025-05-07T20:32:29.4884192Z x1 = x1.contiguous() 2025-05-07T20:32:29.4884270Z 2025-05-07T20:32:29.4884363Z if scale_ub is not None: 2025-05-07T20:32:29.4884469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4884610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4884687Z ) 2025-05-07T20:32:29.4884764Z else: 2025-05-07T20:32:29.4884947Z scale_ub_tensor = None 2025-05-07T20:32:29.4885020Z 2025-05-07T20:32:29.4885154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4885248Z op = silu_mul_quant 2025-05-07T20:32:29.4885335Z if compiled: 2025-05-07T20:32:29.4885435Z op = torch.compile(op) 2025-05-07T20:32:29.4885549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4885619Z 2025-05-07T20:32:29.4885712Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4885716Z 2025-05-07T20:32:29.4885814Z moe/activation_test.py:117: 2025-05-07T20:32:29.4885983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4886086Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4886186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4886565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4886709Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4887221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4887322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4887691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4887925Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4888281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4888378Z kernel = self.compile( 2025-05-07T20:32:29.4888772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4888954Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4889089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4889093Z 2025-05-07T20:32:29.4889304Z self = 2025-05-07T20:32:29.4890104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4890628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7d1f4c0>} 2025-05-07T20:32:29.4891401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4891596Z context = 2025-05-07T20:32:29.4891606Z 2025-05-07T20:32:29.4891785Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4892056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4892167Z module_map=module_map) 2025-05-07T20:32:29.4892331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4892430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4892510Z E ^ 2025-05-07T20:32:29.4892878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4892885Z 2025-05-07T20:32:29.4893313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4893321Z 2025-05-07T20:32:29.4893427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4893657Z self=, 2025-05-07T20:32:29.4893814Z T=16384, 2025-05-07T20:32:29.4893893Z D=7168, 2025-05-07T20:32:29.4893977Z scale_ub=1200.0, 2025-05-07T20:32:29.4894065Z contiguous=True, 2025-05-07T20:32:29.4894147Z compiled=True, 2025-05-07T20:32:29.4894219Z ) 2025-05-07T20:32:29.4894446Z self = 2025-05-07T20:32:29.4894627Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.4894631Z 2025-05-07T20:32:29.4894715Z @given( 2025-05-07T20:32:29.4894839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4894980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4895100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4895218Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4895332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4895448Z ) 2025-05-07T20:32:29.4895707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4895801Z def test_silu_mul_quant( 2025-05-07T20:32:29.4895881Z self, 2025-05-07T20:32:29.4895959Z T: int, 2025-05-07T20:32:29.4896034Z D: int, 2025-05-07T20:32:29.4896136Z scale_ub: Optional[float], 2025-05-07T20:32:29.4896226Z contiguous: bool, 2025-05-07T20:32:29.4896318Z compiled: bool, 2025-05-07T20:32:29.4896397Z ) -> None: 2025-05-07T20:32:29.4896492Z torch.manual_seed(2025) 2025-05-07T20:32:29.4896568Z 2025-05-07T20:32:29.4896746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4896819Z 2025-05-07T20:32:29.4896913Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4897040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4897128Z x = x_sign * x_clamp 2025-05-07T20:32:29.4897215Z x0 = x[:, :D] 2025-05-07T20:32:29.4897294Z x1 = x[:, D:] 2025-05-07T20:32:29.4897370Z 2025-05-07T20:32:29.4897457Z if contiguous: 2025-05-07T20:32:29.4897549Z x0 = x0.contiguous() 2025-05-07T20:32:29.4897638Z x1 = x1.contiguous() 2025-05-07T20:32:29.4897713Z 2025-05-07T20:32:29.4897804Z if scale_ub is not None: 2025-05-07T20:32:29.4897914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4898052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4898131Z ) 2025-05-07T20:32:29.4898210Z else: 2025-05-07T20:32:29.4898304Z scale_ub_tensor = None 2025-05-07T20:32:29.4898378Z 2025-05-07T20:32:29.4898515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4898605Z op = silu_mul_quant 2025-05-07T20:32:29.4898690Z if compiled: 2025-05-07T20:32:29.4898794Z op = torch.compile(op) 2025-05-07T20:32:29.4898903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4898979Z 2025-05-07T20:32:29.4899079Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4899083Z 2025-05-07T20:32:29.4899181Z moe/activation_test.py:117: 2025-05-07T20:32:29.4899316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4899418Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4899519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4899950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4900045Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4900560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4900663Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4901033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4901351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4901707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4901805Z kernel = self.compile( 2025-05-07T20:32:29.4902204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4902386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4902518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4902560Z 2025-05-07T20:32:29.4902774Z self = 2025-05-07T20:32:29.4903577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4904172Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7e34c20>} 2025-05-07T20:32:29.4904948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4905172Z context = 2025-05-07T20:32:29.4905177Z 2025-05-07T20:32:29.4905371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4905646Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4905757Z module_map=module_map) 2025-05-07T20:32:29.4905922Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4906030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4906112Z E ^ 2025-05-07T20:32:29.4906478Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4906482Z 2025-05-07T20:32:29.4906916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4906920Z 2025-05-07T20:32:29.4907026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4907261Z self=, 2025-05-07T20:32:29.4907340Z T=16384, 2025-05-07T20:32:29.4907420Z D=5120, 2025-05-07T20:32:29.4907507Z scale_ub=1200.0, 2025-05-07T20:32:29.4907591Z contiguous=True, 2025-05-07T20:32:29.4907676Z compiled=False, 2025-05-07T20:32:29.4907751Z ) 2025-05-07T20:32:29.4907975Z self = 2025-05-07T20:32:29.4908161Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.4908169Z 2025-05-07T20:32:29.4908249Z @given( 2025-05-07T20:32:29.4908371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4908476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4908593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4908711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4908830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4908904Z ) 2025-05-07T20:32:29.4909156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4909259Z def test_silu_mul_quant( 2025-05-07T20:32:29.4909336Z self, 2025-05-07T20:32:29.4909411Z T: int, 2025-05-07T20:32:29.4909490Z D: int, 2025-05-07T20:32:29.4909589Z scale_ub: Optional[float], 2025-05-07T20:32:29.4909679Z contiguous: bool, 2025-05-07T20:32:29.4909769Z compiled: bool, 2025-05-07T20:32:29.4909847Z ) -> None: 2025-05-07T20:32:29.4910031Z torch.manual_seed(2025) 2025-05-07T20:32:29.4910105Z 2025-05-07T20:32:29.4910281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4910359Z 2025-05-07T20:32:29.4910452Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4910577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4910669Z x = x_sign * x_clamp 2025-05-07T20:32:29.4910748Z x0 = x[:, :D] 2025-05-07T20:32:29.4910827Z x1 = x[:, D:] 2025-05-07T20:32:29.4910903Z 2025-05-07T20:32:29.4911027Z if contiguous: 2025-05-07T20:32:29.4911120Z x0 = x0.contiguous() 2025-05-07T20:32:29.4911216Z x1 = x1.contiguous() 2025-05-07T20:32:29.4911287Z 2025-05-07T20:32:29.4911377Z if scale_ub is not None: 2025-05-07T20:32:29.4911488Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4911665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4911748Z ) 2025-05-07T20:32:29.4911824Z else: 2025-05-07T20:32:29.4911919Z scale_ub_tensor = None 2025-05-07T20:32:29.4911994Z 2025-05-07T20:32:29.4912127Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4912219Z op = silu_mul_quant 2025-05-07T20:32:29.4912306Z if compiled: 2025-05-07T20:32:29.4912406Z op = torch.compile(op) 2025-05-07T20:32:29.4912512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4912586Z 2025-05-07T20:32:29.4912676Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4912682Z 2025-05-07T20:32:29.4912783Z moe/activation_test.py:117: 2025-05-07T20:32:29.4912913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4913014Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4913117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4914067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4914171Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4914549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4914776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4915132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4915227Z kernel = self.compile( 2025-05-07T20:32:29.4915622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4915809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4915940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4915947Z 2025-05-07T20:32:29.4916160Z self = 2025-05-07T20:32:29.4916961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4917479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7e35580>} 2025-05-07T20:32:29.4918246Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4918441Z context = 2025-05-07T20:32:29.4918446Z 2025-05-07T20:32:29.4918617Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4919029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4919139Z module_map=module_map) 2025-05-07T20:32:29.4919303Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4919401Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4919477Z E ^ 2025-05-07T20:32:29.4919844Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4919848Z 2025-05-07T20:32:29.4920340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4920401Z 2025-05-07T20:32:29.4920508Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4920734Z self=, 2025-05-07T20:32:29.4920866Z T=1, 2025-05-07T20:32:29.4920943Z D=7168, 2025-05-07T20:32:29.4921033Z scale_ub=1200.0, 2025-05-07T20:32:29.4923777Z contiguous=False, 2025-05-07T20:32:29.4923886Z compiled=False, 2025-05-07T20:32:29.4923961Z ) 2025-05-07T20:32:29.4924190Z self = 2025-05-07T20:32:29.4924369Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.4924374Z 2025-05-07T20:32:29.4924453Z @given( 2025-05-07T20:32:29.4924574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4924678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4924797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4924922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4925038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4925114Z ) 2025-05-07T20:32:29.4925371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4925469Z def test_silu_mul_quant( 2025-05-07T20:32:29.4925549Z self, 2025-05-07T20:32:29.4925650Z T: int, 2025-05-07T20:32:29.4925728Z D: int, 2025-05-07T20:32:29.4925827Z scale_ub: Optional[float], 2025-05-07T20:32:29.4925919Z contiguous: bool, 2025-05-07T20:32:29.4926006Z compiled: bool, 2025-05-07T20:32:29.4926085Z ) -> None: 2025-05-07T20:32:29.4926184Z torch.manual_seed(2025) 2025-05-07T20:32:29.4926256Z 2025-05-07T20:32:29.4926428Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4926506Z 2025-05-07T20:32:29.4926603Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4926729Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4926821Z x = x_sign * x_clamp 2025-05-07T20:32:29.4926904Z x0 = x[:, :D] 2025-05-07T20:32:29.4926987Z x1 = x[:, D:] 2025-05-07T20:32:29.4927058Z 2025-05-07T20:32:29.4927146Z if contiguous: 2025-05-07T20:32:29.4927240Z x0 = x0.contiguous() 2025-05-07T20:32:29.4927333Z x1 = x1.contiguous() 2025-05-07T20:32:29.4927404Z 2025-05-07T20:32:29.4927501Z if scale_ub is not None: 2025-05-07T20:32:29.4927610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4927747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4927827Z ) 2025-05-07T20:32:29.4927902Z else: 2025-05-07T20:32:29.4927997Z scale_ub_tensor = None 2025-05-07T20:32:29.4928074Z 2025-05-07T20:32:29.4928207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4928305Z op = silu_mul_quant 2025-05-07T20:32:29.4928392Z if compiled: 2025-05-07T20:32:29.4928491Z op = torch.compile(op) 2025-05-07T20:32:29.4928602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4928674Z 2025-05-07T20:32:29.4928764Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4928770Z 2025-05-07T20:32:29.4928931Z moe/activation_test.py:117: 2025-05-07T20:32:29.4929066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4929168Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4929272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4929786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4929889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4930259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4930530Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4930886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4930981Z kernel = self.compile( 2025-05-07T20:32:29.4931420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4931684Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4931818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4931822Z 2025-05-07T20:32:29.4932036Z self = 2025-05-07T20:32:29.4932840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4933367Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7e368e0>} 2025-05-07T20:32:29.4934143Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4934342Z context = 2025-05-07T20:32:29.4934347Z 2025-05-07T20:32:29.4934517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4934789Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4934900Z module_map=module_map) 2025-05-07T20:32:29.4935065Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4935168Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4935253Z E ^ 2025-05-07T20:32:29.4935619Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4935623Z 2025-05-07T20:32:29.4936056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4936067Z 2025-05-07T20:32:29.4936177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4936408Z self=, 2025-05-07T20:32:29.4936489Z T=4096, 2025-05-07T20:32:29.4936568Z D=7168, 2025-05-07T20:32:29.4936652Z scale_ub=1200.0, 2025-05-07T20:32:29.4936741Z contiguous=False, 2025-05-07T20:32:29.4936825Z compiled=True, 2025-05-07T20:32:29.4936900Z ) 2025-05-07T20:32:29.4937129Z self = 2025-05-07T20:32:29.4937312Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4937317Z 2025-05-07T20:32:29.4937398Z @given( 2025-05-07T20:32:29.4937519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4937618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4937739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4937903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4938022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4938101Z ) 2025-05-07T20:32:29.4938353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4938448Z def test_silu_mul_quant( 2025-05-07T20:32:29.4938528Z self, 2025-05-07T20:32:29.4938605Z T: int, 2025-05-07T20:32:29.4938682Z D: int, 2025-05-07T20:32:29.4938784Z scale_ub: Optional[float], 2025-05-07T20:32:29.4938874Z contiguous: bool, 2025-05-07T20:32:29.4939029Z compiled: bool, 2025-05-07T20:32:29.4939108Z ) -> None: 2025-05-07T20:32:29.4939204Z torch.manual_seed(2025) 2025-05-07T20:32:29.4942624Z 2025-05-07T20:32:29.4942820Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4942900Z 2025-05-07T20:32:29.4943070Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4943208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4943354Z x = x_sign * x_clamp 2025-05-07T20:32:29.4943441Z x0 = x[:, :D] 2025-05-07T20:32:29.4943527Z x1 = x[:, D:] 2025-05-07T20:32:29.4943601Z 2025-05-07T20:32:29.4943686Z if contiguous: 2025-05-07T20:32:29.4943780Z x0 = x0.contiguous() 2025-05-07T20:32:29.4943869Z x1 = x1.contiguous() 2025-05-07T20:32:29.4943941Z 2025-05-07T20:32:29.4944040Z if scale_ub is not None: 2025-05-07T20:32:29.4944149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4944292Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4944371Z ) 2025-05-07T20:32:29.4944447Z else: 2025-05-07T20:32:29.4944541Z scale_ub_tensor = None 2025-05-07T20:32:29.4944616Z 2025-05-07T20:32:29.4944751Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4944849Z op = silu_mul_quant 2025-05-07T20:32:29.4944937Z if compiled: 2025-05-07T20:32:29.4945040Z op = torch.compile(op) 2025-05-07T20:32:29.4945151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4945224Z 2025-05-07T20:32:29.4945319Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4945324Z 2025-05-07T20:32:29.4945429Z moe/activation_test.py:117: 2025-05-07T20:32:29.4945562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4945664Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4945768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4946156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4946255Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4946764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4946869Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4947246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4947474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4947830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4947925Z kernel = self.compile( 2025-05-07T20:32:29.4948318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4948505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4948634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4948639Z 2025-05-07T20:32:29.4948848Z self = 2025-05-07T20:32:29.4949705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4950228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7e37a60>} 2025-05-07T20:32:29.4951002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4951238Z context = 2025-05-07T20:32:29.4951243Z 2025-05-07T20:32:29.4951415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4951689Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4951838Z module_map=module_map) 2025-05-07T20:32:29.4952050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4952156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4952233Z E ^ 2025-05-07T20:32:29.4952603Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4952607Z 2025-05-07T20:32:29.4953033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4953039Z 2025-05-07T20:32:29.4953146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4953375Z self=, 2025-05-07T20:32:29.4953453Z T=128, 2025-05-07T20:32:29.4953534Z D=7168, 2025-05-07T20:32:29.4953619Z scale_ub=1200.0, 2025-05-07T20:32:29.4953711Z contiguous=False, 2025-05-07T20:32:29.4953806Z compiled=True, 2025-05-07T20:32:29.4953899Z ) 2025-05-07T20:32:29.4954150Z self = 2025-05-07T20:32:29.4954330Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.4954335Z 2025-05-07T20:32:29.4954411Z @given( 2025-05-07T20:32:29.4954536Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4954638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4954755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4954879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4954997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4955072Z ) 2025-05-07T20:32:29.4955331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4955425Z def test_silu_mul_quant( 2025-05-07T20:32:29.4955507Z self, 2025-05-07T20:32:29.4955591Z T: int, 2025-05-07T20:32:29.4955666Z D: int, 2025-05-07T20:32:29.4955777Z scale_ub: Optional[float], 2025-05-07T20:32:29.4955867Z contiguous: bool, 2025-05-07T20:32:29.4955953Z compiled: bool, 2025-05-07T20:32:29.4956035Z ) -> None: 2025-05-07T20:32:29.4956129Z torch.manual_seed(2025) 2025-05-07T20:32:29.4956201Z 2025-05-07T20:32:29.4956380Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4956459Z 2025-05-07T20:32:29.4956552Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4956681Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4956773Z x = x_sign * x_clamp 2025-05-07T20:32:29.4956856Z x0 = x[:, :D] 2025-05-07T20:32:29.4956938Z x1 = x[:, D:] 2025-05-07T20:32:29.4957011Z 2025-05-07T20:32:29.4957098Z if contiguous: 2025-05-07T20:32:29.4957190Z x0 = x0.contiguous() 2025-05-07T20:32:29.4957280Z x1 = x1.contiguous() 2025-05-07T20:32:29.4957359Z 2025-05-07T20:32:29.4957498Z if scale_ub is not None: 2025-05-07T20:32:29.4957610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4957755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4957830Z ) 2025-05-07T20:32:29.4957905Z else: 2025-05-07T20:32:29.4958005Z scale_ub_tensor = None 2025-05-07T20:32:29.4958077Z 2025-05-07T20:32:29.4958209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4958303Z op = silu_mul_quant 2025-05-07T20:32:29.4958392Z if compiled: 2025-05-07T20:32:29.4958537Z op = torch.compile(op) 2025-05-07T20:32:29.4958643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4958715Z 2025-05-07T20:32:29.4958810Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4958815Z 2025-05-07T20:32:29.4958913Z moe/activation_test.py:117: 2025-05-07T20:32:29.4959082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4959189Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4959333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4959714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4959813Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4960424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4960532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4960901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4961134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4961490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4961592Z kernel = self.compile( 2025-05-07T20:32:29.4961995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4962174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4962303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4962308Z 2025-05-07T20:32:29.4962521Z self = 2025-05-07T20:32:29.4963325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4963879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f817cea0>} 2025-05-07T20:32:29.4964680Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4964877Z context = 2025-05-07T20:32:29.4964881Z 2025-05-07T20:32:29.4965053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4965325Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4965437Z module_map=module_map) 2025-05-07T20:32:29.4965605Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4965705Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4965785Z E ^ 2025-05-07T20:32:29.4966150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4966157Z 2025-05-07T20:32:29.4966637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4966642Z 2025-05-07T20:32:29.4966748Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4966977Z self=, 2025-05-07T20:32:29.4967058Z T=2048, 2025-05-07T20:32:29.4967138Z D=7168, 2025-05-07T20:32:29.4967220Z scale_ub=None, 2025-05-07T20:32:29.4967310Z contiguous=True, 2025-05-07T20:32:29.4967394Z compiled=True, 2025-05-07T20:32:29.4967467Z ) 2025-05-07T20:32:29.4967695Z self = 2025-05-07T20:32:29.4967913Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.4967917Z 2025-05-07T20:32:29.4967998Z @given( 2025-05-07T20:32:29.4968119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4968259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4968384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4968545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4968661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4968739Z ) 2025-05-07T20:32:29.4968992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4969085Z def test_silu_mul_quant( 2025-05-07T20:32:29.4969166Z self, 2025-05-07T20:32:29.4969243Z T: int, 2025-05-07T20:32:29.4969324Z D: int, 2025-05-07T20:32:29.4969424Z scale_ub: Optional[float], 2025-05-07T20:32:29.4969516Z contiguous: bool, 2025-05-07T20:32:29.4969605Z compiled: bool, 2025-05-07T20:32:29.4969683Z ) -> None: 2025-05-07T20:32:29.4969780Z torch.manual_seed(2025) 2025-05-07T20:32:29.4969856Z 2025-05-07T20:32:29.4970031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4970107Z 2025-05-07T20:32:29.4970204Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4970333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4970422Z x = x_sign * x_clamp 2025-05-07T20:32:29.4970508Z x0 = x[:, :D] 2025-05-07T20:32:29.4970587Z x1 = x[:, D:] 2025-05-07T20:32:29.4970658Z 2025-05-07T20:32:29.4970746Z if contiguous: 2025-05-07T20:32:29.4970837Z x0 = x0.contiguous() 2025-05-07T20:32:29.4970934Z x1 = x1.contiguous() 2025-05-07T20:32:29.4971007Z 2025-05-07T20:32:29.4971098Z if scale_ub is not None: 2025-05-07T20:32:29.4971209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.4971347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.4971423Z ) 2025-05-07T20:32:29.4971502Z else: 2025-05-07T20:32:29.4971597Z scale_ub_tensor = None 2025-05-07T20:32:29.4971670Z 2025-05-07T20:32:29.4971809Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.4971902Z op = silu_mul_quant 2025-05-07T20:32:29.4971989Z if compiled: 2025-05-07T20:32:29.4972093Z op = torch.compile(op) 2025-05-07T20:32:29.4972200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4972276Z 2025-05-07T20:32:29.4972368Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.4972373Z 2025-05-07T20:32:29.4972469Z moe/activation_test.py:117: 2025-05-07T20:32:29.4972606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4972708Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.4972809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.4973197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.4973292Z return fn(*args, **kwargs) 2025-05-07T20:32:29.4973873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.4973978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.4974347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.4974582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.4974932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.4975027Z kernel = self.compile( 2025-05-07T20:32:29.4975426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.4975650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4975787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.4975792Z 2025-05-07T20:32:29.4976042Z self = 2025-05-07T20:32:29.4976887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.4977411Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f31f817dc60>} 2025-05-07T20:32:29.4978179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.4978382Z context = 2025-05-07T20:32:29.4978386Z 2025-05-07T20:32:29.4978554Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.4978835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4978949Z module_map=module_map) 2025-05-07T20:32:29.4979113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4979219Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4979295Z E ^ 2025-05-07T20:32:29.4979660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.4979664Z 2025-05-07T20:32:29.4980097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.4980103Z 2025-05-07T20:32:29.4980208Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4980445Z self=, 2025-05-07T20:32:29.4980522Z T=16384, 2025-05-07T20:32:29.4980597Z D=5120, 2025-05-07T20:32:29.4980684Z scale_ub=None, 2025-05-07T20:32:29.4980771Z contiguous=False, 2025-05-07T20:32:29.4980856Z compiled=False, 2025-05-07T20:32:29.4980936Z ) 2025-05-07T20:32:29.4981158Z self = 2025-05-07T20:32:29.4981343Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.4981351Z 2025-05-07T20:32:29.4981427Z @given( 2025-05-07T20:32:29.4981548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4981652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4981768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4981888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4982006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4982081Z ) 2025-05-07T20:32:29.4982337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4982433Z def test_silu_mul_quant( 2025-05-07T20:32:29.4982511Z self, 2025-05-07T20:32:29.4982631Z T: int, 2025-05-07T20:32:29.4982714Z D: int, 2025-05-07T20:32:29.4982813Z scale_ub: Optional[float], 2025-05-07T20:32:29.4982907Z contiguous: bool, 2025-05-07T20:32:29.4982993Z compiled: bool, 2025-05-07T20:32:29.4983072Z ) -> None: 2025-05-07T20:32:29.4983171Z torch.manual_seed(2025) 2025-05-07T20:32:29.4983245Z 2025-05-07T20:32:29.4983421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4983499Z 2025-05-07T20:32:29.4983595Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4983778Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4985732Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.4985774Z 2025-05-07T20:32:29.4985896Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:29.4985900Z 2025-05-07T20:32:29.4986011Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4986242Z self=, 2025-05-07T20:32:29.4986323Z T=4096, 2025-05-07T20:32:29.4986401Z D=7168, 2025-05-07T20:32:29.4986484Z scale_ub=1200.0, 2025-05-07T20:32:29.4986573Z contiguous=True, 2025-05-07T20:32:29.4986655Z compiled=True, 2025-05-07T20:32:29.4986728Z ) 2025-05-07T20:32:29.4986953Z self = 2025-05-07T20:32:29.4987134Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.4987141Z 2025-05-07T20:32:29.4987220Z @given( 2025-05-07T20:32:29.4987345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4987445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4987565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4987683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4987797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4987873Z ) 2025-05-07T20:32:29.4988130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4988226Z def test_silu_mul_quant( 2025-05-07T20:32:29.4988304Z self, 2025-05-07T20:32:29.4988380Z T: int, 2025-05-07T20:32:29.4988457Z D: int, 2025-05-07T20:32:29.4988559Z scale_ub: Optional[float], 2025-05-07T20:32:29.4988653Z contiguous: bool, 2025-05-07T20:32:29.4988741Z compiled: bool, 2025-05-07T20:32:29.4988821Z ) -> None: 2025-05-07T20:32:29.4988920Z torch.manual_seed(2025) 2025-05-07T20:32:29.4988999Z 2025-05-07T20:32:29.4989174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4989248Z 2025-05-07T20:32:29.4989343Z x_sign = torch.sign(x) 2025-05-07T20:32:29.4989468Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.4991320Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.4991334Z 2025-05-07T20:32:29.4991499Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:29.4991506Z 2025-05-07T20:32:29.4991612Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4991844Z self=, 2025-05-07T20:32:29.4991921Z T=16384, 2025-05-07T20:32:29.4991999Z D=7168, 2025-05-07T20:32:29.4992083Z scale_ub=None, 2025-05-07T20:32:29.4992171Z contiguous=False, 2025-05-07T20:32:29.4992255Z compiled=False, 2025-05-07T20:32:29.4992331Z ) 2025-05-07T20:32:29.4992555Z self = 2025-05-07T20:32:29.4992780Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.4992785Z 2025-05-07T20:32:29.4992862Z @given( 2025-05-07T20:32:29.4992983Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4993090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4993243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4993404Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4993524Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4993599Z ) 2025-05-07T20:32:29.4993857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4993952Z def test_silu_mul_quant( 2025-05-07T20:32:29.4994037Z self, 2025-05-07T20:32:29.4994133Z T: int, 2025-05-07T20:32:29.4994215Z D: int, 2025-05-07T20:32:29.4994333Z scale_ub: Optional[float], 2025-05-07T20:32:29.4994428Z contiguous: bool, 2025-05-07T20:32:29.4994515Z compiled: bool, 2025-05-07T20:32:29.4994598Z ) -> None: 2025-05-07T20:32:29.4994694Z torch.manual_seed(2025) 2025-05-07T20:32:29.4994766Z 2025-05-07T20:32:29.4994942Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.4996794Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.4996801Z 2025-05-07T20:32:29.4996922Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.4996929Z 2025-05-07T20:32:29.4997033Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.4997260Z self=, 2025-05-07T20:32:29.4997342Z T=2048, 2025-05-07T20:32:29.4997420Z D=7168, 2025-05-07T20:32:29.4997505Z scale_ub=1200.0, 2025-05-07T20:32:29.4997597Z contiguous=True, 2025-05-07T20:32:29.4997682Z compiled=True, 2025-05-07T20:32:29.4997755Z ) 2025-05-07T20:32:29.4997983Z self = 2025-05-07T20:32:29.4998161Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.4998165Z 2025-05-07T20:32:29.4998245Z @given( 2025-05-07T20:32:29.4998365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.4998464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.4998582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.4998703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.4998817Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.4998893Z ) 2025-05-07T20:32:29.4999146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.4999243Z def test_silu_mul_quant( 2025-05-07T20:32:29.4999320Z self, 2025-05-07T20:32:29.4999396Z T: int, 2025-05-07T20:32:29.4999564Z D: int, 2025-05-07T20:32:29.4999666Z scale_ub: Optional[float], 2025-05-07T20:32:29.4999755Z contiguous: bool, 2025-05-07T20:32:29.4999844Z compiled: bool, 2025-05-07T20:32:29.4999922Z ) -> None: 2025-05-07T20:32:29.5000018Z torch.manual_seed(2025) 2025-05-07T20:32:29.5000093Z 2025-05-07T20:32:29.5000330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5000404Z 2025-05-07T20:32:29.5000502Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5000628Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5002550Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5002615Z 2025-05-07T20:32:29.5002737Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:29.5002742Z 2025-05-07T20:32:29.5002849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5003079Z self=, 2025-05-07T20:32:29.5003156Z T=2048, 2025-05-07T20:32:29.5003235Z D=7168, 2025-05-07T20:32:29.5003325Z scale_ub=None, 2025-05-07T20:32:29.5003411Z contiguous=True, 2025-05-07T20:32:29.5003500Z compiled=False, 2025-05-07T20:32:29.5003574Z ) 2025-05-07T20:32:29.5003796Z self = 2025-05-07T20:32:29.5003976Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.5003986Z 2025-05-07T20:32:29.5004066Z @given( 2025-05-07T20:32:29.5004191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5004290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5004406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5004527Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5004643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5004718Z ) 2025-05-07T20:32:29.5004975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5005073Z def test_silu_mul_quant( 2025-05-07T20:32:29.5005150Z self, 2025-05-07T20:32:29.5005234Z T: int, 2025-05-07T20:32:29.5005316Z D: int, 2025-05-07T20:32:29.5005414Z scale_ub: Optional[float], 2025-05-07T20:32:29.5005503Z contiguous: bool, 2025-05-07T20:32:29.5005591Z compiled: bool, 2025-05-07T20:32:29.5005672Z ) -> None: 2025-05-07T20:32:29.5005771Z torch.manual_seed(2025) 2025-05-07T20:32:29.5005845Z 2025-05-07T20:32:29.5006018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5006093Z 2025-05-07T20:32:29.5006187Z > x_sign = torch.sign(x) 2025-05-07T20:32:29.5008017Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5008026Z 2025-05-07T20:32:29.5008151Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:29.5008158Z 2025-05-07T20:32:29.5008306Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5008540Z self=, 2025-05-07T20:32:29.5008617Z T=1, 2025-05-07T20:32:29.5008693Z D=7168, 2025-05-07T20:32:29.5008780Z scale_ub=1200.0, 2025-05-07T20:32:29.5008865Z contiguous=True, 2025-05-07T20:32:29.5008950Z compiled=False, 2025-05-07T20:32:29.5009024Z ) 2025-05-07T20:32:29.5009244Z self = 2025-05-07T20:32:29.5009415Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.5009459Z 2025-05-07T20:32:29.5009538Z @given( 2025-05-07T20:32:29.5009656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5009754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5009871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5010026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5010148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5010262Z ) 2025-05-07T20:32:29.5010513Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5010609Z def test_silu_mul_quant( 2025-05-07T20:32:29.5010684Z self, 2025-05-07T20:32:29.5010760Z T: int, 2025-05-07T20:32:29.5010838Z D: int, 2025-05-07T20:32:29.5010937Z scale_ub: Optional[float], 2025-05-07T20:32:29.5011028Z contiguous: bool, 2025-05-07T20:32:29.5011119Z compiled: bool, 2025-05-07T20:32:29.5011195Z ) -> None: 2025-05-07T20:32:29.5011292Z torch.manual_seed(2025) 2025-05-07T20:32:29.5011366Z 2025-05-07T20:32:29.5011534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5011610Z 2025-05-07T20:32:29.5011701Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5011825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5011919Z x = x_sign * x_clamp 2025-05-07T20:32:29.5012002Z x0 = x[:, :D] 2025-05-07T20:32:29.5012083Z x1 = x[:, D:] 2025-05-07T20:32:29.5012159Z 2025-05-07T20:32:29.5012244Z if contiguous: 2025-05-07T20:32:29.5012335Z x0 = x0.contiguous() 2025-05-07T20:32:29.5012426Z x1 = x1.contiguous() 2025-05-07T20:32:29.5012498Z 2025-05-07T20:32:29.5012587Z if scale_ub is not None: 2025-05-07T20:32:29.5012697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5012835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5012915Z ) 2025-05-07T20:32:29.5012989Z else: 2025-05-07T20:32:29.5013083Z scale_ub_tensor = None 2025-05-07T20:32:29.5013160Z 2025-05-07T20:32:29.5013292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5013638Z op = silu_mul_quant 2025-05-07T20:32:29.5013745Z if compiled: 2025-05-07T20:32:29.5013859Z op = torch.compile(op) 2025-05-07T20:32:29.5013994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5014070Z 2025-05-07T20:32:29.5014160Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5014164Z 2025-05-07T20:32:29.5014262Z moe/activation_test.py:117: 2025-05-07T20:32:29.5014399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5014500Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5014603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5015124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5015225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5015598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5015826Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5016270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5016366Z kernel = self.compile( 2025-05-07T20:32:29.5016761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5016943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5017070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5017074Z 2025-05-07T20:32:29.5017285Z self = 2025-05-07T20:32:29.5018148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5018782Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7bc4b80>} 2025-05-07T20:32:29.5019559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5019755Z context = 2025-05-07T20:32:29.5019759Z 2025-05-07T20:32:29.5019929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5020204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5020312Z module_map=module_map) 2025-05-07T20:32:29.5020480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5020579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5020658Z E ^ 2025-05-07T20:32:29.5021031Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5021036Z 2025-05-07T20:32:29.5021463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5021468Z 2025-05-07T20:32:29.5021576Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5021806Z self=, 2025-05-07T20:32:29.5021882Z T=128, 2025-05-07T20:32:29.5021961Z D=5120, 2025-05-07T20:32:29.5022042Z scale_ub=None, 2025-05-07T20:32:29.5022129Z contiguous=True, 2025-05-07T20:32:29.5022216Z compiled=False, 2025-05-07T20:32:29.5022290Z ) 2025-05-07T20:32:29.5022517Z self = 2025-05-07T20:32:29.5022691Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.5022698Z 2025-05-07T20:32:29.5022774Z @given( 2025-05-07T20:32:29.5022901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5023001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5023117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5023237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5023350Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5023424Z ) 2025-05-07T20:32:29.5023678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5023771Z def test_silu_mul_quant( 2025-05-07T20:32:29.5023854Z self, 2025-05-07T20:32:29.5023929Z T: int, 2025-05-07T20:32:29.5024008Z D: int, 2025-05-07T20:32:29.5024110Z scale_ub: Optional[float], 2025-05-07T20:32:29.5024198Z contiguous: bool, 2025-05-07T20:32:29.5024284Z compiled: bool, 2025-05-07T20:32:29.5024364Z ) -> None: 2025-05-07T20:32:29.5024461Z torch.manual_seed(2025) 2025-05-07T20:32:29.5024578Z 2025-05-07T20:32:29.5024759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5024833Z 2025-05-07T20:32:29.5024928Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5025056Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5025143Z x = x_sign * x_clamp 2025-05-07T20:32:29.5025231Z x0 = x[:, :D] 2025-05-07T20:32:29.5025312Z x1 = x[:, D:] 2025-05-07T20:32:29.5025383Z 2025-05-07T20:32:29.5025470Z if contiguous: 2025-05-07T20:32:29.5025563Z x0 = x0.contiguous() 2025-05-07T20:32:29.5025693Z x1 = x1.contiguous() 2025-05-07T20:32:29.5025767Z 2025-05-07T20:32:29.5025857Z if scale_ub is not None: 2025-05-07T20:32:29.5025963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5026103Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5026217Z ) 2025-05-07T20:32:29.5026292Z else: 2025-05-07T20:32:29.5026391Z scale_ub_tensor = None 2025-05-07T20:32:29.5026504Z 2025-05-07T20:32:29.5026636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5026729Z op = silu_mul_quant 2025-05-07T20:32:29.5026813Z if compiled: 2025-05-07T20:32:29.5026915Z op = torch.compile(op) 2025-05-07T20:32:29.5027020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5027093Z 2025-05-07T20:32:29.5027186Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5027191Z 2025-05-07T20:32:29.5027287Z moe/activation_test.py:117: 2025-05-07T20:32:29.5027417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5027524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5027623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5028134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5028242Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5028610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5028842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5029192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5029288Z kernel = self.compile( 2025-05-07T20:32:29.5029687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5029870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5030002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5030006Z 2025-05-07T20:32:29.5030214Z self = 2025-05-07T20:32:29.5031020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5031541Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7bc5a80>} 2025-05-07T20:32:29.5032306Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5032506Z context = 2025-05-07T20:32:29.5032510Z 2025-05-07T20:32:29.5032677Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5032994Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5033110Z module_map=module_map) 2025-05-07T20:32:29.5033273Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5033374Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5033451Z E ^ 2025-05-07T20:32:29.5033817Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5033822Z 2025-05-07T20:32:29.5034254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5034297Z 2025-05-07T20:32:29.5034403Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5034633Z self=, 2025-05-07T20:32:29.5034710Z T=128, 2025-05-07T20:32:29.5034785Z D=7168, 2025-05-07T20:32:29.5034870Z scale_ub=None, 2025-05-07T20:32:29.5035017Z contiguous=True, 2025-05-07T20:32:29.5035104Z compiled=False, 2025-05-07T20:32:29.5035180Z ) 2025-05-07T20:32:29.5035446Z self = 2025-05-07T20:32:29.5035625Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.5035629Z 2025-05-07T20:32:29.5035712Z @given( 2025-05-07T20:32:29.5035834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5035937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5036053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5036173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5036290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5036362Z ) 2025-05-07T20:32:29.5036617Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5036713Z def test_silu_mul_quant( 2025-05-07T20:32:29.5036792Z self, 2025-05-07T20:32:29.5036867Z T: int, 2025-05-07T20:32:29.5036947Z D: int, 2025-05-07T20:32:29.5037050Z scale_ub: Optional[float], 2025-05-07T20:32:29.5037142Z contiguous: bool, 2025-05-07T20:32:29.5037233Z compiled: bool, 2025-05-07T20:32:29.5037314Z ) -> None: 2025-05-07T20:32:29.5037416Z torch.manual_seed(2025) 2025-05-07T20:32:29.5037487Z 2025-05-07T20:32:29.5037661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5037738Z 2025-05-07T20:32:29.5037831Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5037955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5038048Z x = x_sign * x_clamp 2025-05-07T20:32:29.5038128Z x0 = x[:, :D] 2025-05-07T20:32:29.5038207Z x1 = x[:, D:] 2025-05-07T20:32:29.5038281Z 2025-05-07T20:32:29.5038363Z if contiguous: 2025-05-07T20:32:29.5038455Z x0 = x0.contiguous() 2025-05-07T20:32:29.5038548Z x1 = x1.contiguous() 2025-05-07T20:32:29.5038623Z 2025-05-07T20:32:29.5038719Z if scale_ub is not None: 2025-05-07T20:32:29.5038825Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5038961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5039039Z ) 2025-05-07T20:32:29.5039114Z else: 2025-05-07T20:32:29.5039207Z scale_ub_tensor = None 2025-05-07T20:32:29.5039282Z 2025-05-07T20:32:29.5039414Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5039505Z op = silu_mul_quant 2025-05-07T20:32:29.5039596Z if compiled: 2025-05-07T20:32:29.5039696Z op = torch.compile(op) 2025-05-07T20:32:29.5039801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5039876Z 2025-05-07T20:32:29.5039968Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5039972Z 2025-05-07T20:32:29.5040072Z moe/activation_test.py:117: 2025-05-07T20:32:29.5040300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5040404Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5040510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5041022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5041121Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5041495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5041725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5042121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5042214Z kernel = self.compile( 2025-05-07T20:32:29.5042610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5042877Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5043007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5043012Z 2025-05-07T20:32:29.5043223Z self = 2025-05-07T20:32:29.5044021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5044542Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7bc6980>} 2025-05-07T20:32:29.5045311Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5045515Z context = 2025-05-07T20:32:29.5045519Z 2025-05-07T20:32:29.5045692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5045962Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5046070Z module_map=module_map) 2025-05-07T20:32:29.5046235Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5046334Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5046415Z E ^ 2025-05-07T20:32:29.5046785Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5046789Z 2025-05-07T20:32:29.5047215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5047223Z 2025-05-07T20:32:29.5047334Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5047563Z self=, 2025-05-07T20:32:29.5047639Z T=2048, 2025-05-07T20:32:29.5047720Z D=7168, 2025-05-07T20:32:29.5047803Z scale_ub=1200.0, 2025-05-07T20:32:29.5047890Z contiguous=True, 2025-05-07T20:32:29.5047974Z compiled=False, 2025-05-07T20:32:29.5048045Z ) 2025-05-07T20:32:29.5048270Z self = 2025-05-07T20:32:29.5048450Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.5048458Z 2025-05-07T20:32:29.5048535Z @given( 2025-05-07T20:32:29.5048657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5048756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5048871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5048994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5049153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5049233Z ) 2025-05-07T20:32:29.5049486Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5049578Z def test_silu_mul_quant( 2025-05-07T20:32:29.5049656Z self, 2025-05-07T20:32:29.5049732Z T: int, 2025-05-07T20:32:29.5049806Z D: int, 2025-05-07T20:32:29.5049906Z scale_ub: Optional[float], 2025-05-07T20:32:29.5049995Z contiguous: bool, 2025-05-07T20:32:29.5050080Z compiled: bool, 2025-05-07T20:32:29.5050160Z ) -> None: 2025-05-07T20:32:29.5050297Z torch.manual_seed(2025) 2025-05-07T20:32:29.5050370Z 2025-05-07T20:32:29.5050544Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5052438Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5052491Z 2025-05-07T20:32:29.5052614Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5052618Z 2025-05-07T20:32:29.5052721Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5052953Z self=, 2025-05-07T20:32:29.5053029Z T=1, 2025-05-07T20:32:29.5053106Z D=5120, 2025-05-07T20:32:29.5053192Z scale_ub=1200.0, 2025-05-07T20:32:29.5053276Z contiguous=True, 2025-05-07T20:32:29.5053361Z compiled=False, 2025-05-07T20:32:29.5053440Z ) 2025-05-07T20:32:29.5053663Z self = 2025-05-07T20:32:29.5053860Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.5053865Z 2025-05-07T20:32:29.5053965Z @given( 2025-05-07T20:32:29.5054086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5054189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5054302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5054420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5054536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5054611Z ) 2025-05-07T20:32:29.5054862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5054957Z def test_silu_mul_quant( 2025-05-07T20:32:29.5055031Z self, 2025-05-07T20:32:29.5055106Z T: int, 2025-05-07T20:32:29.5055186Z D: int, 2025-05-07T20:32:29.5055288Z scale_ub: Optional[float], 2025-05-07T20:32:29.5055383Z contiguous: bool, 2025-05-07T20:32:29.5055471Z compiled: bool, 2025-05-07T20:32:29.5055548Z ) -> None: 2025-05-07T20:32:29.5055644Z torch.manual_seed(2025) 2025-05-07T20:32:29.5055717Z 2025-05-07T20:32:29.5055888Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5055964Z 2025-05-07T20:32:29.5056056Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5056182Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5056271Z x = x_sign * x_clamp 2025-05-07T20:32:29.5056350Z x0 = x[:, :D] 2025-05-07T20:32:29.5056432Z x1 = x[:, D:] 2025-05-07T20:32:29.5056505Z 2025-05-07T20:32:29.5056588Z if contiguous: 2025-05-07T20:32:29.5056679Z x0 = x0.contiguous() 2025-05-07T20:32:29.5056770Z x1 = x1.contiguous() 2025-05-07T20:32:29.5056841Z 2025-05-07T20:32:29.5056936Z if scale_ub is not None: 2025-05-07T20:32:29.5057089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5057229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5057306Z ) 2025-05-07T20:32:29.5057381Z else: 2025-05-07T20:32:29.5057475Z scale_ub_tensor = None 2025-05-07T20:32:29.5057548Z 2025-05-07T20:32:29.5057679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5057768Z op = silu_mul_quant 2025-05-07T20:32:29.5057854Z if compiled: 2025-05-07T20:32:29.5057954Z op = torch.compile(op) 2025-05-07T20:32:29.5058059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5058176Z 2025-05-07T20:32:29.5058266Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5058270Z 2025-05-07T20:32:29.5058370Z moe/activation_test.py:117: 2025-05-07T20:32:29.5058498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5058637Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5058742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5059302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5059403Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5059778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5060007Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5060364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5060462Z kernel = self.compile( 2025-05-07T20:32:29.5060857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5061042Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5061176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5061183Z 2025-05-07T20:32:29.5061395Z self = 2025-05-07T20:32:29.5062198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5062718Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7bc7e20>} 2025-05-07T20:32:29.5063497Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5063691Z context = 2025-05-07T20:32:29.5063699Z 2025-05-07T20:32:29.5063875Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5064147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5064254Z module_map=module_map) 2025-05-07T20:32:29.5064420Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5064518Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5064600Z E ^ 2025-05-07T20:32:29.5064966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5064973Z 2025-05-07T20:32:29.5065400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5065405Z 2025-05-07T20:32:29.5065513Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5065742Z self=, 2025-05-07T20:32:29.5065860Z T=2048, 2025-05-07T20:32:29.5065943Z D=5120, 2025-05-07T20:32:29.5066026Z scale_ub=None, 2025-05-07T20:32:29.5066114Z contiguous=True, 2025-05-07T20:32:29.5066197Z compiled=False, 2025-05-07T20:32:29.5066270Z ) 2025-05-07T20:32:29.5066499Z self = 2025-05-07T20:32:29.5069653Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.5069665Z 2025-05-07T20:32:29.5069756Z @given( 2025-05-07T20:32:29.5069882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5070077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5070195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5070313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5070432Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5070551Z ) 2025-05-07T20:32:29.5070861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5070960Z def test_silu_mul_quant( 2025-05-07T20:32:29.5071036Z self, 2025-05-07T20:32:29.5071113Z T: int, 2025-05-07T20:32:29.5071195Z D: int, 2025-05-07T20:32:29.5071293Z scale_ub: Optional[float], 2025-05-07T20:32:29.5071388Z contiguous: bool, 2025-05-07T20:32:29.5071474Z compiled: bool, 2025-05-07T20:32:29.5071554Z ) -> None: 2025-05-07T20:32:29.5071653Z torch.manual_seed(2025) 2025-05-07T20:32:29.5071724Z 2025-05-07T20:32:29.5071903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5071978Z 2025-05-07T20:32:29.5072072Z > x_sign = torch.sign(x) 2025-05-07T20:32:29.5073916Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5073928Z 2025-05-07T20:32:29.5074049Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:29.5074054Z 2025-05-07T20:32:29.5074161Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5074395Z self=, 2025-05-07T20:32:29.5074477Z T=16384, 2025-05-07T20:32:29.5074557Z D=5120, 2025-05-07T20:32:29.5074640Z scale_ub=None, 2025-05-07T20:32:29.5074726Z contiguous=True, 2025-05-07T20:32:29.5074814Z compiled=False, 2025-05-07T20:32:29.5074888Z ) 2025-05-07T20:32:29.5075115Z self = 2025-05-07T20:32:29.5075303Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.5075307Z 2025-05-07T20:32:29.5075387Z @given( 2025-05-07T20:32:29.5075507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5075609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5075724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5075844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5075959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5076033Z ) 2025-05-07T20:32:29.5076294Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5076388Z def test_silu_mul_quant( 2025-05-07T20:32:29.5076463Z self, 2025-05-07T20:32:29.5076542Z T: int, 2025-05-07T20:32:29.5076617Z D: int, 2025-05-07T20:32:29.5076715Z scale_ub: Optional[float], 2025-05-07T20:32:29.5076810Z contiguous: bool, 2025-05-07T20:32:29.5076943Z compiled: bool, 2025-05-07T20:32:29.5077024Z ) -> None: 2025-05-07T20:32:29.5077122Z torch.manual_seed(2025) 2025-05-07T20:32:29.5077195Z 2025-05-07T20:32:29.5077367Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5079207Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5079251Z 2025-05-07T20:32:29.5079409Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5079413Z 2025-05-07T20:32:29.5079555Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5079787Z self=, 2025-05-07T20:32:29.5079867Z T=4096, 2025-05-07T20:32:29.5079945Z D=5120, 2025-05-07T20:32:29.5080028Z scale_ub=None, 2025-05-07T20:32:29.5080196Z contiguous=True, 2025-05-07T20:32:29.5080282Z compiled=False, 2025-05-07T20:32:29.5080356Z ) 2025-05-07T20:32:29.5080582Z self = 2025-05-07T20:32:29.5080760Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.5080766Z 2025-05-07T20:32:29.5080846Z @given( 2025-05-07T20:32:29.5080964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5081064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5081183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5081306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5081427Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5081503Z ) 2025-05-07T20:32:29.5081756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5081850Z def test_silu_mul_quant( 2025-05-07T20:32:29.5081931Z self, 2025-05-07T20:32:29.5082006Z T: int, 2025-05-07T20:32:29.5082084Z D: int, 2025-05-07T20:32:29.5082182Z scale_ub: Optional[float], 2025-05-07T20:32:29.5082273Z contiguous: bool, 2025-05-07T20:32:29.5082362Z compiled: bool, 2025-05-07T20:32:29.5082441Z ) -> None: 2025-05-07T20:32:29.5082536Z torch.manual_seed(2025) 2025-05-07T20:32:29.5082610Z 2025-05-07T20:32:29.5082782Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5084605Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5084617Z 2025-05-07T20:32:29.5084736Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5084741Z 2025-05-07T20:32:29.5084844Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5085079Z self=, 2025-05-07T20:32:29.5085156Z T=2048, 2025-05-07T20:32:29.5085238Z D=5120, 2025-05-07T20:32:29.5085321Z scale_ub=None, 2025-05-07T20:32:29.5085408Z contiguous=False, 2025-05-07T20:32:29.5085498Z compiled=False, 2025-05-07T20:32:29.5085573Z ) 2025-05-07T20:32:29.5085844Z self = 2025-05-07T20:32:29.5086029Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.5086034Z 2025-05-07T20:32:29.5086111Z @given( 2025-05-07T20:32:29.5086230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5086332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5086447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5086568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5086682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5086796Z ) 2025-05-07T20:32:29.5087052Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5087146Z def test_silu_mul_quant( 2025-05-07T20:32:29.5087221Z self, 2025-05-07T20:32:29.5087303Z T: int, 2025-05-07T20:32:29.5087417Z D: int, 2025-05-07T20:32:29.5087517Z scale_ub: Optional[float], 2025-05-07T20:32:29.5087645Z contiguous: bool, 2025-05-07T20:32:29.5087732Z compiled: bool, 2025-05-07T20:32:29.5087809Z ) -> None: 2025-05-07T20:32:29.5087909Z torch.manual_seed(2025) 2025-05-07T20:32:29.5087981Z 2025-05-07T20:32:29.5088152Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5089976Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5089986Z 2025-05-07T20:32:29.5090110Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5090116Z 2025-05-07T20:32:29.5090222Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5090451Z self=, 2025-05-07T20:32:29.5090530Z T=4096, 2025-05-07T20:32:29.5090607Z D=7168, 2025-05-07T20:32:29.5090689Z scale_ub=None, 2025-05-07T20:32:29.5090776Z contiguous=True, 2025-05-07T20:32:29.5090859Z compiled=True, 2025-05-07T20:32:29.5090931Z ) 2025-05-07T20:32:29.5091156Z self = 2025-05-07T20:32:29.5091332Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.5091336Z 2025-05-07T20:32:29.5091416Z @given( 2025-05-07T20:32:29.5091535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5091635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5091758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5091882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5091996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5092072Z ) 2025-05-07T20:32:29.5092324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5092418Z def test_silu_mul_quant( 2025-05-07T20:32:29.5092496Z self, 2025-05-07T20:32:29.5092572Z T: int, 2025-05-07T20:32:29.5092650Z D: int, 2025-05-07T20:32:29.5092748Z scale_ub: Optional[float], 2025-05-07T20:32:29.5092843Z contiguous: bool, 2025-05-07T20:32:29.5092931Z compiled: bool, 2025-05-07T20:32:29.5093008Z ) -> None: 2025-05-07T20:32:29.5093103Z torch.manual_seed(2025) 2025-05-07T20:32:29.5093178Z 2025-05-07T20:32:29.5093348Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5095589Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5095599Z 2025-05-07T20:32:29.5095721Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5095764Z 2025-05-07T20:32:29.5095870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5096105Z self=, 2025-05-07T20:32:29.5096182Z T=2048, 2025-05-07T20:32:29.5096262Z D=5120, 2025-05-07T20:32:29.5096344Z scale_ub=1200.0, 2025-05-07T20:32:29.5096469Z contiguous=False, 2025-05-07T20:32:29.5096559Z compiled=False, 2025-05-07T20:32:29.5096632Z ) 2025-05-07T20:32:29.5096892Z self = 2025-05-07T20:32:29.5097077Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.5097081Z 2025-05-07T20:32:29.5097158Z @given( 2025-05-07T20:32:29.5097279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5097381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5097495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5097614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5097731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5097805Z ) 2025-05-07T20:32:29.5098060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5098154Z def test_silu_mul_quant( 2025-05-07T20:32:29.5098232Z self, 2025-05-07T20:32:29.5098313Z T: int, 2025-05-07T20:32:29.5098394Z D: int, 2025-05-07T20:32:29.5098495Z scale_ub: Optional[float], 2025-05-07T20:32:29.5098587Z contiguous: bool, 2025-05-07T20:32:29.5098673Z compiled: bool, 2025-05-07T20:32:29.5098750Z ) -> None: 2025-05-07T20:32:29.5098853Z torch.manual_seed(2025) 2025-05-07T20:32:29.5098925Z 2025-05-07T20:32:29.5099097Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5100924Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5100935Z 2025-05-07T20:32:29.5101058Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5101062Z 2025-05-07T20:32:29.5101166Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5101396Z self=, 2025-05-07T20:32:29.5101475Z T=4096, 2025-05-07T20:32:29.5101551Z D=7168, 2025-05-07T20:32:29.5101634Z scale_ub=1200.0, 2025-05-07T20:32:29.5101720Z contiguous=True, 2025-05-07T20:32:29.5101803Z compiled=False, 2025-05-07T20:32:29.5101878Z ) 2025-05-07T20:32:29.5102103Z self = 2025-05-07T20:32:29.5102280Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.5102284Z 2025-05-07T20:32:29.5102362Z @given( 2025-05-07T20:32:29.5102480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5102583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5102778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5102897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5103011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5103087Z ) 2025-05-07T20:32:29.5103339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5103433Z def test_silu_mul_quant( 2025-05-07T20:32:29.5103512Z self, 2025-05-07T20:32:29.5103587Z T: int, 2025-05-07T20:32:29.5103666Z D: int, 2025-05-07T20:32:29.5103805Z scale_ub: Optional[float], 2025-05-07T20:32:29.5103894Z contiguous: bool, 2025-05-07T20:32:29.5103983Z compiled: bool, 2025-05-07T20:32:29.5104063Z ) -> None: 2025-05-07T20:32:29.5104161Z torch.manual_seed(2025) 2025-05-07T20:32:29.5104235Z 2025-05-07T20:32:29.5104405Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5106314Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5106326Z 2025-05-07T20:32:29.5106446Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5106451Z 2025-05-07T20:32:29.5106553Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5106784Z self=, 2025-05-07T20:32:29.5106862Z T=16384, 2025-05-07T20:32:29.5106943Z D=7168, 2025-05-07T20:32:29.5107028Z scale_ub=None, 2025-05-07T20:32:29.5107118Z contiguous=False, 2025-05-07T20:32:29.5107205Z compiled=True, 2025-05-07T20:32:29.5107277Z ) 2025-05-07T20:32:29.5107499Z self = 2025-05-07T20:32:29.5107686Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.5107691Z 2025-05-07T20:32:29.5107768Z @given( 2025-05-07T20:32:29.5107886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5107988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5108102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5108223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5108336Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5108410Z ) 2025-05-07T20:32:29.5108669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5108767Z def test_silu_mul_quant( 2025-05-07T20:32:29.5108844Z self, 2025-05-07T20:32:29.5108923Z T: int, 2025-05-07T20:32:29.5108999Z D: int, 2025-05-07T20:32:29.5109097Z scale_ub: Optional[float], 2025-05-07T20:32:29.5109189Z contiguous: bool, 2025-05-07T20:32:29.5109276Z compiled: bool, 2025-05-07T20:32:29.5109353Z ) -> None: 2025-05-07T20:32:29.5109449Z torch.manual_seed(2025) 2025-05-07T20:32:29.5109521Z 2025-05-07T20:32:29.5109691Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5111569Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5111579Z 2025-05-07T20:32:29.5111702Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5111706Z 2025-05-07T20:32:29.5111809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5112039Z self=, 2025-05-07T20:32:29.5112120Z T=4096, 2025-05-07T20:32:29.5112196Z D=7168, 2025-05-07T20:32:29.5112279Z scale_ub=None, 2025-05-07T20:32:29.5112370Z contiguous=True, 2025-05-07T20:32:29.5112501Z compiled=False, 2025-05-07T20:32:29.5112575Z ) 2025-05-07T20:32:29.5112801Z self = 2025-05-07T20:32:29.5112977Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.5112981Z 2025-05-07T20:32:29.5113104Z @given( 2025-05-07T20:32:29.5113227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5113626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5113758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5113897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5114030Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5114112Z ) 2025-05-07T20:32:29.5114364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5114458Z def test_silu_mul_quant( 2025-05-07T20:32:29.5114536Z self, 2025-05-07T20:32:29.5114614Z T: int, 2025-05-07T20:32:29.5114694Z D: int, 2025-05-07T20:32:29.5114792Z scale_ub: Optional[float], 2025-05-07T20:32:29.5114883Z contiguous: bool, 2025-05-07T20:32:29.5114971Z compiled: bool, 2025-05-07T20:32:29.5115047Z ) -> None: 2025-05-07T20:32:29.5115141Z torch.manual_seed(2025) 2025-05-07T20:32:29.5115219Z 2025-05-07T20:32:29.5115392Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5117230Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5117240Z 2025-05-07T20:32:29.5117359Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5117363Z 2025-05-07T20:32:29.5117466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5117699Z self=, 2025-05-07T20:32:29.5117779Z T=16384, 2025-05-07T20:32:29.5117861Z D=7168, 2025-05-07T20:32:29.5117946Z scale_ub=None, 2025-05-07T20:32:29.5118030Z contiguous=True, 2025-05-07T20:32:29.5118117Z compiled=False, 2025-05-07T20:32:29.5118191Z ) 2025-05-07T20:32:29.5118414Z self = 2025-05-07T20:32:29.5118596Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:29.5118600Z 2025-05-07T20:32:29.5118676Z @given( 2025-05-07T20:32:29.5118794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5118898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5119013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5119134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5119248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5119321Z ) 2025-05-07T20:32:29.5119580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5119742Z def test_silu_mul_quant( 2025-05-07T20:32:29.5119822Z self, 2025-05-07T20:32:29.5119901Z T: int, 2025-05-07T20:32:29.5119976Z D: int, 2025-05-07T20:32:29.5120074Z scale_ub: Optional[float], 2025-05-07T20:32:29.5120232Z contiguous: bool, 2025-05-07T20:32:29.5120318Z compiled: bool, 2025-05-07T20:32:29.5120395Z ) -> None: 2025-05-07T20:32:29.5120493Z torch.manual_seed(2025) 2025-05-07T20:32:29.5120565Z 2025-05-07T20:32:29.5120735Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5122669Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5122728Z 2025-05-07T20:32:29.5122851Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5122856Z 2025-05-07T20:32:29.5122959Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5123188Z self=, 2025-05-07T20:32:29.5123269Z T=16384, 2025-05-07T20:32:29.5123346Z D=7168, 2025-05-07T20:32:29.5123432Z scale_ub=1200.0, 2025-05-07T20:32:29.5123520Z contiguous=True, 2025-05-07T20:32:29.5123604Z compiled=False, 2025-05-07T20:32:29.5123680Z ) 2025-05-07T20:32:29.5123924Z self = 2025-05-07T20:32:29.5124134Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.5124143Z 2025-05-07T20:32:29.5124224Z @given( 2025-05-07T20:32:29.5124345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5124444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5124563Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5124679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5124796Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5124870Z ) 2025-05-07T20:32:29.5125123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5125218Z def test_silu_mul_quant( 2025-05-07T20:32:29.5125300Z self, 2025-05-07T20:32:29.5125376Z T: int, 2025-05-07T20:32:29.5125456Z D: int, 2025-05-07T20:32:29.5125554Z scale_ub: Optional[float], 2025-05-07T20:32:29.5125643Z contiguous: bool, 2025-05-07T20:32:29.5125732Z compiled: bool, 2025-05-07T20:32:29.5125812Z ) -> None: 2025-05-07T20:32:29.5125907Z torch.manual_seed(2025) 2025-05-07T20:32:29.5125984Z 2025-05-07T20:32:29.5126157Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5127985Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5127992Z 2025-05-07T20:32:29.5128111Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5128115Z 2025-05-07T20:32:29.5128224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5128499Z self=, 2025-05-07T20:32:29.5128580Z T=128, 2025-05-07T20:32:29.5128660Z D=5120, 2025-05-07T20:32:29.5128742Z scale_ub=1200.0, 2025-05-07T20:32:29.5128829Z contiguous=False, 2025-05-07T20:32:29.5128914Z compiled=False, 2025-05-07T20:32:29.5128987Z ) 2025-05-07T20:32:29.5129208Z self = 2025-05-07T20:32:29.5129389Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.5129393Z 2025-05-07T20:32:29.5129469Z @given( 2025-05-07T20:32:29.5129633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5129732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5129847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5129969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5130084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5130197Z ) 2025-05-07T20:32:29.5130561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5130657Z def test_silu_mul_quant( 2025-05-07T20:32:29.5130733Z self, 2025-05-07T20:32:29.5130812Z T: int, 2025-05-07T20:32:29.5130887Z D: int, 2025-05-07T20:32:29.5130989Z scale_ub: Optional[float], 2025-05-07T20:32:29.5131079Z contiguous: bool, 2025-05-07T20:32:29.5131165Z compiled: bool, 2025-05-07T20:32:29.5131244Z ) -> None: 2025-05-07T20:32:29.5131339Z torch.manual_seed(2025) 2025-05-07T20:32:29.5131411Z 2025-05-07T20:32:29.5131586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5131660Z 2025-05-07T20:32:29.5131752Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5131882Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5131971Z x = x_sign * x_clamp 2025-05-07T20:32:29.5132055Z x0 = x[:, :D] 2025-05-07T20:32:29.5132137Z x1 = x[:, D:] 2025-05-07T20:32:29.5132212Z 2025-05-07T20:32:29.5132298Z if contiguous: 2025-05-07T20:32:29.5132395Z x0 = x0.contiguous() 2025-05-07T20:32:29.5132494Z x1 = x1.contiguous() 2025-05-07T20:32:29.5132567Z 2025-05-07T20:32:29.5132658Z if scale_ub is not None: 2025-05-07T20:32:29.5132767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5132904Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5132980Z ) 2025-05-07T20:32:29.5133059Z else: 2025-05-07T20:32:29.5133153Z scale_ub_tensor = None 2025-05-07T20:32:29.5133226Z 2025-05-07T20:32:29.5133360Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5133451Z op = silu_mul_quant 2025-05-07T20:32:29.5133535Z if compiled: 2025-05-07T20:32:29.5133642Z op = torch.compile(op) 2025-05-07T20:32:29.5133752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5133828Z 2025-05-07T20:32:29.5133922Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5133926Z 2025-05-07T20:32:29.5134024Z moe/activation_test.py:117: 2025-05-07T20:32:29.5134157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5134258Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5134359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5134881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5134983Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5135356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5135586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5135937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5136084Z kernel = self.compile( 2025-05-07T20:32:29.5136484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5136665Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5136796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5136801Z 2025-05-07T20:32:29.5137010Z self = 2025-05-07T20:32:29.5137814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5138372Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7754ae0>} 2025-05-07T20:32:29.5139220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5139419Z context = 2025-05-07T20:32:29.5139423Z 2025-05-07T20:32:29.5139592Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5139869Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5139980Z module_map=module_map) 2025-05-07T20:32:29.5140150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5140250Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5140327Z E ^ 2025-05-07T20:32:29.5140694Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5140704Z 2025-05-07T20:32:29.5141134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5141138Z 2025-05-07T20:32:29.5141244Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5141477Z self=, 2025-05-07T20:32:29.5141554Z T=2048, 2025-05-07T20:32:29.5141634Z D=7168, 2025-05-07T20:32:29.5141717Z scale_ub=None, 2025-05-07T20:32:29.5141804Z contiguous=False, 2025-05-07T20:32:29.5141894Z compiled=False, 2025-05-07T20:32:29.5141970Z ) 2025-05-07T20:32:29.5142192Z self = 2025-05-07T20:32:29.5142374Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.5142379Z 2025-05-07T20:32:29.5142455Z @given( 2025-05-07T20:32:29.5142578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5142684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5142802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5142922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5143038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5143113Z ) 2025-05-07T20:32:29.5143370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5143464Z def test_silu_mul_quant( 2025-05-07T20:32:29.5143542Z self, 2025-05-07T20:32:29.5143620Z T: int, 2025-05-07T20:32:29.5143699Z D: int, 2025-05-07T20:32:29.5143798Z scale_ub: Optional[float], 2025-05-07T20:32:29.5143895Z contiguous: bool, 2025-05-07T20:32:29.5143982Z compiled: bool, 2025-05-07T20:32:29.5144060Z ) -> None: 2025-05-07T20:32:29.5144159Z torch.manual_seed(2025) 2025-05-07T20:32:29.5144231Z 2025-05-07T20:32:29.5144455Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5146288Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5146330Z 2025-05-07T20:32:29.5146454Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5146458Z 2025-05-07T20:32:29.5146562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5146790Z self=, 2025-05-07T20:32:29.5146907Z T=128, 2025-05-07T20:32:29.5146985Z D=7168, 2025-05-07T20:32:29.5147069Z scale_ub=1200.0, 2025-05-07T20:32:29.5147199Z contiguous=True, 2025-05-07T20:32:29.5147285Z compiled=True, 2025-05-07T20:32:29.5147358Z ) 2025-05-07T20:32:29.5147584Z self = 2025-05-07T20:32:29.5147755Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.5147759Z 2025-05-07T20:32:29.5147838Z @given( 2025-05-07T20:32:29.5147958Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5148058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5148180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5148296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5148410Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5148485Z ) 2025-05-07T20:32:29.5148737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5148841Z def test_silu_mul_quant( 2025-05-07T20:32:29.5148919Z self, 2025-05-07T20:32:29.5148997Z T: int, 2025-05-07T20:32:29.5149078Z D: int, 2025-05-07T20:32:29.5149177Z scale_ub: Optional[float], 2025-05-07T20:32:29.5149266Z contiguous: bool, 2025-05-07T20:32:29.5149354Z compiled: bool, 2025-05-07T20:32:29.5149431Z ) -> None: 2025-05-07T20:32:29.5149527Z torch.manual_seed(2025) 2025-05-07T20:32:29.5149602Z 2025-05-07T20:32:29.5149773Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5149847Z 2025-05-07T20:32:29.5149944Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5150071Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5150159Z x = x_sign * x_clamp 2025-05-07T20:32:29.5150243Z x0 = x[:, :D] 2025-05-07T20:32:29.5150322Z x1 = x[:, D:] 2025-05-07T20:32:29.5150396Z 2025-05-07T20:32:29.5150483Z if contiguous: 2025-05-07T20:32:29.5150577Z x0 = x0.contiguous() 2025-05-07T20:32:29.5150672Z x1 = x1.contiguous() 2025-05-07T20:32:29.5150746Z 2025-05-07T20:32:29.5150837Z if scale_ub is not None: 2025-05-07T20:32:29.5150948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5151085Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5151160Z ) 2025-05-07T20:32:29.5151239Z else: 2025-05-07T20:32:29.5151333Z scale_ub_tensor = None 2025-05-07T20:32:29.5151404Z 2025-05-07T20:32:29.5151539Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5151633Z op = silu_mul_quant 2025-05-07T20:32:29.5151720Z if compiled: 2025-05-07T20:32:29.5151821Z op = torch.compile(op) 2025-05-07T20:32:29.5151928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5152003Z 2025-05-07T20:32:29.5152100Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5152104Z 2025-05-07T20:32:29.5152251Z moe/activation_test.py:117: 2025-05-07T20:32:29.5152388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5152489Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5152591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5152973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.5153066Z return fn(*args, **kwargs) 2025-05-07T20:32:29.5153577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5153718Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5154088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5154323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5154750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5154846Z kernel = self.compile( 2025-05-07T20:32:29.5155244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5155423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5155556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5155560Z 2025-05-07T20:32:29.5155770Z self = 2025-05-07T20:32:29.5156573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5157100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f30b7610040>} 2025-05-07T20:32:29.5157873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5158071Z context = 2025-05-07T20:32:29.5158075Z 2025-05-07T20:32:29.5158244Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5158522Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5158634Z module_map=module_map) 2025-05-07T20:32:29.5158798Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5158902Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5158980Z E ^ 2025-05-07T20:32:29.5159351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5159356Z 2025-05-07T20:32:29.5159787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5159792Z 2025-05-07T20:32:29.5159897Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5160224Z self=, 2025-05-07T20:32:29.5160303Z T=128, 2025-05-07T20:32:29.5160380Z D=7168, 2025-05-07T20:32:29.5160467Z scale_ub=1200.0, 2025-05-07T20:32:29.5160555Z contiguous=True, 2025-05-07T20:32:29.5160639Z compiled=False, 2025-05-07T20:32:29.5160713Z ) 2025-05-07T20:32:29.5160936Z self = 2025-05-07T20:32:29.5161110Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:29.5161120Z 2025-05-07T20:32:29.5161197Z @given( 2025-05-07T20:32:29.5161362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5161470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5161586Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5161705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5161823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5161899Z ) 2025-05-07T20:32:29.5162153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5162251Z def test_silu_mul_quant( 2025-05-07T20:32:29.5162366Z self, 2025-05-07T20:32:29.5162442Z T: int, 2025-05-07T20:32:29.5162520Z D: int, 2025-05-07T20:32:29.5162618Z scale_ub: Optional[float], 2025-05-07T20:32:29.5162710Z contiguous: bool, 2025-05-07T20:32:29.5162796Z compiled: bool, 2025-05-07T20:32:29.5162875Z ) -> None: 2025-05-07T20:32:29.5163039Z torch.manual_seed(2025) 2025-05-07T20:32:29.5163113Z 2025-05-07T20:32:29.5163327Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5163406Z 2025-05-07T20:32:29.5163499Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5163625Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5165482Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5165490Z 2025-05-07T20:32:29.5165610Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:29.5165617Z 2025-05-07T20:32:29.5165730Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5165963Z self=, 2025-05-07T20:32:29.5166044Z T=128, 2025-05-07T20:32:29.5166121Z D=5120, 2025-05-07T20:32:29.5166204Z scale_ub=1200.0, 2025-05-07T20:32:29.5166293Z contiguous=True, 2025-05-07T20:32:29.5166375Z compiled=True, 2025-05-07T20:32:29.5166447Z ) 2025-05-07T20:32:29.5166675Z self = 2025-05-07T20:32:29.5166849Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.5166856Z 2025-05-07T20:32:29.5166932Z @given( 2025-05-07T20:32:29.5167054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5167153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5167273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5167393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5167509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5167589Z ) 2025-05-07T20:32:29.5167843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5167935Z def test_silu_mul_quant( 2025-05-07T20:32:29.5168014Z self, 2025-05-07T20:32:29.5168091Z T: int, 2025-05-07T20:32:29.5168166Z D: int, 2025-05-07T20:32:29.5168267Z scale_ub: Optional[float], 2025-05-07T20:32:29.5168358Z contiguous: bool, 2025-05-07T20:32:29.5168443Z compiled: bool, 2025-05-07T20:32:29.5168526Z ) -> None: 2025-05-07T20:32:29.5168620Z torch.manual_seed(2025) 2025-05-07T20:32:29.5168695Z 2025-05-07T20:32:29.5168865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5168940Z 2025-05-07T20:32:29.5169036Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5169165Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5171038Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5171084Z 2025-05-07T20:32:29.5171206Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:29.5171210Z 2025-05-07T20:32:29.5171313Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5171547Z self=, 2025-05-07T20:32:29.5171629Z T=128, 2025-05-07T20:32:29.5171745Z D=7168, 2025-05-07T20:32:29.5171830Z scale_ub=None, 2025-05-07T20:32:29.5171918Z contiguous=True, 2025-05-07T20:32:29.5172039Z compiled=True, 2025-05-07T20:32:29.5172114Z ) 2025-05-07T20:32:29.5172338Z self = 2025-05-07T20:32:29.5172511Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.5172516Z 2025-05-07T20:32:29.5172592Z @given( 2025-05-07T20:32:29.5172710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5172811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5172926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5173045Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5173162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5173235Z ) 2025-05-07T20:32:29.5173492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5173589Z def test_silu_mul_quant( 2025-05-07T20:32:29.5173664Z self, 2025-05-07T20:32:29.5173747Z T: int, 2025-05-07T20:32:29.5173825Z D: int, 2025-05-07T20:32:29.5173924Z scale_ub: Optional[float], 2025-05-07T20:32:29.5174017Z contiguous: bool, 2025-05-07T20:32:29.5174102Z compiled: bool, 2025-05-07T20:32:29.5174179Z ) -> None: 2025-05-07T20:32:29.5174278Z torch.manual_seed(2025) 2025-05-07T20:32:29.5174350Z 2025-05-07T20:32:29.5174522Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5176352Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:29.5176363Z 2025-05-07T20:32:29.5176484Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:29.5176623Z =============================== warnings summary =============================== 2025-05-07T20:32:29.5176942Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:29.5177256Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:29.5177565Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:29.5178468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:29.5178757Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:29.5178762Z 2025-05-07T20:32:29.5178980Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:29.5179155Z ================= 1 failed, 1 deselected, 3 warnings in 13.80s ================= 2025-05-07T20:32:31.1966280Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:31.2584029Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:31.2584366Z 2025-05-07T20:32:33.2608856Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:35.4122145Z ============================= test session starts ============================== 2025-05-07T20:32:35.4123263Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:35.4123822Z cachedir: .pytest_cache 2025-05-07T20:32:35.4124604Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:35.4125557Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:35.4125985Z plugins: hypothesis-6.131.14 2025-05-07T20:32:36.9729874Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:37.0699227Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:37.0699811Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:37.0700125Z 2025-05-07T20:32:39.1693238Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.1694273Z self=, 2025-05-07T20:32:39.1694705Z T=1, 2025-05-07T20:32:39.1694911Z D=5120, 2025-05-07T20:32:39.1695121Z scale_ub=None, 2025-05-07T20:32:39.1695343Z contiguous=True, 2025-05-07T20:32:39.1695586Z compiled=True, 2025-05-07T20:32:39.1695807Z ) 2025-05-07T20:32:39.1696144Z self = 2025-05-07T20:32:39.1696658Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.1696930Z 2025-05-07T20:32:39.1697021Z @given( 2025-05-07T20:32:39.1697279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.1697604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.1697935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.1698290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.1698637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.1698952Z ) 2025-05-07T20:32:39.1699333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.1699792Z def test_silu_mul_quant( 2025-05-07T20:32:39.1700052Z self, 2025-05-07T20:32:39.1700264Z T: int, 2025-05-07T20:32:39.1700470Z D: int, 2025-05-07T20:32:39.1700708Z scale_ub: Optional[float], 2025-05-07T20:32:39.1700999Z contiguous: bool, 2025-05-07T20:32:39.1701249Z compiled: bool, 2025-05-07T20:32:39.1701490Z ) -> None: 2025-05-07T20:32:39.1701722Z torch.manual_seed(2025) 2025-05-07T20:32:39.1701973Z 2025-05-07T20:32:39.1702268Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.1702630Z 2025-05-07T20:32:39.1702840Z x_sign = torch.sign(x) 2025-05-07T20:32:39.1703145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.1703478Z x = x_sign * x_clamp 2025-05-07T20:32:39.1703735Z x0 = x[:, :D] 2025-05-07T20:32:39.1703960Z x1 = x[:, D:] 2025-05-07T20:32:39.1704508Z 2025-05-07T20:32:39.1704714Z if contiguous: 2025-05-07T20:32:39.1704955Z x0 = x0.contiguous() 2025-05-07T20:32:39.1705232Z x1 = x1.contiguous() 2025-05-07T20:32:39.1705488Z 2025-05-07T20:32:39.1705688Z if scale_ub is not None: 2025-05-07T20:32:39.1705982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.1706338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.1706660Z ) 2025-05-07T20:32:39.1706902Z else: 2025-05-07T20:32:39.1707141Z scale_ub_tensor = None 2025-05-07T20:32:39.1707491Z 2025-05-07T20:32:39.1707743Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1708081Z op = silu_mul_quant 2025-05-07T20:32:39.1708348Z if compiled: 2025-05-07T20:32:39.1708606Z op = torch.compile(op) 2025-05-07T20:32:39.1709031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1709325Z 2025-05-07T20:32:39.1709610Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.1709917Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.1710228Z 2025-05-07T20:32:39.1710480Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1710834Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.1711146Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.1711477Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.1711859Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.1712194Z 2025-05-07T20:32:39.1712417Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.1712622Z 2025-05-07T20:32:39.1712729Z moe/activation_test.py:126: 2025-05-07T20:32:39.1713051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1713681Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.1714033Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.1714870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.1715661Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.1716245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.1716974Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.1717730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.1718494Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.1719256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.1719933Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.1720647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.1721189Z fn() 2025-05-07T20:32:39.1721714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.1722325Z self.fn.run( 2025-05-07T20:32:39.1722820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.1723368Z kernel = self.compile( 2025-05-07T20:32:39.1723941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.1724626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1725052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1725294Z 2025-05-07T20:32:39.1725586Z self = 2025-05-07T20:32:39.1726712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.1728157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f089ac76700>} 2025-05-07T20:32:39.1729543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.1730669Z context = 2025-05-07T20:32:39.1730970Z 2025-05-07T20:32:39.1731144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.1731804Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1732301Z module_map=module_map) 2025-05-07T20:32:39.1732683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1733070Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.1733358Z E ^ 2025-05-07T20:32:39.1733846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1734315Z 2025-05-07T20:32:39.1734748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.1735291Z 2025-05-07T20:32:39.1735401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.1735841Z self=, 2025-05-07T20:32:39.1736265Z T=2048, 2025-05-07T20:32:39.1736469Z D=5120, 2025-05-07T20:32:39.1736676Z scale_ub=1200.0, 2025-05-07T20:32:39.1736919Z contiguous=True, 2025-05-07T20:32:39.1737155Z compiled=False, 2025-05-07T20:32:39.1737376Z ) 2025-05-07T20:32:39.1737718Z self = 2025-05-07T20:32:39.1738244Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:39.1738537Z 2025-05-07T20:32:39.1738619Z @given( 2025-05-07T20:32:39.1738871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.1739199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.1739535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.1739888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.1740239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.1740542Z ) 2025-05-07T20:32:39.1740915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.1741383Z def test_silu_mul_quant( 2025-05-07T20:32:39.1741640Z self, 2025-05-07T20:32:39.1741851Z T: int, 2025-05-07T20:32:39.1742103Z D: int, 2025-05-07T20:32:39.1742431Z scale_ub: Optional[float], 2025-05-07T20:32:39.1742827Z contiguous: bool, 2025-05-07T20:32:39.1743181Z compiled: bool, 2025-05-07T20:32:39.1743418Z ) -> None: 2025-05-07T20:32:39.1743648Z torch.manual_seed(2025) 2025-05-07T20:32:39.1743909Z 2025-05-07T20:32:39.1744381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.1744915Z 2025-05-07T20:32:39.1745174Z x_sign = torch.sign(x) 2025-05-07T20:32:39.1745592Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.1745914Z x = x_sign * x_clamp 2025-05-07T20:32:39.1746213Z x0 = x[:, :D] 2025-05-07T20:32:39.1746443Z x1 = x[:, D:] 2025-05-07T20:32:39.1746674Z 2025-05-07T20:32:39.1746901Z if contiguous: 2025-05-07T20:32:39.1747145Z x0 = x0.contiguous() 2025-05-07T20:32:39.1747481Z x1 = x1.contiguous() 2025-05-07T20:32:39.1747738Z 2025-05-07T20:32:39.1747943Z if scale_ub is not None: 2025-05-07T20:32:39.1748225Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.1748587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.1748916Z ) 2025-05-07T20:32:39.1749124Z else: 2025-05-07T20:32:39.1749345Z scale_ub_tensor = None 2025-05-07T20:32:39.1749620Z 2025-05-07T20:32:39.1749868Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.1750243Z op = silu_mul_quant 2025-05-07T20:32:39.1750515Z if compiled: 2025-05-07T20:32:39.1750780Z op = torch.compile(op) 2025-05-07T20:32:39.1751091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1751385Z 2025-05-07T20:32:39.1751593Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.1751810Z 2025-05-07T20:32:39.1751923Z moe/activation_test.py:117: 2025-05-07T20:32:39.1752279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1752632Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.1752929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.1753641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.1754360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.1754922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.1755636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.1756324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.1756876Z kernel = self.compile( 2025-05-07T20:32:39.1757446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.1758126Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1758548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.1758793Z 2025-05-07T20:32:39.1759006Z self = 2025-05-07T20:32:39.1760238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.1761671Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f089ab2a020>} 2025-05-07T20:32:39.1763059Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.1764115Z context = 2025-05-07T20:32:39.1764418Z 2025-05-07T20:32:39.1764592Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.1765136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1765618Z module_map=module_map) 2025-05-07T20:32:39.1765998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1766375Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.1766667Z E ^ 2025-05-07T20:32:39.1767175Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1767648Z 2025-05-07T20:32:39.1768133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.8306922Z 2025-05-07T20:32:39.8307381Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.8308042Z self=, 2025-05-07T20:32:39.8308643Z T=2048, 2025-05-07T20:32:39.8308906Z D=5120, 2025-05-07T20:32:39.8309210Z scale_ub=1200.0, 2025-05-07T20:32:39.8309515Z contiguous=True, 2025-05-07T20:32:39.8309817Z compiled=True, 2025-05-07T20:32:39.8310074Z ) 2025-05-07T20:32:39.8310416Z self = 2025-05-07T20:32:39.8311239Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:39.8311526Z 2025-05-07T20:32:39.8311610Z @given( 2025-05-07T20:32:39.8311862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.8312197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.8312663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.8313027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.8313766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.8314081Z ) 2025-05-07T20:32:39.8314450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.8314924Z def test_silu_mul_quant( 2025-05-07T20:32:39.8315189Z self, 2025-05-07T20:32:39.8315393Z T: int, 2025-05-07T20:32:39.8315606Z D: int, 2025-05-07T20:32:39.8315841Z scale_ub: Optional[float], 2025-05-07T20:32:39.8316126Z contiguous: bool, 2025-05-07T20:32:39.8316386Z compiled: bool, 2025-05-07T20:32:39.8316630Z ) -> None: 2025-05-07T20:32:39.8316855Z torch.manual_seed(2025) 2025-05-07T20:32:39.8317114Z 2025-05-07T20:32:39.8317408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.8317763Z 2025-05-07T20:32:39.8317976Z x_sign = torch.sign(x) 2025-05-07T20:32:39.8318295Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.8318627Z x = x_sign * x_clamp 2025-05-07T20:32:39.8318880Z x0 = x[:, :D] 2025-05-07T20:32:39.8319113Z x1 = x[:, D:] 2025-05-07T20:32:39.8319336Z 2025-05-07T20:32:39.8319532Z if contiguous: 2025-05-07T20:32:39.8319782Z x0 = x0.contiguous() 2025-05-07T20:32:39.8320059Z x1 = x1.contiguous() 2025-05-07T20:32:39.8320436Z 2025-05-07T20:32:39.8320644Z if scale_ub is not None: 2025-05-07T20:32:39.8320940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.8321300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.8321637Z ) 2025-05-07T20:32:39.8321848Z else: 2025-05-07T20:32:39.8322069Z scale_ub_tensor = None 2025-05-07T20:32:39.8322344Z 2025-05-07T20:32:39.8322595Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8322928Z op = silu_mul_quant 2025-05-07T20:32:39.8323201Z if compiled: 2025-05-07T20:32:39.8323471Z op = torch.compile(op) 2025-05-07T20:32:39.8323785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8324081Z 2025-05-07T20:32:39.8324289Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.8324598Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.8324906Z 2025-05-07T20:32:39.8325162Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8325521Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.8325833Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.8326175Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.8326567Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.8326941Z 2025-05-07T20:32:39.8327165Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.8327373Z 2025-05-07T20:32:39.8327490Z moe/activation_test.py:126: 2025-05-07T20:32:39.8327903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8328268Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.8328621Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.8329448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.8330228Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.8330811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.8331591Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.8332319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.8333077Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.8334472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.8335156Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.8335894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.8336525Z fn() 2025-05-07T20:32:39.8337144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.8337859Z self.fn.run( 2025-05-07T20:32:39.8338420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.8339069Z kernel = self.compile( 2025-05-07T20:32:39.8339725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.8340527Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.8341005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8341287Z 2025-05-07T20:32:39.8341527Z self = 2025-05-07T20:32:39.8342657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.8344104Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f089ab34720>} 2025-05-07T20:32:39.8345500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.8346574Z context = 2025-05-07T20:32:39.8346883Z 2025-05-07T20:32:39.8347065Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.8347622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.8348118Z module_map=module_map) 2025-05-07T20:32:39.8348515Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.8348900Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.8349184Z E ^ 2025-05-07T20:32:39.8349679Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.8350163Z 2025-05-07T20:32:39.8350601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.8351140Z 2025-05-07T20:32:39.8351261Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.8351756Z self=, 2025-05-07T20:32:39.8352197Z T=16384, 2025-05-07T20:32:39.8352413Z D=7168, 2025-05-07T20:32:39.8352619Z scale_ub=1200.0, 2025-05-07T20:32:39.8352866Z contiguous=False, 2025-05-07T20:32:39.8353116Z compiled=False, 2025-05-07T20:32:39.8353334Z ) 2025-05-07T20:32:39.8353678Z self = 2025-05-07T20:32:39.8354218Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:39.8354518Z 2025-05-07T20:32:39.8354654Z @given( 2025-05-07T20:32:39.8354899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.8355239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.8355576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.8355926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.8356324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.8356640Z ) 2025-05-07T20:32:39.8357095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.8357569Z def test_silu_mul_quant( 2025-05-07T20:32:39.8357831Z self, 2025-05-07T20:32:39.8358043Z T: int, 2025-05-07T20:32:39.8358255Z D: int, 2025-05-07T20:32:39.8358494Z scale_ub: Optional[float], 2025-05-07T20:32:39.8358789Z contiguous: bool, 2025-05-07T20:32:39.8359045Z compiled: bool, 2025-05-07T20:32:39.8359285Z ) -> None: 2025-05-07T20:32:39.8359518Z torch.manual_seed(2025) 2025-05-07T20:32:39.8359777Z 2025-05-07T20:32:39.8360068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.8360512Z 2025-05-07T20:32:39.8360720Z x_sign = torch.sign(x) 2025-05-07T20:32:39.8361035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.8361375Z x = x_sign * x_clamp 2025-05-07T20:32:39.8361630Z x0 = x[:, :D] 2025-05-07T20:32:39.8361871Z x1 = x[:, D:] 2025-05-07T20:32:39.8362102Z 2025-05-07T20:32:39.8362301Z if contiguous: 2025-05-07T20:32:39.8362555Z x0 = x0.contiguous() 2025-05-07T20:32:39.8362841Z x1 = x1.contiguous() 2025-05-07T20:32:39.8363097Z 2025-05-07T20:32:39.8363307Z if scale_ub is not None: 2025-05-07T20:32:39.8363605Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.8363969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.8364297Z ) 2025-05-07T20:32:39.8364512Z else: 2025-05-07T20:32:39.8364745Z scale_ub_tensor = None 2025-05-07T20:32:39.8365012Z 2025-05-07T20:32:39.8365267Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.8365608Z op = silu_mul_quant 2025-05-07T20:32:39.8365875Z if compiled: 2025-05-07T20:32:39.8366147Z op = torch.compile(op) 2025-05-07T20:32:39.8366477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8366776Z 2025-05-07T20:32:39.8366992Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.8367169Z 2025-05-07T20:32:39.8367286Z moe/activation_test.py:117: 2025-05-07T20:32:39.8367601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8367962Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.8368270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.8369004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.8369728Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.8370303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.8371029Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.8371787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.8372352Z kernel = self.compile( 2025-05-07T20:32:39.8372931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.8373628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.8374048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.8374293Z 2025-05-07T20:32:39.8374513Z self = 2025-05-07T20:32:39.8375695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.8377134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0899a13880>} 2025-05-07T20:32:39.8378623Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.8379692Z context = 2025-05-07T20:32:39.8380003Z 2025-05-07T20:32:39.8380182Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.8380741Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.8381243Z module_map=module_map) 2025-05-07T20:32:39.8381627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.8382006Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.8382286Z E ^ 2025-05-07T20:32:39.8382780Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.8383262Z 2025-05-07T20:32:39.8383701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.5262909Z 2025-05-07T20:32:40.5263656Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.5264334Z self=, 2025-05-07T20:32:40.5264913Z T=1, 2025-05-07T20:32:40.5265112Z D=7168, 2025-05-07T20:32:40.5265325Z scale_ub=None, 2025-05-07T20:32:40.5265553Z contiguous=True, 2025-05-07T20:32:40.5265809Z compiled=True, 2025-05-07T20:32:40.5272121Z ) 2025-05-07T20:32:40.5272503Z self = 2025-05-07T20:32:40.5273015Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:40.5273301Z 2025-05-07T20:32:40.5273394Z @given( 2025-05-07T20:32:40.5273648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.5273986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.5274314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.5274672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.5275025Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.5275326Z ) 2025-05-07T20:32:40.5275707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.5276177Z def test_silu_mul_quant( 2025-05-07T20:32:40.5276432Z self, 2025-05-07T20:32:40.5276647Z T: int, 2025-05-07T20:32:40.5276859Z D: int, 2025-05-07T20:32:40.5277086Z scale_ub: Optional[float], 2025-05-07T20:32:40.5277378Z contiguous: bool, 2025-05-07T20:32:40.5277636Z compiled: bool, 2025-05-07T20:32:40.5277873Z ) -> None: 2025-05-07T20:32:40.5278103Z torch.manual_seed(2025) 2025-05-07T20:32:40.5278366Z 2025-05-07T20:32:40.5278940Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.5279312Z 2025-05-07T20:32:40.5279522Z x_sign = torch.sign(x) 2025-05-07T20:32:40.5279825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.5280254Z x = x_sign * x_clamp 2025-05-07T20:32:40.5280517Z x0 = x[:, :D] 2025-05-07T20:32:40.5280742Z x1 = x[:, D:] 2025-05-07T20:32:40.5280963Z 2025-05-07T20:32:40.5281162Z if contiguous: 2025-05-07T20:32:40.5281408Z x0 = x0.contiguous() 2025-05-07T20:32:40.5281677Z x1 = x1.contiguous() 2025-05-07T20:32:40.5282031Z 2025-05-07T20:32:40.5282238Z if scale_ub is not None: 2025-05-07T20:32:40.5282523Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.5282882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.5283214Z ) 2025-05-07T20:32:40.5283414Z else: 2025-05-07T20:32:40.5283735Z scale_ub_tensor = None 2025-05-07T20:32:40.5284007Z 2025-05-07T20:32:40.5284326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.5284664Z op = silu_mul_quant 2025-05-07T20:32:40.5284933Z if compiled: 2025-05-07T20:32:40.5285190Z op = torch.compile(op) 2025-05-07T20:32:40.5285504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.5285796Z 2025-05-07T20:32:40.5285996Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.5286302Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.5286616Z 2025-05-07T20:32:40.5286878Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.5287226Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.5287540Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.5287877Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.5288249Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.5288583Z 2025-05-07T20:32:40.5288802Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.5289008Z 2025-05-07T20:32:40.5289115Z moe/activation_test.py:126: 2025-05-07T20:32:40.5289440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.5289801Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.5290154Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.5290973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.5291768Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.5292348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.5293059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.5293790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.5294553Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.5295322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.5295991Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.5296628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.5297175Z fn() 2025-05-07T20:32:40.5297714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.5298320Z self.fn.run( 2025-05-07T20:32:40.5298812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.5299372Z kernel = self.compile( 2025-05-07T20:32:40.5299990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.5300686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.5301106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.5301347Z 2025-05-07T20:32:40.5301568Z self = 2025-05-07T20:32:40.5302689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.5304185Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0899ace480>} 2025-05-07T20:32:40.5305622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.5306730Z context = 2025-05-07T20:32:40.5307029Z 2025-05-07T20:32:40.5307210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.5307750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.5308240Z module_map=module_map) 2025-05-07T20:32:40.5308622Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.5308990Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.5309273Z E ^ 2025-05-07T20:32:40.5309758Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.5310222Z 2025-05-07T20:32:40.5310663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.5311197Z 2025-05-07T20:32:40.5311305Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.5311741Z self=, 2025-05-07T20:32:40.5312163Z T=4096, 2025-05-07T20:32:40.5312357Z D=5120, 2025-05-07T20:32:40.5312561Z scale_ub=None, 2025-05-07T20:32:40.5312791Z contiguous=False, 2025-05-07T20:32:40.5313024Z compiled=False, 2025-05-07T20:32:40.5313242Z ) 2025-05-07T20:32:40.5313976Z self = 2025-05-07T20:32:40.5314497Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:40.5314788Z 2025-05-07T20:32:40.5314870Z @given( 2025-05-07T20:32:40.5315115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.5315446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.5315768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.5316126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.5316477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.5316773Z ) 2025-05-07T20:32:40.5317143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.5317605Z def test_silu_mul_quant( 2025-05-07T20:32:40.5317853Z self, 2025-05-07T20:32:40.5318062Z T: int, 2025-05-07T20:32:40.5318276Z D: int, 2025-05-07T20:32:40.5318500Z scale_ub: Optional[float], 2025-05-07T20:32:40.5318792Z contiguous: bool, 2025-05-07T20:32:40.5319049Z compiled: bool, 2025-05-07T20:32:40.5319285Z ) -> None: 2025-05-07T20:32:40.5319506Z torch.manual_seed(2025) 2025-05-07T20:32:40.5319766Z 2025-05-07T20:32:40.5320056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.5320481Z 2025-05-07T20:32:40.5320686Z x_sign = torch.sign(x) 2025-05-07T20:32:40.5321091Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.5321412Z x = x_sign * x_clamp 2025-05-07T20:32:40.5321664Z x0 = x[:, :D] 2025-05-07T20:32:40.5321896Z x1 = x[:, D:] 2025-05-07T20:32:40.5322109Z 2025-05-07T20:32:40.5322310Z if contiguous: 2025-05-07T20:32:40.5322557Z x0 = x0.contiguous() 2025-05-07T20:32:40.5322827Z x1 = x1.contiguous() 2025-05-07T20:32:40.5323082Z 2025-05-07T20:32:40.5323285Z if scale_ub is not None: 2025-05-07T20:32:40.5323569Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.5323987Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.5324311Z ) 2025-05-07T20:32:40.5324518Z else: 2025-05-07T20:32:40.5324733Z scale_ub_tensor = None 2025-05-07T20:32:40.5325001Z 2025-05-07T20:32:40.5325246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.5325641Z op = silu_mul_quant 2025-05-07T20:32:40.5325908Z if compiled: 2025-05-07T20:32:40.5326224Z op = torch.compile(op) 2025-05-07T20:32:40.5326530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.5326823Z 2025-05-07T20:32:40.5327029Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.5327202Z 2025-05-07T20:32:40.5327306Z moe/activation_test.py:117: 2025-05-07T20:32:40.5327614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.5327970Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.5328272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.5328988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.5329709Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.5330274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.5330987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.5331681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.5332237Z kernel = self.compile( 2025-05-07T20:32:40.5332803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.5333485Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.5333902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.5334149Z 2025-05-07T20:32:40.5334363Z self = 2025-05-07T20:32:40.5335483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.5336924Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0899af23e0>} 2025-05-07T20:32:40.5338344Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.5339402Z context = 2025-05-07T20:32:40.5339698Z 2025-05-07T20:32:40.5339882Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.5340430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.5340912Z module_map=module_map) 2025-05-07T20:32:40.5341298Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.5341723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.5341993Z E ^ 2025-05-07T20:32:40.5342480Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.5342948Z 2025-05-07T20:32:40.5343389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2348255Z 2025-05-07T20:32:41.2348623Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2349139Z self=, 2025-05-07T20:32:41.2350000Z T=4096, 2025-05-07T20:32:41.2350205Z D=7168, 2025-05-07T20:32:41.2350408Z scale_ub=None, 2025-05-07T20:32:41.2350634Z contiguous=False, 2025-05-07T20:32:41.2350870Z compiled=False, 2025-05-07T20:32:41.2351087Z ) 2025-05-07T20:32:41.2351430Z self = 2025-05-07T20:32:41.2352062Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.2352503Z 2025-05-07T20:32:41.2352588Z @given( 2025-05-07T20:32:41.2352834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2353180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2353502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2353851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2354202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2354504Z ) 2025-05-07T20:32:41.2354865Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2355330Z def test_silu_mul_quant( 2025-05-07T20:32:41.2355587Z self, 2025-05-07T20:32:41.2355786Z T: int, 2025-05-07T20:32:41.2355997Z D: int, 2025-05-07T20:32:41.2356228Z scale_ub: Optional[float], 2025-05-07T20:32:41.2356512Z contiguous: bool, 2025-05-07T20:32:41.2356767Z compiled: bool, 2025-05-07T20:32:41.2357013Z ) -> None: 2025-05-07T20:32:41.2357237Z torch.manual_seed(2025) 2025-05-07T20:32:41.2357491Z 2025-05-07T20:32:41.2357782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2358135Z 2025-05-07T20:32:41.2358343Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2358650Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2358971Z x = x_sign * x_clamp 2025-05-07T20:32:41.2359223Z x0 = x[:, :D] 2025-05-07T20:32:41.2359452Z x1 = x[:, D:] 2025-05-07T20:32:41.2359674Z 2025-05-07T20:32:41.2359863Z if contiguous: 2025-05-07T20:32:41.2360243Z x0 = x0.contiguous() 2025-05-07T20:32:41.2360517Z x1 = x1.contiguous() 2025-05-07T20:32:41.2360761Z 2025-05-07T20:32:41.2360971Z if scale_ub is not None: 2025-05-07T20:32:41.2361296Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2361658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2361987Z ) 2025-05-07T20:32:41.2362184Z else: 2025-05-07T20:32:41.2362404Z scale_ub_tensor = None 2025-05-07T20:32:41.2362668Z 2025-05-07T20:32:41.2362909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2363240Z op = silu_mul_quant 2025-05-07T20:32:41.2363506Z if compiled: 2025-05-07T20:32:41.2363760Z op = torch.compile(op) 2025-05-07T20:32:41.2364072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2364364Z 2025-05-07T20:32:41.2364564Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2364743Z 2025-05-07T20:32:41.2364849Z moe/activation_test.py:117: 2025-05-07T20:32:41.2365161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2365514Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2365811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2366630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2367359Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2367920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2368634Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2369334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2369941Z kernel = self.compile( 2025-05-07T20:32:41.2370505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2371195Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2371616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2371900Z 2025-05-07T20:32:41.2372157Z self = 2025-05-07T20:32:41.2373282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2374730Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0899af2700>} 2025-05-07T20:32:41.2376127Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2377259Z context = 2025-05-07T20:32:41.2377637Z 2025-05-07T20:32:41.2377861Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2378551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2379154Z module_map=module_map) 2025-05-07T20:32:41.2379539Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2379904Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2380180Z E ^ 2025-05-07T20:32:41.2380664Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2381133Z 2025-05-07T20:32:41.2381569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2382109Z 2025-05-07T20:32:41.2382217Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2382653Z self=, 2025-05-07T20:32:41.2383076Z T=128, 2025-05-07T20:32:41.2383267Z D=7168, 2025-05-07T20:32:41.2383472Z scale_ub=None, 2025-05-07T20:32:41.2383704Z contiguous=False, 2025-05-07T20:32:41.2383936Z compiled=True, 2025-05-07T20:32:41.2384151Z ) 2025-05-07T20:32:41.2384484Z self = 2025-05-07T20:32:41.2384996Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.2385286Z 2025-05-07T20:32:41.2385367Z @given( 2025-05-07T20:32:41.2385613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2385936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2386262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2386610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2387024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2387389Z ) 2025-05-07T20:32:41.2387846Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2388487Z def test_silu_mul_quant( 2025-05-07T20:32:41.2388789Z self, 2025-05-07T20:32:41.2388993Z T: int, 2025-05-07T20:32:41.2389202Z D: int, 2025-05-07T20:32:41.2389429Z scale_ub: Optional[float], 2025-05-07T20:32:41.2389716Z contiguous: bool, 2025-05-07T20:32:41.2389971Z compiled: bool, 2025-05-07T20:32:41.2390198Z ) -> None: 2025-05-07T20:32:41.2390424Z torch.manual_seed(2025) 2025-05-07T20:32:41.2390678Z 2025-05-07T20:32:41.2390958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2391366Z 2025-05-07T20:32:41.2391571Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2391877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2392200Z x = x_sign * x_clamp 2025-05-07T20:32:41.2392452Z x0 = x[:, :D] 2025-05-07T20:32:41.2392684Z x1 = x[:, D:] 2025-05-07T20:32:41.2392941Z 2025-05-07T20:32:41.2393140Z if contiguous: 2025-05-07T20:32:41.2393389Z x0 = x0.contiguous() 2025-05-07T20:32:41.2393699Z x1 = x1.contiguous() 2025-05-07T20:32:41.2393954Z 2025-05-07T20:32:41.2394156Z if scale_ub is not None: 2025-05-07T20:32:41.2394438Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2394792Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2395117Z ) 2025-05-07T20:32:41.2395318Z else: 2025-05-07T20:32:41.2395550Z scale_ub_tensor = None 2025-05-07T20:32:41.2395812Z 2025-05-07T20:32:41.2396052Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2396388Z op = silu_mul_quant 2025-05-07T20:32:41.2396651Z if compiled: 2025-05-07T20:32:41.2396903Z op = torch.compile(op) 2025-05-07T20:32:41.2397215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2397503Z 2025-05-07T20:32:41.2397710Z y_fp8, y_scale = fn() 2025-05-07T20:32:41.2398008Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:41.2398316Z 2025-05-07T20:32:41.2398567Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2398914Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:41.2399221Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:41.2399553Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:41.2399925Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.2400340Z 2025-05-07T20:32:41.2400554Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:41.2400760Z 2025-05-07T20:32:41.2400877Z moe/activation_test.py:126: 2025-05-07T20:32:41.2401184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2401544Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:41.2401894Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.2402723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:41.2403513Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:41.2404091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2404811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2405526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:41.2406289Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.2407061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:41.2407725Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:41.2408418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:41.2408966Z fn() 2025-05-07T20:32:41.2409498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:41.2410104Z self.fn.run( 2025-05-07T20:32:41.2410605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2411164Z kernel = self.compile( 2025-05-07T20:32:41.2411729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2412465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2412891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2413134Z 2025-05-07T20:32:41.2413553Z self = 2025-05-07T20:32:41.2415072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2416504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08991ab880>} 2025-05-07T20:32:41.2417901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2418972Z context = 2025-05-07T20:32:41.2419275Z 2025-05-07T20:32:41.2419461Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2420009Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2420512Z module_map=module_map) 2025-05-07T20:32:41.2420901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2421273Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:41.2421554Z E ^ 2025-05-07T20:32:41.2422048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2422517Z 2025-05-07T20:32:41.2422957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.4806937Z 2025-05-07T20:32:41.4807792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.4808475Z self=, 2025-05-07T20:32:41.4809058Z T=128, 2025-05-07T20:32:41.4809331Z D=7168, 2025-05-07T20:32:41.4809607Z scale_ub=None, 2025-05-07T20:32:41.4809935Z contiguous=False, 2025-05-07T20:32:41.4810177Z compiled=False, 2025-05-07T20:32:41.4810399Z ) 2025-05-07T20:32:41.4810735Z self = 2025-05-07T20:32:41.4811251Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.4811531Z 2025-05-07T20:32:41.4811665Z @given( 2025-05-07T20:32:41.4811916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.4812266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.4812613Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.4812991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.4813634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.4813942Z ) 2025-05-07T20:32:41.4814310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.4814766Z def test_silu_mul_quant( 2025-05-07T20:32:41.4815020Z self, 2025-05-07T20:32:41.4815227Z T: int, 2025-05-07T20:32:41.4822248Z D: int, 2025-05-07T20:32:41.4822769Z scale_ub: Optional[float], 2025-05-07T20:32:41.4823059Z contiguous: bool, 2025-05-07T20:32:41.4823317Z compiled: bool, 2025-05-07T20:32:41.4823561Z ) -> None: 2025-05-07T20:32:41.4823782Z torch.manual_seed(2025) 2025-05-07T20:32:41.4824041Z 2025-05-07T20:32:41.4824332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.4824682Z 2025-05-07T20:32:41.4824887Z x_sign = torch.sign(x) 2025-05-07T20:32:41.4825195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.4825612Z x = x_sign * x_clamp 2025-05-07T20:32:41.4825863Z x0 = x[:, :D] 2025-05-07T20:32:41.4826091Z x1 = x[:, D:] 2025-05-07T20:32:41.4826301Z 2025-05-07T20:32:41.4826505Z if contiguous: 2025-05-07T20:32:41.4826746Z x0 = x0.contiguous() 2025-05-07T20:32:41.4827113Z x1 = x1.contiguous() 2025-05-07T20:32:41.4827356Z 2025-05-07T20:32:41.4827561Z if scale_ub is not None: 2025-05-07T20:32:41.4827934Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.4828281Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.4828605Z ) 2025-05-07T20:32:41.4828809Z else: 2025-05-07T20:32:41.4829021Z scale_ub_tensor = None 2025-05-07T20:32:41.4829281Z 2025-05-07T20:32:41.4829523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.4829847Z op = silu_mul_quant 2025-05-07T20:32:41.4830110Z if compiled: 2025-05-07T20:32:41.4830375Z op = torch.compile(op) 2025-05-07T20:32:41.4830678Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.4830966Z 2025-05-07T20:32:41.4831171Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.4831342Z 2025-05-07T20:32:41.4831454Z moe/activation_test.py:117: 2025-05-07T20:32:41.4831756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4832111Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.4832409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.4833117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.4833836Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.4834394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.4835103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.4835790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.4836343Z kernel = self.compile( 2025-05-07T20:32:41.4836909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.4837590Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.4838004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4838247Z 2025-05-07T20:32:41.4838460Z self = 2025-05-07T20:32:41.4839582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.4841139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08993fd580>} 2025-05-07T20:32:41.4842525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.4843648Z context = 2025-05-07T20:32:41.4843946Z 2025-05-07T20:32:41.4844128Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.4844671Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.4845157Z module_map=module_map) 2025-05-07T20:32:41.4845540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.4845910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.4846176Z E ^ 2025-05-07T20:32:41.4846713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.4847195Z 2025-05-07T20:32:41.4847668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.4848238Z 2025-05-07T20:32:41.4848353Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.4848824Z self=, 2025-05-07T20:32:41.4849247Z T=4096, 2025-05-07T20:32:41.4849448Z D=5120, 2025-05-07T20:32:41.4849647Z scale_ub=1200.0, 2025-05-07T20:32:41.4849882Z contiguous=True, 2025-05-07T20:32:41.4850115Z compiled=False, 2025-05-07T20:32:41.4850323Z ) 2025-05-07T20:32:41.4850657Z self = 2025-05-07T20:32:41.4851174Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.4851457Z 2025-05-07T20:32:41.4851545Z @given( 2025-05-07T20:32:41.4851780Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.4852108Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.4852433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.4852776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.4853126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.4853425Z ) 2025-05-07T20:32:41.4853784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.4854245Z def test_silu_mul_quant( 2025-05-07T20:32:41.4854498Z self, 2025-05-07T20:32:41.4854696Z T: int, 2025-05-07T20:32:41.4854905Z D: int, 2025-05-07T20:32:41.4855135Z scale_ub: Optional[float], 2025-05-07T20:32:41.4855411Z contiguous: bool, 2025-05-07T20:32:41.4855660Z compiled: bool, 2025-05-07T20:32:41.4855892Z ) -> None: 2025-05-07T20:32:41.4856113Z torch.manual_seed(2025) 2025-05-07T20:32:41.4856366Z 2025-05-07T20:32:41.4856650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.4857005Z 2025-05-07T20:32:41.4857198Z x_sign = torch.sign(x) 2025-05-07T20:32:41.4857502Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.4857827Z x = x_sign * x_clamp 2025-05-07T20:32:41.4858073Z x0 = x[:, :D] 2025-05-07T20:32:41.4858306Z x1 = x[:, D:] 2025-05-07T20:32:41.4858526Z 2025-05-07T20:32:41.4858715Z if contiguous: 2025-05-07T20:32:41.4858956Z x0 = x0.contiguous() 2025-05-07T20:32:41.4859228Z x1 = x1.contiguous() 2025-05-07T20:32:41.4859471Z 2025-05-07T20:32:41.4859670Z if scale_ub is not None: 2025-05-07T20:32:41.4859957Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.4860301Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.4860624Z ) 2025-05-07T20:32:41.4860832Z else: 2025-05-07T20:32:41.4861044Z scale_ub_tensor = None 2025-05-07T20:32:41.4861307Z 2025-05-07T20:32:41.4861554Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.4861886Z op = silu_mul_quant 2025-05-07T20:32:41.4862145Z if compiled: 2025-05-07T20:32:41.4862408Z op = torch.compile(op) 2025-05-07T20:32:41.4862771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.4863057Z 2025-05-07T20:32:41.4863262Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.4863432Z 2025-05-07T20:32:41.4863543Z moe/activation_test.py:117: 2025-05-07T20:32:41.4863842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4864191Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.4864489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.4865195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.4865956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.4866514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.4867244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.4868031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.4868586Z kernel = self.compile( 2025-05-07T20:32:41.4869152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.4869832Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.4870238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4870482Z 2025-05-07T20:32:41.4870693Z self = 2025-05-07T20:32:41.4871808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.4873224Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08993fe7a0>} 2025-05-07T20:32:41.4874603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.4875661Z context = 2025-05-07T20:32:41.4875965Z 2025-05-07T20:32:41.4876138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.4876685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.4877172Z module_map=module_map) 2025-05-07T20:32:41.4877600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.4877969Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.4878235Z E ^ 2025-05-07T20:32:41.4878724Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.4879193Z 2025-05-07T20:32:41.4879624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.4880211Z 2025-05-07T20:32:41.4880327Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.4880752Z self=, 2025-05-07T20:32:41.4881175Z T=1, 2025-05-07T20:32:41.4881366Z D=5120, 2025-05-07T20:32:41.4881561Z scale_ub=None, 2025-05-07T20:32:41.4881791Z contiguous=True, 2025-05-07T20:32:41.4882023Z compiled=True, 2025-05-07T20:32:41.4882237Z ) 2025-05-07T20:32:41.4882567Z self = 2025-05-07T20:32:41.4883071Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.4883340Z 2025-05-07T20:32:41.4883426Z @given( 2025-05-07T20:32:41.4883717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.4884050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.4884371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.4884712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.4885057Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.4885356Z ) 2025-05-07T20:32:41.4885719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.4886169Z def test_silu_mul_quant( 2025-05-07T20:32:41.4886419Z self, 2025-05-07T20:32:41.4886666Z T: int, 2025-05-07T20:32:41.4886863Z D: int, 2025-05-07T20:32:41.4887090Z scale_ub: Optional[float], 2025-05-07T20:32:41.4887370Z contiguous: bool, 2025-05-07T20:32:41.4887614Z compiled: bool, 2025-05-07T20:32:41.4887848Z ) -> None: 2025-05-07T20:32:41.4888149Z torch.manual_seed(2025) 2025-05-07T20:32:41.4888393Z 2025-05-07T20:32:41.4888718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.4889077Z 2025-05-07T20:32:41.4889282Z x_sign = torch.sign(x) 2025-05-07T20:32:41.4889577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.4889902Z x = x_sign * x_clamp 2025-05-07T20:32:41.4890150Z x0 = x[:, :D] 2025-05-07T20:32:41.4890364Z x1 = x[:, D:] 2025-05-07T20:32:41.4890577Z 2025-05-07T20:32:41.4890773Z if contiguous: 2025-05-07T20:32:41.4891009Z x0 = x0.contiguous() 2025-05-07T20:32:41.4891285Z x1 = x1.contiguous() 2025-05-07T20:32:41.4891537Z 2025-05-07T20:32:41.4891729Z if scale_ub is not None: 2025-05-07T20:32:41.4892012Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.4892362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.4892676Z ) 2025-05-07T20:32:41.4892881Z else: 2025-05-07T20:32:41.4893103Z scale_ub_tensor = None 2025-05-07T20:32:41.4893357Z 2025-05-07T20:32:41.4893601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.4893930Z op = silu_mul_quant 2025-05-07T20:32:41.4894190Z if compiled: 2025-05-07T20:32:41.4894459Z op = torch.compile(op) 2025-05-07T20:32:41.4894760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.4895046Z 2025-05-07T20:32:41.4895250Z y_fp8, y_scale = fn() 2025-05-07T20:32:41.4895541Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:41.4895847Z 2025-05-07T20:32:41.4896097Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.4896438Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:41.4896746Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:41.4897074Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:41.4897492Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.4897824Z 2025-05-07T20:32:41.4898041Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:41.4898243Z 2025-05-07T20:32:41.4898359Z moe/activation_test.py:126: 2025-05-07T20:32:41.4898662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4899014Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:41.4899359Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.4900167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:41.4900954Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:41.4901524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.4902234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.4902997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:41.4903748Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.4904509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:41.4905174Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:41.4905793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:41.4906378Z fn() 2025-05-07T20:32:41.4906907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:41.4907557Z self.fn.run( 2025-05-07T20:32:41.4908050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.4908644Z kernel = self.compile( 2025-05-07T20:32:41.4909265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.4909940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.4910355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4910588Z 2025-05-07T20:32:41.4910808Z self = 2025-05-07T20:32:41.4911928Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.4913630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08993ff420>} 2025-05-07T20:32:41.4915033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.4916093Z context = 2025-05-07T20:32:41.4916392Z 2025-05-07T20:32:41.4916571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.4917128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.4917649Z module_map=module_map) 2025-05-07T20:32:41.4918039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.4918416Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:41.4918688Z E ^ 2025-05-07T20:32:41.4919172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.4919640Z 2025-05-07T20:32:41.4920080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1830114Z 2025-05-07T20:32:42.1830506Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1831168Z self=, 2025-05-07T20:32:42.1831778Z T=2048, 2025-05-07T20:32:42.1832026Z D=5120, 2025-05-07T20:32:42.1832226Z scale_ub=None, 2025-05-07T20:32:42.1832463Z contiguous=True, 2025-05-07T20:32:42.1832706Z compiled=True, 2025-05-07T20:32:42.1832922Z ) 2025-05-07T20:32:42.1833265Z self = 2025-05-07T20:32:42.1833816Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1834099Z 2025-05-07T20:32:42.1834190Z @given( 2025-05-07T20:32:42.1834440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1834788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1835408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1835771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1836129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1836436Z ) 2025-05-07T20:32:42.1836808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1837287Z def test_silu_mul_quant( 2025-05-07T20:32:42.1837553Z self, 2025-05-07T20:32:42.1837804Z T: int, 2025-05-07T20:32:42.1838019Z D: int, 2025-05-07T20:32:42.1838249Z scale_ub: Optional[float], 2025-05-07T20:32:42.1838631Z contiguous: bool, 2025-05-07T20:32:42.1838889Z compiled: bool, 2025-05-07T20:32:42.1839122Z ) -> None: 2025-05-07T20:32:42.1839355Z torch.manual_seed(2025) 2025-05-07T20:32:42.1839614Z 2025-05-07T20:32:42.1839899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1840432Z 2025-05-07T20:32:42.1840647Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1841035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1841365Z x = x_sign * x_clamp 2025-05-07T20:32:42.1841624Z x0 = x[:, :D] 2025-05-07T20:32:42.1841857Z x1 = x[:, D:] 2025-05-07T20:32:42.1842075Z 2025-05-07T20:32:42.1842277Z if contiguous: 2025-05-07T20:32:42.1842525Z x0 = x0.contiguous() 2025-05-07T20:32:42.1842794Z x1 = x1.contiguous() 2025-05-07T20:32:42.1843054Z 2025-05-07T20:32:42.1843262Z if scale_ub is not None: 2025-05-07T20:32:42.1843549Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1843908Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1844242Z ) 2025-05-07T20:32:42.1844443Z else: 2025-05-07T20:32:42.1844670Z scale_ub_tensor = None 2025-05-07T20:32:42.1844942Z 2025-05-07T20:32:42.1845190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1845530Z op = silu_mul_quant 2025-05-07T20:32:42.1845804Z if compiled: 2025-05-07T20:32:42.1846070Z op = torch.compile(op) 2025-05-07T20:32:42.1846385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1846685Z 2025-05-07T20:32:42.1846895Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1847198Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1847515Z 2025-05-07T20:32:42.1847771Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1848124Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1848443Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1848784Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1849163Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1849502Z 2025-05-07T20:32:42.1849721Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1849933Z 2025-05-07T20:32:42.1850051Z moe/activation_test.py:126: 2025-05-07T20:32:42.1850364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1850727Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1851081Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1851914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1852710Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1853293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1854024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1854751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1855577Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1856362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1857043Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1857680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1858230Z fn() 2025-05-07T20:32:42.1858771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1859427Z self.fn.run( 2025-05-07T20:32:42.1859927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1860492Z kernel = self.compile( 2025-05-07T20:32:42.1861059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1861803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1862274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1862520Z 2025-05-07T20:32:42.1862748Z self = 2025-05-07T20:32:42.1863883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1865333Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08993b9da0>} 2025-05-07T20:32:42.1866729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1867795Z context = 2025-05-07T20:32:42.1868097Z 2025-05-07T20:32:42.1868282Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1868829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1869327Z module_map=module_map) 2025-05-07T20:32:42.1869718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1870091Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1870381Z E ^ 2025-05-07T20:32:42.1870874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1871343Z 2025-05-07T20:32:42.1871786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1872321Z 2025-05-07T20:32:42.1872436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1872883Z self=, 2025-05-07T20:32:42.1873315Z T=128, 2025-05-07T20:32:42.1873514Z D=5120, 2025-05-07T20:32:42.1873727Z scale_ub=None, 2025-05-07T20:32:42.1873962Z contiguous=True, 2025-05-07T20:32:42.1874207Z compiled=True, 2025-05-07T20:32:42.1874420Z ) 2025-05-07T20:32:42.1874765Z self = 2025-05-07T20:32:42.1875290Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.1875571Z 2025-05-07T20:32:42.1875653Z @given( 2025-05-07T20:32:42.1875905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1876244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1876567Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1876918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1877324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1877629Z ) 2025-05-07T20:32:42.1878004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1878472Z def test_silu_mul_quant( 2025-05-07T20:32:42.1878727Z self, 2025-05-07T20:32:42.1878928Z T: int, 2025-05-07T20:32:42.1879139Z D: int, 2025-05-07T20:32:42.1879373Z scale_ub: Optional[float], 2025-05-07T20:32:42.1879651Z contiguous: bool, 2025-05-07T20:32:42.1879901Z compiled: bool, 2025-05-07T20:32:42.1880207Z ) -> None: 2025-05-07T20:32:42.1880508Z torch.manual_seed(2025) 2025-05-07T20:32:42.1880761Z 2025-05-07T20:32:42.1881048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1881398Z 2025-05-07T20:32:42.1881604Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1881908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1888205Z x = x_sign * x_clamp 2025-05-07T20:32:42.1888516Z x0 = x[:, :D] 2025-05-07T20:32:42.1888824Z x1 = x[:, D:] 2025-05-07T20:32:42.1889045Z 2025-05-07T20:32:42.1889248Z if contiguous: 2025-05-07T20:32:42.1889500Z x0 = x0.contiguous() 2025-05-07T20:32:42.1889773Z x1 = x1.contiguous() 2025-05-07T20:32:42.1890032Z 2025-05-07T20:32:42.1890241Z if scale_ub is not None: 2025-05-07T20:32:42.1890527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1890891Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1891232Z ) 2025-05-07T20:32:42.1891439Z else: 2025-05-07T20:32:42.1891665Z scale_ub_tensor = None 2025-05-07T20:32:42.1891934Z 2025-05-07T20:32:42.1892180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1892524Z op = silu_mul_quant 2025-05-07T20:32:42.1892798Z if compiled: 2025-05-07T20:32:42.1893062Z op = torch.compile(op) 2025-05-07T20:32:42.1893382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1893676Z 2025-05-07T20:32:42.1893883Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.1894186Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.1894502Z 2025-05-07T20:32:42.1894759Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1895111Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.1895424Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.1895758Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.1896137Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1896472Z 2025-05-07T20:32:42.1896696Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.1896900Z 2025-05-07T20:32:42.1897016Z moe/activation_test.py:126: 2025-05-07T20:32:42.1897325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1897693Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.1898049Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.1898869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.1899657Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.1900232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1900948Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1901663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.1902426Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.1903253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.1903928Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.1904569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.1905113Z fn() 2025-05-07T20:32:42.1905648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.1906254Z self.fn.run( 2025-05-07T20:32:42.1906750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1907356Z kernel = self.compile( 2025-05-07T20:32:42.1907924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1908619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1909101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1909347Z 2025-05-07T20:32:42.1909614Z self = 2025-05-07T20:32:42.1910738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1912177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898508540>} 2025-05-07T20:32:42.1913911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1914985Z context = 2025-05-07T20:32:42.1915290Z 2025-05-07T20:32:42.1915479Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1916027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1916523Z module_map=module_map) 2025-05-07T20:32:42.1916914Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1917285Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.1917572Z E ^ 2025-05-07T20:32:42.1918062Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1918533Z 2025-05-07T20:32:42.1918973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.9718146Z 2025-05-07T20:32:42.9718837Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.9719502Z self=, 2025-05-07T20:32:42.9720198Z T=4096, 2025-05-07T20:32:42.9720469Z D=5120, 2025-05-07T20:32:42.9720741Z scale_ub=None, 2025-05-07T20:32:42.9721029Z contiguous=True, 2025-05-07T20:32:42.9721316Z compiled=True, 2025-05-07T20:32:42.9721537Z ) 2025-05-07T20:32:42.9721870Z self = 2025-05-07T20:32:42.9722396Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.9722689Z 2025-05-07T20:32:42.9722777Z @given( 2025-05-07T20:32:42.9723023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.9723361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.9723692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.9724044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.9724387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.9724694Z ) 2025-05-07T20:32:42.9725221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.9725694Z def test_silu_mul_quant( 2025-05-07T20:32:42.9725957Z self, 2025-05-07T20:32:42.9726171Z T: int, 2025-05-07T20:32:42.9726386Z D: int, 2025-05-07T20:32:42.9726617Z scale_ub: Optional[float], 2025-05-07T20:32:42.9726912Z contiguous: bool, 2025-05-07T20:32:42.9727167Z compiled: bool, 2025-05-07T20:32:42.9727402Z ) -> None: 2025-05-07T20:32:42.9727630Z torch.manual_seed(2025) 2025-05-07T20:32:42.9727890Z 2025-05-07T20:32:42.9728175Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.9728618Z 2025-05-07T20:32:42.9728830Z x_sign = torch.sign(x) 2025-05-07T20:32:42.9729135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.9729467Z x = x_sign * x_clamp 2025-05-07T20:32:42.9729726Z x0 = x[:, :D] 2025-05-07T20:32:42.9730258Z x1 = x[:, D:] 2025-05-07T20:32:42.9730484Z 2025-05-07T20:32:42.9730692Z if contiguous: 2025-05-07T20:32:42.9731008Z x0 = x0.contiguous() 2025-05-07T20:32:42.9731293Z x1 = x1.contiguous() 2025-05-07T20:32:42.9731555Z 2025-05-07T20:32:42.9731756Z if scale_ub is not None: 2025-05-07T20:32:42.9732052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.9732416Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.9732749Z ) 2025-05-07T20:32:42.9732953Z else: 2025-05-07T20:32:42.9733181Z scale_ub_tensor = None 2025-05-07T20:32:42.9733455Z 2025-05-07T20:32:42.9733697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.9734036Z op = silu_mul_quant 2025-05-07T20:32:42.9734310Z if compiled: 2025-05-07T20:32:42.9734571Z op = torch.compile(op) 2025-05-07T20:32:42.9734896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.9735197Z 2025-05-07T20:32:42.9735404Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.9735717Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.9736033Z 2025-05-07T20:32:42.9736283Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.9736683Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.9736994Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.9737335Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.9737722Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.9738049Z 2025-05-07T20:32:42.9738272Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.9738490Z 2025-05-07T20:32:42.9738601Z moe/activation_test.py:126: 2025-05-07T20:32:42.9738925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9739282Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.9739641Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.9740487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.9741280Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.9741865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.9742598Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.9743328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.9744090Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.9744875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.9745557Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.9746267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.9746814Z fn() 2025-05-07T20:32:42.9747354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.9748025Z self.fn.run( 2025-05-07T20:32:42.9748518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.9749082Z kernel = self.compile( 2025-05-07T20:32:42.9749657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.9750397Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.9750826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9751070Z 2025-05-07T20:32:42.9751295Z self = 2025-05-07T20:32:42.9752884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.9756740Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08986d2200>} 2025-05-07T20:32:42.9758138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.9759205Z context = 2025-05-07T20:32:42.9759505Z 2025-05-07T20:32:42.9759686Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.9760348Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.9760841Z module_map=module_map) 2025-05-07T20:32:42.9761227Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.9761596Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.9761914Z E ^ 2025-05-07T20:32:42.9762400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.9762869Z 2025-05-07T20:32:42.9763306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.9763841Z 2025-05-07T20:32:42.9763949Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.9764384Z self=, 2025-05-07T20:32:42.9764801Z T=16384, 2025-05-07T20:32:42.9765001Z D=5120, 2025-05-07T20:32:42.9765199Z scale_ub=None, 2025-05-07T20:32:42.9765419Z contiguous=True, 2025-05-07T20:32:42.9765650Z compiled=True, 2025-05-07T20:32:42.9765857Z ) 2025-05-07T20:32:42.9766190Z self = 2025-05-07T20:32:42.9766704Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:42.9766988Z 2025-05-07T20:32:42.9767067Z @given( 2025-05-07T20:32:42.9767306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.9767632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.9767944Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.9768292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.9768639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.9768938Z ) 2025-05-07T20:32:42.9769314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.9769770Z def test_silu_mul_quant( 2025-05-07T20:32:42.9770025Z self, 2025-05-07T20:32:42.9770305Z T: int, 2025-05-07T20:32:42.9770507Z D: int, 2025-05-07T20:32:42.9770738Z scale_ub: Optional[float], 2025-05-07T20:32:42.9771023Z contiguous: bool, 2025-05-07T20:32:42.9771267Z compiled: bool, 2025-05-07T20:32:42.9771501Z ) -> None: 2025-05-07T20:32:42.9771726Z torch.manual_seed(2025) 2025-05-07T20:32:42.9771970Z 2025-05-07T20:32:42.9772256Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.9772612Z 2025-05-07T20:32:42.9772809Z x_sign = torch.sign(x) 2025-05-07T20:32:42.9773157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.9773484Z x = x_sign * x_clamp 2025-05-07T20:32:42.9773733Z x0 = x[:, :D] 2025-05-07T20:32:42.9773952Z x1 = x[:, D:] 2025-05-07T20:32:42.9774169Z 2025-05-07T20:32:42.9774362Z if contiguous: 2025-05-07T20:32:42.9774596Z x0 = x0.contiguous() 2025-05-07T20:32:42.9774910Z x1 = x1.contiguous() 2025-05-07T20:32:42.9775162Z 2025-05-07T20:32:42.9775398Z if scale_ub is not None: 2025-05-07T20:32:42.9775684Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.9776038Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.9776354Z ) 2025-05-07T20:32:42.9776555Z else: 2025-05-07T20:32:42.9776772Z scale_ub_tensor = None 2025-05-07T20:32:42.9777025Z 2025-05-07T20:32:42.9777268Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.9777600Z op = silu_mul_quant 2025-05-07T20:32:42.9777860Z if compiled: 2025-05-07T20:32:42.9778117Z op = torch.compile(op) 2025-05-07T20:32:42.9778426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.9778714Z 2025-05-07T20:32:42.9778910Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.9779207Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.9779514Z 2025-05-07T20:32:42.9779763Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.9780114Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.9780421Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.9780744Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.9781123Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.9781449Z 2025-05-07T20:32:42.9781654Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.9781864Z 2025-05-07T20:32:42.9781968Z moe/activation_test.py:126: 2025-05-07T20:32:42.9782282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9782630Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.9782960Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.9783769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.9784552Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.9785112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.9785816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.9786520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.9787263Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.9788017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.9788724Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.9789353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.9789899Z fn() 2025-05-07T20:32:42.9790477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.9791091Z self.fn.run( 2025-05-07T20:32:42.9791578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.9792122Z kernel = self.compile( 2025-05-07T20:32:42.9792684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.9793363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.9793819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9794055Z 2025-05-07T20:32:42.9794267Z self = 2025-05-07T20:32:42.9795424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.9796887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07adb34900>} 2025-05-07T20:32:42.9798566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.9799771Z context = 2025-05-07T20:32:42.9800162Z 2025-05-07T20:32:42.9800338Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.9800886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.9801375Z module_map=module_map) 2025-05-07T20:32:42.9801756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.9802134Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.9802415Z E ^ 2025-05-07T20:32:42.9802892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.9803365Z 2025-05-07T20:32:42.9803799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.9993632Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:42.9995118Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:42.9996612Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:42.9997642Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:42.9998860Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:43.3976275Z 2025-05-07T20:32:43.3976651Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.3977301Z self=, 2025-05-07T20:32:43.3977938Z T=1, 2025-05-07T20:32:43.3978206Z D=5120, 2025-05-07T20:32:43.3978494Z scale_ub=1200.0, 2025-05-07T20:32:43.3978757Z contiguous=True, 2025-05-07T20:32:43.3978996Z compiled=True, 2025-05-07T20:32:43.3979219Z ) 2025-05-07T20:32:43.3979678Z self = 2025-05-07T20:32:43.3980205Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.3980482Z 2025-05-07T20:32:43.3980573Z @given( 2025-05-07T20:32:43.3980818Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.3981156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.3981487Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.3981847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.3982194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.3982565Z ) 2025-05-07T20:32:43.3982941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.3983406Z def test_silu_mul_quant( 2025-05-07T20:32:43.3983667Z self, 2025-05-07T20:32:43.3983879Z T: int, 2025-05-07T20:32:43.3984087Z D: int, 2025-05-07T20:32:43.3984388Z scale_ub: Optional[float], 2025-05-07T20:32:43.3984678Z contiguous: bool, 2025-05-07T20:32:43.3984987Z compiled: bool, 2025-05-07T20:32:43.3985229Z ) -> None: 2025-05-07T20:32:43.3985458Z torch.manual_seed(2025) 2025-05-07T20:32:43.3985709Z 2025-05-07T20:32:43.3986003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.3986369Z 2025-05-07T20:32:43.3986569Z x_sign = torch.sign(x) 2025-05-07T20:32:43.3986883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.3987215Z x = x_sign * x_clamp 2025-05-07T20:32:43.3987472Z x0 = x[:, :D] 2025-05-07T20:32:43.3987699Z x1 = x[:, D:] 2025-05-07T20:32:43.3987921Z 2025-05-07T20:32:43.3988121Z if contiguous: 2025-05-07T20:32:43.3988366Z x0 = x0.contiguous() 2025-05-07T20:32:43.3988644Z x1 = x1.contiguous() 2025-05-07T20:32:43.3988898Z 2025-05-07T20:32:43.3989096Z if scale_ub is not None: 2025-05-07T20:32:43.3989395Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.3989756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.3990079Z ) 2025-05-07T20:32:43.3990286Z else: 2025-05-07T20:32:43.3990511Z scale_ub_tensor = None 2025-05-07T20:32:43.3990773Z 2025-05-07T20:32:43.3991023Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.3991360Z op = silu_mul_quant 2025-05-07T20:32:43.3991621Z if compiled: 2025-05-07T20:32:43.3991886Z op = torch.compile(op) 2025-05-07T20:32:43.3992205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.3992505Z 2025-05-07T20:32:43.3992709Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.3992891Z 2025-05-07T20:32:43.3993001Z moe/activation_test.py:117: 2025-05-07T20:32:43.3993318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.3993675Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.3993982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.3994580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.3995171Z return fn(*args, **kwargs) 2025-05-07T20:32:43.3995876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.3996600Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.3997173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.3997927Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.3998649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.3999213Z kernel = self.compile( 2025-05-07T20:32:43.3999887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.4000678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.4001107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.4001352Z 2025-05-07T20:32:43.4001576Z self = 2025-05-07T20:32:43.4002704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.4004196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad73cd60>} 2025-05-07T20:32:43.4005647Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.4006761Z context = 2025-05-07T20:32:43.4007065Z 2025-05-07T20:32:43.4007251Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.4007801Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.4008301Z module_map=module_map) 2025-05-07T20:32:43.4008689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.4009063Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.4009340Z E ^ 2025-05-07T20:32:43.4009830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.4010303Z 2025-05-07T20:32:43.4010747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.4011293Z 2025-05-07T20:32:43.4011408Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.4011854Z self=, 2025-05-07T20:32:43.4012283Z T=1, 2025-05-07T20:32:43.4012475Z D=5120, 2025-05-07T20:32:43.4012682Z scale_ub=None, 2025-05-07T20:32:43.4012911Z contiguous=False, 2025-05-07T20:32:43.4013157Z compiled=True, 2025-05-07T20:32:43.4013557Z ) 2025-05-07T20:32:43.4013901Z self = 2025-05-07T20:32:43.4014425Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.4014702Z 2025-05-07T20:32:43.4014788Z @given( 2025-05-07T20:32:43.4015039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.4015377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.4015698Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.4022346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.4022750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.4023051Z ) 2025-05-07T20:32:43.4023429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.4023908Z def test_silu_mul_quant( 2025-05-07T20:32:43.4024171Z self, 2025-05-07T20:32:43.4024373Z T: int, 2025-05-07T20:32:43.4024585Z D: int, 2025-05-07T20:32:43.4024818Z scale_ub: Optional[float], 2025-05-07T20:32:43.4025105Z contiguous: bool, 2025-05-07T20:32:43.4025368Z compiled: bool, 2025-05-07T20:32:43.4025610Z ) -> None: 2025-05-07T20:32:43.4025834Z torch.manual_seed(2025) 2025-05-07T20:32:43.4026093Z 2025-05-07T20:32:43.4026382Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.4026739Z 2025-05-07T20:32:43.4026947Z x_sign = torch.sign(x) 2025-05-07T20:32:43.4027368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.4027715Z x = x_sign * x_clamp 2025-05-07T20:32:43.4028007Z x0 = x[:, :D] 2025-05-07T20:32:43.4028242Z x1 = x[:, D:] 2025-05-07T20:32:43.4028456Z 2025-05-07T20:32:43.4028653Z if contiguous: 2025-05-07T20:32:43.4028898Z x0 = x0.contiguous() 2025-05-07T20:32:43.4029164Z x1 = x1.contiguous() 2025-05-07T20:32:43.4029420Z 2025-05-07T20:32:43.4029624Z if scale_ub is not None: 2025-05-07T20:32:43.4029913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.4030265Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.4030668Z ) 2025-05-07T20:32:43.4030877Z else: 2025-05-07T20:32:43.4031097Z scale_ub_tensor = None 2025-05-07T20:32:43.4031364Z 2025-05-07T20:32:43.4031616Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.4032009Z op = silu_mul_quant 2025-05-07T20:32:43.4032275Z if compiled: 2025-05-07T20:32:43.4032597Z op = torch.compile(op) 2025-05-07T20:32:43.4032908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.4033202Z 2025-05-07T20:32:43.4033408Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.4033705Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.4034011Z 2025-05-07T20:32:43.4034264Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.4034620Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.4034926Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.4035262Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.4035641Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.4035965Z 2025-05-07T20:32:43.4036184Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.4036390Z 2025-05-07T20:32:43.4036512Z moe/activation_test.py:126: 2025-05-07T20:32:43.4036824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.4037189Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.4037541Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.4038390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.4039216Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.4039796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.4040616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.4041342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.4042109Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.4042888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.4043565Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.4044202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.4044747Z fn() 2025-05-07T20:32:43.4045285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.4045900Z self.fn.run( 2025-05-07T20:32:43.4046395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.4046959Z kernel = self.compile( 2025-05-07T20:32:43.4047532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.4048220Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.4048753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.4049003Z 2025-05-07T20:32:43.4049223Z self = 2025-05-07T20:32:43.4050356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.4051791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad732de0>} 2025-05-07T20:32:43.4053242Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.4054362Z context = 2025-05-07T20:32:43.4054673Z 2025-05-07T20:32:43.4054920Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.4055477Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.4055963Z module_map=module_map) 2025-05-07T20:32:43.4056351Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.4056731Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.4057006Z E ^ 2025-05-07T20:32:43.4057495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.4057966Z 2025-05-07T20:32:43.4058409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.5462829Z 2025-05-07T20:32:43.5463980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.5465358Z self=, 2025-05-07T20:32:43.5466481Z T=1, 2025-05-07T20:32:43.5466874Z D=5120, 2025-05-07T20:32:43.5467288Z scale_ub=None, 2025-05-07T20:32:43.5467742Z contiguous=True, 2025-05-07T20:32:43.5468021Z compiled=False, 2025-05-07T20:32:43.5468277Z ) 2025-05-07T20:32:43.5468625Z self = 2025-05-07T20:32:43.5469141Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.5469425Z 2025-05-07T20:32:43.5469513Z @given( 2025-05-07T20:32:43.5469778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5470113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5470449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5470808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5471168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5471476Z ) 2025-05-07T20:32:43.5471868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5472344Z def test_silu_mul_quant( 2025-05-07T20:32:43.5472603Z self, 2025-05-07T20:32:43.5472826Z T: int, 2025-05-07T20:32:43.5473045Z D: int, 2025-05-07T20:32:43.5473279Z scale_ub: Optional[float], 2025-05-07T20:32:43.5473582Z contiguous: bool, 2025-05-07T20:32:43.5473846Z compiled: bool, 2025-05-07T20:32:43.5474089Z ) -> None: 2025-05-07T20:32:43.5474330Z torch.manual_seed(2025) 2025-05-07T20:32:43.5474598Z 2025-05-07T20:32:43.5474891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5475264Z 2025-05-07T20:32:43.5475480Z x_sign = torch.sign(x) 2025-05-07T20:32:43.5475790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.5476131Z x = x_sign * x_clamp 2025-05-07T20:32:43.5476398Z x0 = x[:, :D] 2025-05-07T20:32:43.5476638Z x1 = x[:, D:] 2025-05-07T20:32:43.5477171Z 2025-05-07T20:32:43.5477382Z if contiguous: 2025-05-07T20:32:43.5477641Z x0 = x0.contiguous() 2025-05-07T20:32:43.5477915Z x1 = x1.contiguous() 2025-05-07T20:32:43.5478184Z 2025-05-07T20:32:43.5478398Z if scale_ub is not None: 2025-05-07T20:32:43.5478691Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.5479059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.5479400Z ) 2025-05-07T20:32:43.5479606Z else: 2025-05-07T20:32:43.5479966Z scale_ub_tensor = None 2025-05-07T20:32:43.5480352Z 2025-05-07T20:32:43.5480600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.5480948Z op = silu_mul_quant 2025-05-07T20:32:43.5481222Z if compiled: 2025-05-07T20:32:43.5481484Z op = torch.compile(op) 2025-05-07T20:32:43.5481900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5482204Z 2025-05-07T20:32:43.5482487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.5482678Z 2025-05-07T20:32:43.5482789Z moe/activation_test.py:117: 2025-05-07T20:32:43.5483112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5483473Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.5483772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5484508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.5485247Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.5485814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.5486545Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.5487246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.5487826Z kernel = self.compile( 2025-05-07T20:32:43.5488400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.5489099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.5489530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5489809Z 2025-05-07T20:32:43.5490029Z self = 2025-05-07T20:32:43.5491164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.5492623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898019b20>} 2025-05-07T20:32:43.5494020Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.5495098Z context = 2025-05-07T20:32:43.5495415Z 2025-05-07T20:32:43.5495595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.5496155Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.5496648Z module_map=module_map) 2025-05-07T20:32:43.5497046Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.5497429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.5497712Z E ^ 2025-05-07T20:32:43.5498202Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.5498735Z 2025-05-07T20:32:43.5499176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.5499712Z 2025-05-07T20:32:43.5499832Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.5500275Z self=, 2025-05-07T20:32:43.5500699Z T=128, 2025-05-07T20:32:43.5500906Z D=5120, 2025-05-07T20:32:43.5501118Z scale_ub=None, 2025-05-07T20:32:43.5501348Z contiguous=False, 2025-05-07T20:32:43.5501596Z compiled=True, 2025-05-07T20:32:43.5501862Z ) 2025-05-07T20:32:43.5502193Z self = 2025-05-07T20:32:43.5502711Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.5502991Z 2025-05-07T20:32:43.5503080Z @given( 2025-05-07T20:32:43.5503359Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5503696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5504063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5504418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5504762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5505068Z ) 2025-05-07T20:32:43.5505442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5505901Z def test_silu_mul_quant( 2025-05-07T20:32:43.5506159Z self, 2025-05-07T20:32:43.5506370Z T: int, 2025-05-07T20:32:43.5506579Z D: int, 2025-05-07T20:32:43.5506816Z scale_ub: Optional[float], 2025-05-07T20:32:43.5507110Z contiguous: bool, 2025-05-07T20:32:43.5507384Z compiled: bool, 2025-05-07T20:32:43.5507616Z ) -> None: 2025-05-07T20:32:43.5507849Z torch.manual_seed(2025) 2025-05-07T20:32:43.5508110Z 2025-05-07T20:32:43.5508402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5508769Z 2025-05-07T20:32:43.5508986Z x_sign = torch.sign(x) 2025-05-07T20:32:43.5509299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.5509624Z x = x_sign * x_clamp 2025-05-07T20:32:43.5509884Z x0 = x[:, :D] 2025-05-07T20:32:43.5510118Z x1 = x[:, D:] 2025-05-07T20:32:43.5510334Z 2025-05-07T20:32:43.5510537Z if contiguous: 2025-05-07T20:32:43.5510787Z x0 = x0.contiguous() 2025-05-07T20:32:43.5511058Z x1 = x1.contiguous() 2025-05-07T20:32:43.5511315Z 2025-05-07T20:32:43.5511525Z if scale_ub is not None: 2025-05-07T20:32:43.5511809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.5512170Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.5512498Z ) 2025-05-07T20:32:43.5512697Z else: 2025-05-07T20:32:43.5512928Z scale_ub_tensor = None 2025-05-07T20:32:43.5513195Z 2025-05-07T20:32:43.5513753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.5514092Z op = silu_mul_quant 2025-05-07T20:32:43.5514362Z if compiled: 2025-05-07T20:32:43.5514625Z op = torch.compile(op) 2025-05-07T20:32:43.5514935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5515229Z 2025-05-07T20:32:43.5515439Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.5515613Z 2025-05-07T20:32:43.5515719Z moe/activation_test.py:117: 2025-05-07T20:32:43.5516035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5516396Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.5516693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5517286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.5517876Z return fn(*args, **kwargs) 2025-05-07T20:32:43.5518662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.5519381Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.5519951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.5520801Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.5521491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.5522054Z kernel = self.compile( 2025-05-07T20:32:43.5522693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.5523387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.5523804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5524118Z 2025-05-07T20:32:43.5524340Z self = 2025-05-07T20:32:43.5525524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.5526965Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad733a60>} 2025-05-07T20:32:43.5528358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.5529427Z context = 2025-05-07T20:32:43.5529743Z 2025-05-07T20:32:43.5529924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.5530482Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.5530970Z module_map=module_map) 2025-05-07T20:32:43.5531362Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.5531741Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.5532025Z E ^ 2025-05-07T20:32:43.5532509Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.5532989Z 2025-05-07T20:32:43.5533431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.5533966Z 2025-05-07T20:32:43.5534087Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.5534528Z self=, 2025-05-07T20:32:43.5534952Z T=128, 2025-05-07T20:32:43.5535158Z D=7168, 2025-05-07T20:32:43.5535375Z scale_ub=1200.0, 2025-05-07T20:32:43.5535617Z contiguous=False, 2025-05-07T20:32:43.5535863Z compiled=False, 2025-05-07T20:32:43.7097410Z ) 2025-05-07T20:32:43.7098412Z self = 2025-05-07T20:32:43.7100051Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.7100878Z 2025-05-07T20:32:43.7101059Z @given( 2025-05-07T20:32:43.7101540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7102177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7102839Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7103524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7104193Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7104789Z ) 2025-05-07T20:32:43.7105520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7106714Z def test_silu_mul_quant( 2025-05-07T20:32:43.7107212Z self, 2025-05-07T20:32:43.7107638Z T: int, 2025-05-07T20:32:43.7108094Z D: int, 2025-05-07T20:32:43.7108421Z scale_ub: Optional[float], 2025-05-07T20:32:43.7108786Z contiguous: bool, 2025-05-07T20:32:43.7109113Z compiled: bool, 2025-05-07T20:32:43.7109409Z ) -> None: 2025-05-07T20:32:43.7109663Z torch.manual_seed(2025) 2025-05-07T20:32:43.7109923Z 2025-05-07T20:32:43.7110208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7110575Z 2025-05-07T20:32:43.7110878Z x_sign = torch.sign(x) 2025-05-07T20:32:43.7111180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.7111515Z x = x_sign * x_clamp 2025-05-07T20:32:43.7111772Z x0 = x[:, :D] 2025-05-07T20:32:43.7111998Z x1 = x[:, D:] 2025-05-07T20:32:43.7112224Z 2025-05-07T20:32:43.7112504Z if contiguous: 2025-05-07T20:32:43.7112748Z x0 = x0.contiguous() 2025-05-07T20:32:43.7113143Z x1 = x1.contiguous() 2025-05-07T20:32:43.7113707Z 2025-05-07T20:32:43.7113917Z if scale_ub is not None: 2025-05-07T20:32:43.7114200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.7114557Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.7114891Z ) 2025-05-07T20:32:43.7115094Z else: 2025-05-07T20:32:43.7115321Z scale_ub_tensor = None 2025-05-07T20:32:43.7115589Z 2025-05-07T20:32:43.7115831Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.7116173Z op = silu_mul_quant 2025-05-07T20:32:43.7116441Z if compiled: 2025-05-07T20:32:43.7116701Z op = torch.compile(op) 2025-05-07T20:32:43.7117018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7117313Z 2025-05-07T20:32:43.7117519Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.7117700Z 2025-05-07T20:32:43.7117811Z moe/activation_test.py:117: 2025-05-07T20:32:43.7118128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7118483Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.7118778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7119510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.7120337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.7120899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.7121621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.7122321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.7122884Z kernel = self.compile( 2025-05-07T20:32:43.7123459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.7124152Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.7124575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7124862Z 2025-05-07T20:32:43.7125080Z self = 2025-05-07T20:32:43.7126214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.7127678Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07adb3a2a0>} 2025-05-07T20:32:43.7129183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.7130266Z context = 2025-05-07T20:32:43.7130579Z 2025-05-07T20:32:43.7130755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.7131308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.7131795Z module_map=module_map) 2025-05-07T20:32:43.7132183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.7132625Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.7132907Z E ^ 2025-05-07T20:32:43.7133389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.7133867Z 2025-05-07T20:32:43.7134366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.7134958Z 2025-05-07T20:32:43.7135079Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7135512Z self=, 2025-05-07T20:32:43.7135940Z T=128, 2025-05-07T20:32:43.7136146Z D=5120, 2025-05-07T20:32:43.7136358Z scale_ub=None, 2025-05-07T20:32:43.7136581Z contiguous=False, 2025-05-07T20:32:43.7136826Z compiled=False, 2025-05-07T20:32:43.7137047Z ) 2025-05-07T20:32:43.7137379Z self = 2025-05-07T20:32:43.7137908Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.7138192Z 2025-05-07T20:32:43.7138285Z @given( 2025-05-07T20:32:43.7138526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7138862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7139195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7139544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7139904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7140214Z ) 2025-05-07T20:32:43.7140592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7141056Z def test_silu_mul_quant( 2025-05-07T20:32:43.7141319Z self, 2025-05-07T20:32:43.7141533Z T: int, 2025-05-07T20:32:43.7141744Z D: int, 2025-05-07T20:32:43.7141981Z scale_ub: Optional[float], 2025-05-07T20:32:43.7142273Z contiguous: bool, 2025-05-07T20:32:43.7142529Z compiled: bool, 2025-05-07T20:32:43.7142773Z ) -> None: 2025-05-07T20:32:43.7143009Z torch.manual_seed(2025) 2025-05-07T20:32:43.7143265Z 2025-05-07T20:32:43.7143560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7143931Z 2025-05-07T20:32:43.7144135Z x_sign = torch.sign(x) 2025-05-07T20:32:43.7144452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.7144791Z x = x_sign * x_clamp 2025-05-07T20:32:43.7145050Z x0 = x[:, :D] 2025-05-07T20:32:43.7145278Z x1 = x[:, D:] 2025-05-07T20:32:43.7145504Z 2025-05-07T20:32:43.7145709Z if contiguous: 2025-05-07T20:32:43.7145950Z x0 = x0.contiguous() 2025-05-07T20:32:43.7146231Z x1 = x1.contiguous() 2025-05-07T20:32:43.7146492Z 2025-05-07T20:32:43.7146710Z if scale_ub is not None: 2025-05-07T20:32:43.7153913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.7154299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.7154632Z ) 2025-05-07T20:32:43.7154850Z else: 2025-05-07T20:32:43.7155086Z scale_ub_tensor = None 2025-05-07T20:32:43.7155351Z 2025-05-07T20:32:43.7155610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.7155964Z op = silu_mul_quant 2025-05-07T20:32:43.7156327Z if compiled: 2025-05-07T20:32:43.7156594Z op = torch.compile(op) 2025-05-07T20:32:43.7156919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7157217Z 2025-05-07T20:32:43.7157422Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.7157609Z 2025-05-07T20:32:43.7157719Z moe/activation_test.py:117: 2025-05-07T20:32:43.7158043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7158398Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.7158710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7159494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.7160308Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.7160881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.7161702Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.7162418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.7162979Z kernel = self.compile( 2025-05-07T20:32:43.7163559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.7164268Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.7164702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7164951Z 2025-05-07T20:32:43.7165173Z self = 2025-05-07T20:32:43.7166316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.7167767Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad73c720>} 2025-05-07T20:32:43.7169170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.7170246Z context = 2025-05-07T20:32:43.7170553Z 2025-05-07T20:32:43.7170735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.7171297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.7171799Z module_map=module_map) 2025-05-07T20:32:43.7172186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.7172576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.7172860Z E ^ 2025-05-07T20:32:43.7173349Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.7173832Z 2025-05-07T20:32:43.7174270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.7174817Z 2025-05-07T20:32:43.7174929Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7175375Z self=, 2025-05-07T20:32:43.7175801Z T=128, 2025-05-07T20:32:43.7176007Z D=5120, 2025-05-07T20:32:43.7176223Z scale_ub=1200.0, 2025-05-07T20:32:43.7176462Z contiguous=True, 2025-05-07T20:32:43.7176705Z compiled=False, 2025-05-07T20:32:43.7176935Z ) 2025-05-07T20:32:43.7177282Z self = 2025-05-07T20:32:43.7177855Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.7178157Z 2025-05-07T20:32:43.7178243Z @given( 2025-05-07T20:32:43.7178497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7178827Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7179164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7179517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7179867Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7180177Z ) 2025-05-07T20:32:43.7180554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7181070Z def test_silu_mul_quant( 2025-05-07T20:32:43.7181324Z self, 2025-05-07T20:32:43.7181539Z T: int, 2025-05-07T20:32:43.7181757Z D: int, 2025-05-07T20:32:43.7181989Z scale_ub: Optional[float], 2025-05-07T20:32:43.7182329Z contiguous: bool, 2025-05-07T20:32:43.7182594Z compiled: bool, 2025-05-07T20:32:43.7182837Z ) -> None: 2025-05-07T20:32:43.7183114Z torch.manual_seed(2025) 2025-05-07T20:32:43.7183384Z 2025-05-07T20:32:43.7183672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7184043Z 2025-05-07T20:32:43.7184261Z x_sign = torch.sign(x) 2025-05-07T20:32:43.7184569Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.7184909Z x = x_sign * x_clamp 2025-05-07T20:32:43.7185171Z x0 = x[:, :D] 2025-05-07T20:32:43.7185398Z x1 = x[:, D:] 2025-05-07T20:32:43.7185630Z 2025-05-07T20:32:43.7185839Z if contiguous: 2025-05-07T20:32:43.7186083Z x0 = x0.contiguous() 2025-05-07T20:32:43.7186368Z x1 = x1.contiguous() 2025-05-07T20:32:43.7186630Z 2025-05-07T20:32:43.7186833Z if scale_ub is not None: 2025-05-07T20:32:43.7187131Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.7187499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.7187828Z ) 2025-05-07T20:32:43.7188030Z else: 2025-05-07T20:32:43.7188254Z scale_ub_tensor = None 2025-05-07T20:32:43.7188520Z 2025-05-07T20:32:43.7188763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.7189096Z op = silu_mul_quant 2025-05-07T20:32:43.7189361Z if compiled: 2025-05-07T20:32:43.7189619Z op = torch.compile(op) 2025-05-07T20:32:43.7189932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7190221Z 2025-05-07T20:32:43.7190420Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.7190601Z 2025-05-07T20:32:43.7190706Z moe/activation_test.py:117: 2025-05-07T20:32:43.7191015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7191367Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.7191657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7192385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.7193104Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.7193659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.7194372Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.7195064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.7195625Z kernel = self.compile( 2025-05-07T20:32:43.7196181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.7196866Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.7197284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7197523Z 2025-05-07T20:32:43.7197802Z self = 2025-05-07T20:32:43.7198914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.7200425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad3f8c20>} 2025-05-07T20:32:43.7201820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.7202931Z context = 2025-05-07T20:32:43.7203232Z 2025-05-07T20:32:43.7203444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.7204585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.7205083Z module_map=module_map) 2025-05-07T20:32:43.7205467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.7205833Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.7206111Z E ^ 2025-05-07T20:32:43.7206599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.7207063Z 2025-05-07T20:32:43.7207497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8731580Z 2025-05-07T20:32:43.8731958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.8732627Z self=, 2025-05-07T20:32:43.8733269Z T=1, 2025-05-07T20:32:43.8733531Z D=7168, 2025-05-07T20:32:43.8733807Z scale_ub=1200.0, 2025-05-07T20:32:43.8734058Z contiguous=True, 2025-05-07T20:32:43.8734292Z compiled=True, 2025-05-07T20:32:43.8734517Z ) 2025-05-07T20:32:43.8734861Z self = 2025-05-07T20:32:43.8735380Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.8735657Z 2025-05-07T20:32:43.8735739Z @given( 2025-05-07T20:32:43.8735986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8736320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8736649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8737004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8737357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8737655Z ) 2025-05-07T20:32:43.8738031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8738512Z def test_silu_mul_quant( 2025-05-07T20:32:43.8738774Z self, 2025-05-07T20:32:43.8738976Z T: int, 2025-05-07T20:32:43.8739189Z D: int, 2025-05-07T20:32:43.8739425Z scale_ub: Optional[float], 2025-05-07T20:32:43.8739710Z contiguous: bool, 2025-05-07T20:32:43.8739972Z compiled: bool, 2025-05-07T20:32:43.8740214Z ) -> None: 2025-05-07T20:32:43.8740436Z torch.manual_seed(2025) 2025-05-07T20:32:43.8740694Z 2025-05-07T20:32:43.8740982Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8741342Z 2025-05-07T20:32:43.8741559Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8741877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8742202Z x = x_sign * x_clamp 2025-05-07T20:32:43.8742459Z x0 = x[:, :D] 2025-05-07T20:32:43.8742691Z x1 = x[:, D:] 2025-05-07T20:32:43.8742909Z 2025-05-07T20:32:43.8743107Z if contiguous: 2025-05-07T20:32:43.8743640Z x0 = x0.contiguous() 2025-05-07T20:32:43.8743920Z x1 = x1.contiguous() 2025-05-07T20:32:43.8744183Z 2025-05-07T20:32:43.8744389Z if scale_ub is not None: 2025-05-07T20:32:43.8744676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8745038Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8745371Z ) 2025-05-07T20:32:43.8745578Z else: 2025-05-07T20:32:43.8745798Z scale_ub_tensor = None 2025-05-07T20:32:43.8746066Z 2025-05-07T20:32:43.8746312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8746734Z op = silu_mul_quant 2025-05-07T20:32:43.8747005Z if compiled: 2025-05-07T20:32:43.8747266Z op = torch.compile(op) 2025-05-07T20:32:43.8747575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8747870Z 2025-05-07T20:32:43.8748155Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8748328Z 2025-05-07T20:32:43.8748439Z moe/activation_test.py:117: 2025-05-07T20:32:43.8748824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8749186Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8749514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8750107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.8750695Z return fn(*args, **kwargs) 2025-05-07T20:32:43.8751394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8752122Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8752686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8753403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8754132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8754697Z kernel = self.compile( 2025-05-07T20:32:43.8755271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8755960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8756385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8756635Z 2025-05-07T20:32:43.8756855Z self = 2025-05-07T20:32:43.8758000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8759447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad3f9ee0>} 2025-05-07T20:32:43.8760983Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8762056Z context = 2025-05-07T20:32:43.8762358Z 2025-05-07T20:32:43.8762539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8763090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8763581Z module_map=module_map) 2025-05-07T20:32:43.8763969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8764344Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8764615Z E ^ 2025-05-07T20:32:43.8765155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8765629Z 2025-05-07T20:32:43.8766069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8766600Z 2025-05-07T20:32:43.8766715Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.8767145Z self=, 2025-05-07T20:32:43.8767567Z T=1, 2025-05-07T20:32:43.8767764Z D=7168, 2025-05-07T20:32:43.8767962Z scale_ub=1200.0, 2025-05-07T20:32:43.8768241Z contiguous=False, 2025-05-07T20:32:43.8768477Z compiled=True, 2025-05-07T20:32:43.8768688Z ) 2025-05-07T20:32:43.8769023Z self = 2025-05-07T20:32:43.8769539Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.8769867Z 2025-05-07T20:32:43.8769948Z @given( 2025-05-07T20:32:43.8770202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8770574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8770903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8771244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8771597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8771900Z ) 2025-05-07T20:32:43.8772263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8772733Z def test_silu_mul_quant( 2025-05-07T20:32:43.8772998Z self, 2025-05-07T20:32:43.8773202Z T: int, 2025-05-07T20:32:43.8773410Z D: int, 2025-05-07T20:32:43.8773642Z scale_ub: Optional[float], 2025-05-07T20:32:43.8773925Z contiguous: bool, 2025-05-07T20:32:43.8774181Z compiled: bool, 2025-05-07T20:32:43.8774416Z ) -> None: 2025-05-07T20:32:43.8774643Z torch.manual_seed(2025) 2025-05-07T20:32:43.8774906Z 2025-05-07T20:32:43.8775197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8775562Z 2025-05-07T20:32:43.8775763Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8776076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8776405Z x = x_sign * x_clamp 2025-05-07T20:32:43.8776657Z x0 = x[:, :D] 2025-05-07T20:32:43.8776888Z x1 = x[:, D:] 2025-05-07T20:32:43.8777110Z 2025-05-07T20:32:43.8777303Z if contiguous: 2025-05-07T20:32:43.8777548Z x0 = x0.contiguous() 2025-05-07T20:32:43.8777827Z x1 = x1.contiguous() 2025-05-07T20:32:43.8778078Z 2025-05-07T20:32:43.8778283Z if scale_ub is not None: 2025-05-07T20:32:43.8778574Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8778924Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8779257Z ) 2025-05-07T20:32:43.8779468Z else: 2025-05-07T20:32:43.8779693Z scale_ub_tensor = None 2025-05-07T20:32:43.8779959Z 2025-05-07T20:32:43.8780211Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8780542Z op = silu_mul_quant 2025-05-07T20:32:43.8780804Z if compiled: 2025-05-07T20:32:43.8781068Z op = torch.compile(op) 2025-05-07T20:32:43.8781382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8781667Z 2025-05-07T20:32:43.8781870Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8782042Z 2025-05-07T20:32:43.8782154Z moe/activation_test.py:117: 2025-05-07T20:32:43.8782468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8782826Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8783133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8783718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.8784310Z return fn(*args, **kwargs) 2025-05-07T20:32:43.8785050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8785771Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8786329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8787045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8787739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8788334Z kernel = self.compile( 2025-05-07T20:32:43.8788894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8789579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8790039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8790281Z 2025-05-07T20:32:43.8790532Z self = 2025-05-07T20:32:43.8791662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8793083Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad3fac00>} 2025-05-07T20:32:43.8794483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8795549Z context = 2025-05-07T20:32:43.8795856Z 2025-05-07T20:32:43.8796034Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8796593Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8797085Z module_map=module_map) 2025-05-07T20:32:43.8797471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8797840Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8798115Z E ^ 2025-05-07T20:32:43.8798602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8799071Z 2025-05-07T20:32:43.8799507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.0839570Z 2025-05-07T20:32:44.0840010Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.0840761Z self=, 2025-05-07T20:32:44.0841395Z T=1, 2025-05-07T20:32:44.0841647Z D=7168, 2025-05-07T20:32:44.0841919Z scale_ub=None, 2025-05-07T20:32:44.0842206Z contiguous=False, 2025-05-07T20:32:44.0842500Z compiled=True, 2025-05-07T20:32:44.0842775Z ) 2025-05-07T20:32:44.0843160Z self = 2025-05-07T20:32:44.0843683Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.0843959Z 2025-05-07T20:32:44.0844039Z @given( 2025-05-07T20:32:44.0844283Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.0844623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.0844942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.0845293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.0845642Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.0845938Z ) 2025-05-07T20:32:44.0846595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.0847071Z def test_silu_mul_quant( 2025-05-07T20:32:44.0847325Z self, 2025-05-07T20:32:44.0847524Z T: int, 2025-05-07T20:32:44.0847733Z D: int, 2025-05-07T20:32:44.0847972Z scale_ub: Optional[float], 2025-05-07T20:32:44.0848292Z contiguous: bool, 2025-05-07T20:32:44.0848544Z compiled: bool, 2025-05-07T20:32:44.0848782Z ) -> None: 2025-05-07T20:32:44.0849003Z torch.manual_seed(2025) 2025-05-07T20:32:44.0849258Z 2025-05-07T20:32:44.0849549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.0849996Z 2025-05-07T20:32:44.0850200Z x_sign = torch.sign(x) 2025-05-07T20:32:44.0850508Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.0850835Z x = x_sign * x_clamp 2025-05-07T20:32:44.0851091Z x0 = x[:, :D] 2025-05-07T20:32:44.0851422Z x1 = x[:, D:] 2025-05-07T20:32:44.0851635Z 2025-05-07T20:32:44.0851836Z if contiguous: 2025-05-07T20:32:44.0852153Z x0 = x0.contiguous() 2025-05-07T20:32:44.0852426Z x1 = x1.contiguous() 2025-05-07T20:32:44.0852685Z 2025-05-07T20:32:44.0852886Z if scale_ub is not None: 2025-05-07T20:32:44.0853185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.0853539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.0853869Z ) 2025-05-07T20:32:44.0854074Z else: 2025-05-07T20:32:44.0854291Z scale_ub_tensor = None 2025-05-07T20:32:44.0854557Z 2025-05-07T20:32:44.0854805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.0855175Z op = silu_mul_quant 2025-05-07T20:32:44.0855446Z if compiled: 2025-05-07T20:32:44.0855701Z op = torch.compile(op) 2025-05-07T20:32:44.0856016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0856311Z 2025-05-07T20:32:44.0856510Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.0856823Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.0857132Z 2025-05-07T20:32:44.0857380Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.0857740Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.0858052Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.0858376Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.0858759Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.0859090Z 2025-05-07T20:32:44.0859304Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.0859510Z 2025-05-07T20:32:44.0859615Z moe/activation_test.py:126: 2025-05-07T20:32:44.0859929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0860285Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.0860629Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.0861469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.0862267Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.0862856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.0863573Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.0864302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.0865075Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.0865847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.0866532Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.0867235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.0867783Z fn() 2025-05-07T20:32:44.0868320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.0868940Z self.fn.run( 2025-05-07T20:32:44.0869440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.0869997Z kernel = self.compile( 2025-05-07T20:32:44.0870570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.0871310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.0871733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0871977Z 2025-05-07T20:32:44.0872199Z self = 2025-05-07T20:32:44.0873427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.0875003Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ada9c180>} 2025-05-07T20:32:44.0876574Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.0877778Z context = 2025-05-07T20:32:44.0878112Z 2025-05-07T20:32:44.0878543Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.0886819Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.0887339Z module_map=module_map) 2025-05-07T20:32:44.0887723Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.0888098Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.0888387Z E ^ 2025-05-07T20:32:44.0888879Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.0889361Z 2025-05-07T20:32:44.0889804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.0890347Z 2025-05-07T20:32:44.0890465Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.0890913Z self=, 2025-05-07T20:32:44.0891338Z T=1, 2025-05-07T20:32:44.0891542Z D=5120, 2025-05-07T20:32:44.0891761Z scale_ub=1200.0, 2025-05-07T20:32:44.0891997Z contiguous=False, 2025-05-07T20:32:44.0892243Z compiled=True, 2025-05-07T20:32:44.0892466Z ) 2025-05-07T20:32:44.0892801Z self = 2025-05-07T20:32:44.0893318Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.0893596Z 2025-05-07T20:32:44.0893692Z @given( 2025-05-07T20:32:44.0893933Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.0894269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.0894597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.0894950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.0895292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.0895595Z ) 2025-05-07T20:32:44.0895965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.0896421Z def test_silu_mul_quant( 2025-05-07T20:32:44.0896680Z self, 2025-05-07T20:32:44.0896973Z T: int, 2025-05-07T20:32:44.0897181Z D: int, 2025-05-07T20:32:44.0897414Z scale_ub: Optional[float], 2025-05-07T20:32:44.0897705Z contiguous: bool, 2025-05-07T20:32:44.0897958Z compiled: bool, 2025-05-07T20:32:44.0898199Z ) -> None: 2025-05-07T20:32:44.0898433Z torch.manual_seed(2025) 2025-05-07T20:32:44.0898684Z 2025-05-07T20:32:44.0898973Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.0899335Z 2025-05-07T20:32:44.0899536Z x_sign = torch.sign(x) 2025-05-07T20:32:44.0899845Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.0900221Z x = x_sign * x_clamp 2025-05-07T20:32:44.0900480Z x0 = x[:, :D] 2025-05-07T20:32:44.0900704Z x1 = x[:, D:] 2025-05-07T20:32:44.0900927Z 2025-05-07T20:32:44.0901127Z if contiguous: 2025-05-07T20:32:44.0901365Z x0 = x0.contiguous() 2025-05-07T20:32:44.0901687Z x1 = x1.contiguous() 2025-05-07T20:32:44.0901952Z 2025-05-07T20:32:44.0902197Z if scale_ub is not None: 2025-05-07T20:32:44.0902491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.0902849Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.0903170Z ) 2025-05-07T20:32:44.0903382Z else: 2025-05-07T20:32:44.0903610Z scale_ub_tensor = None 2025-05-07T20:32:44.0903871Z 2025-05-07T20:32:44.0904122Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.0904460Z op = silu_mul_quant 2025-05-07T20:32:44.0904725Z if compiled: 2025-05-07T20:32:44.0904991Z op = torch.compile(op) 2025-05-07T20:32:44.0905309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0905603Z 2025-05-07T20:32:44.0905803Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.0905984Z 2025-05-07T20:32:44.0906098Z moe/activation_test.py:117: 2025-05-07T20:32:44.0906414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0906765Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.0907068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0907658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.0908239Z return fn(*args, **kwargs) 2025-05-07T20:32:44.0908933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.0909649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.0910219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.0910927Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.0911624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.0912192Z kernel = self.compile( 2025-05-07T20:32:44.0912766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.0913868Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.0914298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0914541Z 2025-05-07T20:32:44.0914768Z self = 2025-05-07T20:32:44.0915892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.0917331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ada9d300>} 2025-05-07T20:32:44.0918858Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.0919927Z context = 2025-05-07T20:32:44.0920311Z 2025-05-07T20:32:44.0920497Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.0921050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.0921641Z module_map=module_map) 2025-05-07T20:32:44.0922031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.0922401Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.0922682Z E ^ 2025-05-07T20:32:44.0923171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.0923705Z 2025-05-07T20:32:44.0924210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2314739Z 2025-05-07T20:32:44.2315110Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2315776Z self=, 2025-05-07T20:32:44.2316383Z T=1, 2025-05-07T20:32:44.2316646Z D=5120, 2025-05-07T20:32:44.2316900Z scale_ub=1200.0, 2025-05-07T20:32:44.2317147Z contiguous=False, 2025-05-07T20:32:44.2317386Z compiled=False, 2025-05-07T20:32:44.2317629Z ) 2025-05-07T20:32:44.2317978Z self = 2025-05-07T20:32:44.2318495Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.2318786Z 2025-05-07T20:32:44.2318868Z @given( 2025-05-07T20:32:44.2319118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2319459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2319796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2320261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2320618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2320920Z ) 2025-05-07T20:32:44.2321297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2321767Z def test_silu_mul_quant( 2025-05-07T20:32:44.2322022Z self, 2025-05-07T20:32:44.2322238Z T: int, 2025-05-07T20:32:44.2322458Z D: int, 2025-05-07T20:32:44.2322701Z scale_ub: Optional[float], 2025-05-07T20:32:44.2323003Z contiguous: bool, 2025-05-07T20:32:44.2323270Z compiled: bool, 2025-05-07T20:32:44.2323506Z ) -> None: 2025-05-07T20:32:44.2323741Z torch.manual_seed(2025) 2025-05-07T20:32:44.2324002Z 2025-05-07T20:32:44.2324292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2324662Z 2025-05-07T20:32:44.2324877Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2325193Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2325522Z x = x_sign * x_clamp 2025-05-07T20:32:44.2325787Z x0 = x[:, :D] 2025-05-07T20:32:44.2326024Z x1 = x[:, D:] 2025-05-07T20:32:44.2326243Z 2025-05-07T20:32:44.2326443Z if contiguous: 2025-05-07T20:32:44.2326695Z x0 = x0.contiguous() 2025-05-07T20:32:44.2326972Z x1 = x1.contiguous() 2025-05-07T20:32:44.2327262Z 2025-05-07T20:32:44.2327466Z if scale_ub is not None: 2025-05-07T20:32:44.2327760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2328122Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2328444Z ) 2025-05-07T20:32:44.2328654Z else: 2025-05-07T20:32:44.2328884Z scale_ub_tensor = None 2025-05-07T20:32:44.2329158Z 2025-05-07T20:32:44.2329671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2330018Z op = silu_mul_quant 2025-05-07T20:32:44.2330282Z if compiled: 2025-05-07T20:32:44.2330555Z op = torch.compile(op) 2025-05-07T20:32:44.2330875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2331167Z 2025-05-07T20:32:44.2331381Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2331558Z 2025-05-07T20:32:44.2331670Z moe/activation_test.py:117: 2025-05-07T20:32:44.2331989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2332426Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2332731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2333452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2334175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2334900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2335626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2336329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2336895Z kernel = self.compile( 2025-05-07T20:32:44.2337471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2338162Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2338594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2338839Z 2025-05-07T20:32:44.2339069Z self = 2025-05-07T20:32:44.2340205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2341656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ada9e020>} 2025-05-07T20:32:44.2343064Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2344140Z context = 2025-05-07T20:32:44.2344445Z 2025-05-07T20:32:44.2344627Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2345173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2345673Z module_map=module_map) 2025-05-07T20:32:44.2346063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2346442Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2346713Z E ^ 2025-05-07T20:32:44.2347203Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2347675Z 2025-05-07T20:32:44.2348135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2348702Z 2025-05-07T20:32:44.2348816Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2349264Z self=, 2025-05-07T20:32:44.2349692Z T=16384, 2025-05-07T20:32:44.2349902Z D=5120, 2025-05-07T20:32:44.2350104Z scale_ub=1200.0, 2025-05-07T20:32:44.2350344Z contiguous=False, 2025-05-07T20:32:44.2350599Z compiled=True, 2025-05-07T20:32:44.2350821Z ) 2025-05-07T20:32:44.2351201Z self = 2025-05-07T20:32:44.2351739Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.2352037Z 2025-05-07T20:32:44.2352127Z @given( 2025-05-07T20:32:44.2352366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2352704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2353033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2353388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2353736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2354088Z ) 2025-05-07T20:32:44.2354460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2354925Z def test_silu_mul_quant( 2025-05-07T20:32:44.2355187Z self, 2025-05-07T20:32:44.2355400Z T: int, 2025-05-07T20:32:44.2355647Z D: int, 2025-05-07T20:32:44.2355883Z scale_ub: Optional[float], 2025-05-07T20:32:44.2356177Z contiguous: bool, 2025-05-07T20:32:44.2356468Z compiled: bool, 2025-05-07T20:32:44.2356711Z ) -> None: 2025-05-07T20:32:44.2356947Z torch.manual_seed(2025) 2025-05-07T20:32:44.2357201Z 2025-05-07T20:32:44.2357493Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2357857Z 2025-05-07T20:32:44.2358067Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2358412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2358759Z x = x_sign * x_clamp 2025-05-07T20:32:44.2359017Z x0 = x[:, :D] 2025-05-07T20:32:44.2359244Z x1 = x[:, D:] 2025-05-07T20:32:44.2359465Z 2025-05-07T20:32:44.2359668Z if contiguous: 2025-05-07T20:32:44.2359909Z x0 = x0.contiguous() 2025-05-07T20:32:44.2360281Z x1 = x1.contiguous() 2025-05-07T20:32:44.2360540Z 2025-05-07T20:32:44.2360749Z if scale_ub is not None: 2025-05-07T20:32:44.2361047Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2361410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2361736Z ) 2025-05-07T20:32:44.2361945Z else: 2025-05-07T20:32:44.2362175Z scale_ub_tensor = None 2025-05-07T20:32:44.2362441Z 2025-05-07T20:32:44.2362692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2363032Z op = silu_mul_quant 2025-05-07T20:32:44.2363300Z if compiled: 2025-05-07T20:32:44.2363565Z op = torch.compile(op) 2025-05-07T20:32:44.2363886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2364180Z 2025-05-07T20:32:44.2364381Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2364561Z 2025-05-07T20:32:44.2364667Z moe/activation_test.py:117: 2025-05-07T20:32:44.2364983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2365336Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2365640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2366232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.2366815Z return fn(*args, **kwargs) 2025-05-07T20:32:44.2367511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2368239Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2368813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2369528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2370232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2370794Z kernel = self.compile( 2025-05-07T20:32:44.2371429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2372123Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2372548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2372791Z 2025-05-07T20:32:44.2373016Z self = 2025-05-07T20:32:44.2374145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2375616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ada9f600>} 2025-05-07T20:32:44.2377053Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2378171Z context = 2025-05-07T20:32:44.2378477Z 2025-05-07T20:32:44.2378659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2379205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2379696Z module_map=module_map) 2025-05-07T20:32:44.2380083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2380460Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2380729Z E ^ 2025-05-07T20:32:44.2381223Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2381691Z 2025-05-07T20:32:44.2382134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2382669Z 2025-05-07T20:32:44.2382784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2383228Z self=, 2025-05-07T20:32:44.2383656Z T=2048, 2025-05-07T20:32:44.2383857Z D=7168, 2025-05-07T20:32:44.2384059Z scale_ub=1200.0, 2025-05-07T20:32:44.2384300Z contiguous=False, 2025-05-07T20:32:44.2384540Z compiled=True, 2025-05-07T20:32:44.4255578Z ) 2025-05-07T20:32:44.4256177Z self = 2025-05-07T20:32:44.4256922Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.4257321Z 2025-05-07T20:32:44.4257431Z @given( 2025-05-07T20:32:44.4257742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4258071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4258401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4258761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4259108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4259409Z ) 2025-05-07T20:32:44.4259779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4260247Z def test_silu_mul_quant( 2025-05-07T20:32:44.4260501Z self, 2025-05-07T20:32:44.4260711Z T: int, 2025-05-07T20:32:44.4260923Z D: int, 2025-05-07T20:32:44.4261153Z scale_ub: Optional[float], 2025-05-07T20:32:44.4261449Z contiguous: bool, 2025-05-07T20:32:44.4261718Z compiled: bool, 2025-05-07T20:32:44.4261954Z ) -> None: 2025-05-07T20:32:44.4262182Z torch.manual_seed(2025) 2025-05-07T20:32:44.4262439Z 2025-05-07T20:32:44.4262724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4263086Z 2025-05-07T20:32:44.4263299Z x_sign = torch.sign(x) 2025-05-07T20:32:44.4263917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.4264252Z x = x_sign * x_clamp 2025-05-07T20:32:44.4264507Z x0 = x[:, :D] 2025-05-07T20:32:44.4264731Z x1 = x[:, D:] 2025-05-07T20:32:44.4264952Z 2025-05-07T20:32:44.4265153Z if contiguous: 2025-05-07T20:32:44.4265391Z x0 = x0.contiguous() 2025-05-07T20:32:44.4265667Z x1 = x1.contiguous() 2025-05-07T20:32:44.4265921Z 2025-05-07T20:32:44.4266123Z if scale_ub is not None: 2025-05-07T20:32:44.4266405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.4266854Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.4267179Z ) 2025-05-07T20:32:44.4267378Z else: 2025-05-07T20:32:44.4267600Z scale_ub_tensor = None 2025-05-07T20:32:44.4267864Z 2025-05-07T20:32:44.4268104Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.4268515Z op = silu_mul_quant 2025-05-07T20:32:44.4268783Z if compiled: 2025-05-07T20:32:44.4269104Z op = torch.compile(op) 2025-05-07T20:32:44.4269420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4269713Z 2025-05-07T20:32:44.4269910Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.4270090Z 2025-05-07T20:32:44.4270196Z moe/activation_test.py:117: 2025-05-07T20:32:44.4270509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4270866Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.4271159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4271748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.4272337Z return fn(*args, **kwargs) 2025-05-07T20:32:44.4273025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.4273752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.4274323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.4275036Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.4275731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.4276290Z kernel = self.compile( 2025-05-07T20:32:44.4276858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.4277549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.4277971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4278216Z 2025-05-07T20:32:44.4278435Z self = 2025-05-07T20:32:44.4279573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.4281172Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad03c720>} 2025-05-07T20:32:44.4282567Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.4283636Z context = 2025-05-07T20:32:44.4283948Z 2025-05-07T20:32:44.4284124Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.4284681Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.4285223Z module_map=module_map) 2025-05-07T20:32:44.4285618Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.4285995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.4286268Z E ^ 2025-05-07T20:32:44.4286760Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.4287236Z 2025-05-07T20:32:44.4287673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.4288251Z 2025-05-07T20:32:44.4288366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4288795Z self=, 2025-05-07T20:32:44.4289219Z T=1, 2025-05-07T20:32:44.4289415Z D=5120, 2025-05-07T20:32:44.4289617Z scale_ub=None, 2025-05-07T20:32:44.4289848Z contiguous=False, 2025-05-07T20:32:44.4290128Z compiled=False, 2025-05-07T20:32:44.4290339Z ) 2025-05-07T20:32:44.4290731Z self = 2025-05-07T20:32:44.4291249Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.4291522Z 2025-05-07T20:32:44.4291612Z @given( 2025-05-07T20:32:44.4291852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4292187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4292513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4292860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4293212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4293515Z ) 2025-05-07T20:32:44.4293879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4294340Z def test_silu_mul_quant( 2025-05-07T20:32:44.4294595Z self, 2025-05-07T20:32:44.4294798Z T: int, 2025-05-07T20:32:44.4295005Z D: int, 2025-05-07T20:32:44.4295240Z scale_ub: Optional[float], 2025-05-07T20:32:44.4295531Z contiguous: bool, 2025-05-07T20:32:44.4295788Z compiled: bool, 2025-05-07T20:32:44.4296022Z ) -> None: 2025-05-07T20:32:44.4296249Z torch.manual_seed(2025) 2025-05-07T20:32:44.4296501Z 2025-05-07T20:32:44.4296786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4304791Z 2025-05-07T20:32:44.4305038Z x_sign = torch.sign(x) 2025-05-07T20:32:44.4305353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.4305701Z x = x_sign * x_clamp 2025-05-07T20:32:44.4305962Z x0 = x[:, :D] 2025-05-07T20:32:44.4306192Z x1 = x[:, D:] 2025-05-07T20:32:44.4306413Z 2025-05-07T20:32:44.4306616Z if contiguous: 2025-05-07T20:32:44.4306863Z x0 = x0.contiguous() 2025-05-07T20:32:44.4307133Z x1 = x1.contiguous() 2025-05-07T20:32:44.4307395Z 2025-05-07T20:32:44.4307608Z if scale_ub is not None: 2025-05-07T20:32:44.4307901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.4308316Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.4308648Z ) 2025-05-07T20:32:44.4308849Z else: 2025-05-07T20:32:44.4309076Z scale_ub_tensor = None 2025-05-07T20:32:44.4309345Z 2025-05-07T20:32:44.4309588Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.4309926Z op = silu_mul_quant 2025-05-07T20:32:44.4310192Z if compiled: 2025-05-07T20:32:44.4310455Z op = torch.compile(op) 2025-05-07T20:32:44.4310771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4311063Z 2025-05-07T20:32:44.4311264Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.4311447Z 2025-05-07T20:32:44.4311553Z moe/activation_test.py:117: 2025-05-07T20:32:44.4311870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4312307Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.4312609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4313704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.4314430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.4314994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.4315713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.4316508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.4317067Z kernel = self.compile( 2025-05-07T20:32:44.4317635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.4318437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.4318939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4319181Z 2025-05-07T20:32:44.4319406Z self = 2025-05-07T20:32:44.4320590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.4322028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad03d120>} 2025-05-07T20:32:44.4323430Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.4324504Z context = 2025-05-07T20:32:44.4324808Z 2025-05-07T20:32:44.4324990Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.4325529Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.4326024Z module_map=module_map) 2025-05-07T20:32:44.4326407Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.4326772Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.4327047Z E ^ 2025-05-07T20:32:44.4327539Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.4328007Z 2025-05-07T20:32:44.4328448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.4328982Z 2025-05-07T20:32:44.4329090Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4329532Z self=, 2025-05-07T20:32:44.4329961Z T=4096, 2025-05-07T20:32:44.4330156Z D=7168, 2025-05-07T20:32:44.4330363Z scale_ub=1200.0, 2025-05-07T20:32:44.4330604Z contiguous=False, 2025-05-07T20:32:44.4330837Z compiled=False, 2025-05-07T20:32:44.4331054Z ) 2025-05-07T20:32:44.4331391Z self = 2025-05-07T20:32:44.4331910Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.4332203Z 2025-05-07T20:32:44.4332288Z @given( 2025-05-07T20:32:44.4332532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4332865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4333183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4333533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4333888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4334258Z ) 2025-05-07T20:32:44.4334635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4335097Z def test_silu_mul_quant( 2025-05-07T20:32:44.4335346Z self, 2025-05-07T20:32:44.4335557Z T: int, 2025-05-07T20:32:44.4335766Z D: int, 2025-05-07T20:32:44.4335991Z scale_ub: Optional[float], 2025-05-07T20:32:44.4336278Z contiguous: bool, 2025-05-07T20:32:44.4336533Z compiled: bool, 2025-05-07T20:32:44.4336768Z ) -> None: 2025-05-07T20:32:44.4336987Z torch.manual_seed(2025) 2025-05-07T20:32:44.4337292Z 2025-05-07T20:32:44.4337581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4337930Z 2025-05-07T20:32:44.4338137Z x_sign = torch.sign(x) 2025-05-07T20:32:44.4338446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.4338810Z x = x_sign * x_clamp 2025-05-07T20:32:44.4339065Z x0 = x[:, :D] 2025-05-07T20:32:44.4339358Z x1 = x[:, D:] 2025-05-07T20:32:44.4339573Z 2025-05-07T20:32:44.4339768Z if contiguous: 2025-05-07T20:32:44.4340014Z x0 = x0.contiguous() 2025-05-07T20:32:44.4340277Z x1 = x1.contiguous() 2025-05-07T20:32:44.4340519Z 2025-05-07T20:32:44.4340720Z if scale_ub is not None: 2025-05-07T20:32:44.4341000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.4341352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.4341677Z ) 2025-05-07T20:32:44.4341884Z else: 2025-05-07T20:32:44.4342101Z scale_ub_tensor = None 2025-05-07T20:32:44.4342368Z 2025-05-07T20:32:44.4342616Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.4342941Z op = silu_mul_quant 2025-05-07T20:32:44.4343207Z if compiled: 2025-05-07T20:32:44.4343470Z op = torch.compile(op) 2025-05-07T20:32:44.4343777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4344060Z 2025-05-07T20:32:44.4344260Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.4344432Z 2025-05-07T20:32:44.4344537Z moe/activation_test.py:117: 2025-05-07T20:32:44.4344852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4345208Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.4345499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4346220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.4346942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.4347506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.4348237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.4348966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.4349525Z kernel = self.compile( 2025-05-07T20:32:44.4350093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.4350796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.4351217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4351461Z 2025-05-07T20:32:44.4351678Z self = 2025-05-07T20:32:44.4352801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.4354274Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad03e480>} 2025-05-07T20:32:44.4355676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.4356740Z context = 2025-05-07T20:32:44.4357039Z 2025-05-07T20:32:44.4357220Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.4357760Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.4358332Z module_map=module_map) 2025-05-07T20:32:44.4358726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.4359099Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.4359367Z E ^ 2025-05-07T20:32:44.4359903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.4360478Z 2025-05-07T20:32:44.4360924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.5910729Z 2025-05-07T20:32:44.5911331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5911977Z self=, 2025-05-07T20:32:44.5912560Z T=16384, 2025-05-07T20:32:44.5912771Z D=7168, 2025-05-07T20:32:44.5912969Z scale_ub=None, 2025-05-07T20:32:44.5913222Z contiguous=True, 2025-05-07T20:32:44.5913736Z compiled=True, 2025-05-07T20:32:44.5913958Z ) 2025-05-07T20:32:44.5914298Z self = 2025-05-07T20:32:44.5914808Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.5915108Z 2025-05-07T20:32:44.5915190Z @given( 2025-05-07T20:32:44.5915442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5915774Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5916100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5916453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5916801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5917096Z ) 2025-05-07T20:32:44.5917462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5917925Z def test_silu_mul_quant( 2025-05-07T20:32:44.5918173Z self, 2025-05-07T20:32:44.5918384Z T: int, 2025-05-07T20:32:44.5918597Z D: int, 2025-05-07T20:32:44.5918820Z scale_ub: Optional[float], 2025-05-07T20:32:44.5919110Z contiguous: bool, 2025-05-07T20:32:44.5919362Z compiled: bool, 2025-05-07T20:32:44.5919601Z ) -> None: 2025-05-07T20:32:44.5919823Z torch.manual_seed(2025) 2025-05-07T20:32:44.5920179Z 2025-05-07T20:32:44.5920473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5920824Z 2025-05-07T20:32:44.5921028Z x_sign = torch.sign(x) 2025-05-07T20:32:44.5921333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.5921653Z x = x_sign * x_clamp 2025-05-07T20:32:44.5921903Z x0 = x[:, :D] 2025-05-07T20:32:44.5922129Z x1 = x[:, D:] 2025-05-07T20:32:44.5922340Z 2025-05-07T20:32:44.5922537Z if contiguous: 2025-05-07T20:32:44.5922779Z x0 = x0.contiguous() 2025-05-07T20:32:44.5923045Z x1 = x1.contiguous() 2025-05-07T20:32:44.5923297Z 2025-05-07T20:32:44.5923496Z if scale_ub is not None: 2025-05-07T20:32:44.5923773Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.5924123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.5924444Z ) 2025-05-07T20:32:44.5924648Z else: 2025-05-07T20:32:44.5925138Z scale_ub_tensor = None 2025-05-07T20:32:44.5925402Z 2025-05-07T20:32:44.5925644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.5925967Z op = silu_mul_quant 2025-05-07T20:32:44.5926225Z if compiled: 2025-05-07T20:32:44.5926483Z op = torch.compile(op) 2025-05-07T20:32:44.5926786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.5927074Z 2025-05-07T20:32:44.5927276Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.5927446Z 2025-05-07T20:32:44.5927551Z moe/activation_test.py:117: 2025-05-07T20:32:44.5927936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.5928297Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.5928589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.5929174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.5929824Z return fn(*args, **kwargs) 2025-05-07T20:32:44.5930583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.5931296Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.5931847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.5932556Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.5933246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.5933804Z kernel = self.compile( 2025-05-07T20:32:44.5934359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.5935044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.5935467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.5935709Z 2025-05-07T20:32:44.5935930Z self = 2025-05-07T20:32:44.5937046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.5938485Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad03f740>} 2025-05-07T20:32:44.5939874Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.5940929Z context = 2025-05-07T20:32:44.5941229Z 2025-05-07T20:32:44.5941405Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.5941949Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.5942434Z module_map=module_map) 2025-05-07T20:32:44.5942814Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.5943176Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.5943444Z E ^ 2025-05-07T20:32:44.5943923Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.5944389Z 2025-05-07T20:32:44.5944818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.5945353Z 2025-05-07T20:32:44.5945460Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5945890Z self=, 2025-05-07T20:32:44.5946312Z T=4096, 2025-05-07T20:32:44.5946553Z D=5120, 2025-05-07T20:32:44.5946759Z scale_ub=None, 2025-05-07T20:32:44.5946984Z contiguous=False, 2025-05-07T20:32:44.5947214Z compiled=True, 2025-05-07T20:32:44.5947423Z ) 2025-05-07T20:32:44.5947756Z self = 2025-05-07T20:32:44.5948263Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.5948549Z 2025-05-07T20:32:44.5948630Z @given( 2025-05-07T20:32:44.5948872Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5949269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5949583Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5949926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5950268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5950560Z ) 2025-05-07T20:32:44.5951026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5951528Z def test_silu_mul_quant( 2025-05-07T20:32:44.5951774Z self, 2025-05-07T20:32:44.5951978Z T: int, 2025-05-07T20:32:44.5952184Z D: int, 2025-05-07T20:32:44.5952409Z scale_ub: Optional[float], 2025-05-07T20:32:44.5952693Z contiguous: bool, 2025-05-07T20:32:44.5952943Z compiled: bool, 2025-05-07T20:32:44.5953170Z ) -> None: 2025-05-07T20:32:44.5953396Z torch.manual_seed(2025) 2025-05-07T20:32:44.5953649Z 2025-05-07T20:32:44.5953934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5954285Z 2025-05-07T20:32:44.5954487Z x_sign = torch.sign(x) 2025-05-07T20:32:44.5954794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.5955112Z x = x_sign * x_clamp 2025-05-07T20:32:44.5955363Z x0 = x[:, :D] 2025-05-07T20:32:44.5955597Z x1 = x[:, D:] 2025-05-07T20:32:44.5955811Z 2025-05-07T20:32:44.5956011Z if contiguous: 2025-05-07T20:32:44.5956254Z x0 = x0.contiguous() 2025-05-07T20:32:44.5956517Z x1 = x1.contiguous() 2025-05-07T20:32:44.5956766Z 2025-05-07T20:32:44.5956966Z if scale_ub is not None: 2025-05-07T20:32:44.5957243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.5957590Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.5957915Z ) 2025-05-07T20:32:44.5958110Z else: 2025-05-07T20:32:44.5958328Z scale_ub_tensor = None 2025-05-07T20:32:44.5958595Z 2025-05-07T20:32:44.5958842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.5959167Z op = silu_mul_quant 2025-05-07T20:32:44.5959431Z if compiled: 2025-05-07T20:32:44.5959695Z op = torch.compile(op) 2025-05-07T20:32:44.5959999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.5960385Z 2025-05-07T20:32:44.5960590Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.5960764Z 2025-05-07T20:32:44.5960869Z moe/activation_test.py:117: 2025-05-07T20:32:44.5961181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.5961533Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.5961828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.5962421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.5963008Z return fn(*args, **kwargs) 2025-05-07T20:32:44.5963693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.5964405Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.5964963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.5965676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.5966419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.5966970Z kernel = self.compile( 2025-05-07T20:32:44.5967532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.5968212Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.5968625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.5968869Z 2025-05-07T20:32:44.5969127Z self = 2025-05-07T20:32:44.5970249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.5971779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad290c20>} 2025-05-07T20:32:44.5973170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.5974222Z context = 2025-05-07T20:32:44.5974534Z 2025-05-07T20:32:44.5974709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.5975261Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.5975753Z module_map=module_map) 2025-05-07T20:32:44.5976130Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.5976502Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.5976781Z E ^ 2025-05-07T20:32:44.5977265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.5977742Z 2025-05-07T20:32:44.5978174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.7355755Z 2025-05-07T20:32:44.7356539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.7357265Z self=, 2025-05-07T20:32:44.7357866Z T=4096, 2025-05-07T20:32:44.7358143Z D=5120, 2025-05-07T20:32:44.7358426Z scale_ub=1200.0, 2025-05-07T20:32:44.7358730Z contiguous=False, 2025-05-07T20:32:44.7359039Z compiled=False, 2025-05-07T20:32:44.7359313Z ) 2025-05-07T20:32:44.7359678Z self = 2025-05-07T20:32:44.7360287Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.7360590Z 2025-05-07T20:32:44.7360677Z @given( 2025-05-07T20:32:44.7360929Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.7361265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.7361593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.7361943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.7362295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.7362605Z ) 2025-05-07T20:32:44.7362973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.7363450Z def test_silu_mul_quant( 2025-05-07T20:32:44.7363704Z self, 2025-05-07T20:32:44.7363903Z T: int, 2025-05-07T20:32:44.7364113Z D: int, 2025-05-07T20:32:44.7364345Z scale_ub: Optional[float], 2025-05-07T20:32:44.7364627Z contiguous: bool, 2025-05-07T20:32:44.7364884Z compiled: bool, 2025-05-07T20:32:44.7365132Z ) -> None: 2025-05-07T20:32:44.7365633Z torch.manual_seed(2025) 2025-05-07T20:32:44.7365895Z 2025-05-07T20:32:44.7366188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.7366548Z 2025-05-07T20:32:44.7366751Z x_sign = torch.sign(x) 2025-05-07T20:32:44.7367062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.7367394Z x = x_sign * x_clamp 2025-05-07T20:32:44.7367645Z x0 = x[:, :D] 2025-05-07T20:32:44.7367876Z x1 = x[:, D:] 2025-05-07T20:32:44.7368097Z 2025-05-07T20:32:44.7368288Z if contiguous: 2025-05-07T20:32:44.7368627Z x0 = x0.contiguous() 2025-05-07T20:32:44.7368902Z x1 = x1.contiguous() 2025-05-07T20:32:44.7369152Z 2025-05-07T20:32:44.7369356Z if scale_ub is not None: 2025-05-07T20:32:44.7369646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.7369999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.7370427Z ) 2025-05-07T20:32:44.7370643Z else: 2025-05-07T20:32:44.7370946Z scale_ub_tensor = None 2025-05-07T20:32:44.7371221Z 2025-05-07T20:32:44.7371465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.7371794Z op = silu_mul_quant 2025-05-07T20:32:44.7372062Z if compiled: 2025-05-07T20:32:44.7372323Z op = torch.compile(op) 2025-05-07T20:32:44.7372637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.7372924Z 2025-05-07T20:32:44.7373128Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.7373303Z 2025-05-07T20:32:44.7373417Z moe/activation_test.py:117: 2025-05-07T20:32:44.7373724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.7374079Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.7374383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.7375124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.7375872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.7376454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.7377194Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.7377913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.7378486Z kernel = self.compile( 2025-05-07T20:32:44.7379070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.7379780Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.7380201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.7380453Z 2025-05-07T20:32:44.7387626Z self = 2025-05-07T20:32:44.7388843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.7390315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad2916c0>} 2025-05-07T20:32:44.7391741Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.7392847Z context = 2025-05-07T20:32:44.7393158Z 2025-05-07T20:32:44.7393352Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.7393984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.7394495Z module_map=module_map) 2025-05-07T20:32:44.7394890Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.7395273Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.7395546Z E ^ 2025-05-07T20:32:44.7396045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.7396527Z 2025-05-07T20:32:44.7396979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.7397569Z 2025-05-07T20:32:44.7397693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.7398130Z self=, 2025-05-07T20:32:44.7398564Z T=4096, 2025-05-07T20:32:44.7398766Z D=5120, 2025-05-07T20:32:44.7399012Z scale_ub=1200.0, 2025-05-07T20:32:44.7399259Z contiguous=False, 2025-05-07T20:32:44.7399541Z compiled=True, 2025-05-07T20:32:44.7399756Z ) 2025-05-07T20:32:44.7400204Z self = 2025-05-07T20:32:44.7400742Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.7401031Z 2025-05-07T20:32:44.7401112Z @given( 2025-05-07T20:32:44.7401360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.7401694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.7402020Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.7402369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.7402721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.7403025Z ) 2025-05-07T20:32:44.7403393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.7403865Z def test_silu_mul_quant( 2025-05-07T20:32:44.7404127Z self, 2025-05-07T20:32:44.7404331Z T: int, 2025-05-07T20:32:44.7404545Z D: int, 2025-05-07T20:32:44.7404779Z scale_ub: Optional[float], 2025-05-07T20:32:44.7405061Z contiguous: bool, 2025-05-07T20:32:44.7405319Z compiled: bool, 2025-05-07T20:32:44.7405558Z ) -> None: 2025-05-07T20:32:44.7405784Z torch.manual_seed(2025) 2025-05-07T20:32:44.7406042Z 2025-05-07T20:32:44.7406332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.7406696Z 2025-05-07T20:32:44.7406897Z x_sign = torch.sign(x) 2025-05-07T20:32:44.7407209Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.7407541Z x = x_sign * x_clamp 2025-05-07T20:32:44.7407791Z x0 = x[:, :D] 2025-05-07T20:32:44.7408025Z x1 = x[:, D:] 2025-05-07T20:32:44.7408250Z 2025-05-07T20:32:44.7408441Z if contiguous: 2025-05-07T20:32:44.7408691Z x0 = x0.contiguous() 2025-05-07T20:32:44.7408971Z x1 = x1.contiguous() 2025-05-07T20:32:44.7409219Z 2025-05-07T20:32:44.7409431Z if scale_ub is not None: 2025-05-07T20:32:44.7409724Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.7410079Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.7410411Z ) 2025-05-07T20:32:44.7410618Z else: 2025-05-07T20:32:44.7410840Z scale_ub_tensor = None 2025-05-07T20:32:44.7411109Z 2025-05-07T20:32:44.7411354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.7411690Z op = silu_mul_quant 2025-05-07T20:32:44.7411954Z if compiled: 2025-05-07T20:32:44.7412219Z op = torch.compile(op) 2025-05-07T20:32:44.7412538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.7412824Z 2025-05-07T20:32:44.7413032Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.7413207Z 2025-05-07T20:32:44.7413631Z moe/activation_test.py:117: 2025-05-07T20:32:44.7414051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.7414409Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.7414709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.7415294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.7415885Z return fn(*args, **kwargs) 2025-05-07T20:32:44.7416579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.7417376Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.7417936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.7418661Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.7419364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.7420058Z kernel = self.compile( 2025-05-07T20:32:44.7420628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.7421325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.7421751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.7421997Z 2025-05-07T20:32:44.7422214Z self = 2025-05-07T20:32:44.7423357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.7424815Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad292fc0>} 2025-05-07T20:32:44.7426242Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.7427329Z context = 2025-05-07T20:32:44.7427635Z 2025-05-07T20:32:44.7427810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.7428365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.7428865Z module_map=module_map) 2025-05-07T20:32:44.7429254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.7429624Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.7429900Z E ^ 2025-05-07T20:32:44.7430395Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.7430877Z 2025-05-07T20:32:44.7431323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.7431873Z 2025-05-07T20:32:44.7431981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.7432440Z self=, 2025-05-07T20:32:44.7432873Z T=2048, 2025-05-07T20:32:44.7433064Z D=7168, 2025-05-07T20:32:44.7433267Z scale_ub=1200.0, 2025-05-07T20:32:44.7433503Z contiguous=False, 2025-05-07T20:32:44.7433737Z compiled=False, 2025-05-07T20:32:44.9379366Z ) 2025-05-07T20:32:44.9379985Z self = 2025-05-07T20:32:44.9380720Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9381032Z 2025-05-07T20:32:44.9381118Z @given( 2025-05-07T20:32:44.9381381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9382002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9382341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9382687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9383036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9383341Z ) 2025-05-07T20:32:44.9383705Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9384178Z def test_silu_mul_quant( 2025-05-07T20:32:44.9384438Z self, 2025-05-07T20:32:44.9384642Z T: int, 2025-05-07T20:32:44.9384941Z D: int, 2025-05-07T20:32:44.9385173Z scale_ub: Optional[float], 2025-05-07T20:32:44.9385455Z contiguous: bool, 2025-05-07T20:32:44.9385709Z compiled: bool, 2025-05-07T20:32:44.9385951Z ) -> None: 2025-05-07T20:32:44.9386177Z torch.manual_seed(2025) 2025-05-07T20:32:44.9386557Z 2025-05-07T20:32:44.9386853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9387213Z 2025-05-07T20:32:44.9387486Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9387802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9388132Z x = x_sign * x_clamp 2025-05-07T20:32:44.9388384Z x0 = x[:, :D] 2025-05-07T20:32:44.9388623Z x1 = x[:, D:] 2025-05-07T20:32:44.9388859Z 2025-05-07T20:32:44.9389058Z if contiguous: 2025-05-07T20:32:44.9389307Z x0 = x0.contiguous() 2025-05-07T20:32:44.9389585Z x1 = x1.contiguous() 2025-05-07T20:32:44.9389842Z 2025-05-07T20:32:44.9390047Z if scale_ub is not None: 2025-05-07T20:32:44.9390335Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9390687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9391020Z ) 2025-05-07T20:32:44.9391228Z else: 2025-05-07T20:32:44.9391455Z scale_ub_tensor = None 2025-05-07T20:32:44.9391725Z 2025-05-07T20:32:44.9391976Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9392310Z op = silu_mul_quant 2025-05-07T20:32:44.9392573Z if compiled: 2025-05-07T20:32:44.9392836Z op = torch.compile(op) 2025-05-07T20:32:44.9393152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9393440Z 2025-05-07T20:32:44.9393652Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9393824Z 2025-05-07T20:32:44.9393939Z moe/activation_test.py:117: 2025-05-07T20:32:44.9394248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9394606Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9394914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9395631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9396356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9396937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9397658Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9398358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9398912Z kernel = self.compile( 2025-05-07T20:32:44.9399481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9400296Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9400716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9400964Z 2025-05-07T20:32:44.9401178Z self = 2025-05-07T20:32:44.9402362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9403817Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad293ec0>} 2025-05-07T20:32:44.9405230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9406336Z context = 2025-05-07T20:32:44.9406648Z 2025-05-07T20:32:44.9406825Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9407378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9407913Z module_map=module_map) 2025-05-07T20:32:44.9408331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9408710Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9408983Z E ^ 2025-05-07T20:32:44.9409467Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9409947Z 2025-05-07T20:32:44.9410386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9410929Z 2025-05-07T20:32:44.9411038Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9411477Z self=, 2025-05-07T20:32:44.9411894Z T=1, 2025-05-07T20:32:44.9412094Z D=7168, 2025-05-07T20:32:44.9412298Z scale_ub=None, 2025-05-07T20:32:44.9412520Z contiguous=True, 2025-05-07T20:32:44.9412759Z compiled=False, 2025-05-07T20:32:44.9412973Z ) 2025-05-07T20:32:44.9413629Z self = 2025-05-07T20:32:44.9414181Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9414461Z 2025-05-07T20:32:44.9414549Z @given( 2025-05-07T20:32:44.9414811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9415148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9415470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9415829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9416183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9416483Z ) 2025-05-07T20:32:44.9416853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9417322Z def test_silu_mul_quant( 2025-05-07T20:32:44.9417575Z self, 2025-05-07T20:32:44.9417794Z T: int, 2025-05-07T20:32:44.9418007Z D: int, 2025-05-07T20:32:44.9418232Z scale_ub: Optional[float], 2025-05-07T20:32:44.9418520Z contiguous: bool, 2025-05-07T20:32:44.9418774Z compiled: bool, 2025-05-07T20:32:44.9419006Z ) -> None: 2025-05-07T20:32:44.9419237Z torch.manual_seed(2025) 2025-05-07T20:32:44.9419497Z 2025-05-07T20:32:44.9419780Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9420147Z 2025-05-07T20:32:44.9420360Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9420668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9420989Z x = x_sign * x_clamp 2025-05-07T20:32:44.9421241Z x0 = x[:, :D] 2025-05-07T20:32:44.9421469Z x1 = x[:, D:] 2025-05-07T20:32:44.9421684Z 2025-05-07T20:32:44.9421879Z if contiguous: 2025-05-07T20:32:44.9422121Z x0 = x0.contiguous() 2025-05-07T20:32:44.9422387Z x1 = x1.contiguous() 2025-05-07T20:32:44.9422643Z 2025-05-07T20:32:44.9422843Z if scale_ub is not None: 2025-05-07T20:32:44.9423235Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9423596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9423925Z ) 2025-05-07T20:32:44.9424201Z else: 2025-05-07T20:32:44.9424482Z scale_ub_tensor = None 2025-05-07T20:32:44.9424753Z 2025-05-07T20:32:44.9424992Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9425355Z op = silu_mul_quant 2025-05-07T20:32:44.9425616Z if compiled: 2025-05-07T20:32:44.9425876Z op = torch.compile(op) 2025-05-07T20:32:44.9426813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9427095Z 2025-05-07T20:32:44.9427298Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9427469Z 2025-05-07T20:32:44.9427581Z moe/activation_test.py:117: 2025-05-07T20:32:44.9427885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9428311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9428665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9429378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9430100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9430671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9431385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9432073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9432635Z kernel = self.compile( 2025-05-07T20:32:44.9433206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9433895Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9434316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9434564Z 2025-05-07T20:32:44.9434781Z self = 2025-05-07T20:32:44.9435905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9437331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acf38cc0>} 2025-05-07T20:32:44.9438720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9439787Z context = 2025-05-07T20:32:44.9440199Z 2025-05-07T20:32:44.9440378Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9440929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9441413Z module_map=module_map) 2025-05-07T20:32:44.9441803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9442175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9442440Z E ^ 2025-05-07T20:32:44.9442923Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9443397Z 2025-05-07T20:32:44.9443829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9444362Z 2025-05-07T20:32:44.9444477Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9444955Z self=, 2025-05-07T20:32:44.9445378Z T=16384, 2025-05-07T20:32:44.9445584Z D=7168, 2025-05-07T20:32:44.9445785Z scale_ub=1200.0, 2025-05-07T20:32:44.9446018Z contiguous=False, 2025-05-07T20:32:44.9446254Z compiled=True, 2025-05-07T20:32:44.9446462Z ) 2025-05-07T20:32:44.9446799Z self = 2025-05-07T20:32:44.9447326Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9447620Z 2025-05-07T20:32:44.9447708Z @given( 2025-05-07T20:32:44.9447992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9448322Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9448644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9448983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9449367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9449667Z ) 2025-05-07T20:32:44.9450067Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9450536Z def test_silu_mul_quant( 2025-05-07T20:32:44.9450790Z self, 2025-05-07T20:32:44.9450994Z T: int, 2025-05-07T20:32:44.9451195Z D: int, 2025-05-07T20:32:44.9451426Z scale_ub: Optional[float], 2025-05-07T20:32:44.9451712Z contiguous: bool, 2025-05-07T20:32:44.9451958Z compiled: bool, 2025-05-07T20:32:44.9452191Z ) -> None: 2025-05-07T20:32:44.9452421Z torch.manual_seed(2025) 2025-05-07T20:32:44.9452672Z 2025-05-07T20:32:44.9452961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9453320Z 2025-05-07T20:32:44.9453520Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9453830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9454159Z x = x_sign * x_clamp 2025-05-07T20:32:44.9454408Z x0 = x[:, :D] 2025-05-07T20:32:44.9454641Z x1 = x[:, D:] 2025-05-07T20:32:44.9454864Z 2025-05-07T20:32:44.9455058Z if contiguous: 2025-05-07T20:32:44.9455304Z x0 = x0.contiguous() 2025-05-07T20:32:44.9455583Z x1 = x1.contiguous() 2025-05-07T20:32:44.9455837Z 2025-05-07T20:32:44.9456046Z if scale_ub is not None: 2025-05-07T20:32:44.9456337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9456701Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9457025Z ) 2025-05-07T20:32:44.9457234Z else: 2025-05-07T20:32:44.9457462Z scale_ub_tensor = None 2025-05-07T20:32:44.9457724Z 2025-05-07T20:32:44.9457971Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9458307Z op = silu_mul_quant 2025-05-07T20:32:44.9458566Z if compiled: 2025-05-07T20:32:44.9458828Z op = torch.compile(op) 2025-05-07T20:32:44.9459150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9459438Z 2025-05-07T20:32:44.9459659Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9459834Z 2025-05-07T20:32:44.9459945Z moe/activation_test.py:117: 2025-05-07T20:32:44.9460254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9460605Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9460906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9461489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9462066Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9462758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9463478Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9464038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9464811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9465513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9466075Z kernel = self.compile( 2025-05-07T20:32:44.9466638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9467330Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9467753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9468065Z 2025-05-07T20:32:44.9468290Z self = 2025-05-07T20:32:44.9469548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9471074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acf3a0c0>} 2025-05-07T20:32:44.9472476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9473543Z context = 2025-05-07T20:32:44.9473848Z 2025-05-07T20:32:44.9474038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9474587Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9475086Z module_map=module_map) 2025-05-07T20:32:44.9475473Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9475848Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9476124Z E ^ 2025-05-07T20:32:44.9476613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9477082Z 2025-05-07T20:32:44.9477524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.0809071Z 2025-05-07T20:32:45.0809773Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.0810474Z self=, 2025-05-07T20:32:45.0811086Z T=1, 2025-05-07T20:32:45.0811361Z D=7168, 2025-05-07T20:32:45.0811578Z scale_ub=None, 2025-05-07T20:32:45.0811809Z contiguous=False, 2025-05-07T20:32:45.0812055Z compiled=False, 2025-05-07T20:32:45.0812271Z ) 2025-05-07T20:32:45.0812611Z self = 2025-05-07T20:32:45.0813140Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:45.0813664Z 2025-05-07T20:32:45.0813753Z @given( 2025-05-07T20:32:45.0813994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.0814326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.0814656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.0815004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.0815353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.0815655Z ) 2025-05-07T20:32:45.0816022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.0816492Z def test_silu_mul_quant( 2025-05-07T20:32:45.0816747Z self, 2025-05-07T20:32:45.0816953Z T: int, 2025-05-07T20:32:45.0817164Z D: int, 2025-05-07T20:32:45.0817399Z scale_ub: Optional[float], 2025-05-07T20:32:45.0817686Z contiguous: bool, 2025-05-07T20:32:45.0817944Z compiled: bool, 2025-05-07T20:32:45.0818514Z ) -> None: 2025-05-07T20:32:45.0818750Z torch.manual_seed(2025) 2025-05-07T20:32:45.0819002Z 2025-05-07T20:32:45.0819291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.0819650Z 2025-05-07T20:32:45.0819855Z x_sign = torch.sign(x) 2025-05-07T20:32:45.0820165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.0820495Z x = x_sign * x_clamp 2025-05-07T20:32:45.0820746Z x0 = x[:, :D] 2025-05-07T20:32:45.0820976Z x1 = x[:, D:] 2025-05-07T20:32:45.0821276Z 2025-05-07T20:32:45.0828371Z if contiguous: 2025-05-07T20:32:45.0828661Z x0 = x0.contiguous() 2025-05-07T20:32:45.0828942Z x1 = x1.contiguous() 2025-05-07T20:32:45.0829206Z 2025-05-07T20:32:45.0829414Z if scale_ub is not None: 2025-05-07T20:32:45.0829701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.0830203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.0830540Z ) 2025-05-07T20:32:45.0830826Z else: 2025-05-07T20:32:45.0831059Z scale_ub_tensor = None 2025-05-07T20:32:45.0831336Z 2025-05-07T20:32:45.0831583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.0831927Z op = silu_mul_quant 2025-05-07T20:32:45.0832199Z if compiled: 2025-05-07T20:32:45.0832459Z op = torch.compile(op) 2025-05-07T20:32:45.0832780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.0833075Z 2025-05-07T20:32:45.0833281Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.0833466Z 2025-05-07T20:32:45.0833573Z moe/activation_test.py:117: 2025-05-07T20:32:45.0833891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.0834255Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.0834554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.0835298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.0836021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.0836586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.0837310Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.0838017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.0838579Z kernel = self.compile( 2025-05-07T20:32:45.0839150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.0839848Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.0840346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.0840593Z 2025-05-07T20:32:45.0840825Z self = 2025-05-07T20:32:45.0841950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.0843402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acf3ac00>} 2025-05-07T20:32:45.0844803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.0845884Z context = 2025-05-07T20:32:45.0846185Z 2025-05-07T20:32:45.0846364Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.0846974Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.0847467Z module_map=module_map) 2025-05-07T20:32:45.0847854Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.0848247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.0848552Z E ^ 2025-05-07T20:32:45.0849042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.0849512Z 2025-05-07T20:32:45.0849999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.0850540Z 2025-05-07T20:32:45.0850650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.0851115Z self=, 2025-05-07T20:32:45.0851583Z T=2048, 2025-05-07T20:32:45.0851786Z D=7168, 2025-05-07T20:32:45.0851989Z scale_ub=None, 2025-05-07T20:32:45.0852278Z contiguous=False, 2025-05-07T20:32:45.0852521Z compiled=True, 2025-05-07T20:32:45.0852734Z ) 2025-05-07T20:32:45.0853074Z self = 2025-05-07T20:32:45.0853599Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:45.0853883Z 2025-05-07T20:32:45.0853964Z @given( 2025-05-07T20:32:45.0854212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.0854548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.0854880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.0855225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.0855579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.0855883Z ) 2025-05-07T20:32:45.0856250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.0856724Z def test_silu_mul_quant( 2025-05-07T20:32:45.0856985Z self, 2025-05-07T20:32:45.0857194Z T: int, 2025-05-07T20:32:45.0857410Z D: int, 2025-05-07T20:32:45.0857645Z scale_ub: Optional[float], 2025-05-07T20:32:45.0857929Z contiguous: bool, 2025-05-07T20:32:45.0858185Z compiled: bool, 2025-05-07T20:32:45.0858421Z ) -> None: 2025-05-07T20:32:45.0858645Z torch.manual_seed(2025) 2025-05-07T20:32:45.0858906Z 2025-05-07T20:32:45.0859201Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.0859571Z 2025-05-07T20:32:45.0859777Z x_sign = torch.sign(x) 2025-05-07T20:32:45.0860088Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.0860423Z x = x_sign * x_clamp 2025-05-07T20:32:45.0860673Z x0 = x[:, :D] 2025-05-07T20:32:45.0860906Z x1 = x[:, D:] 2025-05-07T20:32:45.0861137Z 2025-05-07T20:32:45.0861330Z if contiguous: 2025-05-07T20:32:45.0861582Z x0 = x0.contiguous() 2025-05-07T20:32:45.0861861Z x1 = x1.contiguous() 2025-05-07T20:32:45.0862112Z 2025-05-07T20:32:45.0862318Z if scale_ub is not None: 2025-05-07T20:32:45.0862609Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.0862964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.0863295Z ) 2025-05-07T20:32:45.0863502Z else: 2025-05-07T20:32:45.0863717Z scale_ub_tensor = None 2025-05-07T20:32:45.0863985Z 2025-05-07T20:32:45.0864230Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.0864568Z op = silu_mul_quant 2025-05-07T20:32:45.0864828Z if compiled: 2025-05-07T20:32:45.0865093Z op = torch.compile(op) 2025-05-07T20:32:45.0865409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.0865695Z 2025-05-07T20:32:45.0865911Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.0866085Z 2025-05-07T20:32:45.0866249Z moe/activation_test.py:117: 2025-05-07T20:32:45.0866566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.0866922Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.0867230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.0867809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:45.0868386Z return fn(*args, **kwargs) 2025-05-07T20:32:45.0869074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.0869826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.0870384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.0871105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.0871893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.0872460Z kernel = self.compile( 2025-05-07T20:32:45.0873026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.0873723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.0874155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.0874398Z 2025-05-07T20:32:45.0874621Z self = 2025-05-07T20:32:45.0875751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.0877188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898b8c2c0>} 2025-05-07T20:32:45.0878649Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.0879721Z context = 2025-05-07T20:32:45.0880026Z 2025-05-07T20:32:45.0880294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.0880852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.0881350Z module_map=module_map) 2025-05-07T20:32:45.0881739Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.0882106Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.0882386Z E ^ 2025-05-07T20:32:45.0882882Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.0883349Z 2025-05-07T20:32:45.0883783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.0884325Z 2025-05-07T20:32:45.0884435Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.0884876Z self=, 2025-05-07T20:32:45.0885303Z T=4096, 2025-05-07T20:32:45.0885498Z D=7168, 2025-05-07T20:32:45.0885705Z scale_ub=None, 2025-05-07T20:32:45.0885938Z contiguous=False, 2025-05-07T20:32:45.0886173Z compiled=True, 2025-05-07T20:32:45.4934868Z ) 2025-05-07T20:32:45.4935436Z self = 2025-05-07T20:32:45.4936167Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:45.4936590Z 2025-05-07T20:32:45.4936702Z @given( 2025-05-07T20:32:45.4937317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.4937661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.4937990Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.4938346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.4938734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.4939044Z ) 2025-05-07T20:32:45.4939412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.4939920Z def test_silu_mul_quant( 2025-05-07T20:32:45.4940257Z self, 2025-05-07T20:32:45.4940464Z T: int, 2025-05-07T20:32:45.4940666Z D: int, 2025-05-07T20:32:45.4940893Z scale_ub: Optional[float], 2025-05-07T20:32:45.4941178Z contiguous: bool, 2025-05-07T20:32:45.4941427Z compiled: bool, 2025-05-07T20:32:45.4941666Z ) -> None: 2025-05-07T20:32:45.4941968Z torch.manual_seed(2025) 2025-05-07T20:32:45.4942216Z 2025-05-07T20:32:45.4942578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.4942943Z 2025-05-07T20:32:45.4943142Z x_sign = torch.sign(x) 2025-05-07T20:32:45.4943450Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.4943775Z x = x_sign * x_clamp 2025-05-07T20:32:45.4944020Z x0 = x[:, :D] 2025-05-07T20:32:45.4944249Z x1 = x[:, D:] 2025-05-07T20:32:45.4944465Z 2025-05-07T20:32:45.4944659Z if contiguous: 2025-05-07T20:32:45.4944895Z x0 = x0.contiguous() 2025-05-07T20:32:45.4945165Z x1 = x1.contiguous() 2025-05-07T20:32:45.4945414Z 2025-05-07T20:32:45.4945608Z if scale_ub is not None: 2025-05-07T20:32:45.4945891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.4946248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.4946567Z ) 2025-05-07T20:32:45.4946771Z else: 2025-05-07T20:32:45.4946997Z scale_ub_tensor = None 2025-05-07T20:32:45.4947253Z 2025-05-07T20:32:45.4947499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.4947828Z op = silu_mul_quant 2025-05-07T20:32:45.4948103Z if compiled: 2025-05-07T20:32:45.4948355Z op = torch.compile(op) 2025-05-07T20:32:45.4948667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.4948957Z 2025-05-07T20:32:45.4949172Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.4949343Z 2025-05-07T20:32:45.4949449Z moe/activation_test.py:117: 2025-05-07T20:32:45.4949763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.4950114Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.4950406Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.4950992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:45.4951577Z return fn(*args, **kwargs) 2025-05-07T20:32:45.4952262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.4952976Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.4953535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.4954251Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.4954945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.4955504Z kernel = self.compile( 2025-05-07T20:32:45.4956068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.4956756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.4957225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.4957470Z 2025-05-07T20:32:45.4957687Z self = 2025-05-07T20:32:45.4958814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.4960343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898b8cd60>} 2025-05-07T20:32:45.4961770Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.4962844Z context = 2025-05-07T20:32:45.4963188Z 2025-05-07T20:32:45.4963405Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.4963955Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.4964437Z module_map=module_map) 2025-05-07T20:32:45.4964827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.4965203Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.4965473Z E ^ 2025-05-07T20:32:45.4965961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.4966441Z 2025-05-07T20:32:45.4966875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.4967406Z 2025-05-07T20:32:45.4967522Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.4967959Z self=, 2025-05-07T20:32:45.4968393Z T=16384, 2025-05-07T20:32:45.4968597Z D=5120, 2025-05-07T20:32:45.4968803Z scale_ub=1200.0, 2025-05-07T20:32:45.4969041Z contiguous=False, 2025-05-07T20:32:45.4969281Z compiled=False, 2025-05-07T20:32:45.4969504Z ) 2025-05-07T20:32:45.4969834Z self = 2025-05-07T20:32:45.4970367Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:45.4970665Z 2025-05-07T20:32:45.4970752Z @given( 2025-05-07T20:32:45.4970991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.4971325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.4971651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.4971996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.4972345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.4972655Z ) 2025-05-07T20:32:45.4973030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.4973496Z def test_silu_mul_quant( 2025-05-07T20:32:45.4973751Z self, 2025-05-07T20:32:45.4973961Z T: int, 2025-05-07T20:32:45.4974165Z D: int, 2025-05-07T20:32:45.4974398Z scale_ub: Optional[float], 2025-05-07T20:32:45.4974687Z contiguous: bool, 2025-05-07T20:32:45.4974936Z compiled: bool, 2025-05-07T20:32:45.4975174Z ) -> None: 2025-05-07T20:32:45.4975405Z torch.manual_seed(2025) 2025-05-07T20:32:45.4975656Z 2025-05-07T20:32:45.4975947Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.4976315Z 2025-05-07T20:32:45.4976516Z x_sign = torch.sign(x) 2025-05-07T20:32:45.4976832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.4977159Z x = x_sign * x_clamp 2025-05-07T20:32:45.4977407Z x0 = x[:, :D] 2025-05-07T20:32:45.4977638Z x1 = x[:, D:] 2025-05-07T20:32:45.4977856Z 2025-05-07T20:32:45.4978095Z if contiguous: 2025-05-07T20:32:45.4978350Z x0 = x0.contiguous() 2025-05-07T20:32:45.4978671Z x1 = x1.contiguous() 2025-05-07T20:32:45.4978929Z 2025-05-07T20:32:45.4979125Z if scale_ub is not None: 2025-05-07T20:32:45.4979413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.4979773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.4980092Z ) 2025-05-07T20:32:45.4980300Z else: 2025-05-07T20:32:45.4980522Z scale_ub_tensor = None 2025-05-07T20:32:45.4980828Z 2025-05-07T20:32:45.4981076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.4981411Z op = silu_mul_quant 2025-05-07T20:32:45.4981669Z if compiled: 2025-05-07T20:32:45.4981931Z op = torch.compile(op) 2025-05-07T20:32:45.4982246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.4982573Z 2025-05-07T20:32:45.4982782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.4982959Z 2025-05-07T20:32:45.4983108Z moe/activation_test.py:117: 2025-05-07T20:32:45.4983428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.4983773Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.4984068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.4984790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.4985506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.4986073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.4986786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.4987489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.4988050Z kernel = self.compile( 2025-05-07T20:32:45.4988622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.4989310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.4989723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.4989968Z 2025-05-07T20:32:45.4990185Z self = 2025-05-07T20:32:45.4991306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.4992737Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898b8dc60>} 2025-05-07T20:32:45.4994146Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.4995206Z context = 2025-05-07T20:32:45.4995517Z 2025-05-07T20:32:45.4995693Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.4996243Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.4996736Z module_map=module_map) 2025-05-07T20:32:45.4997121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.4997498Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.4997774Z E ^ 2025-05-07T20:32:45.4998260Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.4998737Z 2025-05-07T20:32:45.4999223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.4999763Z 2025-05-07T20:32:45.4999874Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.5000430Z self=, 2025-05-07T20:32:45.5000843Z T=16384, 2025-05-07T20:32:45.5001045Z D=5120, 2025-05-07T20:32:45.5001252Z scale_ub=1200.0, 2025-05-07T20:32:45.5001483Z contiguous=True, 2025-05-07T20:32:45.5001717Z compiled=True, 2025-05-07T20:32:45.5001930Z ) 2025-05-07T20:32:45.5002308Z self = 2025-05-07T20:32:45.5002832Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:45.5003123Z 2025-05-07T20:32:45.5003210Z @given( 2025-05-07T20:32:45.5003456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.5003826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.5004214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.5004569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.5004917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.5005218Z ) 2025-05-07T20:32:45.5005591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.5006048Z def test_silu_mul_quant( 2025-05-07T20:32:45.5006302Z self, 2025-05-07T20:32:45.5006508Z T: int, 2025-05-07T20:32:45.5006712Z D: int, 2025-05-07T20:32:45.5006947Z scale_ub: Optional[float], 2025-05-07T20:32:45.5007236Z contiguous: bool, 2025-05-07T20:32:45.5007488Z compiled: bool, 2025-05-07T20:32:45.5007725Z ) -> None: 2025-05-07T20:32:45.5007953Z torch.manual_seed(2025) 2025-05-07T20:32:45.5008202Z 2025-05-07T20:32:45.5008497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.5008867Z 2025-05-07T20:32:45.5009079Z x_sign = torch.sign(x) 2025-05-07T20:32:45.5009385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.5009715Z x = x_sign * x_clamp 2025-05-07T20:32:45.5009971Z x0 = x[:, :D] 2025-05-07T20:32:45.5010197Z x1 = x[:, D:] 2025-05-07T20:32:45.5010422Z 2025-05-07T20:32:45.5010626Z if contiguous: 2025-05-07T20:32:45.5010865Z x0 = x0.contiguous() 2025-05-07T20:32:45.5011144Z x1 = x1.contiguous() 2025-05-07T20:32:45.5011408Z 2025-05-07T20:32:45.5011606Z if scale_ub is not None: 2025-05-07T20:32:45.5011906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.5012267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.5012586Z ) 2025-05-07T20:32:45.5012795Z else: 2025-05-07T20:32:45.5013019Z scale_ub_tensor = None 2025-05-07T20:32:45.5013282Z 2025-05-07T20:32:45.5013858Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.5014191Z op = silu_mul_quant 2025-05-07T20:32:45.5014454Z if compiled: 2025-05-07T20:32:45.5014709Z op = torch.compile(op) 2025-05-07T20:32:45.5015021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.5015310Z 2025-05-07T20:32:45.5015507Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.5015688Z 2025-05-07T20:32:45.5015792Z moe/activation_test.py:117: 2025-05-07T20:32:45.5016102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.5016449Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.5016747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.5017329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:45.5017909Z return fn(*args, **kwargs) 2025-05-07T20:32:45.5018705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.5019426Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.5019995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.5020705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.5021400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.5021960Z kernel = self.compile( 2025-05-07T20:32:45.5022525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.5023322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.5023742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.5024041Z 2025-05-07T20:32:45.5024272Z self = 2025-05-07T20:32:45.5025455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.5026875Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898b8f380>} 2025-05-07T20:32:45.5028267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.5029332Z context = 2025-05-07T20:32:45.5029632Z 2025-05-07T20:32:45.5029813Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.5030365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.5037778Z module_map=module_map) 2025-05-07T20:32:45.5038196Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.5038574Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.5038851Z E ^ 2025-05-07T20:32:45.5039334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.5039818Z 2025-05-07T20:32:45.5040328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.6573470Z 2025-05-07T20:32:45.6573883Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.6574549Z self=, 2025-05-07T20:32:45.6575068Z T=16384, 2025-05-07T20:32:45.6575299Z D=5120, 2025-05-07T20:32:45.6575504Z scale_ub=None, 2025-05-07T20:32:45.6575744Z contiguous=False, 2025-05-07T20:32:45.6575982Z compiled=True, 2025-05-07T20:32:45.6576206Z ) 2025-05-07T20:32:45.6576547Z self = 2025-05-07T20:32:45.6577071Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:45.6577374Z 2025-05-07T20:32:45.6577456Z @given( 2025-05-07T20:32:45.6577704Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.6578033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.6578380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.6578779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.6579129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.6579439Z ) 2025-05-07T20:32:45.6579811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.6580282Z def test_silu_mul_quant( 2025-05-07T20:32:45.6580816Z self, 2025-05-07T20:32:45.6581031Z T: int, 2025-05-07T20:32:45.6581245Z D: int, 2025-05-07T20:32:45.6581472Z scale_ub: Optional[float], 2025-05-07T20:32:45.6581762Z contiguous: bool, 2025-05-07T20:32:45.6582019Z compiled: bool, 2025-05-07T20:32:45.6582251Z ) -> None: 2025-05-07T20:32:45.6582513Z torch.manual_seed(2025) 2025-05-07T20:32:45.6582763Z 2025-05-07T20:32:45.6583051Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.6583413Z 2025-05-07T20:32:45.6583612Z x_sign = torch.sign(x) 2025-05-07T20:32:45.6584009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.6584340Z x = x_sign * x_clamp 2025-05-07T20:32:45.6584588Z x0 = x[:, :D] 2025-05-07T20:32:45.6584819Z x1 = x[:, D:] 2025-05-07T20:32:45.6585041Z 2025-05-07T20:32:45.6585233Z if contiguous: 2025-05-07T20:32:45.6585563Z x0 = x0.contiguous() 2025-05-07T20:32:45.6585847Z x1 = x1.contiguous() 2025-05-07T20:32:45.6586168Z 2025-05-07T20:32:45.6586379Z if scale_ub is not None: 2025-05-07T20:32:45.6586671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.6587020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.6587358Z ) 2025-05-07T20:32:45.6587567Z else: 2025-05-07T20:32:45.6587794Z scale_ub_tensor = None 2025-05-07T20:32:45.6588054Z 2025-05-07T20:32:45.6588311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.6588651Z op = silu_mul_quant 2025-05-07T20:32:45.6588912Z if compiled: 2025-05-07T20:32:45.6589178Z op = torch.compile(op) 2025-05-07T20:32:45.6589492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.6589780Z 2025-05-07T20:32:45.6589991Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.6590170Z 2025-05-07T20:32:45.6590286Z moe/activation_test.py:117: 2025-05-07T20:32:45.6590605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.6590963Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.6591264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.6591855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:45.6592439Z return fn(*args, **kwargs) 2025-05-07T20:32:45.6593134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.6593858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.6594422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.6595140Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.6595852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.6596414Z kernel = self.compile( 2025-05-07T20:32:45.6596979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.6597671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.6598091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.6598330Z 2025-05-07T20:32:45.6598552Z self = 2025-05-07T20:32:45.6599692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.6601288Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ade2c5e0>} 2025-05-07T20:32:45.6602689Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.6603752Z context = 2025-05-07T20:32:45.6604052Z 2025-05-07T20:32:45.6604229Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.6604780Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.6605315Z module_map=module_map) 2025-05-07T20:32:45.6605704Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.6606072Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.6606347Z E ^ 2025-05-07T20:32:45.6606838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.6607347Z 2025-05-07T20:32:45.6607818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.6608360Z 2025-05-07T20:32:45.6608469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.6608905Z self=, 2025-05-07T20:32:45.6609328Z T=2048, 2025-05-07T20:32:45.6609520Z D=5120, 2025-05-07T20:32:45.6609726Z scale_ub=None, 2025-05-07T20:32:45.6609955Z contiguous=False, 2025-05-07T20:32:45.6610190Z compiled=True, 2025-05-07T20:32:45.6610411Z ) 2025-05-07T20:32:45.6610747Z self = 2025-05-07T20:32:45.6611260Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:45.6611550Z 2025-05-07T20:32:45.6611631Z @given( 2025-05-07T20:32:45.6611883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.6612219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.6612541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.6612891Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.6613242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.6613852Z ) 2025-05-07T20:32:45.6614229Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.6614695Z def test_silu_mul_quant( 2025-05-07T20:32:45.6614946Z self, 2025-05-07T20:32:45.6615154Z T: int, 2025-05-07T20:32:45.6615373Z D: int, 2025-05-07T20:32:45.6615600Z scale_ub: Optional[float], 2025-05-07T20:32:45.6615887Z contiguous: bool, 2025-05-07T20:32:45.6616142Z compiled: bool, 2025-05-07T20:32:45.6616370Z ) -> None: 2025-05-07T20:32:45.6616604Z torch.manual_seed(2025) 2025-05-07T20:32:45.6616865Z 2025-05-07T20:32:45.6617158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.6617517Z 2025-05-07T20:32:45.6617726Z x_sign = torch.sign(x) 2025-05-07T20:32:45.6618034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.6618354Z x = x_sign * x_clamp 2025-05-07T20:32:45.6618614Z x0 = x[:, :D] 2025-05-07T20:32:45.6618846Z x1 = x[:, D:] 2025-05-07T20:32:45.6619063Z 2025-05-07T20:32:45.6619262Z if contiguous: 2025-05-07T20:32:45.6619510Z x0 = x0.contiguous() 2025-05-07T20:32:45.6619779Z x1 = x1.contiguous() 2025-05-07T20:32:45.6620038Z 2025-05-07T20:32:45.6620248Z if scale_ub is not None: 2025-05-07T20:32:45.6620531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.6620886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.6621212Z ) 2025-05-07T20:32:45.6621409Z else: 2025-05-07T20:32:45.6621637Z scale_ub_tensor = None 2025-05-07T20:32:45.6621903Z 2025-05-07T20:32:45.6622225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.6622561Z op = silu_mul_quant 2025-05-07T20:32:45.6622827Z if compiled: 2025-05-07T20:32:45.6623092Z op = torch.compile(op) 2025-05-07T20:32:45.6623398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.6623689Z 2025-05-07T20:32:45.6623896Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.6624070Z 2025-05-07T20:32:45.6624181Z moe/activation_test.py:117: 2025-05-07T20:32:45.6624496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.6624936Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.6625236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.6625828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:45.6626506Z return fn(*args, **kwargs) 2025-05-07T20:32:45.6627253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.6627967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.6628535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.6629247Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.6629934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.6630500Z kernel = self.compile( 2025-05-07T20:32:45.6631069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.6631761Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.6632179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.6632423Z 2025-05-07T20:32:45.6632642Z self = 2025-05-07T20:32:45.6633761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.6635187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ade2d440>} 2025-05-07T20:32:45.6636579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.6637633Z context = 2025-05-07T20:32:45.6637946Z 2025-05-07T20:32:45.6638125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.6638676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.6639167Z module_map=module_map) 2025-05-07T20:32:45.6639543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.6639917Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.6640305Z E ^ 2025-05-07T20:32:45.6640782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.6641256Z 2025-05-07T20:32:45.6641690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.8247583Z 2025-05-07T20:32:45.8247980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.8248643Z self=, 2025-05-07T20:32:45.8249259Z T=2048, 2025-05-07T20:32:45.8249534Z D=5120, 2025-05-07T20:32:45.8250104Z scale_ub=1200.0, 2025-05-07T20:32:45.8250363Z contiguous=False, 2025-05-07T20:32:45.8250603Z compiled=True, 2025-05-07T20:32:45.8250826Z ) 2025-05-07T20:32:45.8251173Z self = 2025-05-07T20:32:45.8251702Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:45.8251990Z 2025-05-07T20:32:45.8252073Z @given( 2025-05-07T20:32:45.8252323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.8252656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.8253064Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.8253417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.8253769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.8254070Z ) 2025-05-07T20:32:45.8254445Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.8254998Z def test_silu_mul_quant( 2025-05-07T20:32:45.8255322Z self, 2025-05-07T20:32:45.8255527Z T: int, 2025-05-07T20:32:45.8255740Z D: int, 2025-05-07T20:32:45.8255973Z scale_ub: Optional[float], 2025-05-07T20:32:45.8256257Z contiguous: bool, 2025-05-07T20:32:45.8256513Z compiled: bool, 2025-05-07T20:32:45.8256753Z ) -> None: 2025-05-07T20:32:45.8256976Z torch.manual_seed(2025) 2025-05-07T20:32:45.8257232Z 2025-05-07T20:32:45.8257522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.8257881Z 2025-05-07T20:32:45.8258089Z x_sign = torch.sign(x) 2025-05-07T20:32:45.8258398Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.8258722Z x = x_sign * x_clamp 2025-05-07T20:32:45.8258978Z x0 = x[:, :D] 2025-05-07T20:32:45.8259208Z x1 = x[:, D:] 2025-05-07T20:32:45.8259430Z 2025-05-07T20:32:45.8259633Z if contiguous: 2025-05-07T20:32:45.8259882Z x0 = x0.contiguous() 2025-05-07T20:32:45.8260158Z x1 = x1.contiguous() 2025-05-07T20:32:45.8260420Z 2025-05-07T20:32:45.8260625Z if scale_ub is not None: 2025-05-07T20:32:45.8260919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.8261272Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.8261601Z ) 2025-05-07T20:32:45.8261811Z else: 2025-05-07T20:32:45.8262032Z scale_ub_tensor = None 2025-05-07T20:32:45.8262300Z 2025-05-07T20:32:45.8262550Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.8262918Z op = silu_mul_quant 2025-05-07T20:32:45.8263178Z if compiled: 2025-05-07T20:32:45.8263446Z op = torch.compile(op) 2025-05-07T20:32:45.8263763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.8264050Z 2025-05-07T20:32:45.8264261Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.8264438Z 2025-05-07T20:32:45.8264559Z moe/activation_test.py:117: 2025-05-07T20:32:45.8264874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.8265229Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.8265531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.8266124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:45.8266710Z return fn(*args, **kwargs) 2025-05-07T20:32:45.8267403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.8268127Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.8268688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.8269403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.8270162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.8270727Z kernel = self.compile( 2025-05-07T20:32:45.8271294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.8271988Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.8272413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.8272654Z 2025-05-07T20:32:45.8272888Z self = 2025-05-07T20:32:45.8274050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.8275577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ade2e660>} 2025-05-07T20:32:45.8276976Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.8278041Z context = 2025-05-07T20:32:45.8278342Z 2025-05-07T20:32:45.8278519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.8279071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.8279570Z module_map=module_map) 2025-05-07T20:32:45.8279955Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.8280450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.8280728Z E ^ 2025-05-07T20:32:45.8281224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.8281694Z 2025-05-07T20:32:45.8282134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.8282666Z 2025-05-07T20:32:45.8282777Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.8283217Z self=, 2025-05-07T20:32:45.8283643Z T=4096, 2025-05-07T20:32:45.8283838Z D=5120, 2025-05-07T20:32:45.8284044Z scale_ub=1200.0, 2025-05-07T20:32:45.8284283Z contiguous=True, 2025-05-07T20:32:45.8284512Z compiled=True, 2025-05-07T20:32:45.8284727Z ) 2025-05-07T20:32:45.8285063Z self = 2025-05-07T20:32:45.8285581Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:45.8285874Z 2025-05-07T20:32:45.8285955Z @given( 2025-05-07T20:32:45.8286206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.8286538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.8286859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.8287208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.8287556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.8287857Z ) 2025-05-07T20:32:45.8288228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.8288750Z def test_silu_mul_quant( 2025-05-07T20:32:45.8289005Z self, 2025-05-07T20:32:45.8289216Z T: int, 2025-05-07T20:32:45.8289429Z D: int, 2025-05-07T20:32:45.8289658Z scale_ub: Optional[float], 2025-05-07T20:32:45.8289949Z contiguous: bool, 2025-05-07T20:32:45.8290209Z compiled: bool, 2025-05-07T20:32:45.8290448Z ) -> None: 2025-05-07T20:32:45.8290680Z torch.manual_seed(2025) 2025-05-07T20:32:45.8290940Z 2025-05-07T20:32:45.8291285Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.8291644Z 2025-05-07T20:32:45.8291856Z x_sign = torch.sign(x) 2025-05-07T20:32:45.8292168Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.8292496Z x = x_sign * x_clamp 2025-05-07T20:32:45.8292755Z x0 = x[:, :D] 2025-05-07T20:32:45.8292990Z x1 = x[:, D:] 2025-05-07T20:32:45.8293205Z 2025-05-07T20:32:45.8293401Z if contiguous: 2025-05-07T20:32:45.8293646Z x0 = x0.contiguous() 2025-05-07T20:32:45.8293965Z x1 = x1.contiguous() 2025-05-07T20:32:45.8294219Z 2025-05-07T20:32:45.8294426Z if scale_ub is not None: 2025-05-07T20:32:45.8294711Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.8295068Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.8295446Z ) 2025-05-07T20:32:45.8295647Z else: 2025-05-07T20:32:45.8295875Z scale_ub_tensor = None 2025-05-07T20:32:45.8296186Z 2025-05-07T20:32:45.8296439Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.8296766Z op = silu_mul_quant 2025-05-07T20:32:45.8297033Z if compiled: 2025-05-07T20:32:45.8297299Z op = torch.compile(op) 2025-05-07T20:32:45.8297606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.8297898Z 2025-05-07T20:32:45.8298106Z > y_fp8, y_scale = fn() 2025-05-07T20:32:45.8298279Z 2025-05-07T20:32:45.8298384Z moe/activation_test.py:117: 2025-05-07T20:32:45.8298708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.8299063Z moe/activation_test.py:115: in fn 2025-05-07T20:32:45.8299362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.8299953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:45.8300548Z return fn(*args, **kwargs) 2025-05-07T20:32:45.8301242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:45.8301956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:45.8302524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.8303241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.8303940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.8304501Z kernel = self.compile( 2025-05-07T20:32:45.8305075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.8305768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.8306197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.8306446Z 2025-05-07T20:32:45.8306669Z self = 2025-05-07T20:32:45.8307799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.8309229Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ade2f9c0>} 2025-05-07T20:32:45.8310631Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.8311690Z context = 2025-05-07T20:32:45.8312005Z 2025-05-07T20:32:45.8312232Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.8312787Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.8313283Z module_map=module_map) 2025-05-07T20:32:45.8313985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.8314360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.8314637Z E ^ 2025-05-07T20:32:45.8315125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.8315727Z 2025-05-07T20:32:45.8316163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.0005973Z 2025-05-07T20:32:46.0006548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.0007559Z self=, 2025-05-07T20:32:46.0007990Z T=128, 2025-05-07T20:32:46.0008276Z D=5120, 2025-05-07T20:32:46.0008475Z scale_ub=1200.0, 2025-05-07T20:32:46.0008713Z contiguous=False, 2025-05-07T20:32:46.0008950Z compiled=True, 2025-05-07T20:32:46.0009166Z ) 2025-05-07T20:32:46.0009496Z self = 2025-05-07T20:32:46.0010012Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:46.0010294Z 2025-05-07T20:32:46.0010374Z @given( 2025-05-07T20:32:46.0010618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.0010955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.0011272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.0011619Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.0011970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.0012274Z ) 2025-05-07T20:32:46.0012635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.0013099Z def test_silu_mul_quant( 2025-05-07T20:32:46.0013604Z self, 2025-05-07T20:32:46.0013806Z T: int, 2025-05-07T20:32:46.0014013Z D: int, 2025-05-07T20:32:46.0014241Z scale_ub: Optional[float], 2025-05-07T20:32:46.0014520Z contiguous: bool, 2025-05-07T20:32:46.0014775Z compiled: bool, 2025-05-07T20:32:46.0015010Z ) -> None: 2025-05-07T20:32:46.0015234Z torch.manual_seed(2025) 2025-05-07T20:32:46.0022689Z 2025-05-07T20:32:46.0022983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.0023346Z 2025-05-07T20:32:46.0023551Z x_sign = torch.sign(x) 2025-05-07T20:32:46.0023858Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.0024196Z x = x_sign * x_clamp 2025-05-07T20:32:46.0024458Z x0 = x[:, :D] 2025-05-07T20:32:46.0024686Z x1 = x[:, D:] 2025-05-07T20:32:46.0024910Z 2025-05-07T20:32:46.0025110Z if contiguous: 2025-05-07T20:32:46.0025351Z x0 = x0.contiguous() 2025-05-07T20:32:46.0025626Z x1 = x1.contiguous() 2025-05-07T20:32:46.0025882Z 2025-05-07T20:32:46.0026078Z if scale_ub is not None: 2025-05-07T20:32:46.0026373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.0026732Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.0027051Z ) 2025-05-07T20:32:46.0027259Z else: 2025-05-07T20:32:46.0027484Z scale_ub_tensor = None 2025-05-07T20:32:46.0027750Z 2025-05-07T20:32:46.0028000Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.0028336Z op = silu_mul_quant 2025-05-07T20:32:46.0028605Z if compiled: 2025-05-07T20:32:46.0028865Z op = torch.compile(op) 2025-05-07T20:32:46.0029183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.0029482Z 2025-05-07T20:32:46.0029818Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.0030003Z 2025-05-07T20:32:46.0030110Z moe/activation_test.py:117: 2025-05-07T20:32:46.0030423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.0030770Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.0031072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.0031667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:46.0032257Z return fn(*args, **kwargs) 2025-05-07T20:32:46.0033024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.0033744Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.0034315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.0035097Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.0035859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.0036428Z kernel = self.compile( 2025-05-07T20:32:46.0037002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.0037694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.0038120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.0038365Z 2025-05-07T20:32:46.0038597Z self = 2025-05-07T20:32:46.0039732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.0041315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acc84fe0>} 2025-05-07T20:32:46.0042720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.0043796Z context = 2025-05-07T20:32:46.0044099Z 2025-05-07T20:32:46.0044285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.0044839Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.0045332Z module_map=module_map) 2025-05-07T20:32:46.0045722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.0046105Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.0046375Z E ^ 2025-05-07T20:32:46.0046870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.0047339Z 2025-05-07T20:32:46.0047779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.0048318Z 2025-05-07T20:32:46.0048438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.0048918Z self=, 2025-05-07T20:32:46.0049342Z T=16384, 2025-05-07T20:32:46.0049549Z D=7168, 2025-05-07T20:32:46.0049749Z scale_ub=1200.0, 2025-05-07T20:32:46.0049988Z contiguous=True, 2025-05-07T20:32:46.0050227Z compiled=True, 2025-05-07T20:32:46.0050438Z ) 2025-05-07T20:32:46.0050778Z self = 2025-05-07T20:32:46.0051306Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:46.0051652Z 2025-05-07T20:32:46.0051738Z @given( 2025-05-07T20:32:46.0051988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.0052329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.0052656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.0052998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.0053345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.0053645Z ) 2025-05-07T20:32:46.0054009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.0054517Z def test_silu_mul_quant( 2025-05-07T20:32:46.0054773Z self, 2025-05-07T20:32:46.0054972Z T: int, 2025-05-07T20:32:46.0055179Z D: int, 2025-05-07T20:32:46.0055408Z scale_ub: Optional[float], 2025-05-07T20:32:46.0055687Z contiguous: bool, 2025-05-07T20:32:46.0055987Z compiled: bool, 2025-05-07T20:32:46.0056228Z ) -> None: 2025-05-07T20:32:46.0056453Z torch.manual_seed(2025) 2025-05-07T20:32:46.0056749Z 2025-05-07T20:32:46.0057040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.0057394Z 2025-05-07T20:32:46.0057602Z x_sign = torch.sign(x) 2025-05-07T20:32:46.0057915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.0058246Z x = x_sign * x_clamp 2025-05-07T20:32:46.0058501Z x0 = x[:, :D] 2025-05-07T20:32:46.0058737Z x1 = x[:, D:] 2025-05-07T20:32:46.0058958Z 2025-05-07T20:32:46.0059149Z if contiguous: 2025-05-07T20:32:46.0059400Z x0 = x0.contiguous() 2025-05-07T20:32:46.0059675Z x1 = x1.contiguous() 2025-05-07T20:32:46.0059928Z 2025-05-07T20:32:46.0060139Z if scale_ub is not None: 2025-05-07T20:32:46.0060433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.0060784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.0061116Z ) 2025-05-07T20:32:46.0061327Z else: 2025-05-07T20:32:46.0061548Z scale_ub_tensor = None 2025-05-07T20:32:46.0061817Z 2025-05-07T20:32:46.0062066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.0062392Z op = silu_mul_quant 2025-05-07T20:32:46.0062660Z if compiled: 2025-05-07T20:32:46.0062929Z op = torch.compile(op) 2025-05-07T20:32:46.0063249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.0063532Z 2025-05-07T20:32:46.0063735Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.0063908Z 2025-05-07T20:32:46.0064018Z moe/activation_test.py:117: 2025-05-07T20:32:46.0064321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.0064678Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.0064978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.0065563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:46.0066147Z return fn(*args, **kwargs) 2025-05-07T20:32:46.0066838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.0067554Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.0068110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.0068823Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.0069521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.0070076Z kernel = self.compile( 2025-05-07T20:32:46.0070644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.0071334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.0071807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.0072048Z 2025-05-07T20:32:46.0072266Z self = 2025-05-07T20:32:46.0073389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.0074812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acc85e40>} 2025-05-07T20:32:46.0076249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.0077356Z context = 2025-05-07T20:32:46.0077658Z 2025-05-07T20:32:46.0077891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.0078456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.0078990Z module_map=module_map) 2025-05-07T20:32:46.0079369Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.0079746Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.0080023Z E ^ 2025-05-07T20:32:46.0080589Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.0081062Z 2025-05-07T20:32:46.0081496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.1235087Z 2025-05-07T20:32:46.1235452Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.1236127Z self=, 2025-05-07T20:32:46.1236734Z T=16384, 2025-05-07T20:32:46.1237022Z D=5120, 2025-05-07T20:32:46.1237296Z scale_ub=1200.0, 2025-05-07T20:32:46.1237622Z contiguous=True, 2025-05-07T20:32:46.1237900Z compiled=False, 2025-05-07T20:32:46.1238114Z ) 2025-05-07T20:32:46.1238451Z self = 2025-05-07T20:32:46.1238972Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.1239263Z 2025-05-07T20:32:46.1239348Z @given( 2025-05-07T20:32:46.1239594Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.1239928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.1240345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.1240688Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.1241038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.1241345Z ) 2025-05-07T20:32:46.1241708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.1242174Z def test_silu_mul_quant( 2025-05-07T20:32:46.1242428Z self, 2025-05-07T20:32:46.1242634Z T: int, 2025-05-07T20:32:46.1242835Z D: int, 2025-05-07T20:32:46.1243066Z scale_ub: Optional[float], 2025-05-07T20:32:46.1243352Z contiguous: bool, 2025-05-07T20:32:46.1243605Z compiled: bool, 2025-05-07T20:32:46.1243841Z ) -> None: 2025-05-07T20:32:46.1244065Z torch.manual_seed(2025) 2025-05-07T20:32:46.1244317Z 2025-05-07T20:32:46.1244605Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.1244961Z 2025-05-07T20:32:46.1245159Z x_sign = torch.sign(x) 2025-05-07T20:32:46.1245464Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.1245787Z x = x_sign * x_clamp 2025-05-07T20:32:46.1246031Z x0 = x[:, :D] 2025-05-07T20:32:46.1246557Z x1 = x[:, D:] 2025-05-07T20:32:46.1246784Z 2025-05-07T20:32:46.1246972Z if contiguous: 2025-05-07T20:32:46.1247216Z x0 = x0.contiguous() 2025-05-07T20:32:46.1247487Z x1 = x1.contiguous() 2025-05-07T20:32:46.1247732Z 2025-05-07T20:32:46.1247932Z if scale_ub is not None: 2025-05-07T20:32:46.1248219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.1248598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.1248940Z ) 2025-05-07T20:32:46.1249141Z else: 2025-05-07T20:32:46.1249453Z scale_ub_tensor = None 2025-05-07T20:32:46.1249717Z 2025-05-07T20:32:46.1249956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.1250287Z op = silu_mul_quant 2025-05-07T20:32:46.1250550Z if compiled: 2025-05-07T20:32:46.1250806Z op = torch.compile(op) 2025-05-07T20:32:46.1251193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.1251481Z 2025-05-07T20:32:46.1251748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.1251927Z 2025-05-07T20:32:46.1252032Z moe/activation_test.py:117: 2025-05-07T20:32:46.1252342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.1252697Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.1252987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.1253707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.1254428Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.1254983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.1255699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.1256395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.1256953Z kernel = self.compile( 2025-05-07T20:32:46.1257513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.1258198Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.1258612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.1258849Z 2025-05-07T20:32:46.1259067Z self = 2025-05-07T20:32:46.1260184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.1261630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acc86ca0>} 2025-05-07T20:32:46.1263021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.1264083Z context = 2025-05-07T20:32:46.1264409Z 2025-05-07T20:32:46.1264588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.1265128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.1265619Z module_map=module_map) 2025-05-07T20:32:46.1266001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.1266372Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.1266637Z E ^ 2025-05-07T20:32:46.1267175Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.1267644Z 2025-05-07T20:32:46.1268084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.1268614Z 2025-05-07T20:32:46.1268727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.1269153Z self=, 2025-05-07T20:32:46.1269572Z T=1, 2025-05-07T20:32:46.1269764Z D=7168, 2025-05-07T20:32:46.1269967Z scale_ub=1200.0, 2025-05-07T20:32:46.1270208Z contiguous=False, 2025-05-07T20:32:46.1270490Z compiled=False, 2025-05-07T20:32:46.1270698Z ) 2025-05-07T20:32:46.1271035Z self = 2025-05-07T20:32:46.1271548Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:46.1271825Z 2025-05-07T20:32:46.1271949Z @given( 2025-05-07T20:32:46.1272195Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.1272560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.1272885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.1273225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.1273572Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.1273873Z ) 2025-05-07T20:32:46.1274233Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.1274693Z def test_silu_mul_quant( 2025-05-07T20:32:46.1274951Z self, 2025-05-07T20:32:46.1275154Z T: int, 2025-05-07T20:32:46.1275363Z D: int, 2025-05-07T20:32:46.1275592Z scale_ub: Optional[float], 2025-05-07T20:32:46.1275870Z contiguous: bool, 2025-05-07T20:32:46.1276134Z compiled: bool, 2025-05-07T20:32:46.1276367Z ) -> None: 2025-05-07T20:32:46.1276589Z torch.manual_seed(2025) 2025-05-07T20:32:46.1276845Z 2025-05-07T20:32:46.1277132Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.1277485Z 2025-05-07T20:32:46.1277690Z x_sign = torch.sign(x) 2025-05-07T20:32:46.1277995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.1278317Z x = x_sign * x_clamp 2025-05-07T20:32:46.1278564Z x0 = x[:, :D] 2025-05-07T20:32:46.1278791Z x1 = x[:, D:] 2025-05-07T20:32:46.1279008Z 2025-05-07T20:32:46.1279198Z if contiguous: 2025-05-07T20:32:46.1279439Z x0 = x0.contiguous() 2025-05-07T20:32:46.1279710Z x1 = x1.contiguous() 2025-05-07T20:32:46.1279959Z 2025-05-07T20:32:46.1280253Z if scale_ub is not None: 2025-05-07T20:32:46.1280545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.1280891Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.1281217Z ) 2025-05-07T20:32:46.1281425Z else: 2025-05-07T20:32:46.1281639Z scale_ub_tensor = None 2025-05-07T20:32:46.1281905Z 2025-05-07T20:32:46.1282151Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.1282476Z op = silu_mul_quant 2025-05-07T20:32:46.1282739Z if compiled: 2025-05-07T20:32:46.1282998Z op = torch.compile(op) 2025-05-07T20:32:46.1283306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.1283588Z 2025-05-07T20:32:46.1283791Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.1283962Z 2025-05-07T20:32:46.1284071Z moe/activation_test.py:117: 2025-05-07T20:32:46.1284374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.1284727Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.1285022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.1285730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.1286454Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.1287068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.1287784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.1288470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.1289026Z kernel = self.compile( 2025-05-07T20:32:46.1289592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.1290311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.1290727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.1290971Z 2025-05-07T20:32:46.1291188Z self = 2025-05-07T20:32:46.1292385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.1294003Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acd740e0>} 2025-05-07T20:32:46.1295390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.1296456Z context = 2025-05-07T20:32:46.1296763Z 2025-05-07T20:32:46.1296940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.1297489Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.1297976Z module_map=module_map) 2025-05-07T20:32:46.1298363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.1298741Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.1299005Z E ^ 2025-05-07T20:32:46.1299487Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.1299959Z 2025-05-07T20:32:46.1300394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.1300923Z 2025-05-07T20:32:46.1301039Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.1301465Z self=, 2025-05-07T20:32:46.1301883Z T=4096, 2025-05-07T20:32:46.1302078Z D=7168, 2025-05-07T20:32:46.1302274Z scale_ub=1200.0, 2025-05-07T20:32:46.1302510Z contiguous=False, 2025-05-07T20:32:46.1302750Z compiled=True, 2025-05-07T20:32:46.2913023Z ) 2025-05-07T20:32:46.2913742Z self = 2025-05-07T20:32:46.2914494Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:46.2914891Z 2025-05-07T20:32:46.2915017Z @given( 2025-05-07T20:32:46.2915274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.2915617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.2915946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.2916302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.2916654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.2916962Z ) 2025-05-07T20:32:46.2917336Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.2917800Z def test_silu_mul_quant( 2025-05-07T20:32:46.2918061Z self, 2025-05-07T20:32:46.2918285Z T: int, 2025-05-07T20:32:46.2918499Z D: int, 2025-05-07T20:32:46.2919051Z scale_ub: Optional[float], 2025-05-07T20:32:46.2919348Z contiguous: bool, 2025-05-07T20:32:46.2919600Z compiled: bool, 2025-05-07T20:32:46.2919843Z ) -> None: 2025-05-07T20:32:46.2920150Z torch.manual_seed(2025) 2025-05-07T20:32:46.2920437Z 2025-05-07T20:32:46.2920728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.2921080Z 2025-05-07T20:32:46.2921292Z x_sign = torch.sign(x) 2025-05-07T20:32:46.2921599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.2922014Z x = x_sign * x_clamp 2025-05-07T20:32:46.2922269Z x0 = x[:, :D] 2025-05-07T20:32:46.2922501Z x1 = x[:, D:] 2025-05-07T20:32:46.2922717Z 2025-05-07T20:32:46.2922915Z if contiguous: 2025-05-07T20:32:46.2923160Z x0 = x0.contiguous() 2025-05-07T20:32:46.2923428Z x1 = x1.contiguous() 2025-05-07T20:32:46.2923764Z 2025-05-07T20:32:46.2923975Z if scale_ub is not None: 2025-05-07T20:32:46.2924360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.2924721Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.2925047Z ) 2025-05-07T20:32:46.2925252Z else: 2025-05-07T20:32:46.2925470Z scale_ub_tensor = None 2025-05-07T20:32:46.2925735Z 2025-05-07T20:32:46.2925981Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.2926308Z op = silu_mul_quant 2025-05-07T20:32:46.2926577Z if compiled: 2025-05-07T20:32:46.2926842Z op = torch.compile(op) 2025-05-07T20:32:46.2927149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.2927438Z 2025-05-07T20:32:46.2927646Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.2927819Z 2025-05-07T20:32:46.2927928Z moe/activation_test.py:117: 2025-05-07T20:32:46.2928243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.2928606Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.2928911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.2929492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:46.2930080Z return fn(*args, **kwargs) 2025-05-07T20:32:46.2930790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.2931508Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.2932068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.2932786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.2933594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.2941071Z kernel = self.compile( 2025-05-07T20:32:46.2941704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.2942412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.2942847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.2943094Z 2025-05-07T20:32:46.2943329Z self = 2025-05-07T20:32:46.2944455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.2945901Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acd75300>} 2025-05-07T20:32:46.2947383Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.2948459Z context = 2025-05-07T20:32:46.2948764Z 2025-05-07T20:32:46.2948950Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.2949498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.2950001Z module_map=module_map) 2025-05-07T20:32:46.2950440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.2950813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.2951095Z E ^ 2025-05-07T20:32:46.2951590Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.2952108Z 2025-05-07T20:32:46.2952593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.2953129Z 2025-05-07T20:32:46.2953241Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.2953683Z self=, 2025-05-07T20:32:46.2954111Z T=128, 2025-05-07T20:32:46.2954306Z D=7168, 2025-05-07T20:32:46.2954519Z scale_ub=1200.0, 2025-05-07T20:32:46.2954758Z contiguous=False, 2025-05-07T20:32:46.2954992Z compiled=True, 2025-05-07T20:32:46.2955211Z ) 2025-05-07T20:32:46.2955551Z self = 2025-05-07T20:32:46.2956071Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:46.2956354Z 2025-05-07T20:32:46.2956438Z @given( 2025-05-07T20:32:46.2956687Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.2957022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.2957345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.2957703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.2958054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.2958354Z ) 2025-05-07T20:32:46.2958727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.2959198Z def test_silu_mul_quant( 2025-05-07T20:32:46.2959461Z self, 2025-05-07T20:32:46.2959663Z T: int, 2025-05-07T20:32:46.2959876Z D: int, 2025-05-07T20:32:46.2960209Z scale_ub: Optional[float], 2025-05-07T20:32:46.2960497Z contiguous: bool, 2025-05-07T20:32:46.2960757Z compiled: bool, 2025-05-07T20:32:46.2961004Z ) -> None: 2025-05-07T20:32:46.2961228Z torch.manual_seed(2025) 2025-05-07T20:32:46.2961486Z 2025-05-07T20:32:46.2961781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.2962142Z 2025-05-07T20:32:46.2962358Z x_sign = torch.sign(x) 2025-05-07T20:32:46.2962673Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.2963000Z x = x_sign * x_clamp 2025-05-07T20:32:46.2963261Z x0 = x[:, :D] 2025-05-07T20:32:46.2963496Z x1 = x[:, D:] 2025-05-07T20:32:46.2963715Z 2025-05-07T20:32:46.2963915Z if contiguous: 2025-05-07T20:32:46.2964162Z x0 = x0.contiguous() 2025-05-07T20:32:46.2964430Z x1 = x1.contiguous() 2025-05-07T20:32:46.2964686Z 2025-05-07T20:32:46.2964894Z if scale_ub is not None: 2025-05-07T20:32:46.2965189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.2965544Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.2965878Z ) 2025-05-07T20:32:46.2966089Z else: 2025-05-07T20:32:46.2966310Z scale_ub_tensor = None 2025-05-07T20:32:46.2966580Z 2025-05-07T20:32:46.2966831Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.2967217Z op = silu_mul_quant 2025-05-07T20:32:46.2967489Z if compiled: 2025-05-07T20:32:46.2967758Z op = torch.compile(op) 2025-05-07T20:32:46.2968069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.2968366Z 2025-05-07T20:32:46.2968576Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.2968749Z 2025-05-07T20:32:46.2968852Z moe/activation_test.py:117: 2025-05-07T20:32:46.2969164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.2969516Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.2969862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.2970445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:46.2971033Z return fn(*args, **kwargs) 2025-05-07T20:32:46.2971730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.2972551Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.2973125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.2973844Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.2974546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.2975103Z kernel = self.compile( 2025-05-07T20:32:46.2975678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.2976374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.2976804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.2977047Z 2025-05-07T20:32:46.2977266Z self = 2025-05-07T20:32:46.2978402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.2979829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acd76160>} 2025-05-07T20:32:46.2981226Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.2982288Z context = 2025-05-07T20:32:46.2982598Z 2025-05-07T20:32:46.2982776Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.2983332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.2983827Z module_map=module_map) 2025-05-07T20:32:46.2984210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.2984589Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.2984868Z E ^ 2025-05-07T20:32:46.2985353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.2985834Z 2025-05-07T20:32:46.2986267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.2986811Z 2025-05-07T20:32:46.2986923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.2987364Z self=, 2025-05-07T20:32:46.2987783Z T=2048, 2025-05-07T20:32:46.2987986Z D=7168, 2025-05-07T20:32:46.2988198Z scale_ub=None, 2025-05-07T20:32:46.2988423Z contiguous=True, 2025-05-07T20:32:46.2988722Z compiled=True, 2025-05-07T20:32:46.4272826Z ) 2025-05-07T20:32:46.4273896Z self = 2025-05-07T20:32:46.4275333Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.4276017Z 2025-05-07T20:32:46.4276179Z @given( 2025-05-07T20:32:46.4276652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.4277296Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.4277925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.4278806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.4279180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.4279475Z ) 2025-05-07T20:32:46.4279845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.4280415Z def test_silu_mul_quant( 2025-05-07T20:32:46.4280758Z self, 2025-05-07T20:32:46.4280966Z T: int, 2025-05-07T20:32:46.4281182Z D: int, 2025-05-07T20:32:46.4281489Z scale_ub: Optional[float], 2025-05-07T20:32:46.4281775Z contiguous: bool, 2025-05-07T20:32:46.4282030Z compiled: bool, 2025-05-07T20:32:46.4282271Z ) -> None: 2025-05-07T20:32:46.4282493Z torch.manual_seed(2025) 2025-05-07T20:32:46.4282752Z 2025-05-07T20:32:46.4283040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.4283394Z 2025-05-07T20:32:46.4283605Z x_sign = torch.sign(x) 2025-05-07T20:32:46.4283915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.4284240Z x = x_sign * x_clamp 2025-05-07T20:32:46.4284500Z x0 = x[:, :D] 2025-05-07T20:32:46.4284737Z x1 = x[:, D:] 2025-05-07T20:32:46.4284953Z 2025-05-07T20:32:46.4285154Z if contiguous: 2025-05-07T20:32:46.4285404Z x0 = x0.contiguous() 2025-05-07T20:32:46.4285681Z x1 = x1.contiguous() 2025-05-07T20:32:46.4285941Z 2025-05-07T20:32:46.4286150Z if scale_ub is not None: 2025-05-07T20:32:46.4286434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.4286795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.4287129Z ) 2025-05-07T20:32:46.4287335Z else: 2025-05-07T20:32:46.4287554Z scale_ub_tensor = None 2025-05-07T20:32:46.4287822Z 2025-05-07T20:32:46.4288070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.4288395Z op = silu_mul_quant 2025-05-07T20:32:46.4288662Z if compiled: 2025-05-07T20:32:46.4288922Z op = torch.compile(op) 2025-05-07T20:32:46.4289235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.4289527Z 2025-05-07T20:32:46.4289725Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.4289903Z 2025-05-07T20:32:46.4290009Z moe/activation_test.py:117: 2025-05-07T20:32:46.4290327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.4290681Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.4290970Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.4291552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:46.4292134Z return fn(*args, **kwargs) 2025-05-07T20:32:46.4292815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.4293532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.4294096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.4294805Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.4295492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.4296144Z kernel = self.compile( 2025-05-07T20:32:46.4296716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.4297400Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.4297813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.4298054Z 2025-05-07T20:32:46.4298268Z self = 2025-05-07T20:32:46.4299386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.4300867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acd77420>} 2025-05-07T20:32:46.4302330Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.4303392Z context = 2025-05-07T20:32:46.4303698Z 2025-05-07T20:32:46.4303871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.4304416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.4304899Z module_map=module_map) 2025-05-07T20:32:46.4305280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.4305651Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.4305917Z E ^ 2025-05-07T20:32:46.4306400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.4306875Z 2025-05-07T20:32:46.4307309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.4307856Z 2025-05-07T20:32:46.4307964Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.4308399Z self=, 2025-05-07T20:32:46.4308818Z T=16384, 2025-05-07T20:32:46.4309023Z D=5120, 2025-05-07T20:32:46.4309230Z scale_ub=None, 2025-05-07T20:32:46.4309457Z contiguous=False, 2025-05-07T20:32:46.4309690Z compiled=False, 2025-05-07T20:32:46.4309908Z ) 2025-05-07T20:32:46.4310243Z self = 2025-05-07T20:32:46.4310759Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:46.4311058Z 2025-05-07T20:32:46.4311143Z @given( 2025-05-07T20:32:46.4311389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.4311718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.4312043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.4312388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.4312729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.4313028Z ) 2025-05-07T20:32:46.4313678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.4314142Z def test_silu_mul_quant( 2025-05-07T20:32:46.4314387Z self, 2025-05-07T20:32:46.4314594Z T: int, 2025-05-07T20:32:46.4314800Z D: int, 2025-05-07T20:32:46.4315025Z scale_ub: Optional[float], 2025-05-07T20:32:46.4315309Z contiguous: bool, 2025-05-07T20:32:46.4315560Z compiled: bool, 2025-05-07T20:32:46.4315787Z ) -> None: 2025-05-07T20:32:46.4316010Z torch.manual_seed(2025) 2025-05-07T20:32:46.4316262Z 2025-05-07T20:32:46.4316543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.4316970Z 2025-05-07T20:32:46.4317182Z x_sign = torch.sign(x) 2025-05-07T20:32:46.4317481Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.4319579Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.4321671Z 2025-05-07T20:32:46.4321799Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:46.4322027Z 2025-05-07T20:32:46.4322198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.4322638Z self=, 2025-05-07T20:32:46.4323103Z T=4096, 2025-05-07T20:32:46.4323302Z D=7168, 2025-05-07T20:32:46.4323507Z scale_ub=1200.0, 2025-05-07T20:32:46.4323761Z contiguous=True, 2025-05-07T20:32:46.4323987Z compiled=True, 2025-05-07T20:32:46.4324203Z ) 2025-05-07T20:32:46.4324541Z self = 2025-05-07T20:32:46.4325057Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:46.4325337Z 2025-05-07T20:32:46.4325418Z @given( 2025-05-07T20:32:46.4325661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.4325990Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.4326305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.4326650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.4326996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.4327288Z ) 2025-05-07T20:32:46.4327663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.4328123Z def test_silu_mul_quant( 2025-05-07T20:32:46.4328373Z self, 2025-05-07T20:32:46.4328572Z T: int, 2025-05-07T20:32:46.4328778Z D: int, 2025-05-07T20:32:46.4329010Z scale_ub: Optional[float], 2025-05-07T20:32:46.4329288Z contiguous: bool, 2025-05-07T20:32:46.4329539Z compiled: bool, 2025-05-07T20:32:46.4329771Z ) -> None: 2025-05-07T20:32:46.4329991Z torch.manual_seed(2025) 2025-05-07T20:32:46.4330245Z 2025-05-07T20:32:46.4330529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.4330879Z 2025-05-07T20:32:46.4331085Z x_sign = torch.sign(x) 2025-05-07T20:32:46.4331390Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.4333467Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.4335398Z 2025-05-07T20:32:46.4335525Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:46.4335749Z 2025-05-07T20:32:46.4335857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.4336288Z self=, 2025-05-07T20:32:46.4336710Z T=16384, 2025-05-07T20:32:46.4336909Z D=7168, 2025-05-07T20:32:46.4337108Z scale_ub=None, 2025-05-07T20:32:46.4337336Z contiguous=False, 2025-05-07T20:32:46.4337566Z compiled=False, 2025-05-07T20:32:46.4337854Z ) 2025-05-07T20:32:46.4338191Z self = 2025-05-07T20:32:46.4338710Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:46.4339011Z 2025-05-07T20:32:46.4339091Z @given( 2025-05-07T20:32:46.4339337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.4339667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.4339981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.4340326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.4340714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.4341005Z ) 2025-05-07T20:32:46.4341370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.4341832Z def test_silu_mul_quant( 2025-05-07T20:32:46.4342157Z self, 2025-05-07T20:32:46.4342365Z T: int, 2025-05-07T20:32:46.4342575Z D: int, 2025-05-07T20:32:46.4342833Z scale_ub: Optional[float], 2025-05-07T20:32:46.4343121Z contiguous: bool, 2025-05-07T20:32:46.4343378Z compiled: bool, 2025-05-07T20:32:46.4343605Z ) -> None: 2025-05-07T20:32:46.4343832Z torch.manual_seed(2025) 2025-05-07T20:32:46.4344083Z 2025-05-07T20:32:46.4344365Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.4346484Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.4348425Z 2025-05-07T20:32:46.4348550Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.5584990Z 2025-05-07T20:32:46.5585364Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.5586081Z self=, 2025-05-07T20:32:46.5586667Z T=2048, 2025-05-07T20:32:46.5586936Z D=7168, 2025-05-07T20:32:46.5587163Z scale_ub=1200.0, 2025-05-07T20:32:46.5587404Z contiguous=True, 2025-05-07T20:32:46.5587635Z compiled=True, 2025-05-07T20:32:46.5587852Z ) 2025-05-07T20:32:46.5588205Z self = 2025-05-07T20:32:46.5588725Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:46.5589016Z 2025-05-07T20:32:46.5589099Z @given( 2025-05-07T20:32:46.5589344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.5589683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.5590017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.5590372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.5590723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.5591020Z ) 2025-05-07T20:32:46.5591392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.5591861Z def test_silu_mul_quant( 2025-05-07T20:32:46.5592113Z self, 2025-05-07T20:32:46.5592324Z T: int, 2025-05-07T20:32:46.5592536Z D: int, 2025-05-07T20:32:46.5592770Z scale_ub: Optional[float], 2025-05-07T20:32:46.5593060Z contiguous: bool, 2025-05-07T20:32:46.5593318Z compiled: bool, 2025-05-07T20:32:46.5593560Z ) -> None: 2025-05-07T20:32:46.5593788Z torch.manual_seed(2025) 2025-05-07T20:32:46.5594048Z 2025-05-07T20:32:46.5594339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.5594702Z 2025-05-07T20:32:46.5595190Z x_sign = torch.sign(x) 2025-05-07T20:32:46.5595512Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.5597594Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.5599674Z 2025-05-07T20:32:46.5599799Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:46.5600030Z 2025-05-07T20:32:46.5600271Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.5600794Z self=, 2025-05-07T20:32:46.5601285Z T=2048, 2025-05-07T20:32:46.5601485Z D=7168, 2025-05-07T20:32:46.5601692Z scale_ub=None, 2025-05-07T20:32:46.5601928Z contiguous=True, 2025-05-07T20:32:46.5602161Z compiled=False, 2025-05-07T20:32:46.5602381Z ) 2025-05-07T20:32:46.5602719Z self = 2025-05-07T20:32:46.5603234Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:46.5603521Z 2025-05-07T20:32:46.5603604Z @given( 2025-05-07T20:32:46.5603884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.5604215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.5604543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.5604887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.5605232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.5605536Z ) 2025-05-07T20:32:46.5605908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.5606364Z def test_silu_mul_quant( 2025-05-07T20:32:46.5606620Z self, 2025-05-07T20:32:46.5606823Z T: int, 2025-05-07T20:32:46.5607025Z D: int, 2025-05-07T20:32:46.5607256Z scale_ub: Optional[float], 2025-05-07T20:32:46.5607545Z contiguous: bool, 2025-05-07T20:32:46.5607792Z compiled: bool, 2025-05-07T20:32:46.5608024Z ) -> None: 2025-05-07T20:32:46.5608251Z torch.manual_seed(2025) 2025-05-07T20:32:46.5608499Z 2025-05-07T20:32:46.5608793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.5609151Z 2025-05-07T20:32:46.5609350Z > x_sign = torch.sign(x) 2025-05-07T20:32:46.5611360Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.5613288Z 2025-05-07T20:32:46.5613710Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:46.5613942Z 2025-05-07T20:32:46.5614051Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.5614488Z self=, 2025-05-07T20:32:46.5614906Z T=1, 2025-05-07T20:32:46.5615100Z D=7168, 2025-05-07T20:32:46.5615305Z scale_ub=1200.0, 2025-05-07T20:32:46.5615534Z contiguous=True, 2025-05-07T20:32:46.5615771Z compiled=False, 2025-05-07T20:32:46.5615991Z ) 2025-05-07T20:32:46.5616400Z self = 2025-05-07T20:32:46.5616917Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.5617205Z 2025-05-07T20:32:46.5617286Z @given( 2025-05-07T20:32:46.5617530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.5617854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.5618183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.5618533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.5618872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.5619237Z ) 2025-05-07T20:32:46.5619606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.5620062Z def test_silu_mul_quant( 2025-05-07T20:32:46.5620317Z self, 2025-05-07T20:32:46.5620536Z T: int, 2025-05-07T20:32:46.5620744Z D: int, 2025-05-07T20:32:46.5621036Z scale_ub: Optional[float], 2025-05-07T20:32:46.5621327Z contiguous: bool, 2025-05-07T20:32:46.5621637Z compiled: bool, 2025-05-07T20:32:46.5621947Z ) -> None: 2025-05-07T20:32:46.5629967Z torch.manual_seed(2025) 2025-05-07T20:32:46.5630270Z 2025-05-07T20:32:46.5630561Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.5630922Z 2025-05-07T20:32:46.5631129Z x_sign = torch.sign(x) 2025-05-07T20:32:46.5631432Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.5631761Z x = x_sign * x_clamp 2025-05-07T20:32:46.5632016Z x0 = x[:, :D] 2025-05-07T20:32:46.5632244Z x1 = x[:, D:] 2025-05-07T20:32:46.5632462Z 2025-05-07T20:32:46.5632659Z if contiguous: 2025-05-07T20:32:46.5632894Z x0 = x0.contiguous() 2025-05-07T20:32:46.5633166Z x1 = x1.contiguous() 2025-05-07T20:32:46.5633421Z 2025-05-07T20:32:46.5633617Z if scale_ub is not None: 2025-05-07T20:32:46.5633908Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.5634268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.5634587Z ) 2025-05-07T20:32:46.5634796Z else: 2025-05-07T20:32:46.5635017Z scale_ub_tensor = None 2025-05-07T20:32:46.5635283Z 2025-05-07T20:32:46.5635523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.5635859Z op = silu_mul_quant 2025-05-07T20:32:46.5636123Z if compiled: 2025-05-07T20:32:46.5636379Z op = torch.compile(op) 2025-05-07T20:32:46.5636692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.5636987Z 2025-05-07T20:32:46.5637185Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.5637366Z 2025-05-07T20:32:46.5637470Z moe/activation_test.py:117: 2025-05-07T20:32:46.5637783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.5638131Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.5638434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.5639164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.5639889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.5640541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.5641260Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.5641954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.5642504Z kernel = self.compile( 2025-05-07T20:32:46.5643071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.5643759Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.5644264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.5644504Z 2025-05-07T20:32:46.5644723Z self = 2025-05-07T20:32:46.5645848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.5647285Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acb462a0>} 2025-05-07T20:32:46.5648718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.5649782Z context = 2025-05-07T20:32:46.5650122Z 2025-05-07T20:32:46.5650340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.5650892Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.5651387Z module_map=module_map) 2025-05-07T20:32:46.5651759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.5652134Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.5652404Z E ^ 2025-05-07T20:32:46.5652888Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.5653358Z 2025-05-07T20:32:46.5653789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.5654327Z 2025-05-07T20:32:46.5654439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.5654877Z self=, 2025-05-07T20:32:46.5655298Z T=128, 2025-05-07T20:32:46.5655491Z D=5120, 2025-05-07T20:32:46.5655693Z scale_ub=None, 2025-05-07T20:32:46.5655916Z contiguous=True, 2025-05-07T20:32:46.5656141Z compiled=False, 2025-05-07T20:32:46.5656358Z ) 2025-05-07T20:32:46.5656692Z self = 2025-05-07T20:32:46.5657198Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:46.5657484Z 2025-05-07T20:32:46.5657564Z @given( 2025-05-07T20:32:46.5657801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.5658122Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.5658442Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.5658784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.5659130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.5659429Z ) 2025-05-07T20:32:46.5659796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.5660260Z def test_silu_mul_quant( 2025-05-07T20:32:46.5660505Z self, 2025-05-07T20:32:46.5660711Z T: int, 2025-05-07T20:32:46.5660918Z D: int, 2025-05-07T20:32:46.5661140Z scale_ub: Optional[float], 2025-05-07T20:32:46.5661425Z contiguous: bool, 2025-05-07T20:32:46.5661677Z compiled: bool, 2025-05-07T20:32:46.5661903Z ) -> None: 2025-05-07T20:32:46.5662127Z torch.manual_seed(2025) 2025-05-07T20:32:46.5662380Z 2025-05-07T20:32:46.5662658Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.5663020Z 2025-05-07T20:32:46.5663227Z x_sign = torch.sign(x) 2025-05-07T20:32:46.5663527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.5663850Z x = x_sign * x_clamp 2025-05-07T20:32:46.5664102Z x0 = x[:, :D] 2025-05-07T20:32:46.5664334Z x1 = x[:, D:] 2025-05-07T20:32:46.5664543Z 2025-05-07T20:32:46.5664792Z if contiguous: 2025-05-07T20:32:46.5665038Z x0 = x0.contiguous() 2025-05-07T20:32:46.5665302Z x1 = x1.contiguous() 2025-05-07T20:32:46.5665550Z 2025-05-07T20:32:46.5665754Z if scale_ub is not None: 2025-05-07T20:32:46.5666032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.5666387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.5666709Z ) 2025-05-07T20:32:46.5666904Z else: 2025-05-07T20:32:46.5667125Z scale_ub_tensor = None 2025-05-07T20:32:46.5667440Z 2025-05-07T20:32:46.5667676Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.5668007Z op = silu_mul_quant 2025-05-07T20:32:46.5668267Z if compiled: 2025-05-07T20:32:46.5668514Z op = torch.compile(op) 2025-05-07T20:32:46.5668824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.5669149Z 2025-05-07T20:32:46.5669346Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.5669521Z 2025-05-07T20:32:46.5669662Z moe/activation_test.py:117: 2025-05-07T20:32:46.5669972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.5670320Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.5670605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.5671317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.5672028Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.5672582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.5673291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.5673979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.5674540Z kernel = self.compile( 2025-05-07T20:32:46.5675097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.5675779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.5676194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.5676429Z 2025-05-07T20:32:46.5676647Z self = 2025-05-07T20:32:46.5677763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.5679191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acb471a0>} 2025-05-07T20:32:46.5680671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.5681738Z context = 2025-05-07T20:32:46.5682036Z 2025-05-07T20:32:46.5682209Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.5682753Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.5683241Z module_map=module_map) 2025-05-07T20:32:46.5683622Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.5683983Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.5684256Z E ^ 2025-05-07T20:32:46.5684736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.5685207Z 2025-05-07T20:32:46.5685697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.6810359Z 2025-05-07T20:32:46.6810628Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.6811310Z self=, 2025-05-07T20:32:46.6811901Z T=128, 2025-05-07T20:32:46.6812129Z D=7168, 2025-05-07T20:32:46.6812339Z scale_ub=None, 2025-05-07T20:32:46.6812569Z contiguous=True, 2025-05-07T20:32:46.6812803Z compiled=False, 2025-05-07T20:32:46.6813026Z ) 2025-05-07T20:32:46.6813777Z self = 2025-05-07T20:32:46.6814290Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:46.6814580Z 2025-05-07T20:32:46.6814667Z @given( 2025-05-07T20:32:46.6814921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.6815345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.6815752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.6816111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.6816462Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.6816762Z ) 2025-05-07T20:32:46.6817132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.6817626Z def test_silu_mul_quant( 2025-05-07T20:32:46.6817881Z self, 2025-05-07T20:32:46.6818088Z T: int, 2025-05-07T20:32:46.6818292Z D: int, 2025-05-07T20:32:46.6818526Z scale_ub: Optional[float], 2025-05-07T20:32:46.6818845Z contiguous: bool, 2025-05-07T20:32:46.6819123Z compiled: bool, 2025-05-07T20:32:46.6819363Z ) -> None: 2025-05-07T20:32:46.6819593Z torch.manual_seed(2025) 2025-05-07T20:32:46.6819843Z 2025-05-07T20:32:46.6820137Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.6820503Z 2025-05-07T20:32:46.6820716Z x_sign = torch.sign(x) 2025-05-07T20:32:46.6821030Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.6821374Z x = x_sign * x_clamp 2025-05-07T20:32:46.6821631Z x0 = x[:, :D] 2025-05-07T20:32:46.6821864Z x1 = x[:, D:] 2025-05-07T20:32:46.6822082Z 2025-05-07T20:32:46.6822284Z if contiguous: 2025-05-07T20:32:46.6822534Z x0 = x0.contiguous() 2025-05-07T20:32:46.6822807Z x1 = x1.contiguous() 2025-05-07T20:32:46.6823065Z 2025-05-07T20:32:46.6823272Z if scale_ub is not None: 2025-05-07T20:32:46.6823563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.6823922Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.6824253Z ) 2025-05-07T20:32:46.6824458Z else: 2025-05-07T20:32:46.6824686Z scale_ub_tensor = None 2025-05-07T20:32:46.6824961Z 2025-05-07T20:32:46.6825205Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.6825543Z op = silu_mul_quant 2025-05-07T20:32:46.6825814Z if compiled: 2025-05-07T20:32:46.6826080Z op = torch.compile(op) 2025-05-07T20:32:46.6826391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.6826684Z 2025-05-07T20:32:46.6826890Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.6827063Z 2025-05-07T20:32:46.6827169Z moe/activation_test.py:117: 2025-05-07T20:32:46.6827484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.6827843Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.6828136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.6828859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.6829631Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.6830284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.6830998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.6831697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.6832256Z kernel = self.compile( 2025-05-07T20:32:46.6832819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.6833507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.6834022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.6834264Z 2025-05-07T20:32:46.6834487Z self = 2025-05-07T20:32:46.6835690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.6837176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07aca58040>} 2025-05-07T20:32:46.6838572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.6839636Z context = 2025-05-07T20:32:46.6839942Z 2025-05-07T20:32:46.6840191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.6840739Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.6841232Z module_map=module_map) 2025-05-07T20:32:46.6841629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.6842003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.6842280Z E ^ 2025-05-07T20:32:46.6842769Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.6843239Z 2025-05-07T20:32:46.6843682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.6844218Z 2025-05-07T20:32:46.6844331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.6844776Z self=, 2025-05-07T20:32:46.6845202Z T=2048, 2025-05-07T20:32:46.6845398Z D=7168, 2025-05-07T20:32:46.6845607Z scale_ub=1200.0, 2025-05-07T20:32:46.6845850Z contiguous=True, 2025-05-07T20:32:46.6846086Z compiled=False, 2025-05-07T20:32:46.6846314Z ) 2025-05-07T20:32:46.6846658Z self = 2025-05-07T20:32:46.6847191Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.6847476Z 2025-05-07T20:32:46.6847557Z @given( 2025-05-07T20:32:46.6847809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.6848143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.6848467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.6848822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.6849184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.6849493Z ) 2025-05-07T20:32:46.6849868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.6850339Z def test_silu_mul_quant( 2025-05-07T20:32:46.6850596Z self, 2025-05-07T20:32:46.6850797Z T: int, 2025-05-07T20:32:46.6851010Z D: int, 2025-05-07T20:32:46.6851245Z scale_ub: Optional[float], 2025-05-07T20:32:46.6851576Z contiguous: bool, 2025-05-07T20:32:46.6851843Z compiled: bool, 2025-05-07T20:32:46.6852080Z ) -> None: 2025-05-07T20:32:46.6852306Z torch.manual_seed(2025) 2025-05-07T20:32:46.6852564Z 2025-05-07T20:32:46.6852856Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.6855001Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.6857004Z 2025-05-07T20:32:46.6857131Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.6857363Z 2025-05-07T20:32:46.6857510Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.6857953Z self=, 2025-05-07T20:32:46.6858376Z T=1, 2025-05-07T20:32:46.6858568Z D=5120, 2025-05-07T20:32:46.6858775Z scale_ub=1200.0, 2025-05-07T20:32:46.6859053Z contiguous=True, 2025-05-07T20:32:46.6859289Z compiled=False, 2025-05-07T20:32:46.6859508Z ) 2025-05-07T20:32:46.6859843Z self = 2025-05-07T20:32:46.6860354Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.6860636Z 2025-05-07T20:32:46.6860717Z @given( 2025-05-07T20:32:46.6860966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.6861305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.6861625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.6861988Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.6862342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.6862641Z ) 2025-05-07T20:32:46.6863012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.6863488Z def test_silu_mul_quant( 2025-05-07T20:32:46.6863742Z self, 2025-05-07T20:32:46.6863953Z T: int, 2025-05-07T20:32:46.6864172Z D: int, 2025-05-07T20:32:46.6864401Z scale_ub: Optional[float], 2025-05-07T20:32:46.6864696Z contiguous: bool, 2025-05-07T20:32:46.6864963Z compiled: bool, 2025-05-07T20:32:46.6865195Z ) -> None: 2025-05-07T20:32:46.6865425Z torch.manual_seed(2025) 2025-05-07T20:32:46.6865689Z 2025-05-07T20:32:46.6865972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.6866337Z 2025-05-07T20:32:46.6866550Z x_sign = torch.sign(x) 2025-05-07T20:32:46.6866865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.6867195Z x = x_sign * x_clamp 2025-05-07T20:32:46.6867462Z x0 = x[:, :D] 2025-05-07T20:32:46.6867694Z x1 = x[:, D:] 2025-05-07T20:32:46.6867910Z 2025-05-07T20:32:46.6868109Z if contiguous: 2025-05-07T20:32:46.6868401Z x0 = x0.contiguous() 2025-05-07T20:32:46.6868735Z x1 = x1.contiguous() 2025-05-07T20:32:46.6869057Z 2025-05-07T20:32:46.6869315Z if scale_ub is not None: 2025-05-07T20:32:46.6869673Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.6870119Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.6870501Z ) 2025-05-07T20:32:46.6870705Z else: 2025-05-07T20:32:46.6870929Z scale_ub_tensor = None 2025-05-07T20:32:46.6871197Z 2025-05-07T20:32:46.6871439Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.6871777Z op = silu_mul_quant 2025-05-07T20:32:46.6872046Z if compiled: 2025-05-07T20:32:46.6872362Z op = torch.compile(op) 2025-05-07T20:32:46.6872681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.6872977Z 2025-05-07T20:32:46.6873186Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.6873362Z 2025-05-07T20:32:46.6873468Z moe/activation_test.py:117: 2025-05-07T20:32:46.6873784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.6874142Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.6874440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.6875266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.6875990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.6876565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.6877325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.6878069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.6878636Z kernel = self.compile( 2025-05-07T20:32:46.6879207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.6879902Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.6880482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.6880727Z 2025-05-07T20:32:46.6880955Z self = 2025-05-07T20:32:46.6882082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.6883528Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07aca59580>} 2025-05-07T20:32:46.6884934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.6886007Z context = 2025-05-07T20:32:46.6886313Z 2025-05-07T20:32:46.6886497Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.6887052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.6887552Z module_map=module_map) 2025-05-07T20:32:46.6887944Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.6888322Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.6888604Z E ^ 2025-05-07T20:32:46.6889110Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.6889583Z 2025-05-07T20:32:46.6890026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.7712183Z 2025-05-07T20:32:46.7712496Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7713211Z self=, 2025-05-07T20:32:46.7714136Z T=2048, 2025-05-07T20:32:46.7714440Z D=5120, 2025-05-07T20:32:46.7714656Z scale_ub=None, 2025-05-07T20:32:46.7714878Z contiguous=True, 2025-05-07T20:32:46.7715116Z compiled=False, 2025-05-07T20:32:46.7715336Z ) 2025-05-07T20:32:46.7715667Z self = 2025-05-07T20:32:46.7716193Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:46.7716721Z 2025-05-07T20:32:46.7716809Z @given( 2025-05-07T20:32:46.7717059Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7717387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7717715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7718067Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7718413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7718721Z ) 2025-05-07T20:32:46.7719089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7719635Z def test_silu_mul_quant( 2025-05-07T20:32:46.7719885Z self, 2025-05-07T20:32:46.7720232Z T: int, 2025-05-07T20:32:46.7720447Z D: int, 2025-05-07T20:32:46.7720674Z scale_ub: Optional[float], 2025-05-07T20:32:46.7720962Z contiguous: bool, 2025-05-07T20:32:46.7721309Z compiled: bool, 2025-05-07T20:32:46.7721543Z ) -> None: 2025-05-07T20:32:46.7721776Z torch.manual_seed(2025) 2025-05-07T20:32:46.7722095Z 2025-05-07T20:32:46.7722397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7722760Z 2025-05-07T20:32:46.7722970Z > x_sign = torch.sign(x) 2025-05-07T20:32:46.7724999Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.7726950Z 2025-05-07T20:32:46.7727081Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:46.7727314Z 2025-05-07T20:32:46.7727427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7727986Z self=, 2025-05-07T20:32:46.7728499Z T=16384, 2025-05-07T20:32:46.7728760Z D=5120, 2025-05-07T20:32:46.7729236Z scale_ub=None, 2025-05-07T20:32:46.7737100Z contiguous=True, 2025-05-07T20:32:46.7737351Z compiled=False, 2025-05-07T20:32:46.7737576Z ) 2025-05-07T20:32:46.7737917Z self = 2025-05-07T20:32:46.7738461Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:46.7738801Z 2025-05-07T20:32:46.7738903Z @given( 2025-05-07T20:32:46.7739153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7739483Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7739817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7740174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7740525Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7740834Z ) 2025-05-07T20:32:46.7741213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7741686Z def test_silu_mul_quant( 2025-05-07T20:32:46.7741943Z self, 2025-05-07T20:32:46.7742156Z T: int, 2025-05-07T20:32:46.7742369Z D: int, 2025-05-07T20:32:46.7742600Z scale_ub: Optional[float], 2025-05-07T20:32:46.7742894Z contiguous: bool, 2025-05-07T20:32:46.7743158Z compiled: bool, 2025-05-07T20:32:46.7743396Z ) -> None: 2025-05-07T20:32:46.7743629Z torch.manual_seed(2025) 2025-05-07T20:32:46.7743890Z 2025-05-07T20:32:46.7744175Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7746417Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.7748362Z 2025-05-07T20:32:46.7748488Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.7748717Z 2025-05-07T20:32:46.7748829Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7749313Z self=, 2025-05-07T20:32:46.7749733Z T=4096, 2025-05-07T20:32:46.7749933Z D=5120, 2025-05-07T20:32:46.7750138Z scale_ub=None, 2025-05-07T20:32:46.7750362Z contiguous=True, 2025-05-07T20:32:46.7750603Z compiled=False, 2025-05-07T20:32:46.7750865Z ) 2025-05-07T20:32:46.7751196Z self = 2025-05-07T20:32:46.7751758Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:46.7752042Z 2025-05-07T20:32:46.7752131Z @given( 2025-05-07T20:32:46.7752369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7752704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7753030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7753379Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7753722Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7754030Z ) 2025-05-07T20:32:46.7754399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7754861Z def test_silu_mul_quant( 2025-05-07T20:32:46.7755120Z self, 2025-05-07T20:32:46.7755332Z T: int, 2025-05-07T20:32:46.7755541Z D: int, 2025-05-07T20:32:46.7755776Z scale_ub: Optional[float], 2025-05-07T20:32:46.7756070Z contiguous: bool, 2025-05-07T20:32:46.7756326Z compiled: bool, 2025-05-07T20:32:46.7756562Z ) -> None: 2025-05-07T20:32:46.7756792Z torch.manual_seed(2025) 2025-05-07T20:32:46.7757045Z 2025-05-07T20:32:46.7757340Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7759460Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.7761478Z 2025-05-07T20:32:46.7761609Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.7761836Z 2025-05-07T20:32:46.7761954Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7762385Z self=, 2025-05-07T20:32:46.7762814Z T=2048, 2025-05-07T20:32:46.7763014Z D=5120, 2025-05-07T20:32:46.7763213Z scale_ub=None, 2025-05-07T20:32:46.7763443Z contiguous=False, 2025-05-07T20:32:46.7763687Z compiled=False, 2025-05-07T20:32:46.7763899Z ) 2025-05-07T20:32:46.7764235Z self = 2025-05-07T20:32:46.7764761Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:46.7765049Z 2025-05-07T20:32:46.7765138Z @given( 2025-05-07T20:32:46.7765378Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7765709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7766038Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7766431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7766782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7767088Z ) 2025-05-07T20:32:46.7767451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7767922Z def test_silu_mul_quant( 2025-05-07T20:32:46.7768181Z self, 2025-05-07T20:32:46.7768392Z T: int, 2025-05-07T20:32:46.7768596Z D: int, 2025-05-07T20:32:46.7768834Z scale_ub: Optional[float], 2025-05-07T20:32:46.7769130Z contiguous: bool, 2025-05-07T20:32:46.7769423Z compiled: bool, 2025-05-07T20:32:46.7769659Z ) -> None: 2025-05-07T20:32:46.7769893Z torch.manual_seed(2025) 2025-05-07T20:32:46.7770145Z 2025-05-07T20:32:46.7770433Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7772630Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.7774545Z 2025-05-07T20:32:46.7774676Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.7774900Z 2025-05-07T20:32:46.7775017Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7775447Z self=, 2025-05-07T20:32:46.7775868Z T=4096, 2025-05-07T20:32:46.7776074Z D=7168, 2025-05-07T20:32:46.7776273Z scale_ub=None, 2025-05-07T20:32:46.7776505Z contiguous=True, 2025-05-07T20:32:46.7776745Z compiled=True, 2025-05-07T20:32:46.7776962Z ) 2025-05-07T20:32:46.7777302Z self = 2025-05-07T20:32:46.7777827Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.7778107Z 2025-05-07T20:32:46.7778188Z @given( 2025-05-07T20:32:46.7778437Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7778794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7779151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7779496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7779858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7780163Z ) 2025-05-07T20:32:46.7780527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7780995Z def test_silu_mul_quant( 2025-05-07T20:32:46.7781256Z self, 2025-05-07T20:32:46.7781459Z T: int, 2025-05-07T20:32:46.7781674Z D: int, 2025-05-07T20:32:46.7781911Z scale_ub: Optional[float], 2025-05-07T20:32:46.7782197Z contiguous: bool, 2025-05-07T20:32:46.7782456Z compiled: bool, 2025-05-07T20:32:46.7782696Z ) -> None: 2025-05-07T20:32:46.7782923Z torch.manual_seed(2025) 2025-05-07T20:32:46.7783184Z 2025-05-07T20:32:46.7783476Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7785603Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.7787573Z 2025-05-07T20:32:46.7787709Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.7787932Z 2025-05-07T20:32:46.7788041Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7788476Z self=, 2025-05-07T20:32:46.7788923Z T=2048, 2025-05-07T20:32:46.7789139Z D=5120, 2025-05-07T20:32:46.7789345Z scale_ub=1200.0, 2025-05-07T20:32:46.7789586Z contiguous=False, 2025-05-07T20:32:46.7789830Z compiled=False, 2025-05-07T20:32:46.8331976Z ) 2025-05-07T20:32:46.8332739Z self = 2025-05-07T20:32:46.8333479Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:46.8333879Z 2025-05-07T20:32:46.8333993Z @given( 2025-05-07T20:32:46.8334300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.8334845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.8335339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.8335688Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.8336044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.8336353Z ) 2025-05-07T20:32:46.8336721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.8337194Z def test_silu_mul_quant( 2025-05-07T20:32:46.8337461Z self, 2025-05-07T20:32:46.8337673Z T: int, 2025-05-07T20:32:46.8337880Z D: int, 2025-05-07T20:32:46.8338121Z scale_ub: Optional[float], 2025-05-07T20:32:46.8338413Z contiguous: bool, 2025-05-07T20:32:46.8338679Z compiled: bool, 2025-05-07T20:32:46.8338958Z ) -> None: 2025-05-07T20:32:46.8339189Z torch.manual_seed(2025) 2025-05-07T20:32:46.8339451Z 2025-05-07T20:32:46.8339741Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.8341871Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.8343802Z 2025-05-07T20:32:46.8343928Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.8344156Z 2025-05-07T20:32:46.8344266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.8344708Z self=, 2025-05-07T20:32:46.8345134Z T=4096, 2025-05-07T20:32:46.8345333Z D=7168, 2025-05-07T20:32:46.8345543Z scale_ub=1200.0, 2025-05-07T20:32:46.8345785Z contiguous=True, 2025-05-07T20:32:46.8346086Z compiled=False, 2025-05-07T20:32:46.8346401Z ) 2025-05-07T20:32:46.8346860Z self = 2025-05-07T20:32:46.8347488Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.8347780Z 2025-05-07T20:32:46.8347861Z @given( 2025-05-07T20:32:46.8348105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.8348431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.8348756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.8349106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.8349454Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.8349749Z ) 2025-05-07T20:32:46.8350114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.8350579Z def test_silu_mul_quant( 2025-05-07T20:32:46.8350831Z self, 2025-05-07T20:32:46.8351133Z T: int, 2025-05-07T20:32:46.8351350Z D: int, 2025-05-07T20:32:46.8351580Z scale_ub: Optional[float], 2025-05-07T20:32:46.8351873Z contiguous: bool, 2025-05-07T20:32:46.8352126Z compiled: bool, 2025-05-07T20:32:46.8352356Z ) -> None: 2025-05-07T20:32:46.8352585Z torch.manual_seed(2025) 2025-05-07T20:32:46.8352844Z 2025-05-07T20:32:46.8353139Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.8355305Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.8357332Z 2025-05-07T20:32:46.8357464Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.8357685Z 2025-05-07T20:32:46.8357793Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.8358228Z self=, 2025-05-07T20:32:46.8358654Z T=16384, 2025-05-07T20:32:46.8358854Z D=7168, 2025-05-07T20:32:46.8359058Z scale_ub=None, 2025-05-07T20:32:46.8359291Z contiguous=False, 2025-05-07T20:32:46.8359527Z compiled=True, 2025-05-07T20:32:46.8359742Z ) 2025-05-07T20:32:46.8360187Z self = 2025-05-07T20:32:46.8360708Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:46.8361003Z 2025-05-07T20:32:46.8361089Z @given( 2025-05-07T20:32:46.8361340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.8361679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.8361999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.8362345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.8362691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.8362987Z ) 2025-05-07T20:32:46.8363358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.8363820Z def test_silu_mul_quant( 2025-05-07T20:32:46.8364070Z self, 2025-05-07T20:32:46.8364274Z T: int, 2025-05-07T20:32:46.8364487Z D: int, 2025-05-07T20:32:46.8364720Z scale_ub: Optional[float], 2025-05-07T20:32:46.8365000Z contiguous: bool, 2025-05-07T20:32:46.8365252Z compiled: bool, 2025-05-07T20:32:46.8365486Z ) -> None: 2025-05-07T20:32:46.8365709Z torch.manual_seed(2025) 2025-05-07T20:32:46.8365967Z 2025-05-07T20:32:46.8366280Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.8368396Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.8370362Z 2025-05-07T20:32:46.8370493Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.8370715Z 2025-05-07T20:32:46.8370823Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.8371258Z self=, 2025-05-07T20:32:46.8371683Z T=4096, 2025-05-07T20:32:46.8371926Z D=7168, 2025-05-07T20:32:46.8372123Z scale_ub=None, 2025-05-07T20:32:46.8372350Z contiguous=True, 2025-05-07T20:32:46.8372589Z compiled=False, 2025-05-07T20:32:46.8372799Z ) 2025-05-07T20:32:46.8373136Z self = 2025-05-07T20:32:46.8373652Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:46.8373930Z 2025-05-07T20:32:46.8374013Z @given( 2025-05-07T20:32:46.8374255Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.8374584Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.8374948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.8375298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.8375645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.8375951Z ) 2025-05-07T20:32:46.8376311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.8376818Z def test_silu_mul_quant( 2025-05-07T20:32:46.8377112Z self, 2025-05-07T20:32:46.8377314Z T: int, 2025-05-07T20:32:46.8377524Z D: int, 2025-05-07T20:32:46.8377758Z scale_ub: Optional[float], 2025-05-07T20:32:46.8378041Z contiguous: bool, 2025-05-07T20:32:46.8378296Z compiled: bool, 2025-05-07T20:32:46.8378537Z ) -> None: 2025-05-07T20:32:46.8378761Z torch.manual_seed(2025) 2025-05-07T20:32:46.8379026Z 2025-05-07T20:32:46.8379318Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.8381448Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.8383369Z 2025-05-07T20:32:46.8383498Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.8383722Z 2025-05-07T20:32:46.8383829Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.8384266Z self=, 2025-05-07T20:32:46.8384685Z T=16384, 2025-05-07T20:32:46.8384882Z D=7168, 2025-05-07T20:32:46.8385085Z scale_ub=None, 2025-05-07T20:32:46.8385312Z contiguous=True, 2025-05-07T20:32:46.8385540Z compiled=False, 2025-05-07T20:32:46.8385755Z ) 2025-05-07T20:32:46.8386087Z self = 2025-05-07T20:32:46.8386600Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:46.8386900Z 2025-05-07T20:32:46.8386982Z @given( 2025-05-07T20:32:46.8387231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.8387559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.8387877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.8388222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.8388567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.8388860Z ) 2025-05-07T20:32:46.8389228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.8389687Z def test_silu_mul_quant( 2025-05-07T20:32:46.8389936Z self, 2025-05-07T20:32:46.8390140Z T: int, 2025-05-07T20:32:46.8390346Z D: int, 2025-05-07T20:32:46.8390568Z scale_ub: Optional[float], 2025-05-07T20:32:46.8390853Z contiguous: bool, 2025-05-07T20:32:46.8391106Z compiled: bool, 2025-05-07T20:32:46.8391340Z ) -> None: 2025-05-07T20:32:46.8391562Z torch.manual_seed(2025) 2025-05-07T20:32:46.8391870Z 2025-05-07T20:32:46.8392160Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.8394268Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.8396228Z 2025-05-07T20:32:46.8396352Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:46.8396577Z 2025-05-07T20:32:46.8396687Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.8397162Z self=, 2025-05-07T20:32:46.8397586Z T=16384, 2025-05-07T20:32:46.8397820Z D=7168, 2025-05-07T20:32:46.8398031Z scale_ub=1200.0, 2025-05-07T20:32:46.8398265Z contiguous=True, 2025-05-07T20:32:46.8398494Z compiled=False, 2025-05-07T20:32:46.8398709Z ) 2025-05-07T20:32:46.8399047Z self = 2025-05-07T20:32:46.8399563Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.8399865Z 2025-05-07T20:32:46.8399946Z @given( 2025-05-07T20:32:46.8400270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.8400603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.8400931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.8401282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.8401632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.8401932Z ) 2025-05-07T20:32:46.8402309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.8402775Z def test_silu_mul_quant( 2025-05-07T20:32:46.8403024Z self, 2025-05-07T20:32:46.8403231Z T: int, 2025-05-07T20:32:46.8403440Z D: int, 2025-05-07T20:32:46.8403666Z scale_ub: Optional[float], 2025-05-07T20:32:46.8403955Z contiguous: bool, 2025-05-07T20:32:46.8404212Z compiled: bool, 2025-05-07T20:32:46.8404441Z ) -> None: 2025-05-07T20:32:46.8404669Z torch.manual_seed(2025) 2025-05-07T20:32:46.8404926Z 2025-05-07T20:32:46.8405209Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.8407328Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:46.8409303Z 2025-05-07T20:32:46.8409428Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.0223338Z 2025-05-07T20:32:47.0224026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.0224743Z self=, 2025-05-07T20:32:47.0225334Z T=128, 2025-05-07T20:32:47.0225597Z D=5120, 2025-05-07T20:32:47.0225801Z scale_ub=1200.0, 2025-05-07T20:32:47.0226043Z contiguous=False, 2025-05-07T20:32:47.0226287Z compiled=False, 2025-05-07T20:32:47.0226504Z ) 2025-05-07T20:32:47.0226846Z self = 2025-05-07T20:32:47.0227384Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.0227962Z 2025-05-07T20:32:47.0228062Z @given( 2025-05-07T20:32:47.0228306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.0228640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.0228998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.0229366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.0229717Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.0230024Z ) 2025-05-07T20:32:47.0230392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.0230950Z def test_silu_mul_quant( 2025-05-07T20:32:47.0231214Z self, 2025-05-07T20:32:47.0231420Z T: int, 2025-05-07T20:32:47.0231633Z D: int, 2025-05-07T20:32:47.0231869Z scale_ub: Optional[float], 2025-05-07T20:32:47.0232153Z contiguous: bool, 2025-05-07T20:32:47.0232493Z compiled: bool, 2025-05-07T20:32:47.0232738Z ) -> None: 2025-05-07T20:32:47.0233044Z torch.manual_seed(2025) 2025-05-07T20:32:47.0233299Z 2025-05-07T20:32:47.0233589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.0233956Z 2025-05-07T20:32:47.0234161Z x_sign = torch.sign(x) 2025-05-07T20:32:47.0234478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.0234814Z x = x_sign * x_clamp 2025-05-07T20:32:47.0235065Z x0 = x[:, :D] 2025-05-07T20:32:47.0235297Z x1 = x[:, D:] 2025-05-07T20:32:47.0235520Z 2025-05-07T20:32:47.0235716Z if contiguous: 2025-05-07T20:32:47.0235964Z x0 = x0.contiguous() 2025-05-07T20:32:47.0236238Z x1 = x1.contiguous() 2025-05-07T20:32:47.0236488Z 2025-05-07T20:32:47.0236695Z if scale_ub is not None: 2025-05-07T20:32:47.0236987Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.0237348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.0237687Z ) 2025-05-07T20:32:47.0237900Z else: 2025-05-07T20:32:47.0238120Z scale_ub_tensor = None 2025-05-07T20:32:47.0238387Z 2025-05-07T20:32:47.0238636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.0238975Z op = silu_mul_quant 2025-05-07T20:32:47.0239236Z if compiled: 2025-05-07T20:32:47.0239501Z op = torch.compile(op) 2025-05-07T20:32:47.0239816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.0240201Z 2025-05-07T20:32:47.0240410Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.0240587Z 2025-05-07T20:32:47.0240701Z moe/activation_test.py:117: 2025-05-07T20:32:47.0241012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.0241367Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.0241670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.0242414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.0243155Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.0243733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.0244464Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.0245168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.0245737Z kernel = self.compile( 2025-05-07T20:32:47.0246320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.0247019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.0247437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.0247692Z 2025-05-07T20:32:47.0247964Z self = 2025-05-07T20:32:47.0257200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.0258670Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ac7b11c0>} 2025-05-07T20:32:47.0260076Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.0261281Z context = 2025-05-07T20:32:47.0261597Z 2025-05-07T20:32:47.0261827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.0262440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.0262937Z module_map=module_map) 2025-05-07T20:32:47.0263334Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.0263716Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.0263999Z E ^ 2025-05-07T20:32:47.0264489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.0264968Z 2025-05-07T20:32:47.0265408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.0265949Z 2025-05-07T20:32:47.0266069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.0266514Z self=, 2025-05-07T20:32:47.0266943Z T=2048, 2025-05-07T20:32:47.0267151Z D=7168, 2025-05-07T20:32:47.0267368Z scale_ub=None, 2025-05-07T20:32:47.0267605Z contiguous=False, 2025-05-07T20:32:47.0267854Z compiled=False, 2025-05-07T20:32:47.0268077Z ) 2025-05-07T20:32:47.0268412Z self = 2025-05-07T20:32:47.0268991Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.0269278Z 2025-05-07T20:32:47.0269370Z @given( 2025-05-07T20:32:47.0269614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.0269952Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.0270289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.0270646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.0270995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.0271305Z ) 2025-05-07T20:32:47.0271686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.0272155Z def test_silu_mul_quant( 2025-05-07T20:32:47.0272422Z self, 2025-05-07T20:32:47.0272642Z T: int, 2025-05-07T20:32:47.0272853Z D: int, 2025-05-07T20:32:47.0273093Z scale_ub: Optional[float], 2025-05-07T20:32:47.0273387Z contiguous: bool, 2025-05-07T20:32:47.0273643Z compiled: bool, 2025-05-07T20:32:47.0273884Z ) -> None: 2025-05-07T20:32:47.0274120Z torch.manual_seed(2025) 2025-05-07T20:32:47.0274377Z 2025-05-07T20:32:47.0274678Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.0276897Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.0278823Z 2025-05-07T20:32:47.0278954Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.0279180Z 2025-05-07T20:32:47.0279298Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.0279734Z self=, 2025-05-07T20:32:47.0280278Z T=128, 2025-05-07T20:32:47.0280479Z D=7168, 2025-05-07T20:32:47.0280678Z scale_ub=1200.0, 2025-05-07T20:32:47.0280968Z contiguous=True, 2025-05-07T20:32:47.0281207Z compiled=True, 2025-05-07T20:32:47.0281418Z ) 2025-05-07T20:32:47.0281754Z self = 2025-05-07T20:32:47.0282274Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.0282596Z 2025-05-07T20:32:47.0282687Z @given( 2025-05-07T20:32:47.0282930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.0283304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.0283633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.0283976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.0284328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.0284634Z ) 2025-05-07T20:32:47.0285000Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.0285469Z def test_silu_mul_quant( 2025-05-07T20:32:47.0285727Z self, 2025-05-07T20:32:47.0285930Z T: int, 2025-05-07T20:32:47.0286145Z D: int, 2025-05-07T20:32:47.0286379Z scale_ub: Optional[float], 2025-05-07T20:32:47.0286667Z contiguous: bool, 2025-05-07T20:32:47.0286920Z compiled: bool, 2025-05-07T20:32:47.0287158Z ) -> None: 2025-05-07T20:32:47.0287388Z torch.manual_seed(2025) 2025-05-07T20:32:47.0287641Z 2025-05-07T20:32:47.0287935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.0288295Z 2025-05-07T20:32:47.0288498Z x_sign = torch.sign(x) 2025-05-07T20:32:47.0288810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.0289142Z x = x_sign * x_clamp 2025-05-07T20:32:47.0289391Z x0 = x[:, :D] 2025-05-07T20:32:47.0289623Z x1 = x[:, D:] 2025-05-07T20:32:47.0289846Z 2025-05-07T20:32:47.0290043Z if contiguous: 2025-05-07T20:32:47.0290292Z x0 = x0.contiguous() 2025-05-07T20:32:47.0290570Z x1 = x1.contiguous() 2025-05-07T20:32:47.0290821Z 2025-05-07T20:32:47.0291027Z if scale_ub is not None: 2025-05-07T20:32:47.0291319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.0291672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.0292001Z ) 2025-05-07T20:32:47.0292211Z else: 2025-05-07T20:32:47.0292437Z scale_ub_tensor = None 2025-05-07T20:32:47.0292703Z 2025-05-07T20:32:47.0292955Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.0293293Z op = silu_mul_quant 2025-05-07T20:32:47.0293553Z if compiled: 2025-05-07T20:32:47.0293818Z op = torch.compile(op) 2025-05-07T20:32:47.0294136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.0294425Z 2025-05-07T20:32:47.0294636Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.0294812Z 2025-05-07T20:32:47.0294925Z moe/activation_test.py:117: 2025-05-07T20:32:47.0295238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.0295597Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.0295902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.0296491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.0297080Z return fn(*args, **kwargs) 2025-05-07T20:32:47.0297824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.0298548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.0299160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.0299877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.0300577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.0301175Z kernel = self.compile( 2025-05-07T20:32:47.0301737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.0302426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.0302943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.0303187Z 2025-05-07T20:32:47.0303452Z self = 2025-05-07T20:32:47.0304577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.0306005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ac8a3b00>} 2025-05-07T20:32:47.0307403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.0308474Z context = 2025-05-07T20:32:47.0308779Z 2025-05-07T20:32:47.0308960Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.0309523Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.0310018Z module_map=module_map) 2025-05-07T20:32:47.0310405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.0310774Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.0311053Z E ^ 2025-05-07T20:32:47.0311530Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.0312000Z 2025-05-07T20:32:47.0312435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3039924Z 2025-05-07T20:32:47.3040641Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3041932Z self=, 2025-05-07T20:32:47.3043122Z T=128, 2025-05-07T20:32:47.3043679Z D=7168, 2025-05-07T20:32:47.3044234Z scale_ub=1200.0, 2025-05-07T20:32:47.3044764Z contiguous=True, 2025-05-07T20:32:47.3045231Z compiled=False, 2025-05-07T20:32:47.3045663Z ) 2025-05-07T20:32:47.3046333Z self = 2025-05-07T20:32:47.3047364Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.3047926Z 2025-05-07T20:32:47.3048088Z @given( 2025-05-07T20:32:47.3048573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3049065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3049412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3049769Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3050122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3050426Z ) 2025-05-07T20:32:47.3051060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3051533Z def test_silu_mul_quant( 2025-05-07T20:32:47.3051794Z self, 2025-05-07T20:32:47.3051998Z T: int, 2025-05-07T20:32:47.3052213Z D: int, 2025-05-07T20:32:47.3052446Z scale_ub: Optional[float], 2025-05-07T20:32:47.3052731Z contiguous: bool, 2025-05-07T20:32:47.3052993Z compiled: bool, 2025-05-07T20:32:47.3053231Z ) -> None: 2025-05-07T20:32:47.3053454Z torch.manual_seed(2025) 2025-05-07T20:32:47.3053711Z 2025-05-07T20:32:47.3054002Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3054439Z 2025-05-07T20:32:47.3054651Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3054960Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3057124Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.3059145Z 2025-05-07T20:32:47.3059283Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:47.3059506Z 2025-05-07T20:32:47.3059615Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3060054Z self=, 2025-05-07T20:32:47.3060480Z T=128, 2025-05-07T20:32:47.3060675Z D=5120, 2025-05-07T20:32:47.3060883Z scale_ub=1200.0, 2025-05-07T20:32:47.3061130Z contiguous=True, 2025-05-07T20:32:47.3061369Z compiled=True, 2025-05-07T20:32:47.3061594Z ) 2025-05-07T20:32:47.3061949Z self = 2025-05-07T20:32:47.3062473Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.3062755Z 2025-05-07T20:32:47.3062845Z @given( 2025-05-07T20:32:47.3063094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3063421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3063748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3064101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3064445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3064757Z ) 2025-05-07T20:32:47.3065132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3065760Z def test_silu_mul_quant( 2025-05-07T20:32:47.3066131Z self, 2025-05-07T20:32:47.3066439Z T: int, 2025-05-07T20:32:47.3066719Z D: int, 2025-05-07T20:32:47.3067040Z scale_ub: Optional[float], 2025-05-07T20:32:47.3067450Z contiguous: bool, 2025-05-07T20:32:47.3067797Z compiled: bool, 2025-05-07T20:32:47.3068120Z ) -> None: 2025-05-07T20:32:47.3068431Z torch.manual_seed(2025) 2025-05-07T20:32:47.3068787Z 2025-05-07T20:32:47.3069165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3069666Z 2025-05-07T20:32:47.3069946Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3070355Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3073400Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.3076110Z 2025-05-07T20:32:47.3076279Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:47.3076602Z 2025-05-07T20:32:47.3076747Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3077344Z self=, 2025-05-07T20:32:47.3077912Z T=128, 2025-05-07T20:32:47.3078183Z D=7168, 2025-05-07T20:32:47.3078461Z scale_ub=None, 2025-05-07T20:32:47.3078764Z contiguous=True, 2025-05-07T20:32:47.3079143Z compiled=True, 2025-05-07T20:32:47.3079441Z ) 2025-05-07T20:32:47.3079891Z self = 2025-05-07T20:32:47.3080756Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3081142Z 2025-05-07T20:32:47.3081261Z @given( 2025-05-07T20:32:47.3081665Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3082121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3082622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3083108Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3083575Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3083988Z ) 2025-05-07T20:32:47.3084495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3085122Z def test_silu_mul_quant( 2025-05-07T20:32:47.3085467Z self, 2025-05-07T20:32:47.3085743Z T: int, 2025-05-07T20:32:47.3086021Z D: int, 2025-05-07T20:32:47.3086332Z scale_ub: Optional[float], 2025-05-07T20:32:47.3086720Z contiguous: bool, 2025-05-07T20:32:47.3087066Z compiled: bool, 2025-05-07T20:32:47.3087378Z ) -> None: 2025-05-07T20:32:47.3087689Z torch.manual_seed(2025) 2025-05-07T20:32:47.3088035Z 2025-05-07T20:32:47.3088420Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3091391Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.3094117Z 2025-05-07T20:32:47.3094285Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.3094594Z 2025-05-07T20:32:47.3095568Z FAILED 2025-05-07T20:32:47.3095724Z 2025-05-07T20:32:47.3095905Z =================================== FAILURES =================================== 2025-05-07T20:32:47.3096514Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:47.3097144Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:47.3098020Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:47.3098799Z | yield 2025-05-07T20:32:47.3099403Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:47.3099951Z | self._callTestMethod(testMethod) 2025-05-07T20:32:47.3100257Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:47.3100825Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:47.3101409Z | if method() is not None: 2025-05-07T20:32:47.3101677Z | ~~~~~~^^ 2025-05-07T20:32:47.3102339Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:47.3103162Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3103475Z | ^^^^^^^ 2025-05-07T20:32:47.3104066Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:47.3104711Z | raise the_error_hypothesis_found 2025-05-07T20:32:47.3105165Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:47.3105611Z +-+---------------- 1 ---------------- 2025-05-07T20:32:47.3105915Z | Traceback (most recent call last): 2025-05-07T20:32:47.3106705Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:47.3107516Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3109682Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.3111757Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:47.3112219Z | self=, 2025-05-07T20:32:47.3112651Z | T=2048, 2025-05-07T20:32:47.3112907Z | D=5120, # or any other generated value 2025-05-07T20:32:47.3113262Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:47.3114019Z | contiguous=True, # or any other generated value 2025-05-07T20:32:47.3114559Z | compiled=False, # or any other generated value 2025-05-07T20:32:47.3115000Z | ) 2025-05-07T20:32:47.3115265Z | 2025-05-07T20:32:47.3116023Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:47.3116865Z +---------------- 2 ---------------- 2025-05-07T20:32:47.3117268Z | Traceback (most recent call last): 2025-05-07T20:32:47.3118274Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:47.3119412Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3122465Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.3125325Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:47.3125955Z | self=, 2025-05-07T20:32:47.3126539Z | T=128, 2025-05-07T20:32:47.3126851Z | D=7168, 2025-05-07T20:32:47.3127156Z | scale_ub=None, 2025-05-07T20:32:47.3127519Z | contiguous=True, 2025-05-07T20:32:47.3127882Z | compiled=True, 2025-05-07T20:32:47.3128207Z | ) 2025-05-07T20:32:47.3128477Z | 2025-05-07T20:32:47.3129244Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:47.3130269Z +---------------- 3 ---------------- 2025-05-07T20:32:47.3130699Z | Traceback (most recent call last): 2025-05-07T20:32:47.3131736Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:47.3132867Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3135817Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.3138847Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:47.3139610Z | self=, 2025-05-07T20:32:47.3140216Z | T=128, 2025-05-07T20:32:47.3140511Z | D=5120, 2025-05-07T20:32:47.3140809Z | scale_ub=1200.0, 2025-05-07T20:32:47.3141164Z | contiguous=True, 2025-05-07T20:32:47.3141517Z | compiled=True, 2025-05-07T20:32:47.3141839Z | ) 2025-05-07T20:32:47.3142102Z | 2025-05-07T20:32:47.3142863Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:47.3143740Z +---------------- 4 ---------------- 2025-05-07T20:32:47.3144163Z | Traceback (most recent call last): 2025-05-07T20:32:47.3145198Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:47.3146236Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3146653Z | ~~~~~~^^ 2025-05-07T20:32:47.3147600Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:47.3148609Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3149811Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:47.3150988Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3151427Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:47.3151809Z | a, 2025-05-07T20:32:47.3152095Z | ^^ 2025-05-07T20:32:47.3152391Z | ...<23 lines>... 2025-05-07T20:32:47.3152744Z | USE_INT64=use_int64, 2025-05-07T20:32:47.3153124Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.3153616Z | ) 2025-05-07T20:32:47.3153990Z | ^ 2025-05-07T20:32:47.3173637Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:47.3174681Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3175296Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.3176183Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:47.3177257Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3177912Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.3178786Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:47.3179857Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3180385Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.3181200Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:47.3181958Z | fn() 2025-05-07T20:32:47.3182229Z | ~~^^ 2025-05-07T20:32:47.3183007Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:47.3183917Z | self.fn.run( 2025-05-07T20:32:47.3184226Z | ~~~~~~~~~~~^ 2025-05-07T20:32:47.3184525Z | *args, 2025-05-07T20:32:47.3184811Z | ^^^^^^ 2025-05-07T20:32:47.3185119Z | **current, 2025-05-07T20:32:47.3185449Z | ^^^^^^^^^^ 2025-05-07T20:32:47.3185840Z | ) 2025-05-07T20:32:47.3186113Z | ^ 2025-05-07T20:32:47.3186908Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:47.3187747Z | kernel = self.compile( 2025-05-07T20:32:47.3188113Z | src, 2025-05-07T20:32:47.3188438Z | target=target, 2025-05-07T20:32:47.3188815Z | options=options.__dict__, 2025-05-07T20:32:47.3189185Z | ) 2025-05-07T20:32:47.3189933Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:47.3190911Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3191883Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:47.3192966Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3193624Z | module_map=module_map) 2025-05-07T20:32:47.3194125Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3194597Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3194963Z | ^ 2025-05-07T20:32:47.3195590Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3196365Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:47.3196911Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:47.3197618Z | self=, 2025-05-07T20:32:47.3198217Z | T=1, # or any other generated value 2025-05-07T20:32:47.3198636Z | D=5120, # or any other generated value 2025-05-07T20:32:47.3199162Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:47.3199677Z | contiguous=True, # or any other generated value 2025-05-07T20:32:47.3200315Z | compiled=True, # or any other generated value 2025-05-07T20:32:47.3200763Z | ) 2025-05-07T20:32:47.3201031Z | 2025-05-07T20:32:47.3201785Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:47.3202649Z +------------------------------------ 2025-05-07T20:32:47.3203180Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:47.3203733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3204323Z self=, 2025-05-07T20:32:47.3204913Z T=1, 2025-05-07T20:32:47.3205188Z D=5120, 2025-05-07T20:32:47.3205464Z scale_ub=None, 2025-05-07T20:32:47.3205795Z contiguous=True, 2025-05-07T20:32:47.3206139Z compiled=True, 2025-05-07T20:32:47.3206435Z ) 2025-05-07T20:32:47.3206922Z self = 2025-05-07T20:32:47.3207713Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3208098Z 2025-05-07T20:32:47.3208220Z @given( 2025-05-07T20:32:47.3208552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3209026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3209473Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3209953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3210434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3210853Z ) 2025-05-07T20:32:47.3211406Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3212040Z def test_silu_mul_quant( 2025-05-07T20:32:47.3212403Z self, 2025-05-07T20:32:47.3212693Z T: int, 2025-05-07T20:32:47.3212978Z D: int, 2025-05-07T20:32:47.3213294Z scale_ub: Optional[float], 2025-05-07T20:32:47.3214036Z contiguous: bool, 2025-05-07T20:32:47.3214391Z compiled: bool, 2025-05-07T20:32:47.3214914Z ) -> None: 2025-05-07T20:32:47.3215238Z torch.manual_seed(2025) 2025-05-07T20:32:47.3215571Z 2025-05-07T20:32:47.3215950Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3216424Z 2025-05-07T20:32:47.3216691Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3217095Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3217532Z x = x_sign * x_clamp 2025-05-07T20:32:47.3217858Z x0 = x[:, :D] 2025-05-07T20:32:47.3218168Z x1 = x[:, D:] 2025-05-07T20:32:47.3218460Z 2025-05-07T20:32:47.3218712Z if contiguous: 2025-05-07T20:32:47.3219040Z x0 = x0.contiguous() 2025-05-07T20:32:47.3219401Z x1 = x1.contiguous() 2025-05-07T20:32:47.3219736Z 2025-05-07T20:32:47.3220011Z if scale_ub is not None: 2025-05-07T20:32:47.3220406Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3220882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3221312Z ) 2025-05-07T20:32:47.3221588Z else: 2025-05-07T20:32:47.3221891Z scale_ub_tensor = None 2025-05-07T20:32:47.3222241Z 2025-05-07T20:32:47.3222568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3223021Z op = silu_mul_quant 2025-05-07T20:32:47.3223363Z if compiled: 2025-05-07T20:32:47.3223711Z op = torch.compile(op) 2025-05-07T20:32:47.3224143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3224558Z 2025-05-07T20:32:47.3224847Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3225257Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3225662Z 2025-05-07T20:32:47.3226003Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3226477Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3226897Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3227339Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3227864Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3228331Z 2025-05-07T20:32:47.3228626Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3228909Z 2025-05-07T20:32:47.3229049Z moe/activation_test.py:126: 2025-05-07T20:32:47.3229465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3229925Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3230381Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3231480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3232518Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3233356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3234304Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3235251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3236250Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3237258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3238146Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3239062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3239775Z fn() 2025-05-07T20:32:47.3240580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3241471Z self.fn.run( 2025-05-07T20:32:47.3242166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3242898Z kernel = self.compile( 2025-05-07T20:32:47.3243647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3244552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3245099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3245433Z 2025-05-07T20:32:47.3245745Z self = 2025-05-07T20:32:47.3247306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3249314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f089ac76700>} 2025-05-07T20:32:47.3251254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3252737Z context = 2025-05-07T20:32:47.3253147Z 2025-05-07T20:32:47.3253385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3254160Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3254850Z module_map=module_map) 2025-05-07T20:32:47.3255381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3255907Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3256309Z E ^ 2025-05-07T20:32:47.3256971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3257617Z 2025-05-07T20:32:47.3258212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3258971Z 2025-05-07T20:32:47.3259118Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3259719Z self=, 2025-05-07T20:32:47.3260287Z T=2048, 2025-05-07T20:32:47.3260567Z D=5120, 2025-05-07T20:32:47.3260854Z scale_ub=1200.0, 2025-05-07T20:32:47.3261160Z contiguous=True, 2025-05-07T20:32:47.3261494Z compiled=False, 2025-05-07T20:32:47.3261799Z ) 2025-05-07T20:32:47.3262270Z self = 2025-05-07T20:32:47.3262970Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.3263371Z 2025-05-07T20:32:47.3263499Z @given( 2025-05-07T20:32:47.3263914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3264356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3264813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3265307Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3265796Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3266224Z ) 2025-05-07T20:32:47.3266741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3267389Z def test_silu_mul_quant( 2025-05-07T20:32:47.3267799Z self, 2025-05-07T20:32:47.3268075Z T: int, 2025-05-07T20:32:47.3268353Z D: int, 2025-05-07T20:32:47.3268655Z scale_ub: Optional[float], 2025-05-07T20:32:47.3269034Z contiguous: bool, 2025-05-07T20:32:47.3269378Z compiled: bool, 2025-05-07T20:32:47.3269749Z ) -> None: 2025-05-07T20:32:47.3270064Z torch.manual_seed(2025) 2025-05-07T20:32:47.3270416Z 2025-05-07T20:32:47.3270858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3271356Z 2025-05-07T20:32:47.3271636Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3272053Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3272493Z x = x_sign * x_clamp 2025-05-07T20:32:47.3272841Z x0 = x[:, :D] 2025-05-07T20:32:47.3273148Z x1 = x[:, D:] 2025-05-07T20:32:47.3273454Z 2025-05-07T20:32:47.3273720Z if contiguous: 2025-05-07T20:32:47.3274046Z x0 = x0.contiguous() 2025-05-07T20:32:47.3274431Z x1 = x1.contiguous() 2025-05-07T20:32:47.3274781Z 2025-05-07T20:32:47.3275058Z if scale_ub is not None: 2025-05-07T20:32:47.3275447Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3275925Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3276387Z ) 2025-05-07T20:32:47.3276673Z else: 2025-05-07T20:32:47.3276977Z scale_ub_tensor = None 2025-05-07T20:32:47.3277359Z 2025-05-07T20:32:47.3277702Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3278161Z op = silu_mul_quant 2025-05-07T20:32:47.3278532Z if compiled: 2025-05-07T20:32:47.3278908Z op = torch.compile(op) 2025-05-07T20:32:47.3279340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3279736Z 2025-05-07T20:32:47.3280010Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3280369Z 2025-05-07T20:32:47.3280529Z moe/activation_test.py:117: 2025-05-07T20:32:47.3280907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3281265Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3281568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3282293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3283030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3283602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3284330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3285028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3285593Z kernel = self.compile( 2025-05-07T20:32:47.3286169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3286863Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3287294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3287543Z 2025-05-07T20:32:47.3287766Z self = 2025-05-07T20:32:47.3288988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3290430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f089ab2a020>} 2025-05-07T20:32:47.3291830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3292945Z context = 2025-05-07T20:32:47.3293249Z 2025-05-07T20:32:47.3293434Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3294039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3294570Z module_map=module_map) 2025-05-07T20:32:47.3294964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3295345Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3295620Z E ^ 2025-05-07T20:32:47.3296118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3296591Z 2025-05-07T20:32:47.3297033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3297571Z 2025-05-07T20:32:47.3297687Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3298124Z self=, 2025-05-07T20:32:47.3298558Z T=2048, 2025-05-07T20:32:47.3298763Z D=5120, 2025-05-07T20:32:47.3298974Z scale_ub=1200.0, 2025-05-07T20:32:47.3299217Z contiguous=True, 2025-05-07T20:32:47.3299464Z compiled=True, 2025-05-07T20:32:47.3299680Z ) 2025-05-07T20:32:47.3300019Z self = 2025-05-07T20:32:47.3300547Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.3300833Z 2025-05-07T20:32:47.3300922Z @given( 2025-05-07T20:32:47.3301164Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3301501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3301832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3302184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3302538Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3302844Z ) 2025-05-07T20:32:47.3303211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3303680Z def test_silu_mul_quant( 2025-05-07T20:32:47.3303944Z self, 2025-05-07T20:32:47.3304154Z T: int, 2025-05-07T20:32:47.3304365Z D: int, 2025-05-07T20:32:47.3304605Z scale_ub: Optional[float], 2025-05-07T20:32:47.3304892Z contiguous: bool, 2025-05-07T20:32:47.3305153Z compiled: bool, 2025-05-07T20:32:47.3305397Z ) -> None: 2025-05-07T20:32:47.3305624Z torch.manual_seed(2025) 2025-05-07T20:32:47.3305885Z 2025-05-07T20:32:47.3306180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3306542Z 2025-05-07T20:32:47.3306746Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3307065Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3307396Z x = x_sign * x_clamp 2025-05-07T20:32:47.3307649Z x0 = x[:, :D] 2025-05-07T20:32:47.3307883Z x1 = x[:, D:] 2025-05-07T20:32:47.3308107Z 2025-05-07T20:32:47.3308303Z if contiguous: 2025-05-07T20:32:47.3308552Z x0 = x0.contiguous() 2025-05-07T20:32:47.3308832Z x1 = x1.contiguous() 2025-05-07T20:32:47.3309132Z 2025-05-07T20:32:47.3309348Z if scale_ub is not None: 2025-05-07T20:32:47.3309647Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3310003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3310335Z ) 2025-05-07T20:32:47.3310545Z else: 2025-05-07T20:32:47.3310767Z scale_ub_tensor = None 2025-05-07T20:32:47.3311037Z 2025-05-07T20:32:47.3311286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3311624Z op = silu_mul_quant 2025-05-07T20:32:47.3311938Z if compiled: 2025-05-07T20:32:47.3312206Z op = torch.compile(op) 2025-05-07T20:32:47.3312527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3312817Z 2025-05-07T20:32:47.3313029Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3313634Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3314143Z 2025-05-07T20:32:47.3314402Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3314838Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3315150Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3315488Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3315874Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3316208Z 2025-05-07T20:32:47.3316421Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3316639Z 2025-05-07T20:32:47.3316746Z moe/activation_test.py:126: 2025-05-07T20:32:47.3317065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3317421Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3317772Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3318606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3319449Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3320026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3320831Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3321561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3322317Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3323094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3323776Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3324414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3324959Z fn() 2025-05-07T20:32:47.3325498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3326111Z self.fn.run( 2025-05-07T20:32:47.3326598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3327160Z kernel = self.compile( 2025-05-07T20:32:47.3327738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3328434Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3328884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3329157Z 2025-05-07T20:32:47.3329374Z self = 2025-05-07T20:32:47.3330585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3332031Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f089ab34720>} 2025-05-07T20:32:47.3333423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3334492Z context = 2025-05-07T20:32:47.3334860Z 2025-05-07T20:32:47.3335038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3335592Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3336081Z module_map=module_map) 2025-05-07T20:32:47.3336520Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3336940Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3337231Z E ^ 2025-05-07T20:32:47.3337719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3338196Z 2025-05-07T20:32:47.3338634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3339173Z 2025-05-07T20:32:47.3339293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3339734Z self=, 2025-05-07T20:32:47.3340162Z T=16384, 2025-05-07T20:32:47.3340368Z D=7168, 2025-05-07T20:32:47.3340579Z scale_ub=1200.0, 2025-05-07T20:32:47.3340813Z contiguous=False, 2025-05-07T20:32:47.3341058Z compiled=False, 2025-05-07T20:32:47.3341283Z ) 2025-05-07T20:32:47.3341618Z self = 2025-05-07T20:32:47.3342158Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.3342460Z 2025-05-07T20:32:47.3342551Z @given( 2025-05-07T20:32:47.3342798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3343136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3343467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3343816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3344169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3344479Z ) 2025-05-07T20:32:47.3344853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3345317Z def test_silu_mul_quant( 2025-05-07T20:32:47.3345581Z self, 2025-05-07T20:32:47.3345795Z T: int, 2025-05-07T20:32:47.3345999Z D: int, 2025-05-07T20:32:47.3346240Z scale_ub: Optional[float], 2025-05-07T20:32:47.3346537Z contiguous: bool, 2025-05-07T20:32:47.3346793Z compiled: bool, 2025-05-07T20:32:47.3347032Z ) -> None: 2025-05-07T20:32:47.3347262Z torch.manual_seed(2025) 2025-05-07T20:32:47.3347512Z 2025-05-07T20:32:47.3347808Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3348169Z 2025-05-07T20:32:47.3348371Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3348683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3349012Z x = x_sign * x_clamp 2025-05-07T20:32:47.3349268Z x0 = x[:, :D] 2025-05-07T20:32:47.3349498Z x1 = x[:, D:] 2025-05-07T20:32:47.3349723Z 2025-05-07T20:32:47.3349922Z if contiguous: 2025-05-07T20:32:47.3350163Z x0 = x0.contiguous() 2025-05-07T20:32:47.3350440Z x1 = x1.contiguous() 2025-05-07T20:32:47.3350700Z 2025-05-07T20:32:47.3350902Z if scale_ub is not None: 2025-05-07T20:32:47.3351245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3351608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3351933Z ) 2025-05-07T20:32:47.3352142Z else: 2025-05-07T20:32:47.3352366Z scale_ub_tensor = None 2025-05-07T20:32:47.3352628Z 2025-05-07T20:32:47.3352876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3353211Z op = silu_mul_quant 2025-05-07T20:32:47.3353473Z if compiled: 2025-05-07T20:32:47.3353741Z op = torch.compile(op) 2025-05-07T20:32:47.3354059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3354391Z 2025-05-07T20:32:47.3354600Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3354779Z 2025-05-07T20:32:47.3354894Z moe/activation_test.py:117: 2025-05-07T20:32:47.3362587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3363035Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3363347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3364121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3364849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3365420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3366136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3366842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3367406Z kernel = self.compile( 2025-05-07T20:32:47.3367982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3368667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3369095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3369342Z 2025-05-07T20:32:47.3369567Z self = 2025-05-07T20:32:47.3370689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3372123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0899a13880>} 2025-05-07T20:32:47.3373520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3374586Z context = 2025-05-07T20:32:47.3374894Z 2025-05-07T20:32:47.3375083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3375631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3376134Z module_map=module_map) 2025-05-07T20:32:47.3376524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3376895Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3377175Z E ^ 2025-05-07T20:32:47.3377670Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3378142Z 2025-05-07T20:32:47.3378583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3379117Z 2025-05-07T20:32:47.3379233Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3379675Z self=, 2025-05-07T20:32:47.3380151Z T=1, 2025-05-07T20:32:47.3380356Z D=7168, 2025-05-07T20:32:47.3380560Z scale_ub=None, 2025-05-07T20:32:47.3380794Z contiguous=True, 2025-05-07T20:32:47.3381034Z compiled=True, 2025-05-07T20:32:47.3381248Z ) 2025-05-07T20:32:47.3381590Z self = 2025-05-07T20:32:47.3382107Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3382380Z 2025-05-07T20:32:47.3382462Z @given( 2025-05-07T20:32:47.3382713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3383098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3383424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3383786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3384145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3384541Z ) 2025-05-07T20:32:47.3384911Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3385426Z def test_silu_mul_quant( 2025-05-07T20:32:47.3385688Z self, 2025-05-07T20:32:47.3385892Z T: int, 2025-05-07T20:32:47.3386106Z D: int, 2025-05-07T20:32:47.3386345Z scale_ub: Optional[float], 2025-05-07T20:32:47.3386630Z contiguous: bool, 2025-05-07T20:32:47.3386893Z compiled: bool, 2025-05-07T20:32:47.3387136Z ) -> None: 2025-05-07T20:32:47.3387365Z torch.manual_seed(2025) 2025-05-07T20:32:47.3387630Z 2025-05-07T20:32:47.3387929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3388292Z 2025-05-07T20:32:47.3388515Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3388832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3389194Z x = x_sign * x_clamp 2025-05-07T20:32:47.3389464Z x0 = x[:, :D] 2025-05-07T20:32:47.3389708Z x1 = x[:, D:] 2025-05-07T20:32:47.3389934Z 2025-05-07T20:32:47.3390136Z if contiguous: 2025-05-07T20:32:47.3390389Z x0 = x0.contiguous() 2025-05-07T20:32:47.3390673Z x1 = x1.contiguous() 2025-05-07T20:32:47.3390924Z 2025-05-07T20:32:47.3391137Z if scale_ub is not None: 2025-05-07T20:32:47.3391434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3391791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3392121Z ) 2025-05-07T20:32:47.3392331Z else: 2025-05-07T20:32:47.3392430Z scale_ub_tensor = None 2025-05-07T20:32:47.3392512Z 2025-05-07T20:32:47.3392659Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3392758Z op = silu_mul_quant 2025-05-07T20:32:47.3392850Z if compiled: 2025-05-07T20:32:47.3392962Z op = torch.compile(op) 2025-05-07T20:32:47.3393074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3393164Z 2025-05-07T20:32:47.3393263Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3393395Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3393479Z 2025-05-07T20:32:47.3393624Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3393733Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3393851Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3393980Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3394130Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3394218Z 2025-05-07T20:32:47.3394325Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3394330Z 2025-05-07T20:32:47.3394448Z moe/activation_test.py:126: 2025-05-07T20:32:47.3394589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3394703Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3394856Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3395493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3395605Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3395990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3396229Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3396622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3396962Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3397358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3397583Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3397987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3398071Z fn() 2025-05-07T20:32:47.3398500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3398589Z self.fn.run( 2025-05-07T20:32:47.3398950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3399052Z kernel = self.compile( 2025-05-07T20:32:47.3399451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3399652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3399789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3399794Z 2025-05-07T20:32:47.3400020Z self = 2025-05-07T20:32:47.3400919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3401447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0899ace480>} 2025-05-07T20:32:47.3402230Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3402435Z context = 2025-05-07T20:32:47.3402440Z 2025-05-07T20:32:47.3402626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3402912Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3403030Z module_map=module_map) 2025-05-07T20:32:47.3403214Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3403323Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3403413Z E ^ 2025-05-07T20:32:47.3403784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3403789Z 2025-05-07T20:32:47.3404223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3404230Z 2025-05-07T20:32:47.3404350Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3404589Z self=, 2025-05-07T20:32:47.3404673Z T=4096, 2025-05-07T20:32:47.3404767Z D=5120, 2025-05-07T20:32:47.3404857Z scale_ub=None, 2025-05-07T20:32:47.3405006Z contiguous=False, 2025-05-07T20:32:47.3405099Z compiled=False, 2025-05-07T20:32:47.3405179Z ) 2025-05-07T20:32:47.3405418Z self = 2025-05-07T20:32:47.3405605Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.3405610Z 2025-05-07T20:32:47.3405691Z @given( 2025-05-07T20:32:47.3405830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3405937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3406065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3406241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3406366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3406454Z ) 2025-05-07T20:32:47.3406712Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3406854Z def test_silu_mul_quant( 2025-05-07T20:32:47.3406950Z self, 2025-05-07T20:32:47.3407036Z T: int, 2025-05-07T20:32:47.3407157Z D: int, 2025-05-07T20:32:47.3407271Z scale_ub: Optional[float], 2025-05-07T20:32:47.3407369Z contiguous: bool, 2025-05-07T20:32:47.3407461Z compiled: bool, 2025-05-07T20:32:47.3407552Z ) -> None: 2025-05-07T20:32:47.3407654Z torch.manual_seed(2025) 2025-05-07T20:32:47.3407732Z 2025-05-07T20:32:47.3407919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3408000Z 2025-05-07T20:32:47.3408104Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3408240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3408335Z x = x_sign * x_clamp 2025-05-07T20:32:47.3408428Z x0 = x[:, :D] 2025-05-07T20:32:47.3408515Z x1 = x[:, D:] 2025-05-07T20:32:47.3408593Z 2025-05-07T20:32:47.3408689Z if contiguous: 2025-05-07T20:32:47.3408791Z x0 = x0.contiguous() 2025-05-07T20:32:47.3408892Z x1 = x1.contiguous() 2025-05-07T20:32:47.3408980Z 2025-05-07T20:32:47.3409082Z if scale_ub is not None: 2025-05-07T20:32:47.3409197Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3409347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3409429Z ) 2025-05-07T20:32:47.3409519Z else: 2025-05-07T20:32:47.3409620Z scale_ub_tensor = None 2025-05-07T20:32:47.3409697Z 2025-05-07T20:32:47.3409842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3409939Z op = silu_mul_quant 2025-05-07T20:32:47.3410033Z if compiled: 2025-05-07T20:32:47.3410149Z op = torch.compile(op) 2025-05-07T20:32:47.3410261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3410341Z 2025-05-07T20:32:47.3410447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3410455Z 2025-05-07T20:32:47.3410561Z moe/activation_test.py:117: 2025-05-07T20:32:47.3410703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3410818Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3410925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3411453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3411558Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3411937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3412186Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3412544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3412652Z kernel = self.compile( 2025-05-07T20:32:47.3413102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3413296Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3413709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3413717Z 2025-05-07T20:32:47.3413959Z self = 2025-05-07T20:32:47.3414771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3415402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0899af23e0>} 2025-05-07T20:32:47.3416182Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3416511Z context = 2025-05-07T20:32:47.3416516Z 2025-05-07T20:32:47.3416697Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3416981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3417098Z module_map=module_map) 2025-05-07T20:32:47.3417271Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3417391Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3417475Z E ^ 2025-05-07T20:32:47.3417846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3417851Z 2025-05-07T20:32:47.3418293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3418304Z 2025-05-07T20:32:47.3418417Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3418663Z self=, 2025-05-07T20:32:47.3418748Z T=4096, 2025-05-07T20:32:47.3418850Z D=7168, 2025-05-07T20:32:47.3418956Z scale_ub=None, 2025-05-07T20:32:47.3419072Z contiguous=False, 2025-05-07T20:32:47.3419163Z compiled=False, 2025-05-07T20:32:47.3419250Z ) 2025-05-07T20:32:47.3419482Z self = 2025-05-07T20:32:47.3419679Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.3419686Z 2025-05-07T20:32:47.3419773Z @given( 2025-05-07T20:32:47.3419902Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3420018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3420140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3420273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3420404Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3420484Z ) 2025-05-07T20:32:47.3420744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3420850Z def test_silu_mul_quant( 2025-05-07T20:32:47.3420937Z self, 2025-05-07T20:32:47.3421025Z T: int, 2025-05-07T20:32:47.3421108Z D: int, 2025-05-07T20:32:47.3421215Z scale_ub: Optional[float], 2025-05-07T20:32:47.3421319Z contiguous: bool, 2025-05-07T20:32:47.3421413Z compiled: bool, 2025-05-07T20:32:47.3421499Z ) -> None: 2025-05-07T20:32:47.3421611Z torch.manual_seed(2025) 2025-05-07T20:32:47.3421689Z 2025-05-07T20:32:47.3421872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3421958Z 2025-05-07T20:32:47.3422056Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3422190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3422363Z x = x_sign * x_clamp 2025-05-07T20:32:47.3422450Z x0 = x[:, :D] 2025-05-07T20:32:47.3422538Z x1 = x[:, D:] 2025-05-07T20:32:47.3422622Z 2025-05-07T20:32:47.3422714Z if contiguous: 2025-05-07T20:32:47.3422811Z x0 = x0.contiguous() 2025-05-07T20:32:47.3422911Z x1 = x1.contiguous() 2025-05-07T20:32:47.3422988Z 2025-05-07T20:32:47.3423085Z if scale_ub is not None: 2025-05-07T20:32:47.3423203Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3423348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3423471Z ) 2025-05-07T20:32:47.3423559Z else: 2025-05-07T20:32:47.3423659Z scale_ub_tensor = None 2025-05-07T20:32:47.3423742Z 2025-05-07T20:32:47.3423879Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3424016Z op = silu_mul_quant 2025-05-07T20:32:47.3424111Z if compiled: 2025-05-07T20:32:47.3424260Z op = torch.compile(op) 2025-05-07T20:32:47.3424374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3424457Z 2025-05-07T20:32:47.3424553Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3424557Z 2025-05-07T20:32:47.3424661Z moe/activation_test.py:117: 2025-05-07T20:32:47.3424802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3424909Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3425020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3425540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3425647Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3426030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3426270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3426627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3426733Z kernel = self.compile( 2025-05-07T20:32:47.3427136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3427329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3427464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3427471Z 2025-05-07T20:32:47.3427686Z self = 2025-05-07T20:32:47.3428495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3429026Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0899af2700>} 2025-05-07T20:32:47.3429808Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3430009Z context = 2025-05-07T20:32:47.3430014Z 2025-05-07T20:32:47.3430194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3430473Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3430587Z module_map=module_map) 2025-05-07T20:32:47.3430761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3430867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3430949Z E ^ 2025-05-07T20:32:47.3431394Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3431399Z 2025-05-07T20:32:47.3431830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3431834Z 2025-05-07T20:32:47.3431950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3432185Z self=, 2025-05-07T20:32:47.3432267Z T=128, 2025-05-07T20:32:47.3432399Z D=7168, 2025-05-07T20:32:47.3432486Z scale_ub=None, 2025-05-07T20:32:47.3432578Z contiguous=False, 2025-05-07T20:32:47.3432670Z compiled=True, 2025-05-07T20:32:47.3432747Z ) 2025-05-07T20:32:47.3432976Z self = 2025-05-07T20:32:47.3433162Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3433206Z 2025-05-07T20:32:47.3433292Z @given( 2025-05-07T20:32:47.3433462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3433569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3433692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3433822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3433944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3434023Z ) 2025-05-07T20:32:47.3434288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3434390Z def test_silu_mul_quant( 2025-05-07T20:32:47.3434483Z self, 2025-05-07T20:32:47.3434564Z T: int, 2025-05-07T20:32:47.3434652Z D: int, 2025-05-07T20:32:47.3434756Z scale_ub: Optional[float], 2025-05-07T20:32:47.3434851Z contiguous: bool, 2025-05-07T20:32:47.3434954Z compiled: bool, 2025-05-07T20:32:47.3435043Z ) -> None: 2025-05-07T20:32:47.3435146Z torch.manual_seed(2025) 2025-05-07T20:32:47.3435231Z 2025-05-07T20:32:47.3435410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3435489Z 2025-05-07T20:32:47.3435591Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3435723Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3435817Z x = x_sign * x_clamp 2025-05-07T20:32:47.3435908Z x0 = x[:, :D] 2025-05-07T20:32:47.3435993Z x1 = x[:, D:] 2025-05-07T20:32:47.3436077Z 2025-05-07T20:32:47.3436166Z if contiguous: 2025-05-07T20:32:47.3436265Z x0 = x0.contiguous() 2025-05-07T20:32:47.3436368Z x1 = x1.contiguous() 2025-05-07T20:32:47.3436445Z 2025-05-07T20:32:47.3436541Z if scale_ub is not None: 2025-05-07T20:32:47.3436657Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3436799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3436882Z ) 2025-05-07T20:32:47.3436973Z else: 2025-05-07T20:32:47.3437077Z scale_ub_tensor = None 2025-05-07T20:32:47.3437154Z 2025-05-07T20:32:47.3437296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3437391Z op = silu_mul_quant 2025-05-07T20:32:47.3437485Z if compiled: 2025-05-07T20:32:47.3437591Z op = torch.compile(op) 2025-05-07T20:32:47.3437704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3437787Z 2025-05-07T20:32:47.3437882Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3438012Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3438097Z 2025-05-07T20:32:47.3438241Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3438349Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3438461Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3438591Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3438790Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3438875Z 2025-05-07T20:32:47.3438983Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3438988Z 2025-05-07T20:32:47.3439099Z moe/activation_test.py:126: 2025-05-07T20:32:47.3439235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3439344Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3439489Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3440067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3440273Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3440656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3440938Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3441362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3441635Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3442027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3442208Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3442567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3442656Z fn() 2025-05-07T20:32:47.3443075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3443163Z self.fn.run( 2025-05-07T20:32:47.3443521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3443626Z kernel = self.compile( 2025-05-07T20:32:47.3444026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3444217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3444352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3444356Z 2025-05-07T20:32:47.3444575Z self = 2025-05-07T20:32:47.3445379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3445906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08991ab880>} 2025-05-07T20:32:47.3446688Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3446890Z context = 2025-05-07T20:32:47.3446895Z 2025-05-07T20:32:47.3447076Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3447354Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3447476Z module_map=module_map) 2025-05-07T20:32:47.3447645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3447753Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3447839Z E ^ 2025-05-07T20:32:47.3448210Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3448261Z 2025-05-07T20:32:47.3448700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3448705Z 2025-05-07T20:32:47.3448834Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3449105Z self=, 2025-05-07T20:32:47.3449194Z T=128, 2025-05-07T20:32:47.3449275Z D=7168, 2025-05-07T20:32:47.3449362Z scale_ub=None, 2025-05-07T20:32:47.3449463Z contiguous=False, 2025-05-07T20:32:47.3449553Z compiled=False, 2025-05-07T20:32:47.3449674Z ) 2025-05-07T20:32:47.3449909Z self = 2025-05-07T20:32:47.3450091Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.3450095Z 2025-05-07T20:32:47.3450177Z @given( 2025-05-07T20:32:47.3450353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3450460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3450626Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3450752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3450875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3450960Z ) 2025-05-07T20:32:47.3451220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3451321Z def test_silu_mul_quant( 2025-05-07T20:32:47.3451409Z self, 2025-05-07T20:32:47.3451490Z T: int, 2025-05-07T20:32:47.3451573Z D: int, 2025-05-07T20:32:47.3451683Z scale_ub: Optional[float], 2025-05-07T20:32:47.3451778Z contiguous: bool, 2025-05-07T20:32:47.3451869Z compiled: bool, 2025-05-07T20:32:47.3451958Z ) -> None: 2025-05-07T20:32:47.3452059Z torch.manual_seed(2025) 2025-05-07T20:32:47.3452141Z 2025-05-07T20:32:47.3452326Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3452404Z 2025-05-07T20:32:47.3452508Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3452640Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3452733Z x = x_sign * x_clamp 2025-05-07T20:32:47.3452822Z x0 = x[:, :D] 2025-05-07T20:32:47.3452908Z x1 = x[:, D:] 2025-05-07T20:32:47.3452985Z 2025-05-07T20:32:47.3453080Z if contiguous: 2025-05-07T20:32:47.3453177Z x0 = x0.contiguous() 2025-05-07T20:32:47.3453272Z x1 = x1.contiguous() 2025-05-07T20:32:47.3453360Z 2025-05-07T20:32:47.3453460Z if scale_ub is not None: 2025-05-07T20:32:47.3453581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3453723Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3453803Z ) 2025-05-07T20:32:47.3453890Z else: 2025-05-07T20:32:47.3453996Z scale_ub_tensor = None 2025-05-07T20:32:47.3454074Z 2025-05-07T20:32:47.3454224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3454320Z op = silu_mul_quant 2025-05-07T20:32:47.3454411Z if compiled: 2025-05-07T20:32:47.3454522Z op = torch.compile(op) 2025-05-07T20:32:47.3454633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3454710Z 2025-05-07T20:32:47.3454810Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3454814Z 2025-05-07T20:32:47.3454918Z moe/activation_test.py:117: 2025-05-07T20:32:47.3455061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3455170Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3455275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3455798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3455902Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3456332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3456576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3456934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3457038Z kernel = self.compile( 2025-05-07T20:32:47.3457439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3457624Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3457807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3457812Z 2025-05-07T20:32:47.3458027Z self = 2025-05-07T20:32:47.3458882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3459467Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08993fd580>} 2025-05-07T20:32:47.3460239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3460447Z context = 2025-05-07T20:32:47.3460452Z 2025-05-07T20:32:47.3460627Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3460910Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3461027Z module_map=module_map) 2025-05-07T20:32:47.3461201Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3461316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3461398Z E ^ 2025-05-07T20:32:47.3461768Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3461779Z 2025-05-07T20:32:47.3462212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3462217Z 2025-05-07T20:32:47.3462330Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3462573Z self=, 2025-05-07T20:32:47.3462655Z T=4096, 2025-05-07T20:32:47.3462735Z D=5120, 2025-05-07T20:32:47.3462830Z scale_ub=1200.0, 2025-05-07T20:32:47.3462920Z contiguous=True, 2025-05-07T20:32:47.3463011Z compiled=False, 2025-05-07T20:32:47.3463093Z ) 2025-05-07T20:32:47.3463329Z self = 2025-05-07T20:32:47.3463519Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.3463523Z 2025-05-07T20:32:47.3463607Z @given( 2025-05-07T20:32:47.3463732Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3463844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3463966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3464090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3464213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3464295Z ) 2025-05-07T20:32:47.3464558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3464657Z def test_silu_mul_quant( 2025-05-07T20:32:47.3464737Z self, 2025-05-07T20:32:47.3464823Z T: int, 2025-05-07T20:32:47.3464909Z D: int, 2025-05-07T20:32:47.3465060Z scale_ub: Optional[float], 2025-05-07T20:32:47.3465165Z contiguous: bool, 2025-05-07T20:32:47.3465256Z compiled: bool, 2025-05-07T20:32:47.3465340Z ) -> None: 2025-05-07T20:32:47.3465447Z torch.manual_seed(2025) 2025-05-07T20:32:47.3465524Z 2025-05-07T20:32:47.3465703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3465788Z 2025-05-07T20:32:47.3465883Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3466013Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3466112Z x = x_sign * x_clamp 2025-05-07T20:32:47.3466240Z x0 = x[:, :D] 2025-05-07T20:32:47.3466330Z x1 = x[:, D:] 2025-05-07T20:32:47.3466408Z 2025-05-07T20:32:47.3466496Z if contiguous: 2025-05-07T20:32:47.3466598Z x0 = x0.contiguous() 2025-05-07T20:32:47.3466692Z x1 = x1.contiguous() 2025-05-07T20:32:47.3466808Z 2025-05-07T20:32:47.3466909Z if scale_ub is not None: 2025-05-07T20:32:47.3467024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3467202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3467289Z ) 2025-05-07T20:32:47.3467369Z else: 2025-05-07T20:32:47.3467468Z scale_ub_tensor = None 2025-05-07T20:32:47.3467557Z 2025-05-07T20:32:47.3467695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3467795Z op = silu_mul_quant 2025-05-07T20:32:47.3467884Z if compiled: 2025-05-07T20:32:47.3467989Z op = torch.compile(op) 2025-05-07T20:32:47.3468109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3468190Z 2025-05-07T20:32:47.3468288Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3468293Z 2025-05-07T20:32:47.3468400Z moe/activation_test.py:117: 2025-05-07T20:32:47.3468536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3468646Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3468761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3469278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3469386Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3469760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3469996Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3470357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3470459Z kernel = self.compile( 2025-05-07T20:32:47.3470858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3471048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3471190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3471194Z 2025-05-07T20:32:47.3471410Z self = 2025-05-07T20:32:47.3472212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3472738Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08993fe7a0>} 2025-05-07T20:32:47.3473511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3473712Z context = 2025-05-07T20:32:47.3473762Z 2025-05-07T20:32:47.3473947Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3474224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3474342Z module_map=module_map) 2025-05-07T20:32:47.3474512Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3474615Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3474699Z E ^ 2025-05-07T20:32:47.3475068Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3475115Z 2025-05-07T20:32:47.3475550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3475560Z 2025-05-07T20:32:47.3475670Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3475947Z self=, 2025-05-07T20:32:47.3476070Z T=1, 2025-05-07T20:32:47.3476152Z D=5120, 2025-05-07T20:32:47.3476240Z scale_ub=None, 2025-05-07T20:32:47.3476337Z contiguous=True, 2025-05-07T20:32:47.3476425Z compiled=True, 2025-05-07T20:32:47.3476502Z ) 2025-05-07T20:32:47.3476736Z self = 2025-05-07T20:32:47.3476908Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3476912Z 2025-05-07T20:32:47.3476998Z @given( 2025-05-07T20:32:47.3477127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3477232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3477358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3477481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3477600Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3477686Z ) 2025-05-07T20:32:47.3477949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3478049Z def test_silu_mul_quant( 2025-05-07T20:32:47.3478134Z self, 2025-05-07T20:32:47.3478214Z T: int, 2025-05-07T20:32:47.3478293Z D: int, 2025-05-07T20:32:47.3478400Z scale_ub: Optional[float], 2025-05-07T20:32:47.3478495Z contiguous: bool, 2025-05-07T20:32:47.3478591Z compiled: bool, 2025-05-07T20:32:47.3478674Z ) -> None: 2025-05-07T20:32:47.3478774Z torch.manual_seed(2025) 2025-05-07T20:32:47.3478853Z 2025-05-07T20:32:47.3479035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3479113Z 2025-05-07T20:32:47.3479213Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3479347Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3479439Z x = x_sign * x_clamp 2025-05-07T20:32:47.3479531Z x0 = x[:, :D] 2025-05-07T20:32:47.3479617Z x1 = x[:, D:] 2025-05-07T20:32:47.3479695Z 2025-05-07T20:32:47.3479792Z if contiguous: 2025-05-07T20:32:47.3479889Z x0 = x0.contiguous() 2025-05-07T20:32:47.3479985Z x1 = x1.contiguous() 2025-05-07T20:32:47.3480062Z 2025-05-07T20:32:47.3480212Z if scale_ub is not None: 2025-05-07T20:32:47.3480329Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3480472Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3480552Z ) 2025-05-07T20:32:47.3480640Z else: 2025-05-07T20:32:47.3480746Z scale_ub_tensor = None 2025-05-07T20:32:47.3480822Z 2025-05-07T20:32:47.3480963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3481058Z op = silu_mul_quant 2025-05-07T20:32:47.3481148Z if compiled: 2025-05-07T20:32:47.3481256Z op = torch.compile(op) 2025-05-07T20:32:47.3481370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3481495Z 2025-05-07T20:32:47.3481600Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3481729Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3481811Z 2025-05-07T20:32:47.3481954Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3482061Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3482172Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3482300Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3482449Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3482574Z 2025-05-07T20:32:47.3482679Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3482683Z 2025-05-07T20:32:47.3482790Z moe/activation_test.py:126: 2025-05-07T20:32:47.3482925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3483074Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3483225Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3483840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3483949Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3484328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3484564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3484952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3485225Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3485618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3485804Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3486165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3486246Z fn() 2025-05-07T20:32:47.3486672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3486759Z self.fn.run( 2025-05-07T20:32:47.3487117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3487215Z kernel = self.compile( 2025-05-07T20:32:47.3487614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3487803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3487937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3487945Z 2025-05-07T20:32:47.3488166Z self = 2025-05-07T20:32:47.3488969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3489491Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08993ff420>} 2025-05-07T20:32:47.3490276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3490478Z context = 2025-05-07T20:32:47.3490483Z 2025-05-07T20:32:47.3490659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3490990Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3491106Z module_map=module_map) 2025-05-07T20:32:47.3491284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3491392Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3491473Z E ^ 2025-05-07T20:32:47.3491846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3491851Z 2025-05-07T20:32:47.3492282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3492326Z 2025-05-07T20:32:47.3492442Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3492677Z self=, 2025-05-07T20:32:47.3492824Z T=2048, 2025-05-07T20:32:47.3492909Z D=5120, 2025-05-07T20:32:47.3493000Z scale_ub=None, 2025-05-07T20:32:47.3493096Z contiguous=True, 2025-05-07T20:32:47.3493220Z compiled=True, 2025-05-07T20:32:47.3493299Z ) 2025-05-07T20:32:47.3493535Z self = 2025-05-07T20:32:47.3493716Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3493721Z 2025-05-07T20:32:47.3493801Z @given( 2025-05-07T20:32:47.3493934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3494039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3494162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3494293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3494412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3494499Z ) 2025-05-07T20:32:47.3494758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3494862Z def test_silu_mul_quant( 2025-05-07T20:32:47.3494949Z self, 2025-05-07T20:32:47.3495032Z T: int, 2025-05-07T20:32:47.3495112Z D: int, 2025-05-07T20:32:47.3495221Z scale_ub: Optional[float], 2025-05-07T20:32:47.3495315Z contiguous: bool, 2025-05-07T20:32:47.3495405Z compiled: bool, 2025-05-07T20:32:47.3495491Z ) -> None: 2025-05-07T20:32:47.3495589Z torch.manual_seed(2025) 2025-05-07T20:32:47.3495665Z 2025-05-07T20:32:47.3495844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3495923Z 2025-05-07T20:32:47.3496028Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3496160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3496253Z x = x_sign * x_clamp 2025-05-07T20:32:47.3496343Z x0 = x[:, :D] 2025-05-07T20:32:47.3496427Z x1 = x[:, D:] 2025-05-07T20:32:47.3496502Z 2025-05-07T20:32:47.3496596Z if contiguous: 2025-05-07T20:32:47.3496692Z x0 = x0.contiguous() 2025-05-07T20:32:47.3496795Z x1 = x1.contiguous() 2025-05-07T20:32:47.3496880Z 2025-05-07T20:32:47.3500788Z if scale_ub is not None: 2025-05-07T20:32:47.3500923Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3501074Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3501155Z ) 2025-05-07T20:32:47.3501236Z else: 2025-05-07T20:32:47.3501341Z scale_ub_tensor = None 2025-05-07T20:32:47.3501418Z 2025-05-07T20:32:47.3501558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3501665Z op = silu_mul_quant 2025-05-07T20:32:47.3501756Z if compiled: 2025-05-07T20:32:47.3501864Z op = torch.compile(op) 2025-05-07T20:32:47.3501982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3502061Z 2025-05-07T20:32:47.3502166Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3502299Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3502447Z 2025-05-07T20:32:47.3502598Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3502707Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3502814Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3502948Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3503096Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3503173Z 2025-05-07T20:32:47.3503283Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3503288Z 2025-05-07T20:32:47.3503438Z moe/activation_test.py:126: 2025-05-07T20:32:47.3503584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3503696Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3503840Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3504509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3504618Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3504996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3505239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3505627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3505907Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3506302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3506479Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3506840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3506925Z fn() 2025-05-07T20:32:47.3507350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3507437Z self.fn.run( 2025-05-07T20:32:47.3507789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3507889Z kernel = self.compile( 2025-05-07T20:32:47.3508289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3508477Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3508621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3508626Z 2025-05-07T20:32:47.3508843Z self = 2025-05-07T20:32:47.3509660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3510182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08993b9da0>} 2025-05-07T20:32:47.3510948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3511154Z context = 2025-05-07T20:32:47.3511159Z 2025-05-07T20:32:47.3511333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3511613Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3511771Z module_map=module_map) 2025-05-07T20:32:47.3511945Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3512054Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3512133Z E ^ 2025-05-07T20:32:47.3512505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3512510Z 2025-05-07T20:32:47.3512940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3512944Z 2025-05-07T20:32:47.3513093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3513530Z self=, 2025-05-07T20:32:47.3513655Z T=128, 2025-05-07T20:32:47.3513770Z D=5120, 2025-05-07T20:32:47.3513871Z scale_ub=None, 2025-05-07T20:32:47.3513961Z contiguous=True, 2025-05-07T20:32:47.3514144Z compiled=True, 2025-05-07T20:32:47.3514223Z ) 2025-05-07T20:32:47.3514522Z self = 2025-05-07T20:32:47.3514707Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3514712Z 2025-05-07T20:32:47.3514793Z @given( 2025-05-07T20:32:47.3514919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3515026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3515147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3515269Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3515397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3515477Z ) 2025-05-07T20:32:47.3515741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3515840Z def test_silu_mul_quant( 2025-05-07T20:32:47.3515919Z self, 2025-05-07T20:32:47.3516005Z T: int, 2025-05-07T20:32:47.3516087Z D: int, 2025-05-07T20:32:47.3516192Z scale_ub: Optional[float], 2025-05-07T20:32:47.3516296Z contiguous: bool, 2025-05-07T20:32:47.3516388Z compiled: bool, 2025-05-07T20:32:47.3516472Z ) -> None: 2025-05-07T20:32:47.3516577Z torch.manual_seed(2025) 2025-05-07T20:32:47.3516653Z 2025-05-07T20:32:47.3516832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3516913Z 2025-05-07T20:32:47.3517009Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3517146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3517242Z x = x_sign * x_clamp 2025-05-07T20:32:47.3517332Z x0 = x[:, :D] 2025-05-07T20:32:47.3517420Z x1 = x[:, D:] 2025-05-07T20:32:47.3517496Z 2025-05-07T20:32:47.3517584Z if contiguous: 2025-05-07T20:32:47.3517685Z x0 = x0.contiguous() 2025-05-07T20:32:47.3517777Z x1 = x1.contiguous() 2025-05-07T20:32:47.3517855Z 2025-05-07T20:32:47.3517954Z if scale_ub is not None: 2025-05-07T20:32:47.3518071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3518214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3518296Z ) 2025-05-07T20:32:47.3518375Z else: 2025-05-07T20:32:47.3518476Z scale_ub_tensor = None 2025-05-07T20:32:47.3518552Z 2025-05-07T20:32:47.3518686Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3518784Z op = silu_mul_quant 2025-05-07T20:32:47.3518877Z if compiled: 2025-05-07T20:32:47.3518980Z op = torch.compile(op) 2025-05-07T20:32:47.3519096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3519172Z 2025-05-07T20:32:47.3519267Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3519401Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3519476Z 2025-05-07T20:32:47.3519621Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3519800Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3519910Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3520047Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3520291Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3520367Z 2025-05-07T20:32:47.3520479Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3520484Z 2025-05-07T20:32:47.3520586Z moe/activation_test.py:126: 2025-05-07T20:32:47.3520719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3520900Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3521042Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3521625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3521774Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3522912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3523158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3523543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3523816Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3524214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3524393Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3524754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3524835Z fn() 2025-05-07T20:32:47.3525261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3525353Z self.fn.run( 2025-05-07T20:32:47.3525708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3525807Z kernel = self.compile( 2025-05-07T20:32:47.3526211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3526395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3526534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3526541Z 2025-05-07T20:32:47.3526755Z self = 2025-05-07T20:32:47.3527563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3528100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898508540>} 2025-05-07T20:32:47.3528873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3529077Z context = 2025-05-07T20:32:47.3529082Z 2025-05-07T20:32:47.3529258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3529543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3529654Z module_map=module_map) 2025-05-07T20:32:47.3529824Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3529984Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3530066Z E ^ 2025-05-07T20:32:47.3530438Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3530443Z 2025-05-07T20:32:47.3530876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3530881Z 2025-05-07T20:32:47.3530990Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3531228Z self=, 2025-05-07T20:32:47.3531474Z T=4096, 2025-05-07T20:32:47.3531554Z D=5120, 2025-05-07T20:32:47.3531644Z scale_ub=None, 2025-05-07T20:32:47.3531732Z contiguous=True, 2025-05-07T20:32:47.3531817Z compiled=True, 2025-05-07T20:32:47.3531898Z ) 2025-05-07T20:32:47.3532129Z self = 2025-05-07T20:32:47.3532354Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3532364Z 2025-05-07T20:32:47.3532483Z @given( 2025-05-07T20:32:47.3532612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3532721Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3532843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3532965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3533088Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3533166Z ) 2025-05-07T20:32:47.3533422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3533530Z def test_silu_mul_quant( 2025-05-07T20:32:47.3533610Z self, 2025-05-07T20:32:47.3533690Z T: int, 2025-05-07T20:32:47.3533773Z D: int, 2025-05-07T20:32:47.3533875Z scale_ub: Optional[float], 2025-05-07T20:32:47.3533973Z contiguous: bool, 2025-05-07T20:32:47.3534066Z compiled: bool, 2025-05-07T20:32:47.3534149Z ) -> None: 2025-05-07T20:32:47.3534253Z torch.manual_seed(2025) 2025-05-07T20:32:47.3534328Z 2025-05-07T20:32:47.3534505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3534584Z 2025-05-07T20:32:47.3534680Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3534815Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3534912Z x = x_sign * x_clamp 2025-05-07T20:32:47.3534996Z x0 = x[:, :D] 2025-05-07T20:32:47.3535080Z x1 = x[:, D:] 2025-05-07T20:32:47.3535160Z 2025-05-07T20:32:47.3535252Z if contiguous: 2025-05-07T20:32:47.3535351Z x0 = x0.contiguous() 2025-05-07T20:32:47.3535445Z x1 = x1.contiguous() 2025-05-07T20:32:47.3535521Z 2025-05-07T20:32:47.3535621Z if scale_ub is not None: 2025-05-07T20:32:47.3535732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3535875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3535960Z ) 2025-05-07T20:32:47.3536042Z else: 2025-05-07T20:32:47.3536141Z scale_ub_tensor = None 2025-05-07T20:32:47.3536220Z 2025-05-07T20:32:47.3536354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3536449Z op = silu_mul_quant 2025-05-07T20:32:47.3536542Z if compiled: 2025-05-07T20:32:47.3536646Z op = torch.compile(op) 2025-05-07T20:32:47.3536758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3536835Z 2025-05-07T20:32:47.3536934Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3537063Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3537138Z 2025-05-07T20:32:47.3537280Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3537388Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3537494Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3537673Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3537828Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3537905Z 2025-05-07T20:32:47.3538010Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3538020Z 2025-05-07T20:32:47.3538123Z moe/activation_test.py:126: 2025-05-07T20:32:47.3538261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3538372Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3538512Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3539134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3539245Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3539619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3539937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3540320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3540594Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3540991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3541168Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3541526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3541612Z fn() 2025-05-07T20:32:47.3542028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3542116Z self.fn.run( 2025-05-07T20:32:47.3542474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3542574Z kernel = self.compile( 2025-05-07T20:32:47.3542974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3543159Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3543292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3543300Z 2025-05-07T20:32:47.3543511Z self = 2025-05-07T20:32:47.3544316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3544847Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08986d2200>} 2025-05-07T20:32:47.3545624Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3545827Z context = 2025-05-07T20:32:47.3545832Z 2025-05-07T20:32:47.3546005Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3546283Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3546406Z module_map=module_map) 2025-05-07T20:32:47.3546581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3546691Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3546771Z E ^ 2025-05-07T20:32:47.3547187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3547194Z 2025-05-07T20:32:47.3547632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3547637Z 2025-05-07T20:32:47.3547746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3547980Z self=, 2025-05-07T20:32:47.3548068Z T=16384, 2025-05-07T20:32:47.3548148Z D=5120, 2025-05-07T20:32:47.3548237Z scale_ub=None, 2025-05-07T20:32:47.3548325Z contiguous=True, 2025-05-07T20:32:47.3548454Z compiled=True, 2025-05-07T20:32:47.3548533Z ) 2025-05-07T20:32:47.3548762Z self = 2025-05-07T20:32:47.3548947Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3548952Z 2025-05-07T20:32:47.3549076Z @given( 2025-05-07T20:32:47.3549204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3549346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3549471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3549593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3549717Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3549794Z ) 2025-05-07T20:32:47.3550051Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3550151Z def test_silu_mul_quant( 2025-05-07T20:32:47.3550230Z self, 2025-05-07T20:32:47.3550314Z T: int, 2025-05-07T20:32:47.3550399Z D: int, 2025-05-07T20:32:47.3550501Z scale_ub: Optional[float], 2025-05-07T20:32:47.3550593Z contiguous: bool, 2025-05-07T20:32:47.3550684Z compiled: bool, 2025-05-07T20:32:47.3550766Z ) -> None: 2025-05-07T20:32:47.3550864Z torch.manual_seed(2025) 2025-05-07T20:32:47.3550949Z 2025-05-07T20:32:47.3551130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3551213Z 2025-05-07T20:32:47.3551307Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3551435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3551531Z x = x_sign * x_clamp 2025-05-07T20:32:47.3551612Z x0 = x[:, :D] 2025-05-07T20:32:47.3551695Z x1 = x[:, D:] 2025-05-07T20:32:47.3551772Z 2025-05-07T20:32:47.3551858Z if contiguous: 2025-05-07T20:32:47.3551954Z x0 = x0.contiguous() 2025-05-07T20:32:47.3552051Z x1 = x1.contiguous() 2025-05-07T20:32:47.3552130Z 2025-05-07T20:32:47.3552224Z if scale_ub is not None: 2025-05-07T20:32:47.3552339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3552481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3552562Z ) 2025-05-07T20:32:47.3552644Z else: 2025-05-07T20:32:47.3552743Z scale_ub_tensor = None 2025-05-07T20:32:47.3552824Z 2025-05-07T20:32:47.3552959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3553054Z op = silu_mul_quant 2025-05-07T20:32:47.3553143Z if compiled: 2025-05-07T20:32:47.3553246Z op = torch.compile(op) 2025-05-07T20:32:47.3553356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3553435Z 2025-05-07T20:32:47.3553531Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3553657Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3553734Z 2025-05-07T20:32:47.3553880Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3553988Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3554092Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3554225Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3554373Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3554495Z 2025-05-07T20:32:47.3554606Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3554611Z 2025-05-07T20:32:47.3554714Z moe/activation_test.py:126: 2025-05-07T20:32:47.3554847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3554958Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3555096Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3555672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3555823Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3556197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3556435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3556859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3557165Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3557563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3557738Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3558098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3558178Z fn() 2025-05-07T20:32:47.3558599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3558688Z self.fn.run( 2025-05-07T20:32:47.3559041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3559140Z kernel = self.compile( 2025-05-07T20:32:47.3559542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3559726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3559862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3559866Z 2025-05-07T20:32:47.3560162Z self = 2025-05-07T20:32:47.3560963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3561497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07adb34900>} 2025-05-07T20:32:47.3562276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3562479Z context = 2025-05-07T20:32:47.3562483Z 2025-05-07T20:32:47.3562656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3562934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3563046Z module_map=module_map) 2025-05-07T20:32:47.3563217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3563328Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3563407Z E ^ 2025-05-07T20:32:47.3563776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3563783Z 2025-05-07T20:32:47.3564289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3564294Z 2025-05-07T20:32:47.3564404Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3564643Z self=, 2025-05-07T20:32:47.3564723Z T=1, 2025-05-07T20:32:47.3564803Z D=5120, 2025-05-07T20:32:47.3564894Z scale_ub=1200.0, 2025-05-07T20:32:47.3564981Z contiguous=True, 2025-05-07T20:32:47.3565067Z compiled=True, 2025-05-07T20:32:47.3565147Z ) 2025-05-07T20:32:47.3565375Z self = 2025-05-07T20:32:47.3565589Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.3565597Z 2025-05-07T20:32:47.3565676Z @given( 2025-05-07T20:32:47.3565801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3565906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3566068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3566228Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3566351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3566429Z ) 2025-05-07T20:32:47.3566686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3566788Z def test_silu_mul_quant( 2025-05-07T20:32:47.3566867Z self, 2025-05-07T20:32:47.3566947Z T: int, 2025-05-07T20:32:47.3567029Z D: int, 2025-05-07T20:32:47.3567172Z scale_ub: Optional[float], 2025-05-07T20:32:47.3567306Z contiguous: bool, 2025-05-07T20:32:47.3567432Z compiled: bool, 2025-05-07T20:32:47.3567547Z ) -> None: 2025-05-07T20:32:47.3567671Z torch.manual_seed(2025) 2025-05-07T20:32:47.3567745Z 2025-05-07T20:32:47.3567923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3568003Z 2025-05-07T20:32:47.3568098Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3568233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3568323Z x = x_sign * x_clamp 2025-05-07T20:32:47.3568406Z x0 = x[:, :D] 2025-05-07T20:32:47.3568493Z x1 = x[:, D:] 2025-05-07T20:32:47.3568569Z 2025-05-07T20:32:47.3568656Z if contiguous: 2025-05-07T20:32:47.3568752Z x0 = x0.contiguous() 2025-05-07T20:32:47.3568843Z x1 = x1.contiguous() 2025-05-07T20:32:47.3568921Z 2025-05-07T20:32:47.3569015Z if scale_ub is not None: 2025-05-07T20:32:47.3569124Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3569268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3569346Z ) 2025-05-07T20:32:47.3569422Z else: 2025-05-07T20:32:47.3569521Z scale_ub_tensor = None 2025-05-07T20:32:47.3569595Z 2025-05-07T20:32:47.3569728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3569830Z op = silu_mul_quant 2025-05-07T20:32:47.3569920Z if compiled: 2025-05-07T20:32:47.3570022Z op = torch.compile(op) 2025-05-07T20:32:47.3570135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3570209Z 2025-05-07T20:32:47.3570304Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3570308Z 2025-05-07T20:32:47.3570407Z moe/activation_test.py:117: 2025-05-07T20:32:47.3570540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3570649Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3570754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3571136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3571239Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3571749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3571911Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3572286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3572518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3572873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3572968Z kernel = self.compile( 2025-05-07T20:32:47.3573362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3573589Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3573720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3573725Z 2025-05-07T20:32:47.3573937Z self = 2025-05-07T20:32:47.3574819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3575348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad73cd60>} 2025-05-07T20:32:47.3576117Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3576316Z context = 2025-05-07T20:32:47.3576320Z 2025-05-07T20:32:47.3576494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3576773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3576891Z module_map=module_map) 2025-05-07T20:32:47.3577059Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3577161Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3577243Z E ^ 2025-05-07T20:32:47.3577610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3577615Z 2025-05-07T20:32:47.3578067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3578084Z 2025-05-07T20:32:47.3578235Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3578546Z self=, 2025-05-07T20:32:47.3578659Z T=1, 2025-05-07T20:32:47.3578739Z D=5120, 2025-05-07T20:32:47.3578823Z scale_ub=None, 2025-05-07T20:32:47.3578921Z contiguous=False, 2025-05-07T20:32:47.3579011Z compiled=True, 2025-05-07T20:32:47.3579087Z ) 2025-05-07T20:32:47.3579323Z self = 2025-05-07T20:32:47.3579492Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3579497Z 2025-05-07T20:32:47.3579583Z @given( 2025-05-07T20:32:47.3579707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3579809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3579931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3580052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3580172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3580252Z ) 2025-05-07T20:32:47.3580507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3580606Z def test_silu_mul_quant( 2025-05-07T20:32:47.3580691Z self, 2025-05-07T20:32:47.3580770Z T: int, 2025-05-07T20:32:47.3580907Z D: int, 2025-05-07T20:32:47.3581018Z scale_ub: Optional[float], 2025-05-07T20:32:47.3581113Z contiguous: bool, 2025-05-07T20:32:47.3581203Z compiled: bool, 2025-05-07T20:32:47.3581284Z ) -> None: 2025-05-07T20:32:47.3581382Z torch.manual_seed(2025) 2025-05-07T20:32:47.3581459Z 2025-05-07T20:32:47.3581635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3581712Z 2025-05-07T20:32:47.3581808Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3581936Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3582069Z x = x_sign * x_clamp 2025-05-07T20:32:47.3582157Z x0 = x[:, :D] 2025-05-07T20:32:47.3582240Z x1 = x[:, D:] 2025-05-07T20:32:47.3582314Z 2025-05-07T20:32:47.3582402Z if contiguous: 2025-05-07T20:32:47.3582495Z x0 = x0.contiguous() 2025-05-07T20:32:47.3582625Z x1 = x1.contiguous() 2025-05-07T20:32:47.3582702Z 2025-05-07T20:32:47.3582837Z if scale_ub is not None: 2025-05-07T20:32:47.3582953Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3583093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3583172Z ) 2025-05-07T20:32:47.3583253Z else: 2025-05-07T20:32:47.3583350Z scale_ub_tensor = None 2025-05-07T20:32:47.3583426Z 2025-05-07T20:32:47.3583562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3583656Z op = silu_mul_quant 2025-05-07T20:32:47.3583743Z if compiled: 2025-05-07T20:32:47.3583852Z op = torch.compile(op) 2025-05-07T20:32:47.3583960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3584034Z 2025-05-07T20:32:47.3584132Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3584257Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3584338Z 2025-05-07T20:32:47.3584479Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3584586Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3584693Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3584818Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3584962Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3585040Z 2025-05-07T20:32:47.3585143Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3585148Z 2025-05-07T20:32:47.3585249Z moe/activation_test.py:126: 2025-05-07T20:32:47.3585385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3585495Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3585640Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3586216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3586327Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3586707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3586942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3587322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3587588Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3587977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3588156Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3588510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3588591Z fn() 2025-05-07T20:32:47.3589059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3589145Z self.fn.run( 2025-05-07T20:32:47.3589499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3589596Z kernel = self.compile( 2025-05-07T20:32:47.3589988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3590176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3590351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3590355Z 2025-05-07T20:32:47.3590571Z self = 2025-05-07T20:32:47.3591410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3591993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad732de0>} 2025-05-07T20:32:47.3592763Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3592966Z context = 2025-05-07T20:32:47.3592972Z 2025-05-07T20:32:47.3593148Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3593423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3593534Z module_map=module_map) 2025-05-07T20:32:47.3593710Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3593821Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3593901Z E ^ 2025-05-07T20:32:47.3594272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3594276Z 2025-05-07T20:32:47.3594706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3594710Z 2025-05-07T20:32:47.3594822Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3595053Z self=, 2025-05-07T20:32:47.3595135Z T=1, 2025-05-07T20:32:47.3595220Z D=5120, 2025-05-07T20:32:47.3595304Z scale_ub=None, 2025-05-07T20:32:47.3595396Z contiguous=True, 2025-05-07T20:32:47.3595482Z compiled=False, 2025-05-07T20:32:47.3595557Z ) 2025-05-07T20:32:47.3595789Z self = 2025-05-07T20:32:47.3595964Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.3595969Z 2025-05-07T20:32:47.3596046Z @given( 2025-05-07T20:32:47.3596175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3596280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3596398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3596525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3596642Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3596721Z ) 2025-05-07T20:32:47.3596976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3597071Z def test_silu_mul_quant( 2025-05-07T20:32:47.3597152Z self, 2025-05-07T20:32:47.3597232Z T: int, 2025-05-07T20:32:47.3597309Z D: int, 2025-05-07T20:32:47.3597416Z scale_ub: Optional[float], 2025-05-07T20:32:47.3597512Z contiguous: bool, 2025-05-07T20:32:47.3597656Z compiled: bool, 2025-05-07T20:32:47.3597743Z ) -> None: 2025-05-07T20:32:47.3597841Z torch.manual_seed(2025) 2025-05-07T20:32:47.3597915Z 2025-05-07T20:32:47.3598097Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3598174Z 2025-05-07T20:32:47.3598268Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3598399Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3598489Z x = x_sign * x_clamp 2025-05-07T20:32:47.3598574Z x0 = x[:, :D] 2025-05-07T20:32:47.3598699Z x1 = x[:, D:] 2025-05-07T20:32:47.3598775Z 2025-05-07T20:32:47.3598864Z if contiguous: 2025-05-07T20:32:47.3598958Z x0 = x0.contiguous() 2025-05-07T20:32:47.3599051Z x1 = x1.contiguous() 2025-05-07T20:32:47.3599128Z 2025-05-07T20:32:47.3599222Z if scale_ub is not None: 2025-05-07T20:32:47.3599372Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3599556Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3599634Z ) 2025-05-07T20:32:47.3599712Z else: 2025-05-07T20:32:47.3599813Z scale_ub_tensor = None 2025-05-07T20:32:47.3599887Z 2025-05-07T20:32:47.3600025Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3600228Z op = silu_mul_quant 2025-05-07T20:32:47.3600319Z if compiled: 2025-05-07T20:32:47.3600424Z op = torch.compile(op) 2025-05-07T20:32:47.3600534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3600611Z 2025-05-07T20:32:47.3600709Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3600713Z 2025-05-07T20:32:47.3600815Z moe/activation_test.py:117: 2025-05-07T20:32:47.3600947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3601053Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3601160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3601684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3601785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3602156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3602392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3602745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3602843Z kernel = self.compile( 2025-05-07T20:32:47.3603243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3603426Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3603562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3603568Z 2025-05-07T20:32:47.3603783Z self = 2025-05-07T20:32:47.3604581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3605103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898019b20>} 2025-05-07T20:32:47.3605869Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3606072Z context = 2025-05-07T20:32:47.3606080Z 2025-05-07T20:32:47.3606300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3606583Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3606696Z module_map=module_map) 2025-05-07T20:32:47.3606863Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3606969Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3607048Z E ^ 2025-05-07T20:32:47.3607413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3607457Z 2025-05-07T20:32:47.3607890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3607895Z 2025-05-07T20:32:47.3608002Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3608235Z self=, 2025-05-07T20:32:47.3608356Z T=128, 2025-05-07T20:32:47.3608437Z D=5120, 2025-05-07T20:32:47.3608565Z scale_ub=None, 2025-05-07T20:32:47.3608656Z contiguous=False, 2025-05-07T20:32:47.3608742Z compiled=True, 2025-05-07T20:32:47.3608822Z ) 2025-05-07T20:32:47.3609049Z self = 2025-05-07T20:32:47.3609225Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3609235Z 2025-05-07T20:32:47.3609315Z @given( 2025-05-07T20:32:47.3609439Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3609550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3609669Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3609791Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3609912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3609988Z ) 2025-05-07T20:32:47.3610249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3610355Z def test_silu_mul_quant( 2025-05-07T20:32:47.3610433Z self, 2025-05-07T20:32:47.3610511Z T: int, 2025-05-07T20:32:47.3610592Z D: int, 2025-05-07T20:32:47.3610694Z scale_ub: Optional[float], 2025-05-07T20:32:47.3610789Z contiguous: bool, 2025-05-07T20:32:47.3610877Z compiled: bool, 2025-05-07T20:32:47.3610958Z ) -> None: 2025-05-07T20:32:47.3611058Z torch.manual_seed(2025) 2025-05-07T20:32:47.3611136Z 2025-05-07T20:32:47.3611312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3611392Z 2025-05-07T20:32:47.3611486Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3611615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3611708Z x = x_sign * x_clamp 2025-05-07T20:32:47.3611790Z x0 = x[:, :D] 2025-05-07T20:32:47.3611875Z x1 = x[:, D:] 2025-05-07T20:32:47.3611952Z 2025-05-07T20:32:47.3612042Z if contiguous: 2025-05-07T20:32:47.3612141Z x0 = x0.contiguous() 2025-05-07T20:32:47.3612233Z x1 = x1.contiguous() 2025-05-07T20:32:47.3612307Z 2025-05-07T20:32:47.3612404Z if scale_ub is not None: 2025-05-07T20:32:47.3612513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3612652Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3612735Z ) 2025-05-07T20:32:47.3612813Z else: 2025-05-07T20:32:47.3612909Z scale_ub_tensor = None 2025-05-07T20:32:47.3612986Z 2025-05-07T20:32:47.3613122Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3613215Z op = silu_mul_quant 2025-05-07T20:32:47.3613305Z if compiled: 2025-05-07T20:32:47.3613767Z op = torch.compile(op) 2025-05-07T20:32:47.3613885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3613962Z 2025-05-07T20:32:47.3614056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3614152Z 2025-05-07T20:32:47.3614262Z moe/activation_test.py:117: 2025-05-07T20:32:47.3614395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3614499Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3614606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3614989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3615086Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3615600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3615759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3616133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3616422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3616832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3616934Z kernel = self.compile( 2025-05-07T20:32:47.3617328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3617513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3617648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3617653Z 2025-05-07T20:32:47.3617871Z self = 2025-05-07T20:32:47.3618677Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3619208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad733a60>} 2025-05-07T20:32:47.3619980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3620179Z context = 2025-05-07T20:32:47.3620183Z 2025-05-07T20:32:47.3620357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3620637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3620748Z module_map=module_map) 2025-05-07T20:32:47.3620920Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3621023Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3621104Z E ^ 2025-05-07T20:32:47.3621481Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3621485Z 2025-05-07T20:32:47.3621913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3621918Z 2025-05-07T20:32:47.3622028Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3622262Z self=, 2025-05-07T20:32:47.3622341Z T=128, 2025-05-07T20:32:47.3622425Z D=7168, 2025-05-07T20:32:47.3622513Z scale_ub=1200.0, 2025-05-07T20:32:47.3622603Z contiguous=False, 2025-05-07T20:32:47.3622692Z compiled=False, 2025-05-07T20:32:47.3622767Z ) 2025-05-07T20:32:47.3622991Z self = 2025-05-07T20:32:47.3623177Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.3623184Z 2025-05-07T20:32:47.3623309Z @given( 2025-05-07T20:32:47.3623444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3623547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3623666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3623790Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3623906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3623983Z ) 2025-05-07T20:32:47.3624240Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3624379Z def test_silu_mul_quant( 2025-05-07T20:32:47.3624457Z self, 2025-05-07T20:32:47.3624538Z T: int, 2025-05-07T20:32:47.3624615Z D: int, 2025-05-07T20:32:47.3624716Z scale_ub: Optional[float], 2025-05-07T20:32:47.3624813Z contiguous: bool, 2025-05-07T20:32:47.3624903Z compiled: bool, 2025-05-07T20:32:47.3625053Z ) -> None: 2025-05-07T20:32:47.3625155Z torch.manual_seed(2025) 2025-05-07T20:32:47.3625231Z 2025-05-07T20:32:47.3625447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3625527Z 2025-05-07T20:32:47.3625622Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3625753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3625843Z x = x_sign * x_clamp 2025-05-07T20:32:47.3625924Z x0 = x[:, :D] 2025-05-07T20:32:47.3626010Z x1 = x[:, D:] 2025-05-07T20:32:47.3626087Z 2025-05-07T20:32:47.3626172Z if contiguous: 2025-05-07T20:32:47.3626273Z x0 = x0.contiguous() 2025-05-07T20:32:47.3626366Z x1 = x1.contiguous() 2025-05-07T20:32:47.3626442Z 2025-05-07T20:32:47.3626535Z if scale_ub is not None: 2025-05-07T20:32:47.3626643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3626785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3626865Z ) 2025-05-07T20:32:47.3626946Z else: 2025-05-07T20:32:47.3627049Z scale_ub_tensor = None 2025-05-07T20:32:47.3627124Z 2025-05-07T20:32:47.3627257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3627353Z op = silu_mul_quant 2025-05-07T20:32:47.3627441Z if compiled: 2025-05-07T20:32:47.3627544Z op = torch.compile(op) 2025-05-07T20:32:47.3627655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3627731Z 2025-05-07T20:32:47.3627823Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3627837Z 2025-05-07T20:32:47.3627940Z moe/activation_test.py:117: 2025-05-07T20:32:47.3628072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3631778Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3631908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3632448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3632558Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3632936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3633178Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3633536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3633641Z kernel = self.compile( 2025-05-07T20:32:47.3634042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3634231Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3634376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3634381Z 2025-05-07T20:32:47.3634599Z self = 2025-05-07T20:32:47.3635477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3636009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07adb3a2a0>} 2025-05-07T20:32:47.3636784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3637028Z context = 2025-05-07T20:32:47.3637032Z 2025-05-07T20:32:47.3637208Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3637535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3637684Z module_map=module_map) 2025-05-07T20:32:47.3637857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3637971Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3638053Z E ^ 2025-05-07T20:32:47.3638427Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3638432Z 2025-05-07T20:32:47.3638870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3638877Z 2025-05-07T20:32:47.3638987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3639227Z self=, 2025-05-07T20:32:47.3639308Z T=128, 2025-05-07T20:32:47.3639393Z D=5120, 2025-05-07T20:32:47.3639483Z scale_ub=None, 2025-05-07T20:32:47.3639579Z contiguous=False, 2025-05-07T20:32:47.3639676Z compiled=False, 2025-05-07T20:32:47.3639757Z ) 2025-05-07T20:32:47.3639989Z self = 2025-05-07T20:32:47.3640268Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.3640274Z 2025-05-07T20:32:47.3640355Z @given( 2025-05-07T20:32:47.3640483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3640594Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3640717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3640847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3640967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3641045Z ) 2025-05-07T20:32:47.3641307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3641410Z def test_silu_mul_quant( 2025-05-07T20:32:47.3641490Z self, 2025-05-07T20:32:47.3641575Z T: int, 2025-05-07T20:32:47.3641657Z D: int, 2025-05-07T20:32:47.3641762Z scale_ub: Optional[float], 2025-05-07T20:32:47.3641858Z contiguous: bool, 2025-05-07T20:32:47.3641950Z compiled: bool, 2025-05-07T20:32:47.3642034Z ) -> None: 2025-05-07T20:32:47.3642136Z torch.manual_seed(2025) 2025-05-07T20:32:47.3642212Z 2025-05-07T20:32:47.3642393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3642475Z 2025-05-07T20:32:47.3642571Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3642707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3642800Z x = x_sign * x_clamp 2025-05-07T20:32:47.3642885Z x0 = x[:, :D] 2025-05-07T20:32:47.3642971Z x1 = x[:, D:] 2025-05-07T20:32:47.3643048Z 2025-05-07T20:32:47.3643136Z if contiguous: 2025-05-07T20:32:47.3643237Z x0 = x0.contiguous() 2025-05-07T20:32:47.3643379Z x1 = x1.contiguous() 2025-05-07T20:32:47.3643456Z 2025-05-07T20:32:47.3643558Z if scale_ub is not None: 2025-05-07T20:32:47.3643671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3643815Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3643901Z ) 2025-05-07T20:32:47.3643981Z else: 2025-05-07T20:32:47.3644083Z scale_ub_tensor = None 2025-05-07T20:32:47.3644159Z 2025-05-07T20:32:47.3644296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3644396Z op = silu_mul_quant 2025-05-07T20:32:47.3644531Z if compiled: 2025-05-07T20:32:47.3644636Z op = torch.compile(op) 2025-05-07T20:32:47.3644751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3644827Z 2025-05-07T20:32:47.3644923Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3644927Z 2025-05-07T20:32:47.3645072Z moe/activation_test.py:117: 2025-05-07T20:32:47.3645250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3645362Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3645468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3645991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3646100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3646474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3646712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3647070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3647170Z kernel = self.compile( 2025-05-07T20:32:47.3647575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3647772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3647906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3647910Z 2025-05-07T20:32:47.3648132Z self = 2025-05-07T20:32:47.3648937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3649474Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad73c720>} 2025-05-07T20:32:47.3650249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3650456Z context = 2025-05-07T20:32:47.3650463Z 2025-05-07T20:32:47.3650638Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3650914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3651031Z module_map=module_map) 2025-05-07T20:32:47.3651203Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3651307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3651395Z E ^ 2025-05-07T20:32:47.3651764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3651769Z 2025-05-07T20:32:47.3652206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3652212Z 2025-05-07T20:32:47.3652368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3652606Z self=, 2025-05-07T20:32:47.3652691Z T=128, 2025-05-07T20:32:47.3652771Z D=5120, 2025-05-07T20:32:47.3652860Z scale_ub=1200.0, 2025-05-07T20:32:47.3652953Z contiguous=True, 2025-05-07T20:32:47.3653041Z compiled=False, 2025-05-07T20:32:47.3653118Z ) 2025-05-07T20:32:47.3653353Z self = 2025-05-07T20:32:47.3653534Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.3653581Z 2025-05-07T20:32:47.3653667Z @given( 2025-05-07T20:32:47.3653794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3653900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3654025Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3654188Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3654345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3654427Z ) 2025-05-07T20:32:47.3654687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3654785Z def test_silu_mul_quant( 2025-05-07T20:32:47.3654868Z self, 2025-05-07T20:32:47.3654949Z T: int, 2025-05-07T20:32:47.3655032Z D: int, 2025-05-07T20:32:47.3655136Z scale_ub: Optional[float], 2025-05-07T20:32:47.3655230Z contiguous: bool, 2025-05-07T20:32:47.3655323Z compiled: bool, 2025-05-07T20:32:47.3655406Z ) -> None: 2025-05-07T20:32:47.3655505Z torch.manual_seed(2025) 2025-05-07T20:32:47.3655584Z 2025-05-07T20:32:47.3655762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3655839Z 2025-05-07T20:32:47.3655939Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3656071Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3656164Z x = x_sign * x_clamp 2025-05-07T20:32:47.3656253Z x0 = x[:, :D] 2025-05-07T20:32:47.3656336Z x1 = x[:, D:] 2025-05-07T20:32:47.3656415Z 2025-05-07T20:32:47.3656502Z if contiguous: 2025-05-07T20:32:47.3656597Z x0 = x0.contiguous() 2025-05-07T20:32:47.3656692Z x1 = x1.contiguous() 2025-05-07T20:32:47.3656768Z 2025-05-07T20:32:47.3656863Z if scale_ub is not None: 2025-05-07T20:32:47.3656979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3657120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3657201Z ) 2025-05-07T20:32:47.3657284Z else: 2025-05-07T20:32:47.3657382Z scale_ub_tensor = None 2025-05-07T20:32:47.3657457Z 2025-05-07T20:32:47.3657595Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3657688Z op = silu_mul_quant 2025-05-07T20:32:47.3657777Z if compiled: 2025-05-07T20:32:47.3657886Z op = torch.compile(op) 2025-05-07T20:32:47.3657999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3658078Z 2025-05-07T20:32:47.3658173Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3658177Z 2025-05-07T20:32:47.3658277Z moe/activation_test.py:117: 2025-05-07T20:32:47.3658415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3658520Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3658624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3659150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3659256Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3659632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3659917Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3660276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3660379Z kernel = self.compile( 2025-05-07T20:32:47.3660778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3660960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3661101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3661106Z 2025-05-07T20:32:47.3661388Z self = 2025-05-07T20:32:47.3662196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3662796Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad3f8c20>} 2025-05-07T20:32:47.3663575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3663776Z context = 2025-05-07T20:32:47.3663780Z 2025-05-07T20:32:47.3663954Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3664235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3664347Z module_map=module_map) 2025-05-07T20:32:47.3664519Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3664627Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3664706Z E ^ 2025-05-07T20:32:47.3665084Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3665089Z 2025-05-07T20:32:47.3665518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3665523Z 2025-05-07T20:32:47.3665630Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3665867Z self=, 2025-05-07T20:32:47.3665947Z T=1, 2025-05-07T20:32:47.3666032Z D=7168, 2025-05-07T20:32:47.3666120Z scale_ub=1200.0, 2025-05-07T20:32:47.3666210Z contiguous=True, 2025-05-07T20:32:47.3666301Z compiled=True, 2025-05-07T20:32:47.3666378Z ) 2025-05-07T20:32:47.3666606Z self = 2025-05-07T20:32:47.3666782Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.3666790Z 2025-05-07T20:32:47.3666875Z @given( 2025-05-07T20:32:47.3667003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3667110Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3667229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3667355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3667473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3667549Z ) 2025-05-07T20:32:47.3667809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3667909Z def test_silu_mul_quant( 2025-05-07T20:32:47.3667988Z self, 2025-05-07T20:32:47.3668069Z T: int, 2025-05-07T20:32:47.3668148Z D: int, 2025-05-07T20:32:47.3668251Z scale_ub: Optional[float], 2025-05-07T20:32:47.3668350Z contiguous: bool, 2025-05-07T20:32:47.3668438Z compiled: bool, 2025-05-07T20:32:47.3668521Z ) -> None: 2025-05-07T20:32:47.3668720Z torch.manual_seed(2025) 2025-05-07T20:32:47.3668799Z 2025-05-07T20:32:47.3668989Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3669066Z 2025-05-07T20:32:47.3669161Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3669296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3669389Z x = x_sign * x_clamp 2025-05-07T20:32:47.3669473Z x0 = x[:, :D] 2025-05-07T20:32:47.3669558Z x1 = x[:, D:] 2025-05-07T20:32:47.3669631Z 2025-05-07T20:32:47.3669716Z if contiguous: 2025-05-07T20:32:47.3669854Z x0 = x0.contiguous() 2025-05-07T20:32:47.3669946Z x1 = x1.contiguous() 2025-05-07T20:32:47.3670020Z 2025-05-07T20:32:47.3670119Z if scale_ub is not None: 2025-05-07T20:32:47.3670227Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3670371Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3670489Z ) 2025-05-07T20:32:47.3670569Z else: 2025-05-07T20:32:47.3670707Z scale_ub_tensor = None 2025-05-07T20:32:47.3670782Z 2025-05-07T20:32:47.3670916Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3671011Z op = silu_mul_quant 2025-05-07T20:32:47.3671097Z if compiled: 2025-05-07T20:32:47.3671199Z op = torch.compile(op) 2025-05-07T20:32:47.3671310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3671384Z 2025-05-07T20:32:47.3671476Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3671481Z 2025-05-07T20:32:47.3671585Z moe/activation_test.py:117: 2025-05-07T20:32:47.3671718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3671825Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3671928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3672308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3672417Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3672928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3673032Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3673405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3673637Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3673994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3674093Z kernel = self.compile( 2025-05-07T20:32:47.3674489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3674675Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3674816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3674820Z 2025-05-07T20:32:47.3675037Z self = 2025-05-07T20:32:47.3675834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3676356Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad3f9ee0>} 2025-05-07T20:32:47.3677125Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3677327Z context = 2025-05-07T20:32:47.3677376Z 2025-05-07T20:32:47.3677553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3677826Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3677935Z module_map=module_map) 2025-05-07T20:32:47.3678110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3678212Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3678293Z E ^ 2025-05-07T20:32:47.3678659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3678706Z 2025-05-07T20:32:47.3679135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3679140Z 2025-05-07T20:32:47.3679249Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3679523Z self=, 2025-05-07T20:32:47.3679640Z T=1, 2025-05-07T20:32:47.3679722Z D=7168, 2025-05-07T20:32:47.3679806Z scale_ub=1200.0, 2025-05-07T20:32:47.3679900Z contiguous=False, 2025-05-07T20:32:47.3679985Z compiled=True, 2025-05-07T20:32:47.3680061Z ) 2025-05-07T20:32:47.3680350Z self = 2025-05-07T20:32:47.3680523Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.3680528Z 2025-05-07T20:32:47.3680607Z @given( 2025-05-07T20:32:47.3680735Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3680839Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3680957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3681081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3681198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3681281Z ) 2025-05-07T20:32:47.3681545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3681640Z def test_silu_mul_quant( 2025-05-07T20:32:47.3681721Z self, 2025-05-07T20:32:47.3681799Z T: int, 2025-05-07T20:32:47.3681877Z D: int, 2025-05-07T20:32:47.3681983Z scale_ub: Optional[float], 2025-05-07T20:32:47.3682074Z contiguous: bool, 2025-05-07T20:32:47.3682163Z compiled: bool, 2025-05-07T20:32:47.3682245Z ) -> None: 2025-05-07T20:32:47.3682342Z torch.manual_seed(2025) 2025-05-07T20:32:47.3682415Z 2025-05-07T20:32:47.3682598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3682673Z 2025-05-07T20:32:47.3682769Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3682899Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3682988Z x = x_sign * x_clamp 2025-05-07T20:32:47.3683078Z x0 = x[:, :D] 2025-05-07T20:32:47.3683161Z x1 = x[:, D:] 2025-05-07T20:32:47.3683236Z 2025-05-07T20:32:47.3683328Z if contiguous: 2025-05-07T20:32:47.3683421Z x0 = x0.contiguous() 2025-05-07T20:32:47.3683514Z x1 = x1.contiguous() 2025-05-07T20:32:47.3683592Z 2025-05-07T20:32:47.3683684Z if scale_ub is not None: 2025-05-07T20:32:47.3683796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3683935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3684016Z ) 2025-05-07T20:32:47.3684099Z else: 2025-05-07T20:32:47.3684198Z scale_ub_tensor = None 2025-05-07T20:32:47.3684271Z 2025-05-07T20:32:47.3684409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3684503Z op = silu_mul_quant 2025-05-07T20:32:47.3684591Z if compiled: 2025-05-07T20:32:47.3684692Z op = torch.compile(op) 2025-05-07T20:32:47.3684802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3684930Z 2025-05-07T20:32:47.3685028Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3685032Z 2025-05-07T20:32:47.3685133Z moe/activation_test.py:117: 2025-05-07T20:32:47.3685269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3685372Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3685475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3685858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3685954Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3686510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3686612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3686981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3687297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3687653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3687753Z kernel = self.compile( 2025-05-07T20:32:47.3688147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3688328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3688461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3688468Z 2025-05-07T20:32:47.3688679Z self = 2025-05-07T20:32:47.3689483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3690011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad3fac00>} 2025-05-07T20:32:47.3690777Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3690979Z context = 2025-05-07T20:32:47.3690984Z 2025-05-07T20:32:47.3691159Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3691441Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3691552Z module_map=module_map) 2025-05-07T20:32:47.3691720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3691827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3691909Z E ^ 2025-05-07T20:32:47.3692280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3692289Z 2025-05-07T20:32:47.3692718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3692723Z 2025-05-07T20:32:47.3692828Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3693061Z self=, 2025-05-07T20:32:47.3693142Z T=1, 2025-05-07T20:32:47.3693221Z D=7168, 2025-05-07T20:32:47.3693307Z scale_ub=None, 2025-05-07T20:32:47.3693395Z contiguous=False, 2025-05-07T20:32:47.3693481Z compiled=True, 2025-05-07T20:32:47.3693559Z ) 2025-05-07T20:32:47.3693784Z self = 2025-05-07T20:32:47.3694027Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3694033Z 2025-05-07T20:32:47.3694117Z @given( 2025-05-07T20:32:47.3694241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3694345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3694464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3694585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3694704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3694782Z ) 2025-05-07T20:32:47.3695038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3695181Z def test_silu_mul_quant( 2025-05-07T20:32:47.3695259Z self, 2025-05-07T20:32:47.3695342Z T: int, 2025-05-07T20:32:47.3695420Z D: int, 2025-05-07T20:32:47.3695528Z scale_ub: Optional[float], 2025-05-07T20:32:47.3695621Z contiguous: bool, 2025-05-07T20:32:47.3695754Z compiled: bool, 2025-05-07T20:32:47.3695838Z ) -> None: 2025-05-07T20:32:47.3695971Z torch.manual_seed(2025) 2025-05-07T20:32:47.3696046Z 2025-05-07T20:32:47.3696225Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3696300Z 2025-05-07T20:32:47.3696393Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3696525Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3696614Z x = x_sign * x_clamp 2025-05-07T20:32:47.3696699Z x0 = x[:, :D] 2025-05-07T20:32:47.3696784Z x1 = x[:, D:] 2025-05-07T20:32:47.3696857Z 2025-05-07T20:32:47.3696947Z if contiguous: 2025-05-07T20:32:47.3697040Z x0 = x0.contiguous() 2025-05-07T20:32:47.3697130Z x1 = x1.contiguous() 2025-05-07T20:32:47.3697206Z 2025-05-07T20:32:47.3697300Z if scale_ub is not None: 2025-05-07T20:32:47.3697407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3697551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3697630Z ) 2025-05-07T20:32:47.3697710Z else: 2025-05-07T20:32:47.3697808Z scale_ub_tensor = None 2025-05-07T20:32:47.3697881Z 2025-05-07T20:32:47.3698014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3698111Z op = silu_mul_quant 2025-05-07T20:32:47.3698196Z if compiled: 2025-05-07T20:32:47.3698305Z op = torch.compile(op) 2025-05-07T20:32:47.3698413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3698486Z 2025-05-07T20:32:47.3698581Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.3698708Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.3698782Z 2025-05-07T20:32:47.3698951Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3699068Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.3699186Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.3699321Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.3699468Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3699545Z 2025-05-07T20:32:47.3699648Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.3699652Z 2025-05-07T20:32:47.3699752Z moe/activation_test.py:126: 2025-05-07T20:32:47.3699886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3699994Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.3700131Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.3700713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.3700818Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.3701192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3701475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3701860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.3702129Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.3702519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.3702696Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.3703092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.3703171Z fn() 2025-05-07T20:32:47.3703591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.3703675Z self.fn.run( 2025-05-07T20:32:47.3704070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3704208Z kernel = self.compile( 2025-05-07T20:32:47.3704607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3704793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3704925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3704929Z 2025-05-07T20:32:47.3705141Z self = 2025-05-07T20:32:47.3705952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3706477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ada9c180>} 2025-05-07T20:32:47.3707254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3707454Z context = 2025-05-07T20:32:47.3707459Z 2025-05-07T20:32:47.3707631Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3707911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3708023Z module_map=module_map) 2025-05-07T20:32:47.3708195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3708300Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.3708377Z E ^ 2025-05-07T20:32:47.3708752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3708758Z 2025-05-07T20:32:47.3709188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3709192Z 2025-05-07T20:32:47.3709302Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3709533Z self=, 2025-05-07T20:32:47.3709611Z T=1, 2025-05-07T20:32:47.3709693Z D=5120, 2025-05-07T20:32:47.3709780Z scale_ub=1200.0, 2025-05-07T20:32:47.3709869Z contiguous=False, 2025-05-07T20:32:47.3709960Z compiled=True, 2025-05-07T20:32:47.3710035Z ) 2025-05-07T20:32:47.3710263Z self = 2025-05-07T20:32:47.3710438Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.3710443Z 2025-05-07T20:32:47.3710525Z @given( 2025-05-07T20:32:47.3710697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3710804Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3710924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3711047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3711163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3711240Z ) 2025-05-07T20:32:47.3711500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3711595Z def test_silu_mul_quant( 2025-05-07T20:32:47.3711675Z self, 2025-05-07T20:32:47.3711796Z T: int, 2025-05-07T20:32:47.3711875Z D: int, 2025-05-07T20:32:47.3711976Z scale_ub: Optional[float], 2025-05-07T20:32:47.3712069Z contiguous: bool, 2025-05-07T20:32:47.3712159Z compiled: bool, 2025-05-07T20:32:47.3712242Z ) -> None: 2025-05-07T20:32:47.3712339Z torch.manual_seed(2025) 2025-05-07T20:32:47.3712455Z 2025-05-07T20:32:47.3712637Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3712749Z 2025-05-07T20:32:47.3712845Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3712977Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3713071Z x = x_sign * x_clamp 2025-05-07T20:32:47.3713152Z x0 = x[:, :D] 2025-05-07T20:32:47.3713237Z x1 = x[:, D:] 2025-05-07T20:32:47.3713487Z 2025-05-07T20:32:47.3713617Z if contiguous: 2025-05-07T20:32:47.3713759Z x0 = x0.contiguous() 2025-05-07T20:32:47.3713863Z x1 = x1.contiguous() 2025-05-07T20:32:47.3713942Z 2025-05-07T20:32:47.3714039Z if scale_ub is not None: 2025-05-07T20:32:47.3714148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3714291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3714367Z ) 2025-05-07T20:32:47.3714448Z else: 2025-05-07T20:32:47.3714547Z scale_ub_tensor = None 2025-05-07T20:32:47.3714625Z 2025-05-07T20:32:47.3714762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3714859Z op = silu_mul_quant 2025-05-07T20:32:47.3714946Z if compiled: 2025-05-07T20:32:47.3715047Z op = torch.compile(op) 2025-05-07T20:32:47.3715166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3715241Z 2025-05-07T20:32:47.3715334Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3715341Z 2025-05-07T20:32:47.3715443Z moe/activation_test.py:117: 2025-05-07T20:32:47.3715577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3715686Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3715788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3716167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3716269Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3716784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3716887Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3717257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3717489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3717844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3717942Z kernel = self.compile( 2025-05-07T20:32:47.3718337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3718522Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3718655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3718662Z 2025-05-07T20:32:47.3718973Z self = 2025-05-07T20:32:47.3719777Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3720375Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ada9d300>} 2025-05-07T20:32:47.3721209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3721411Z context = 2025-05-07T20:32:47.3721473Z 2025-05-07T20:32:47.3721652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3722013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3722128Z module_map=module_map) 2025-05-07T20:32:47.3722299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3722401Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3722484Z E ^ 2025-05-07T20:32:47.3722852Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3722859Z 2025-05-07T20:32:47.3723290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3723295Z 2025-05-07T20:32:47.3723406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3723637Z self=, 2025-05-07T20:32:47.3723721Z T=1, 2025-05-07T20:32:47.3723802Z D=5120, 2025-05-07T20:32:47.3723889Z scale_ub=1200.0, 2025-05-07T20:32:47.3723980Z contiguous=False, 2025-05-07T20:32:47.3724065Z compiled=False, 2025-05-07T20:32:47.3724139Z ) 2025-05-07T20:32:47.3724367Z self = 2025-05-07T20:32:47.3724543Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.3724547Z 2025-05-07T20:32:47.3724625Z @given( 2025-05-07T20:32:47.3724753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3724859Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3724980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3725102Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3725217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3725296Z ) 2025-05-07T20:32:47.3725553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3725651Z def test_silu_mul_quant( 2025-05-07T20:32:47.3725734Z self, 2025-05-07T20:32:47.3725812Z T: int, 2025-05-07T20:32:47.3725890Z D: int, 2025-05-07T20:32:47.3725995Z scale_ub: Optional[float], 2025-05-07T20:32:47.3726087Z contiguous: bool, 2025-05-07T20:32:47.3726174Z compiled: bool, 2025-05-07T20:32:47.3726259Z ) -> None: 2025-05-07T20:32:47.3726356Z torch.manual_seed(2025) 2025-05-07T20:32:47.3726434Z 2025-05-07T20:32:47.3726609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3726687Z 2025-05-07T20:32:47.3726784Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3726912Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3727001Z x = x_sign * x_clamp 2025-05-07T20:32:47.3727086Z x0 = x[:, :D] 2025-05-07T20:32:47.3727168Z x1 = x[:, D:] 2025-05-07T20:32:47.3727244Z 2025-05-07T20:32:47.3727332Z if contiguous: 2025-05-07T20:32:47.3727475Z x0 = x0.contiguous() 2025-05-07T20:32:47.3727569Z x1 = x1.contiguous() 2025-05-07T20:32:47.3727647Z 2025-05-07T20:32:47.3727741Z if scale_ub is not None: 2025-05-07T20:32:47.3727850Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3727994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3728071Z ) 2025-05-07T20:32:47.3728150Z else: 2025-05-07T20:32:47.3728247Z scale_ub_tensor = None 2025-05-07T20:32:47.3728321Z 2025-05-07T20:32:47.3728497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3728591Z op = silu_mul_quant 2025-05-07T20:32:47.3728678Z if compiled: 2025-05-07T20:32:47.3728782Z op = torch.compile(op) 2025-05-07T20:32:47.3728891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3729006Z 2025-05-07T20:32:47.3729104Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3729111Z 2025-05-07T20:32:47.3729254Z moe/activation_test.py:117: 2025-05-07T20:32:47.3729392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3729495Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3729598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3730115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3730215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3730589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3730827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3731178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3731282Z kernel = self.compile( 2025-05-07T20:32:47.3731685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3731868Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3732000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3732005Z 2025-05-07T20:32:47.3732217Z self = 2025-05-07T20:32:47.3733020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3733546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ada9e020>} 2025-05-07T20:32:47.3734327Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3734532Z context = 2025-05-07T20:32:47.3734536Z 2025-05-07T20:32:47.3734706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3734988Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3735099Z module_map=module_map) 2025-05-07T20:32:47.3735268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3735374Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3735452Z E ^ 2025-05-07T20:32:47.3735817Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3735828Z 2025-05-07T20:32:47.3736305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3736310Z 2025-05-07T20:32:47.3736417Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3736650Z self=, 2025-05-07T20:32:47.3736729Z T=16384, 2025-05-07T20:32:47.3736807Z D=5120, 2025-05-07T20:32:47.3736897Z scale_ub=1200.0, 2025-05-07T20:32:47.3736986Z contiguous=False, 2025-05-07T20:32:47.3737071Z compiled=True, 2025-05-07T20:32:47.3737152Z ) 2025-05-07T20:32:47.3737379Z self = 2025-05-07T20:32:47.3737612Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.3737616Z 2025-05-07T20:32:47.3737695Z @given( 2025-05-07T20:32:47.3737818Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3737964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3738085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3738243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3738366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3738441Z ) 2025-05-07T20:32:47.3738699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3738795Z def test_silu_mul_quant( 2025-05-07T20:32:47.3738874Z self, 2025-05-07T20:32:47.3738955Z T: int, 2025-05-07T20:32:47.3739032Z D: int, 2025-05-07T20:32:47.3739132Z scale_ub: Optional[float], 2025-05-07T20:32:47.3739229Z contiguous: bool, 2025-05-07T20:32:47.3739318Z compiled: bool, 2025-05-07T20:32:47.3739397Z ) -> None: 2025-05-07T20:32:47.3739497Z torch.manual_seed(2025) 2025-05-07T20:32:47.3739570Z 2025-05-07T20:32:47.3739745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3739826Z 2025-05-07T20:32:47.3739923Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3740055Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3740147Z x = x_sign * x_clamp 2025-05-07T20:32:47.3740229Z x0 = x[:, :D] 2025-05-07T20:32:47.3740312Z x1 = x[:, D:] 2025-05-07T20:32:47.3740386Z 2025-05-07T20:32:47.3740471Z if contiguous: 2025-05-07T20:32:47.3740566Z x0 = x0.contiguous() 2025-05-07T20:32:47.3740656Z x1 = x1.contiguous() 2025-05-07T20:32:47.3740732Z 2025-05-07T20:32:47.3740828Z if scale_ub is not None: 2025-05-07T20:32:47.3740940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3741079Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3741160Z ) 2025-05-07T20:32:47.3741238Z else: 2025-05-07T20:32:47.3741334Z scale_ub_tensor = None 2025-05-07T20:32:47.3741411Z 2025-05-07T20:32:47.3741547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3741644Z op = silu_mul_quant 2025-05-07T20:32:47.3741733Z if compiled: 2025-05-07T20:32:47.3741834Z op = torch.compile(op) 2025-05-07T20:32:47.3741945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3742018Z 2025-05-07T20:32:47.3742110Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3742115Z 2025-05-07T20:32:47.3742216Z moe/activation_test.py:117: 2025-05-07T20:32:47.3742348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3742451Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3742558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3742938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3743037Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3743549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3743699Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3744076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3744307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3744658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3744757Z kernel = self.compile( 2025-05-07T20:32:47.3745151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3745376Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3745511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3745515Z 2025-05-07T20:32:47.3745728Z self = 2025-05-07T20:32:47.3746618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3747145Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ada9f600>} 2025-05-07T20:32:47.3747919Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3748120Z context = 2025-05-07T20:32:47.3748124Z 2025-05-07T20:32:47.3748303Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3748582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3748694Z module_map=module_map) 2025-05-07T20:32:47.3748864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3748965Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3749042Z E ^ 2025-05-07T20:32:47.3749411Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3749415Z 2025-05-07T20:32:47.3749843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3749850Z 2025-05-07T20:32:47.3749961Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3750193Z self=, 2025-05-07T20:32:47.3750271Z T=2048, 2025-05-07T20:32:47.3750352Z D=7168, 2025-05-07T20:32:47.3750440Z scale_ub=1200.0, 2025-05-07T20:32:47.3750529Z contiguous=False, 2025-05-07T20:32:47.3750620Z compiled=True, 2025-05-07T20:32:47.3750697Z ) 2025-05-07T20:32:47.3750922Z self = 2025-05-07T20:32:47.3751108Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.3751112Z 2025-05-07T20:32:47.3751190Z @given( 2025-05-07T20:32:47.3751315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3751417Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3751535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3751661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3751777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3751852Z ) 2025-05-07T20:32:47.3752108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3752203Z def test_silu_mul_quant( 2025-05-07T20:32:47.3752286Z self, 2025-05-07T20:32:47.3752408Z T: int, 2025-05-07T20:32:47.3752491Z D: int, 2025-05-07T20:32:47.3752595Z scale_ub: Optional[float], 2025-05-07T20:32:47.3752687Z contiguous: bool, 2025-05-07T20:32:47.3752776Z compiled: bool, 2025-05-07T20:32:47.3752859Z ) -> None: 2025-05-07T20:32:47.3752956Z torch.manual_seed(2025) 2025-05-07T20:32:47.3753030Z 2025-05-07T20:32:47.3753208Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3753286Z 2025-05-07T20:32:47.3753381Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3753555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3753644Z x = x_sign * x_clamp 2025-05-07T20:32:47.3753725Z x0 = x[:, :D] 2025-05-07T20:32:47.3753810Z x1 = x[:, D:] 2025-05-07T20:32:47.3753885Z 2025-05-07T20:32:47.3753975Z if contiguous: 2025-05-07T20:32:47.3754066Z x0 = x0.contiguous() 2025-05-07T20:32:47.3754223Z x1 = x1.contiguous() 2025-05-07T20:32:47.3754303Z 2025-05-07T20:32:47.3754435Z if scale_ub is not None: 2025-05-07T20:32:47.3754543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3754683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3754760Z ) 2025-05-07T20:32:47.3754836Z else: 2025-05-07T20:32:47.3754936Z scale_ub_tensor = None 2025-05-07T20:32:47.3755010Z 2025-05-07T20:32:47.3755142Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3755239Z op = silu_mul_quant 2025-05-07T20:32:47.3755329Z if compiled: 2025-05-07T20:32:47.3755432Z op = torch.compile(op) 2025-05-07T20:32:47.3755541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3755616Z 2025-05-07T20:32:47.3755713Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3755718Z 2025-05-07T20:32:47.3755820Z moe/activation_test.py:117: 2025-05-07T20:32:47.3755956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3759536Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3759664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3760059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3760260Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3760780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3760885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3761259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3761496Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3761852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3761957Z kernel = self.compile( 2025-05-07T20:32:47.3762364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3762548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3762685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3762690Z 2025-05-07T20:32:47.3762906Z self = 2025-05-07T20:32:47.3763709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3764241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad03c720>} 2025-05-07T20:32:47.3765079Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3765287Z context = 2025-05-07T20:32:47.3765291Z 2025-05-07T20:32:47.3765465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3765746Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3765901Z module_map=module_map) 2025-05-07T20:32:47.3766070Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3766176Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3766257Z E ^ 2025-05-07T20:32:47.3766627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3766675Z 2025-05-07T20:32:47.3767155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3767160Z 2025-05-07T20:32:47.3767268Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3767504Z self=, 2025-05-07T20:32:47.3767584Z T=1, 2025-05-07T20:32:47.3767664Z D=5120, 2025-05-07T20:32:47.3767753Z scale_ub=None, 2025-05-07T20:32:47.3767843Z contiguous=False, 2025-05-07T20:32:47.3767931Z compiled=False, 2025-05-07T20:32:47.3768011Z ) 2025-05-07T20:32:47.3768238Z self = 2025-05-07T20:32:47.3768415Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.3768423Z 2025-05-07T20:32:47.3768502Z @given( 2025-05-07T20:32:47.3768626Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3768739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3768872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3769009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3769153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3769230Z ) 2025-05-07T20:32:47.3769486Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3769585Z def test_silu_mul_quant( 2025-05-07T20:32:47.3769664Z self, 2025-05-07T20:32:47.3769743Z T: int, 2025-05-07T20:32:47.3769824Z D: int, 2025-05-07T20:32:47.3769929Z scale_ub: Optional[float], 2025-05-07T20:32:47.3770026Z contiguous: bool, 2025-05-07T20:32:47.3770114Z compiled: bool, 2025-05-07T20:32:47.3770195Z ) -> None: 2025-05-07T20:32:47.3770295Z torch.manual_seed(2025) 2025-05-07T20:32:47.3770370Z 2025-05-07T20:32:47.3770549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3770630Z 2025-05-07T20:32:47.3770727Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3770857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3770951Z x = x_sign * x_clamp 2025-05-07T20:32:47.3771033Z x0 = x[:, :D] 2025-05-07T20:32:47.3771116Z x1 = x[:, D:] 2025-05-07T20:32:47.3771194Z 2025-05-07T20:32:47.3771279Z if contiguous: 2025-05-07T20:32:47.3771376Z x0 = x0.contiguous() 2025-05-07T20:32:47.3771468Z x1 = x1.contiguous() 2025-05-07T20:32:47.3771545Z 2025-05-07T20:32:47.3771646Z if scale_ub is not None: 2025-05-07T20:32:47.3771757Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3771898Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3771977Z ) 2025-05-07T20:32:47.3772056Z else: 2025-05-07T20:32:47.3772153Z scale_ub_tensor = None 2025-05-07T20:32:47.3772236Z 2025-05-07T20:32:47.3772420Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3772517Z op = silu_mul_quant 2025-05-07T20:32:47.3772607Z if compiled: 2025-05-07T20:32:47.3772711Z op = torch.compile(op) 2025-05-07T20:32:47.3772826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3772901Z 2025-05-07T20:32:47.3772995Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3773000Z 2025-05-07T20:32:47.3773102Z moe/activation_test.py:117: 2025-05-07T20:32:47.3773236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3773390Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3773497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3774017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3774118Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3774575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3774813Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3775171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3775268Z kernel = self.compile( 2025-05-07T20:32:47.3775663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3775851Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3775986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3775990Z 2025-05-07T20:32:47.3776206Z self = 2025-05-07T20:32:47.3777013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3777538Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad03d120>} 2025-05-07T20:32:47.3778310Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3778513Z context = 2025-05-07T20:32:47.3778517Z 2025-05-07T20:32:47.3778693Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3778969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3779082Z module_map=module_map) 2025-05-07T20:32:47.3779258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3779363Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3779445Z E ^ 2025-05-07T20:32:47.3779813Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3779818Z 2025-05-07T20:32:47.3780248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3780252Z 2025-05-07T20:32:47.3780365Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3780602Z self=, 2025-05-07T20:32:47.3780685Z T=4096, 2025-05-07T20:32:47.3780767Z D=7168, 2025-05-07T20:32:47.3780854Z scale_ub=1200.0, 2025-05-07T20:32:47.3780947Z contiguous=False, 2025-05-07T20:32:47.3781033Z compiled=False, 2025-05-07T20:32:47.3781111Z ) 2025-05-07T20:32:47.3781384Z self = 2025-05-07T20:32:47.3781575Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.3781579Z 2025-05-07T20:32:47.3781660Z @given( 2025-05-07T20:32:47.3781787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3781888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3782008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3782129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3782247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3782369Z ) 2025-05-07T20:32:47.3782624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3782721Z def test_silu_mul_quant( 2025-05-07T20:32:47.3782803Z self, 2025-05-07T20:32:47.3782881Z T: int, 2025-05-07T20:32:47.3783000Z D: int, 2025-05-07T20:32:47.3783104Z scale_ub: Optional[float], 2025-05-07T20:32:47.3783197Z contiguous: bool, 2025-05-07T20:32:47.3783324Z compiled: bool, 2025-05-07T20:32:47.3783409Z ) -> None: 2025-05-07T20:32:47.3783506Z torch.manual_seed(2025) 2025-05-07T20:32:47.3783580Z 2025-05-07T20:32:47.3783759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3783834Z 2025-05-07T20:32:47.3783932Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3784060Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3784151Z x = x_sign * x_clamp 2025-05-07T20:32:47.3784237Z x0 = x[:, :D] 2025-05-07T20:32:47.3784318Z x1 = x[:, D:] 2025-05-07T20:32:47.3784392Z 2025-05-07T20:32:47.3784483Z if contiguous: 2025-05-07T20:32:47.3784575Z x0 = x0.contiguous() 2025-05-07T20:32:47.3784666Z x1 = x1.contiguous() 2025-05-07T20:32:47.3784747Z 2025-05-07T20:32:47.3784844Z if scale_ub is not None: 2025-05-07T20:32:47.3784954Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3785099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3785177Z ) 2025-05-07T20:32:47.3785256Z else: 2025-05-07T20:32:47.3785352Z scale_ub_tensor = None 2025-05-07T20:32:47.3785426Z 2025-05-07T20:32:47.3785562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3785654Z op = silu_mul_quant 2025-05-07T20:32:47.3785741Z if compiled: 2025-05-07T20:32:47.3785846Z op = torch.compile(op) 2025-05-07T20:32:47.3785957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3786031Z 2025-05-07T20:32:47.3786127Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3786131Z 2025-05-07T20:32:47.3786230Z moe/activation_test.py:117: 2025-05-07T20:32:47.3786366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3786473Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3786580Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3787100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3787200Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3787570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3787809Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3788161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3788263Z kernel = self.compile( 2025-05-07T20:32:47.3788659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3788840Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3789027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3789034Z 2025-05-07T20:32:47.3789246Z self = 2025-05-07T20:32:47.3790049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3790569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad03e480>} 2025-05-07T20:32:47.3791396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3791635Z context = 2025-05-07T20:32:47.3791642Z 2025-05-07T20:32:47.3791851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3792128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3792239Z module_map=module_map) 2025-05-07T20:32:47.3792405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3792509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3792587Z E ^ 2025-05-07T20:32:47.3792954Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3792964Z 2025-05-07T20:32:47.3793392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3793397Z 2025-05-07T20:32:47.3793503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3793745Z self=, 2025-05-07T20:32:47.3793826Z T=16384, 2025-05-07T20:32:47.3793905Z D=7168, 2025-05-07T20:32:47.3793995Z scale_ub=None, 2025-05-07T20:32:47.3794083Z contiguous=True, 2025-05-07T20:32:47.3794167Z compiled=True, 2025-05-07T20:32:47.3794244Z ) 2025-05-07T20:32:47.3794470Z self = 2025-05-07T20:32:47.3794654Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.3794658Z 2025-05-07T20:32:47.3794737Z @given( 2025-05-07T20:32:47.3794862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3794968Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3795086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3795207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3795326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3795407Z ) 2025-05-07T20:32:47.3795667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3795767Z def test_silu_mul_quant( 2025-05-07T20:32:47.3795844Z self, 2025-05-07T20:32:47.3795926Z T: int, 2025-05-07T20:32:47.3796003Z D: int, 2025-05-07T20:32:47.3796104Z scale_ub: Optional[float], 2025-05-07T20:32:47.3796198Z contiguous: bool, 2025-05-07T20:32:47.3796285Z compiled: bool, 2025-05-07T20:32:47.3796364Z ) -> None: 2025-05-07T20:32:47.3796466Z torch.manual_seed(2025) 2025-05-07T20:32:47.3796539Z 2025-05-07T20:32:47.3796716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3796795Z 2025-05-07T20:32:47.3796892Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3797021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3797115Z x = x_sign * x_clamp 2025-05-07T20:32:47.3797198Z x0 = x[:, :D] 2025-05-07T20:32:47.3797281Z x1 = x[:, D:] 2025-05-07T20:32:47.3797402Z 2025-05-07T20:32:47.3797491Z if contiguous: 2025-05-07T20:32:47.3797588Z x0 = x0.contiguous() 2025-05-07T20:32:47.3797683Z x1 = x1.contiguous() 2025-05-07T20:32:47.3797757Z 2025-05-07T20:32:47.3797852Z if scale_ub is not None: 2025-05-07T20:32:47.3797960Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3798098Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3798179Z ) 2025-05-07T20:32:47.3798256Z else: 2025-05-07T20:32:47.3798352Z scale_ub_tensor = None 2025-05-07T20:32:47.3798472Z 2025-05-07T20:32:47.3798605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3798697Z op = silu_mul_quant 2025-05-07T20:32:47.3798786Z if compiled: 2025-05-07T20:32:47.3798889Z op = torch.compile(op) 2025-05-07T20:32:47.3799042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3799116Z 2025-05-07T20:32:47.3799280Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3799285Z 2025-05-07T20:32:47.3799388Z moe/activation_test.py:117: 2025-05-07T20:32:47.3799522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3799626Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3799733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3800191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3800292Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3800805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3800904Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3801280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3801521Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3801873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3801973Z kernel = self.compile( 2025-05-07T20:32:47.3802369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3802554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3802684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3802691Z 2025-05-07T20:32:47.3802902Z self = 2025-05-07T20:32:47.3803703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3804232Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad03f740>} 2025-05-07T20:32:47.3805002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3805201Z context = 2025-05-07T20:32:47.3805206Z 2025-05-07T20:32:47.3805384Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3805657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3805767Z module_map=module_map) 2025-05-07T20:32:47.3805935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3806038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3806163Z E ^ 2025-05-07T20:32:47.3806538Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3806542Z 2025-05-07T20:32:47.3806971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3806975Z 2025-05-07T20:32:47.3807085Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3807316Z self=, 2025-05-07T20:32:47.3807436Z T=4096, 2025-05-07T20:32:47.3807521Z D=5120, 2025-05-07T20:32:47.3807606Z scale_ub=None, 2025-05-07T20:32:47.3807695Z contiguous=False, 2025-05-07T20:32:47.3807782Z compiled=True, 2025-05-07T20:32:47.3807857Z ) 2025-05-07T20:32:47.3808083Z self = 2025-05-07T20:32:47.3808310Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3808315Z 2025-05-07T20:32:47.3808431Z @given( 2025-05-07T20:32:47.3808558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3808660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3808778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3808903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3809020Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3809095Z ) 2025-05-07T20:32:47.3809353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3809452Z def test_silu_mul_quant( 2025-05-07T20:32:47.3809529Z self, 2025-05-07T20:32:47.3809612Z T: int, 2025-05-07T20:32:47.3809689Z D: int, 2025-05-07T20:32:47.3809793Z scale_ub: Optional[float], 2025-05-07T20:32:47.3809884Z contiguous: bool, 2025-05-07T20:32:47.3809975Z compiled: bool, 2025-05-07T20:32:47.3810062Z ) -> None: 2025-05-07T20:32:47.3810160Z torch.manual_seed(2025) 2025-05-07T20:32:47.3810234Z 2025-05-07T20:32:47.3810412Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3810487Z 2025-05-07T20:32:47.3810581Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3810713Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3810802Z x = x_sign * x_clamp 2025-05-07T20:32:47.3810885Z x0 = x[:, :D] 2025-05-07T20:32:47.3810969Z x1 = x[:, D:] 2025-05-07T20:32:47.3811042Z 2025-05-07T20:32:47.3811134Z if contiguous: 2025-05-07T20:32:47.3811227Z x0 = x0.contiguous() 2025-05-07T20:32:47.3811317Z x1 = x1.contiguous() 2025-05-07T20:32:47.3811394Z 2025-05-07T20:32:47.3811487Z if scale_ub is not None: 2025-05-07T20:32:47.3811604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3811748Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3811828Z ) 2025-05-07T20:32:47.3811911Z else: 2025-05-07T20:32:47.3812007Z scale_ub_tensor = None 2025-05-07T20:32:47.3812082Z 2025-05-07T20:32:47.3812217Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3812308Z op = silu_mul_quant 2025-05-07T20:32:47.3812394Z if compiled: 2025-05-07T20:32:47.3812501Z op = torch.compile(op) 2025-05-07T20:32:47.3812608Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3812682Z 2025-05-07T20:32:47.3812777Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3812785Z 2025-05-07T20:32:47.3812884Z moe/activation_test.py:117: 2025-05-07T20:32:47.3813019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3813121Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3813224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3813977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3814083Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3814597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3814704Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3815075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3815309Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3815722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3815820Z kernel = self.compile( 2025-05-07T20:32:47.3816221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3816463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3816650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3816655Z 2025-05-07T20:32:47.3816870Z self = 2025-05-07T20:32:47.3817671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3818197Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad290c20>} 2025-05-07T20:32:47.3818968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3819176Z context = 2025-05-07T20:32:47.3819183Z 2025-05-07T20:32:47.3819356Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3819631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3819746Z module_map=module_map) 2025-05-07T20:32:47.3819913Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3820019Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3820099Z E ^ 2025-05-07T20:32:47.3820468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3820473Z 2025-05-07T20:32:47.3820906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3820913Z 2025-05-07T20:32:47.3821021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3821262Z self=, 2025-05-07T20:32:47.3821342Z T=4096, 2025-05-07T20:32:47.3821420Z D=5120, 2025-05-07T20:32:47.3821510Z scale_ub=1200.0, 2025-05-07T20:32:47.3821599Z contiguous=False, 2025-05-07T20:32:47.3821685Z compiled=False, 2025-05-07T20:32:47.3821763Z ) 2025-05-07T20:32:47.3821990Z self = 2025-05-07T20:32:47.3822177Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.3822185Z 2025-05-07T20:32:47.3822266Z @given( 2025-05-07T20:32:47.3822391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3822497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3822617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3822739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3822862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3823009Z ) 2025-05-07T20:32:47.3823272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3823374Z def test_silu_mul_quant( 2025-05-07T20:32:47.3823452Z self, 2025-05-07T20:32:47.3823539Z T: int, 2025-05-07T20:32:47.3823619Z D: int, 2025-05-07T20:32:47.3823721Z scale_ub: Optional[float], 2025-05-07T20:32:47.3823816Z contiguous: bool, 2025-05-07T20:32:47.3823905Z compiled: bool, 2025-05-07T20:32:47.3823988Z ) -> None: 2025-05-07T20:32:47.3824085Z torch.manual_seed(2025) 2025-05-07T20:32:47.3824204Z 2025-05-07T20:32:47.3824384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3824460Z 2025-05-07T20:32:47.3824555Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3824689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3824821Z x = x_sign * x_clamp 2025-05-07T20:32:47.3824905Z x0 = x[:, :D] 2025-05-07T20:32:47.3825026Z x1 = x[:, D:] 2025-05-07T20:32:47.3825102Z 2025-05-07T20:32:47.3825187Z if contiguous: 2025-05-07T20:32:47.3825284Z x0 = x0.contiguous() 2025-05-07T20:32:47.3825376Z x1 = x1.contiguous() 2025-05-07T20:32:47.3825451Z 2025-05-07T20:32:47.3825547Z if scale_ub is not None: 2025-05-07T20:32:47.3825657Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3825799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3825876Z ) 2025-05-07T20:32:47.3825955Z else: 2025-05-07T20:32:47.3826055Z scale_ub_tensor = None 2025-05-07T20:32:47.3826129Z 2025-05-07T20:32:47.3826265Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3826361Z op = silu_mul_quant 2025-05-07T20:32:47.3826449Z if compiled: 2025-05-07T20:32:47.3826555Z op = torch.compile(op) 2025-05-07T20:32:47.3826673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3826749Z 2025-05-07T20:32:47.3826842Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3826849Z 2025-05-07T20:32:47.3826950Z moe/activation_test.py:117: 2025-05-07T20:32:47.3827084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3827194Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3827299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3827816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3827923Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3828295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3828528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3828909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3829018Z kernel = self.compile( 2025-05-07T20:32:47.3829442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3829625Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3829759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3829763Z 2025-05-07T20:32:47.3829981Z self = 2025-05-07T20:32:47.3830790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3831372Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad2916c0>} 2025-05-07T20:32:47.3832143Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3832346Z context = 2025-05-07T20:32:47.3832351Z 2025-05-07T20:32:47.3832523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3832798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3832954Z module_map=module_map) 2025-05-07T20:32:47.3833123Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3833225Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3833307Z E ^ 2025-05-07T20:32:47.3833723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3833764Z 2025-05-07T20:32:47.3834205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3834210Z 2025-05-07T20:32:47.3834320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3834552Z self=, 2025-05-07T20:32:47.3834635Z T=4096, 2025-05-07T20:32:47.3834715Z D=5120, 2025-05-07T20:32:47.3834801Z scale_ub=1200.0, 2025-05-07T20:32:47.3834897Z contiguous=False, 2025-05-07T20:32:47.3834984Z compiled=True, 2025-05-07T20:32:47.3835062Z ) 2025-05-07T20:32:47.3835292Z self = 2025-05-07T20:32:47.3835475Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.3835482Z 2025-05-07T20:32:47.3835567Z @given( 2025-05-07T20:32:47.3835693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3835797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3835920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3836041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3836162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3836239Z ) 2025-05-07T20:32:47.3836496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3836596Z def test_silu_mul_quant( 2025-05-07T20:32:47.3836675Z self, 2025-05-07T20:32:47.3836758Z T: int, 2025-05-07T20:32:47.3836840Z D: int, 2025-05-07T20:32:47.3836941Z scale_ub: Optional[float], 2025-05-07T20:32:47.3837035Z contiguous: bool, 2025-05-07T20:32:47.3837126Z compiled: bool, 2025-05-07T20:32:47.3837205Z ) -> None: 2025-05-07T20:32:47.3837307Z torch.manual_seed(2025) 2025-05-07T20:32:47.3837385Z 2025-05-07T20:32:47.3837565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3837641Z 2025-05-07T20:32:47.3837737Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3837868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3837964Z x = x_sign * x_clamp 2025-05-07T20:32:47.3838045Z x0 = x[:, :D] 2025-05-07T20:32:47.3838129Z x1 = x[:, D:] 2025-05-07T20:32:47.3838206Z 2025-05-07T20:32:47.3838294Z if contiguous: 2025-05-07T20:32:47.3838388Z x0 = x0.contiguous() 2025-05-07T20:32:47.3838484Z x1 = x1.contiguous() 2025-05-07T20:32:47.3838559Z 2025-05-07T20:32:47.3838655Z if scale_ub is not None: 2025-05-07T20:32:47.3838766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3838904Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3838981Z ) 2025-05-07T20:32:47.3839063Z else: 2025-05-07T20:32:47.3839205Z scale_ub_tensor = None 2025-05-07T20:32:47.3839285Z 2025-05-07T20:32:47.3839421Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3839513Z op = silu_mul_quant 2025-05-07T20:32:47.3839602Z if compiled: 2025-05-07T20:32:47.3839704Z op = torch.compile(op) 2025-05-07T20:32:47.3839811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3839888Z 2025-05-07T20:32:47.3839979Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3839984Z 2025-05-07T20:32:47.3840153Z moe/activation_test.py:117: 2025-05-07T20:32:47.3840334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3840437Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3840543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3840922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3841072Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3841623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3841726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3842096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3842330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3842682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3842783Z kernel = self.compile( 2025-05-07T20:32:47.3843180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3843361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3843496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3843503Z 2025-05-07T20:32:47.3843716Z self = 2025-05-07T20:32:47.3844518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3845040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad292fc0>} 2025-05-07T20:32:47.3845809Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3846009Z context = 2025-05-07T20:32:47.3846017Z 2025-05-07T20:32:47.3846190Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3846470Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3846581Z module_map=module_map) 2025-05-07T20:32:47.3846746Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3846851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3846930Z E ^ 2025-05-07T20:32:47.3847295Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3847306Z 2025-05-07T20:32:47.3847734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3847738Z 2025-05-07T20:32:47.3847844Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3848077Z self=, 2025-05-07T20:32:47.3848160Z T=2048, 2025-05-07T20:32:47.3848282Z D=7168, 2025-05-07T20:32:47.3848375Z scale_ub=1200.0, 2025-05-07T20:32:47.3848466Z contiguous=False, 2025-05-07T20:32:47.3848554Z compiled=False, 2025-05-07T20:32:47.3848632Z ) 2025-05-07T20:32:47.3848859Z self = 2025-05-07T20:32:47.3849045Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.3849050Z 2025-05-07T20:32:47.3849128Z @given( 2025-05-07T20:32:47.3849252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3849400Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3849521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3849642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3849762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3849878Z ) 2025-05-07T20:32:47.3850137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3850295Z def test_silu_mul_quant( 2025-05-07T20:32:47.3850373Z self, 2025-05-07T20:32:47.3850454Z T: int, 2025-05-07T20:32:47.3850531Z D: int, 2025-05-07T20:32:47.3850631Z scale_ub: Optional[float], 2025-05-07T20:32:47.3850726Z contiguous: bool, 2025-05-07T20:32:47.3850815Z compiled: bool, 2025-05-07T20:32:47.3850893Z ) -> None: 2025-05-07T20:32:47.3850994Z torch.manual_seed(2025) 2025-05-07T20:32:47.3851069Z 2025-05-07T20:32:47.3851244Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3851328Z 2025-05-07T20:32:47.3851422Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3851550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3851644Z x = x_sign * x_clamp 2025-05-07T20:32:47.3851725Z x0 = x[:, :D] 2025-05-07T20:32:47.3851811Z x1 = x[:, D:] 2025-05-07T20:32:47.3851885Z 2025-05-07T20:32:47.3851972Z if contiguous: 2025-05-07T20:32:47.3852071Z x0 = x0.contiguous() 2025-05-07T20:32:47.3852163Z x1 = x1.contiguous() 2025-05-07T20:32:47.3852238Z 2025-05-07T20:32:47.3852335Z if scale_ub is not None: 2025-05-07T20:32:47.3852444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3852584Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3852664Z ) 2025-05-07T20:32:47.3852740Z else: 2025-05-07T20:32:47.3852836Z scale_ub_tensor = None 2025-05-07T20:32:47.3852915Z 2025-05-07T20:32:47.3853048Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3853143Z op = silu_mul_quant 2025-05-07T20:32:47.3853229Z if compiled: 2025-05-07T20:32:47.3853330Z op = torch.compile(op) 2025-05-07T20:32:47.3853441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3853519Z 2025-05-07T20:32:47.3853613Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3853617Z 2025-05-07T20:32:47.3853723Z moe/activation_test.py:117: 2025-05-07T20:32:47.3853856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3853958Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3854065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3854578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3854680Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3855056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3855290Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3855645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3855745Z kernel = self.compile( 2025-05-07T20:32:47.3856191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3856377Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3856508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3856512Z 2025-05-07T20:32:47.3856726Z self = 2025-05-07T20:32:47.3857524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3858090Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ad293ec0>} 2025-05-07T20:32:47.3858949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3859184Z context = 2025-05-07T20:32:47.3859190Z 2025-05-07T20:32:47.3859365Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3859637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3859749Z module_map=module_map) 2025-05-07T20:32:47.3859917Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3860020Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3860102Z E ^ 2025-05-07T20:32:47.3860468Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3860476Z 2025-05-07T20:32:47.3860910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3860921Z 2025-05-07T20:32:47.3861026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3861257Z self=, 2025-05-07T20:32:47.3861339Z T=1, 2025-05-07T20:32:47.3861421Z D=7168, 2025-05-07T20:32:47.3861504Z scale_ub=None, 2025-05-07T20:32:47.3861594Z contiguous=True, 2025-05-07T20:32:47.3861680Z compiled=False, 2025-05-07T20:32:47.3861754Z ) 2025-05-07T20:32:47.3861987Z self = 2025-05-07T20:32:47.3862157Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.3862162Z 2025-05-07T20:32:47.3862240Z @given( 2025-05-07T20:32:47.3862366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3862471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3862594Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3862717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3862834Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3862912Z ) 2025-05-07T20:32:47.3863166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3863262Z def test_silu_mul_quant( 2025-05-07T20:32:47.3863344Z self, 2025-05-07T20:32:47.3863422Z T: int, 2025-05-07T20:32:47.3863500Z D: int, 2025-05-07T20:32:47.3863605Z scale_ub: Optional[float], 2025-05-07T20:32:47.3863700Z contiguous: bool, 2025-05-07T20:32:47.3863790Z compiled: bool, 2025-05-07T20:32:47.3863869Z ) -> None: 2025-05-07T20:32:47.3863965Z torch.manual_seed(2025) 2025-05-07T20:32:47.3864044Z 2025-05-07T20:32:47.3864220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3864298Z 2025-05-07T20:32:47.3864441Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3864574Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3864664Z x = x_sign * x_clamp 2025-05-07T20:32:47.3864749Z x0 = x[:, :D] 2025-05-07T20:32:47.3864830Z x1 = x[:, D:] 2025-05-07T20:32:47.3864903Z 2025-05-07T20:32:47.3864991Z if contiguous: 2025-05-07T20:32:47.3865084Z x0 = x0.contiguous() 2025-05-07T20:32:47.3865174Z x1 = x1.contiguous() 2025-05-07T20:32:47.3865251Z 2025-05-07T20:32:47.3865344Z if scale_ub is not None: 2025-05-07T20:32:47.3865496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3865636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3865713Z ) 2025-05-07T20:32:47.3865792Z else: 2025-05-07T20:32:47.3865888Z scale_ub_tensor = None 2025-05-07T20:32:47.3866005Z 2025-05-07T20:32:47.3866141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3866236Z op = silu_mul_quant 2025-05-07T20:32:47.3866364Z if compiled: 2025-05-07T20:32:47.3866470Z op = torch.compile(op) 2025-05-07T20:32:47.3866579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3866652Z 2025-05-07T20:32:47.3866746Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3866751Z 2025-05-07T20:32:47.3866849Z moe/activation_test.py:117: 2025-05-07T20:32:47.3866985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3867087Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3867192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3867710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3867810Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3868187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3868423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3868777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3868875Z kernel = self.compile( 2025-05-07T20:32:47.3869269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3869450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3869586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3869590Z 2025-05-07T20:32:47.3869801Z self = 2025-05-07T20:32:47.3870609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3871136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acf38cc0>} 2025-05-07T20:32:47.3871903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3872104Z context = 2025-05-07T20:32:47.3872111Z 2025-05-07T20:32:47.3872282Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3872563Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3872673Z module_map=module_map) 2025-05-07T20:32:47.3872886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3872995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3873074Z E ^ 2025-05-07T20:32:47.3873442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3873447Z 2025-05-07T20:32:47.3873874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3873878Z 2025-05-07T20:32:47.3873984Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3874216Z self=, 2025-05-07T20:32:47.3874336Z T=16384, 2025-05-07T20:32:47.3874416Z D=7168, 2025-05-07T20:32:47.3874506Z scale_ub=1200.0, 2025-05-07T20:32:47.3874595Z contiguous=False, 2025-05-07T20:32:47.3874683Z compiled=True, 2025-05-07T20:32:47.3874757Z ) 2025-05-07T20:32:47.3875026Z self = 2025-05-07T20:32:47.3875257Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.3875262Z 2025-05-07T20:32:47.3875343Z @given( 2025-05-07T20:32:47.3875465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3875571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3875689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3875810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3875929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3876007Z ) 2025-05-07T20:32:47.3876266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3876361Z def test_silu_mul_quant( 2025-05-07T20:32:47.3876438Z self, 2025-05-07T20:32:47.3876519Z T: int, 2025-05-07T20:32:47.3876597Z D: int, 2025-05-07T20:32:47.3876700Z scale_ub: Optional[float], 2025-05-07T20:32:47.3876795Z contiguous: bool, 2025-05-07T20:32:47.3876885Z compiled: bool, 2025-05-07T20:32:47.3876966Z ) -> None: 2025-05-07T20:32:47.3877067Z torch.manual_seed(2025) 2025-05-07T20:32:47.3877140Z 2025-05-07T20:32:47.3877314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3877392Z 2025-05-07T20:32:47.3877485Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3877616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3877705Z x = x_sign * x_clamp 2025-05-07T20:32:47.3877786Z x0 = x[:, :D] 2025-05-07T20:32:47.3877873Z x1 = x[:, D:] 2025-05-07T20:32:47.3877946Z 2025-05-07T20:32:47.3878031Z if contiguous: 2025-05-07T20:32:47.3878126Z x0 = x0.contiguous() 2025-05-07T20:32:47.3878216Z x1 = x1.contiguous() 2025-05-07T20:32:47.3878289Z 2025-05-07T20:32:47.3878386Z if scale_ub is not None: 2025-05-07T20:32:47.3878497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3878642Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3878722Z ) 2025-05-07T20:32:47.3878802Z else: 2025-05-07T20:32:47.3878900Z scale_ub_tensor = None 2025-05-07T20:32:47.3878975Z 2025-05-07T20:32:47.3879108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3879203Z op = silu_mul_quant 2025-05-07T20:32:47.3879288Z if compiled: 2025-05-07T20:32:47.3879390Z op = torch.compile(op) 2025-05-07T20:32:47.3879501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3879578Z 2025-05-07T20:32:47.3879670Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3879674Z 2025-05-07T20:32:47.3879776Z moe/activation_test.py:117: 2025-05-07T20:32:47.3879910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3880013Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3880184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3880613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3880712Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3881221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3881320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3881693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3881966Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3882320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3882416Z kernel = self.compile( 2025-05-07T20:32:47.3882815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3883097Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3883232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3883237Z 2025-05-07T20:32:47.3883450Z self = 2025-05-07T20:32:47.3884254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3884779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acf3a0c0>} 2025-05-07T20:32:47.3885548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3889247Z context = 2025-05-07T20:32:47.3889257Z 2025-05-07T20:32:47.3889447Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3889724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3889842Z module_map=module_map) 2025-05-07T20:32:47.3890014Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3890121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3890205Z E ^ 2025-05-07T20:32:47.3890576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3890580Z 2025-05-07T20:32:47.3891022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3891030Z 2025-05-07T20:32:47.3891143Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3891378Z self=, 2025-05-07T20:32:47.3891461Z T=1, 2025-05-07T20:32:47.3891539Z D=7168, 2025-05-07T20:32:47.3891624Z scale_ub=None, 2025-05-07T20:32:47.3891717Z contiguous=False, 2025-05-07T20:32:47.3891803Z compiled=False, 2025-05-07T20:32:47.3891882Z ) 2025-05-07T20:32:47.3892110Z self = 2025-05-07T20:32:47.3892285Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.3892292Z 2025-05-07T20:32:47.3892375Z @given( 2025-05-07T20:32:47.3892499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3892603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3892724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3892849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3893033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3893118Z ) 2025-05-07T20:32:47.3893376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3893475Z def test_silu_mul_quant( 2025-05-07T20:32:47.3893553Z self, 2025-05-07T20:32:47.3893632Z T: int, 2025-05-07T20:32:47.3893713Z D: int, 2025-05-07T20:32:47.3893815Z scale_ub: Optional[float], 2025-05-07T20:32:47.3893908Z contiguous: bool, 2025-05-07T20:32:47.3893999Z compiled: bool, 2025-05-07T20:32:47.3894124Z ) -> None: 2025-05-07T20:32:47.3894225Z torch.manual_seed(2025) 2025-05-07T20:32:47.3894302Z 2025-05-07T20:32:47.3894483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3894558Z 2025-05-07T20:32:47.3894656Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3894826Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3894923Z x = x_sign * x_clamp 2025-05-07T20:32:47.3895047Z x0 = x[:, :D] 2025-05-07T20:32:47.3895131Z x1 = x[:, D:] 2025-05-07T20:32:47.3895209Z 2025-05-07T20:32:47.3895295Z if contiguous: 2025-05-07T20:32:47.3895390Z x0 = x0.contiguous() 2025-05-07T20:32:47.3895484Z x1 = x1.contiguous() 2025-05-07T20:32:47.3895558Z 2025-05-07T20:32:47.3895652Z if scale_ub is not None: 2025-05-07T20:32:47.3895766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3895908Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3895987Z ) 2025-05-07T20:32:47.3896068Z else: 2025-05-07T20:32:47.3896165Z scale_ub_tensor = None 2025-05-07T20:32:47.3896239Z 2025-05-07T20:32:47.3896378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3896471Z op = silu_mul_quant 2025-05-07T20:32:47.3896563Z if compiled: 2025-05-07T20:32:47.3896669Z op = torch.compile(op) 2025-05-07T20:32:47.3896782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3896859Z 2025-05-07T20:32:47.3896953Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3896957Z 2025-05-07T20:32:47.3897058Z moe/activation_test.py:117: 2025-05-07T20:32:47.3897196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3897302Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3897406Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3897929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3898034Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3898411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3898652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3899010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3899112Z kernel = self.compile( 2025-05-07T20:32:47.3899508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3899694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3899827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3899831Z 2025-05-07T20:32:47.3900047Z self = 2025-05-07T20:32:47.3900858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3901432Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acf3ac00>} 2025-05-07T20:32:47.3902204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3902404Z context = 2025-05-07T20:32:47.3902408Z 2025-05-07T20:32:47.3902580Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3902930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3903041Z module_map=module_map) 2025-05-07T20:32:47.3903212Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3903353Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3903432Z E ^ 2025-05-07T20:32:47.3903845Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3903850Z 2025-05-07T20:32:47.3904282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3904286Z 2025-05-07T20:32:47.3904395Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3904628Z self=, 2025-05-07T20:32:47.3904710Z T=2048, 2025-05-07T20:32:47.3904794Z D=7168, 2025-05-07T20:32:47.3904881Z scale_ub=None, 2025-05-07T20:32:47.3904971Z contiguous=False, 2025-05-07T20:32:47.3905058Z compiled=True, 2025-05-07T20:32:47.3905133Z ) 2025-05-07T20:32:47.3905359Z self = 2025-05-07T20:32:47.3905544Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3905552Z 2025-05-07T20:32:47.3905635Z @given( 2025-05-07T20:32:47.3905763Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3905867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3905986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3906109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3906227Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3906303Z ) 2025-05-07T20:32:47.3906562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3906662Z def test_silu_mul_quant( 2025-05-07T20:32:47.3906741Z self, 2025-05-07T20:32:47.3906824Z T: int, 2025-05-07T20:32:47.3906903Z D: int, 2025-05-07T20:32:47.3907007Z scale_ub: Optional[float], 2025-05-07T20:32:47.3907101Z contiguous: bool, 2025-05-07T20:32:47.3907189Z compiled: bool, 2025-05-07T20:32:47.3907275Z ) -> None: 2025-05-07T20:32:47.3907377Z torch.manual_seed(2025) 2025-05-07T20:32:47.3907453Z 2025-05-07T20:32:47.3907633Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3907708Z 2025-05-07T20:32:47.3907803Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3907935Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3908026Z x = x_sign * x_clamp 2025-05-07T20:32:47.3908108Z x0 = x[:, :D] 2025-05-07T20:32:47.3908194Z x1 = x[:, D:] 2025-05-07T20:32:47.3908271Z 2025-05-07T20:32:47.3908356Z if contiguous: 2025-05-07T20:32:47.3908456Z x0 = x0.contiguous() 2025-05-07T20:32:47.3908549Z x1 = x1.contiguous() 2025-05-07T20:32:47.3908631Z 2025-05-07T20:32:47.3908724Z if scale_ub is not None: 2025-05-07T20:32:47.3908833Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3908977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3909057Z ) 2025-05-07T20:32:47.3909183Z else: 2025-05-07T20:32:47.3909288Z scale_ub_tensor = None 2025-05-07T20:32:47.3909363Z 2025-05-07T20:32:47.3909497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3909593Z op = silu_mul_quant 2025-05-07T20:32:47.3909680Z if compiled: 2025-05-07T20:32:47.3909783Z op = torch.compile(op) 2025-05-07T20:32:47.3909895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3909971Z 2025-05-07T20:32:47.3910067Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3910113Z 2025-05-07T20:32:47.3910214Z moe/activation_test.py:117: 2025-05-07T20:32:47.3910348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3910457Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3910559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3910940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3911121Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3911635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3911738Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3912108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3912341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3912696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3912796Z kernel = self.compile( 2025-05-07T20:32:47.3913192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3913597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3913782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3913787Z 2025-05-07T20:32:47.3914009Z self = 2025-05-07T20:32:47.3914810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3915331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898b8c2c0>} 2025-05-07T20:32:47.3916103Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3916306Z context = 2025-05-07T20:32:47.3916313Z 2025-05-07T20:32:47.3916493Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3916768Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3916882Z module_map=module_map) 2025-05-07T20:32:47.3917052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3917155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3917237Z E ^ 2025-05-07T20:32:47.3917605Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3917612Z 2025-05-07T20:32:47.3918043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3918048Z 2025-05-07T20:32:47.3918159Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3918489Z self=, 2025-05-07T20:32:47.3918575Z T=4096, 2025-05-07T20:32:47.3918654Z D=7168, 2025-05-07T20:32:47.3918738Z scale_ub=None, 2025-05-07T20:32:47.3918830Z contiguous=False, 2025-05-07T20:32:47.3918916Z compiled=True, 2025-05-07T20:32:47.3919000Z ) 2025-05-07T20:32:47.3919273Z self = 2025-05-07T20:32:47.3919454Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3919458Z 2025-05-07T20:32:47.3919537Z @given( 2025-05-07T20:32:47.3919758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3919861Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3919984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3920172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3920291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3920433Z ) 2025-05-07T20:32:47.3920741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3920840Z def test_silu_mul_quant( 2025-05-07T20:32:47.3920922Z self, 2025-05-07T20:32:47.3921000Z T: int, 2025-05-07T20:32:47.3921077Z D: int, 2025-05-07T20:32:47.3921181Z scale_ub: Optional[float], 2025-05-07T20:32:47.3921272Z contiguous: bool, 2025-05-07T20:32:47.3921360Z compiled: bool, 2025-05-07T20:32:47.3921444Z ) -> None: 2025-05-07T20:32:47.3921541Z torch.manual_seed(2025) 2025-05-07T20:32:47.3921621Z 2025-05-07T20:32:47.3921796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3921871Z 2025-05-07T20:32:47.3921967Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3922099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3922190Z x = x_sign * x_clamp 2025-05-07T20:32:47.3922278Z x0 = x[:, :D] 2025-05-07T20:32:47.3922360Z x1 = x[:, D:] 2025-05-07T20:32:47.3922435Z 2025-05-07T20:32:47.3922527Z if contiguous: 2025-05-07T20:32:47.3922620Z x0 = x0.contiguous() 2025-05-07T20:32:47.3922710Z x1 = x1.contiguous() 2025-05-07T20:32:47.3922786Z 2025-05-07T20:32:47.3922879Z if scale_ub is not None: 2025-05-07T20:32:47.3922991Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3923130Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3923206Z ) 2025-05-07T20:32:47.3923288Z else: 2025-05-07T20:32:47.3923388Z scale_ub_tensor = None 2025-05-07T20:32:47.3923463Z 2025-05-07T20:32:47.3923598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3923690Z op = silu_mul_quant 2025-05-07T20:32:47.3923776Z if compiled: 2025-05-07T20:32:47.3923880Z op = torch.compile(op) 2025-05-07T20:32:47.3923990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3924067Z 2025-05-07T20:32:47.3924167Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3924172Z 2025-05-07T20:32:47.3924271Z moe/activation_test.py:117: 2025-05-07T20:32:47.3924407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3924510Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3924611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3924993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3925088Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3925599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3925702Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3926071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3926358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3926711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3926807Z kernel = self.compile( 2025-05-07T20:32:47.3927204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3927385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3927517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3927564Z 2025-05-07T20:32:47.3927777Z self = 2025-05-07T20:32:47.3928576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3929178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898b8cd60>} 2025-05-07T20:32:47.3929946Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3930149Z context = 2025-05-07T20:32:47.3930153Z 2025-05-07T20:32:47.3930325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3930598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3930711Z module_map=module_map) 2025-05-07T20:32:47.3930877Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3930986Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3931064Z E ^ 2025-05-07T20:32:47.3931432Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3931437Z 2025-05-07T20:32:47.3931871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3931876Z 2025-05-07T20:32:47.3931982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3932216Z self=, 2025-05-07T20:32:47.3932298Z T=16384, 2025-05-07T20:32:47.3932376Z D=5120, 2025-05-07T20:32:47.3932464Z scale_ub=1200.0, 2025-05-07T20:32:47.3932551Z contiguous=False, 2025-05-07T20:32:47.3932636Z compiled=False, 2025-05-07T20:32:47.3932714Z ) 2025-05-07T20:32:47.3932939Z self = 2025-05-07T20:32:47.3933132Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.3933139Z 2025-05-07T20:32:47.3933220Z @given( 2025-05-07T20:32:47.3933342Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3933445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3933565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3933685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3933803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3933878Z ) 2025-05-07T20:32:47.3934132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3934238Z def test_silu_mul_quant( 2025-05-07T20:32:47.3934319Z self, 2025-05-07T20:32:47.3934396Z T: int, 2025-05-07T20:32:47.3934476Z D: int, 2025-05-07T20:32:47.3934576Z scale_ub: Optional[float], 2025-05-07T20:32:47.3934671Z contiguous: bool, 2025-05-07T20:32:47.3934762Z compiled: bool, 2025-05-07T20:32:47.3934887Z ) -> None: 2025-05-07T20:32:47.3934988Z torch.manual_seed(2025) 2025-05-07T20:32:47.3935065Z 2025-05-07T20:32:47.3935240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3935318Z 2025-05-07T20:32:47.3935411Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3935538Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3935631Z x = x_sign * x_clamp 2025-05-07T20:32:47.3935712Z x0 = x[:, :D] 2025-05-07T20:32:47.3935793Z x1 = x[:, D:] 2025-05-07T20:32:47.3935913Z 2025-05-07T20:32:47.3935998Z if contiguous: 2025-05-07T20:32:47.3936091Z x0 = x0.contiguous() 2025-05-07T20:32:47.3936186Z x1 = x1.contiguous() 2025-05-07T20:32:47.3936259Z 2025-05-07T20:32:47.3936352Z if scale_ub is not None: 2025-05-07T20:32:47.3936461Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3936686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3936766Z ) 2025-05-07T20:32:47.3936883Z else: 2025-05-07T20:32:47.3936981Z scale_ub_tensor = None 2025-05-07T20:32:47.3937059Z 2025-05-07T20:32:47.3937192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3937284Z op = silu_mul_quant 2025-05-07T20:32:47.3937374Z if compiled: 2025-05-07T20:32:47.3937475Z op = torch.compile(op) 2025-05-07T20:32:47.3937583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3937659Z 2025-05-07T20:32:47.3937753Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3937757Z 2025-05-07T20:32:47.3937862Z moe/activation_test.py:117: 2025-05-07T20:32:47.3937995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3938099Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3938206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3938728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3938832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3939203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3939437Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3939791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3939888Z kernel = self.compile( 2025-05-07T20:32:47.3940284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3940465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3940596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3940604Z 2025-05-07T20:32:47.3940823Z self = 2025-05-07T20:32:47.3941618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3942140Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898b8dc60>} 2025-05-07T20:32:47.3942904Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3943106Z context = 2025-05-07T20:32:47.3943112Z 2025-05-07T20:32:47.3943285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3943603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3943719Z module_map=module_map) 2025-05-07T20:32:47.3943885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3943987Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3944068Z E ^ 2025-05-07T20:32:47.3944432Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3944436Z 2025-05-07T20:32:47.3944908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3944912Z 2025-05-07T20:32:47.3945018Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3945248Z self=, 2025-05-07T20:32:47.3945368Z T=16384, 2025-05-07T20:32:47.3945446Z D=5120, 2025-05-07T20:32:47.3945534Z scale_ub=1200.0, 2025-05-07T20:32:47.3945660Z contiguous=True, 2025-05-07T20:32:47.3945746Z compiled=True, 2025-05-07T20:32:47.3945820Z ) 2025-05-07T20:32:47.3946049Z self = 2025-05-07T20:32:47.3946231Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.3946235Z 2025-05-07T20:32:47.3946317Z @given( 2025-05-07T20:32:47.3946440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3946542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3946668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3946789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3946906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3946983Z ) 2025-05-07T20:32:47.3947239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3947341Z def test_silu_mul_quant( 2025-05-07T20:32:47.3947423Z self, 2025-05-07T20:32:47.3947501Z T: int, 2025-05-07T20:32:47.3947583Z D: int, 2025-05-07T20:32:47.3947684Z scale_ub: Optional[float], 2025-05-07T20:32:47.3947775Z contiguous: bool, 2025-05-07T20:32:47.3947865Z compiled: bool, 2025-05-07T20:32:47.3947944Z ) -> None: 2025-05-07T20:32:47.3948041Z torch.manual_seed(2025) 2025-05-07T20:32:47.3948117Z 2025-05-07T20:32:47.3948292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3948369Z 2025-05-07T20:32:47.3948466Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3948599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3948691Z x = x_sign * x_clamp 2025-05-07T20:32:47.3948776Z x0 = x[:, :D] 2025-05-07T20:32:47.3948857Z x1 = x[:, D:] 2025-05-07T20:32:47.3948934Z 2025-05-07T20:32:47.3949023Z if contiguous: 2025-05-07T20:32:47.3949126Z x0 = x0.contiguous() 2025-05-07T20:32:47.3949221Z x1 = x1.contiguous() 2025-05-07T20:32:47.3949295Z 2025-05-07T20:32:47.3949390Z if scale_ub is not None: 2025-05-07T20:32:47.3949497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3949636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3949715Z ) 2025-05-07T20:32:47.3949792Z else: 2025-05-07T20:32:47.3949887Z scale_ub_tensor = None 2025-05-07T20:32:47.3949964Z 2025-05-07T20:32:47.3950097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3950193Z op = silu_mul_quant 2025-05-07T20:32:47.3950283Z if compiled: 2025-05-07T20:32:47.3950384Z op = torch.compile(op) 2025-05-07T20:32:47.3950494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3950568Z 2025-05-07T20:32:47.3950662Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3950666Z 2025-05-07T20:32:47.3950837Z moe/activation_test.py:117: 2025-05-07T20:32:47.3950973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3951076Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3951181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3951560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3951659Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3952169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3952309Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3952680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3952912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3953342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3953443Z kernel = self.compile( 2025-05-07T20:32:47.3953836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3954019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3954149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3954154Z 2025-05-07T20:32:47.3954363Z self = 2025-05-07T20:32:47.3955168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3955693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0898b8f380>} 2025-05-07T20:32:47.3956462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3956661Z context = 2025-05-07T20:32:47.3956666Z 2025-05-07T20:32:47.3956839Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3957116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3957228Z module_map=module_map) 2025-05-07T20:32:47.3957397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3957499Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3957580Z E ^ 2025-05-07T20:32:47.3957952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3957957Z 2025-05-07T20:32:47.3958384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3958389Z 2025-05-07T20:32:47.3958499Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3958730Z self=, 2025-05-07T20:32:47.3958810Z T=16384, 2025-05-07T20:32:47.3958895Z D=5120, 2025-05-07T20:32:47.3958986Z scale_ub=None, 2025-05-07T20:32:47.3959094Z contiguous=False, 2025-05-07T20:32:47.3959193Z compiled=True, 2025-05-07T20:32:47.3959281Z ) 2025-05-07T20:32:47.3959506Z self = 2025-05-07T20:32:47.3959692Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3959700Z 2025-05-07T20:32:47.3959778Z @given( 2025-05-07T20:32:47.3959951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3960054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3960243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3960373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3960491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3960567Z ) 2025-05-07T20:32:47.3960827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3960924Z def test_silu_mul_quant( 2025-05-07T20:32:47.3961047Z self, 2025-05-07T20:32:47.3961129Z T: int, 2025-05-07T20:32:47.3961208Z D: int, 2025-05-07T20:32:47.3961311Z scale_ub: Optional[float], 2025-05-07T20:32:47.3961405Z contiguous: bool, 2025-05-07T20:32:47.3961492Z compiled: bool, 2025-05-07T20:32:47.3961575Z ) -> None: 2025-05-07T20:32:47.3961713Z torch.manual_seed(2025) 2025-05-07T20:32:47.3961787Z 2025-05-07T20:32:47.3962004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3962081Z 2025-05-07T20:32:47.3962176Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3962307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3962397Z x = x_sign * x_clamp 2025-05-07T20:32:47.3962478Z x0 = x[:, :D] 2025-05-07T20:32:47.3962563Z x1 = x[:, D:] 2025-05-07T20:32:47.3962637Z 2025-05-07T20:32:47.3962721Z if contiguous: 2025-05-07T20:32:47.3962820Z x0 = x0.contiguous() 2025-05-07T20:32:47.3962915Z x1 = x1.contiguous() 2025-05-07T20:32:47.3962991Z 2025-05-07T20:32:47.3963083Z if scale_ub is not None: 2025-05-07T20:32:47.3963190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3963330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3963411Z ) 2025-05-07T20:32:47.3963488Z else: 2025-05-07T20:32:47.3963588Z scale_ub_tensor = None 2025-05-07T20:32:47.3963667Z 2025-05-07T20:32:47.3963799Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3963895Z op = silu_mul_quant 2025-05-07T20:32:47.3963982Z if compiled: 2025-05-07T20:32:47.3964083Z op = torch.compile(op) 2025-05-07T20:32:47.3964194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3964267Z 2025-05-07T20:32:47.3964361Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3964366Z 2025-05-07T20:32:47.3964464Z moe/activation_test.py:117: 2025-05-07T20:32:47.3964600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3964705Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3964807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3965186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3965290Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3965802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3965905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3966273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3966504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3966857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3966955Z kernel = self.compile( 2025-05-07T20:32:47.3967348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3967535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3967715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3967720Z 2025-05-07T20:32:47.3967939Z self = 2025-05-07T20:32:47.3968737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3969260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ade2c5e0>} 2025-05-07T20:32:47.3970068Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3970266Z context = 2025-05-07T20:32:47.3970309Z 2025-05-07T20:32:47.3970523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3970796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3970910Z module_map=module_map) 2025-05-07T20:32:47.3971077Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3971179Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3971260Z E ^ 2025-05-07T20:32:47.3971627Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3971634Z 2025-05-07T20:32:47.3972065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3972072Z 2025-05-07T20:32:47.3972178Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3972411Z self=, 2025-05-07T20:32:47.3972497Z T=2048, 2025-05-07T20:32:47.3972577Z D=5120, 2025-05-07T20:32:47.3972660Z scale_ub=None, 2025-05-07T20:32:47.3972750Z contiguous=False, 2025-05-07T20:32:47.3972835Z compiled=True, 2025-05-07T20:32:47.3972909Z ) 2025-05-07T20:32:47.3973136Z self = 2025-05-07T20:32:47.3973316Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.3973321Z 2025-05-07T20:32:47.3973399Z @given( 2025-05-07T20:32:47.3973524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3973628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3973748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3973869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3973985Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3974068Z ) 2025-05-07T20:32:47.3974325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3974425Z def test_silu_mul_quant( 2025-05-07T20:32:47.3974505Z self, 2025-05-07T20:32:47.3974584Z T: int, 2025-05-07T20:32:47.3974663Z D: int, 2025-05-07T20:32:47.3974768Z scale_ub: Optional[float], 2025-05-07T20:32:47.3974859Z contiguous: bool, 2025-05-07T20:32:47.3974949Z compiled: bool, 2025-05-07T20:32:47.3975028Z ) -> None: 2025-05-07T20:32:47.3975125Z torch.manual_seed(2025) 2025-05-07T20:32:47.3975203Z 2025-05-07T20:32:47.3975378Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3975459Z 2025-05-07T20:32:47.3975556Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3975685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3975775Z x = x_sign * x_clamp 2025-05-07T20:32:47.3975861Z x0 = x[:, :D] 2025-05-07T20:32:47.3975945Z x1 = x[:, D:] 2025-05-07T20:32:47.3976019Z 2025-05-07T20:32:47.3976154Z if contiguous: 2025-05-07T20:32:47.3976250Z x0 = x0.contiguous() 2025-05-07T20:32:47.3976340Z x1 = x1.contiguous() 2025-05-07T20:32:47.3976416Z 2025-05-07T20:32:47.3976509Z if scale_ub is not None: 2025-05-07T20:32:47.3976619Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3976758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3976835Z ) 2025-05-07T20:32:47.3976917Z else: 2025-05-07T20:32:47.3977013Z scale_ub_tensor = None 2025-05-07T20:32:47.3977128Z 2025-05-07T20:32:47.3977264Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3977355Z op = silu_mul_quant 2025-05-07T20:32:47.3977442Z if compiled: 2025-05-07T20:32:47.3977547Z op = torch.compile(op) 2025-05-07T20:32:47.3977655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3977771Z 2025-05-07T20:32:47.3977870Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3977875Z 2025-05-07T20:32:47.3978034Z moe/activation_test.py:117: 2025-05-07T20:32:47.3978172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3978276Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3978379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3978760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3978857Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3979366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3979472Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3979841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3980081Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3980434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3980530Z kernel = self.compile( 2025-05-07T20:32:47.3980926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3981106Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3981239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3981246Z 2025-05-07T20:32:47.3981457Z self = 2025-05-07T20:32:47.3982253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3982790Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ade2d440>} 2025-05-07T20:32:47.3983557Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3983757Z context = 2025-05-07T20:32:47.3983762Z 2025-05-07T20:32:47.3983932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3984207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3984320Z module_map=module_map) 2025-05-07T20:32:47.3984486Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3984594Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3984673Z E ^ 2025-05-07T20:32:47.3985086Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3985092Z 2025-05-07T20:32:47.3985525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3985530Z 2025-05-07T20:32:47.3985637Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3985871Z self=, 2025-05-07T20:32:47.3985949Z T=2048, 2025-05-07T20:32:47.3986069Z D=5120, 2025-05-07T20:32:47.3986157Z scale_ub=1200.0, 2025-05-07T20:32:47.3986245Z contiguous=False, 2025-05-07T20:32:47.3986330Z compiled=True, 2025-05-07T20:32:47.3986407Z ) 2025-05-07T20:32:47.3986634Z self = 2025-05-07T20:32:47.3986815Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.3986863Z 2025-05-07T20:32:47.3986947Z @given( 2025-05-07T20:32:47.3987108Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.3987217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.3987335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.3987454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.3987573Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.3987649Z ) 2025-05-07T20:32:47.3987901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.3988002Z def test_silu_mul_quant( 2025-05-07T20:32:47.3988080Z self, 2025-05-07T20:32:47.3988157Z T: int, 2025-05-07T20:32:47.3988239Z D: int, 2025-05-07T20:32:47.3988339Z scale_ub: Optional[float], 2025-05-07T20:32:47.3988430Z contiguous: bool, 2025-05-07T20:32:47.3988524Z compiled: bool, 2025-05-07T20:32:47.3988603Z ) -> None: 2025-05-07T20:32:47.3988704Z torch.manual_seed(2025) 2025-05-07T20:32:47.3988780Z 2025-05-07T20:32:47.3988957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.3989056Z 2025-05-07T20:32:47.3989157Z x_sign = torch.sign(x) 2025-05-07T20:32:47.3989305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.3989398Z x = x_sign * x_clamp 2025-05-07T20:32:47.3989481Z x0 = x[:, :D] 2025-05-07T20:32:47.3989562Z x1 = x[:, D:] 2025-05-07T20:32:47.3989641Z 2025-05-07T20:32:47.3989727Z if contiguous: 2025-05-07T20:32:47.3989822Z x0 = x0.contiguous() 2025-05-07T20:32:47.3989916Z x1 = x1.contiguous() 2025-05-07T20:32:47.3989990Z 2025-05-07T20:32:47.3990083Z if scale_ub is not None: 2025-05-07T20:32:47.3990195Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.3990333Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.3990417Z ) 2025-05-07T20:32:47.3990497Z else: 2025-05-07T20:32:47.3990597Z scale_ub_tensor = None 2025-05-07T20:32:47.3990676Z 2025-05-07T20:32:47.3990809Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.3990901Z op = silu_mul_quant 2025-05-07T20:32:47.3990988Z if compiled: 2025-05-07T20:32:47.3991090Z op = torch.compile(op) 2025-05-07T20:32:47.3991197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3991274Z 2025-05-07T20:32:47.3991366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.3991373Z 2025-05-07T20:32:47.3991475Z moe/activation_test.py:117: 2025-05-07T20:32:47.3991606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3991709Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.3991813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.3992243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.3992344Z return fn(*args, **kwargs) 2025-05-07T20:32:47.3992855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.3992954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.3993326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.3993557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.3993949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.3994047Z kernel = self.compile( 2025-05-07T20:32:47.3994441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.3994662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.3994835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.3994840Z 2025-05-07T20:32:47.3995051Z self = 2025-05-07T20:32:47.3995855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.3996379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ade2e660>} 2025-05-07T20:32:47.3997149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.3997354Z context = 2025-05-07T20:32:47.3997358Z 2025-05-07T20:32:47.3997532Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.3997808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.3997919Z module_map=module_map) 2025-05-07T20:32:47.3998088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.3998192Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.3998270Z E ^ 2025-05-07T20:32:47.3998638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.3998645Z 2025-05-07T20:32:47.3999074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.3999079Z 2025-05-07T20:32:47.3999188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.3999424Z self=, 2025-05-07T20:32:47.3999506Z T=4096, 2025-05-07T20:32:47.3999589Z D=5120, 2025-05-07T20:32:47.3999675Z scale_ub=1200.0, 2025-05-07T20:32:47.3999761Z contiguous=True, 2025-05-07T20:32:47.3999847Z compiled=True, 2025-05-07T20:32:47.3999921Z ) 2025-05-07T20:32:47.4000225Z self = 2025-05-07T20:32:47.4000407Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.4000412Z 2025-05-07T20:32:47.4000490Z @given( 2025-05-07T20:32:47.4000614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4000718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4000835Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4000961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4001081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4001156Z ) 2025-05-07T20:32:47.4001462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4001560Z def test_silu_mul_quant( 2025-05-07T20:32:47.4001638Z self, 2025-05-07T20:32:47.4001722Z T: int, 2025-05-07T20:32:47.4001802Z D: int, 2025-05-07T20:32:47.4001902Z scale_ub: Optional[float], 2025-05-07T20:32:47.4001996Z contiguous: bool, 2025-05-07T20:32:47.4002083Z compiled: bool, 2025-05-07T20:32:47.4002161Z ) -> None: 2025-05-07T20:32:47.4002260Z torch.manual_seed(2025) 2025-05-07T20:32:47.4002374Z 2025-05-07T20:32:47.4002549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4002631Z 2025-05-07T20:32:47.4002725Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4002856Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4002946Z x = x_sign * x_clamp 2025-05-07T20:32:47.4003069Z x0 = x[:, :D] 2025-05-07T20:32:47.4003155Z x1 = x[:, D:] 2025-05-07T20:32:47.4003271Z 2025-05-07T20:32:47.4003359Z if contiguous: 2025-05-07T20:32:47.4003455Z x0 = x0.contiguous() 2025-05-07T20:32:47.4003545Z x1 = x1.contiguous() 2025-05-07T20:32:47.4003619Z 2025-05-07T20:32:47.4003717Z if scale_ub is not None: 2025-05-07T20:32:47.4003825Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4003963Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4004043Z ) 2025-05-07T20:32:47.4004120Z else: 2025-05-07T20:32:47.4004221Z scale_ub_tensor = None 2025-05-07T20:32:47.4004295Z 2025-05-07T20:32:47.4004427Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4004520Z op = silu_mul_quant 2025-05-07T20:32:47.4004606Z if compiled: 2025-05-07T20:32:47.4004707Z op = torch.compile(op) 2025-05-07T20:32:47.4004820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4004898Z 2025-05-07T20:32:47.4004993Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4004997Z 2025-05-07T20:32:47.4005100Z moe/activation_test.py:117: 2025-05-07T20:32:47.4005231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4005340Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4005442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4005822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.4005926Z return fn(*args, **kwargs) 2025-05-07T20:32:47.4006434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4006534Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4006907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4007149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4007503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4007599Z kernel = self.compile( 2025-05-07T20:32:47.4007994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4008177Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4008307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4008315Z 2025-05-07T20:32:47.4008528Z self = 2025-05-07T20:32:47.4009368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4009897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ade2f9c0>} 2025-05-07T20:32:47.4010670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4010868Z context = 2025-05-07T20:32:47.4010913Z 2025-05-07T20:32:47.4011088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4011363Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4011473Z module_map=module_map) 2025-05-07T20:32:47.4011707Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4011811Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4011928Z E ^ 2025-05-07T20:32:47.4012302Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4012306Z 2025-05-07T20:32:47.4012737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4012742Z 2025-05-07T20:32:47.4012852Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4013084Z self=, 2025-05-07T20:32:47.4016865Z T=128, 2025-05-07T20:32:47.4016953Z D=5120, 2025-05-07T20:32:47.4017047Z scale_ub=1200.0, 2025-05-07T20:32:47.4017139Z contiguous=False, 2025-05-07T20:32:47.4017228Z compiled=True, 2025-05-07T20:32:47.4017307Z ) 2025-05-07T20:32:47.4017540Z self = 2025-05-07T20:32:47.4017729Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.4017738Z 2025-05-07T20:32:47.4017823Z @given( 2025-05-07T20:32:47.4017946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4018052Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4018173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4018292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4018412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4018487Z ) 2025-05-07T20:32:47.4018745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4018847Z def test_silu_mul_quant( 2025-05-07T20:32:47.4018924Z self, 2025-05-07T20:32:47.4019001Z T: int, 2025-05-07T20:32:47.4019084Z D: int, 2025-05-07T20:32:47.4019185Z scale_ub: Optional[float], 2025-05-07T20:32:47.4019281Z contiguous: bool, 2025-05-07T20:32:47.4019372Z compiled: bool, 2025-05-07T20:32:47.4019458Z ) -> None: 2025-05-07T20:32:47.4019557Z torch.manual_seed(2025) 2025-05-07T20:32:47.4019634Z 2025-05-07T20:32:47.4019809Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4019888Z 2025-05-07T20:32:47.4019983Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4020110Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4020204Z x = x_sign * x_clamp 2025-05-07T20:32:47.4020285Z x0 = x[:, :D] 2025-05-07T20:32:47.4020367Z x1 = x[:, D:] 2025-05-07T20:32:47.4020445Z 2025-05-07T20:32:47.4020530Z if contiguous: 2025-05-07T20:32:47.4020624Z x0 = x0.contiguous() 2025-05-07T20:32:47.4020719Z x1 = x1.contiguous() 2025-05-07T20:32:47.4020792Z 2025-05-07T20:32:47.4020885Z if scale_ub is not None: 2025-05-07T20:32:47.4020996Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4021245Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4021325Z ) 2025-05-07T20:32:47.4021408Z else: 2025-05-07T20:32:47.4021505Z scale_ub_tensor = None 2025-05-07T20:32:47.4021582Z 2025-05-07T20:32:47.4021714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4021811Z op = silu_mul_quant 2025-05-07T20:32:47.4021899Z if compiled: 2025-05-07T20:32:47.4022002Z op = torch.compile(op) 2025-05-07T20:32:47.4022110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4022187Z 2025-05-07T20:32:47.4022339Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4022344Z 2025-05-07T20:32:47.4022444Z moe/activation_test.py:117: 2025-05-07T20:32:47.4022582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4022686Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4022851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4023294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.4023392Z return fn(*args, **kwargs) 2025-05-07T20:32:47.4023904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4024005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4024377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4024613Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4024969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4025069Z kernel = self.compile( 2025-05-07T20:32:47.4025465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4025656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4025797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4025802Z 2025-05-07T20:32:47.4026017Z self = 2025-05-07T20:32:47.4026823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4027351Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acc84fe0>} 2025-05-07T20:32:47.4028118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4028328Z context = 2025-05-07T20:32:47.4028333Z 2025-05-07T20:32:47.4028506Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4028783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4028894Z module_map=module_map) 2025-05-07T20:32:47.4029060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4029166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4029247Z E ^ 2025-05-07T20:32:47.4029613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4029620Z 2025-05-07T20:32:47.4030049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4030055Z 2025-05-07T20:32:47.4030207Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4030447Z self=, 2025-05-07T20:32:47.4030527Z T=16384, 2025-05-07T20:32:47.4030605Z D=7168, 2025-05-07T20:32:47.4030693Z scale_ub=1200.0, 2025-05-07T20:32:47.4030779Z contiguous=True, 2025-05-07T20:32:47.4030863Z compiled=True, 2025-05-07T20:32:47.4030944Z ) 2025-05-07T20:32:47.4031170Z self = 2025-05-07T20:32:47.4031354Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.4031401Z 2025-05-07T20:32:47.4031480Z @given( 2025-05-07T20:32:47.4031604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4031709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4031832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4031993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4032115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4032227Z ) 2025-05-07T20:32:47.4032488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4032585Z def test_silu_mul_quant( 2025-05-07T20:32:47.4032663Z self, 2025-05-07T20:32:47.4032742Z T: int, 2025-05-07T20:32:47.4032821Z D: int, 2025-05-07T20:32:47.4032921Z scale_ub: Optional[float], 2025-05-07T20:32:47.4033015Z contiguous: bool, 2025-05-07T20:32:47.4033103Z compiled: bool, 2025-05-07T20:32:47.4033182Z ) -> None: 2025-05-07T20:32:47.4033285Z torch.manual_seed(2025) 2025-05-07T20:32:47.4033359Z 2025-05-07T20:32:47.4033535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4033616Z 2025-05-07T20:32:47.4033710Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4033838Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4033936Z x = x_sign * x_clamp 2025-05-07T20:32:47.4034019Z x0 = x[:, :D] 2025-05-07T20:32:47.4034106Z x1 = x[:, D:] 2025-05-07T20:32:47.4034180Z 2025-05-07T20:32:47.4034267Z if contiguous: 2025-05-07T20:32:47.4034362Z x0 = x0.contiguous() 2025-05-07T20:32:47.4034453Z x1 = x1.contiguous() 2025-05-07T20:32:47.4034525Z 2025-05-07T20:32:47.4034620Z if scale_ub is not None: 2025-05-07T20:32:47.4034729Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4034868Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4034950Z ) 2025-05-07T20:32:47.4035029Z else: 2025-05-07T20:32:47.4035125Z scale_ub_tensor = None 2025-05-07T20:32:47.4035203Z 2025-05-07T20:32:47.4035336Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4035431Z op = silu_mul_quant 2025-05-07T20:32:47.4035520Z if compiled: 2025-05-07T20:32:47.4035623Z op = torch.compile(op) 2025-05-07T20:32:47.4035740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4035814Z 2025-05-07T20:32:47.4035906Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4035910Z 2025-05-07T20:32:47.4036012Z moe/activation_test.py:117: 2025-05-07T20:32:47.4036144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4036248Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4036352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4036730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.4036831Z return fn(*args, **kwargs) 2025-05-07T20:32:47.4037342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4037443Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4037864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4038100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4038453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4038554Z kernel = self.compile( 2025-05-07T20:32:47.4038948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4039134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4039305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4039310Z 2025-05-07T20:32:47.4039521Z self = 2025-05-07T20:32:47.4040465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4041028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acc85e40>} 2025-05-07T20:32:47.4041800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4041997Z context = 2025-05-07T20:32:47.4042004Z 2025-05-07T20:32:47.4042180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4042454Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4042564Z module_map=module_map) 2025-05-07T20:32:47.4042739Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4042843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4042920Z E ^ 2025-05-07T20:32:47.4043288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4043293Z 2025-05-07T20:32:47.4043720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4043724Z 2025-05-07T20:32:47.4043836Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4044069Z self=, 2025-05-07T20:32:47.4044146Z T=16384, 2025-05-07T20:32:47.4044227Z D=5120, 2025-05-07T20:32:47.4044313Z scale_ub=1200.0, 2025-05-07T20:32:47.4044400Z contiguous=True, 2025-05-07T20:32:47.4044487Z compiled=False, 2025-05-07T20:32:47.4044566Z ) 2025-05-07T20:32:47.4044794Z self = 2025-05-07T20:32:47.4044985Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.4044990Z 2025-05-07T20:32:47.4045069Z @given( 2025-05-07T20:32:47.4045193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4045295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4045413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4045535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4045654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4045733Z ) 2025-05-07T20:32:47.4045991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4046087Z def test_silu_mul_quant( 2025-05-07T20:32:47.4046166Z self, 2025-05-07T20:32:47.4046243Z T: int, 2025-05-07T20:32:47.4046320Z D: int, 2025-05-07T20:32:47.4046427Z scale_ub: Optional[float], 2025-05-07T20:32:47.4046563Z contiguous: bool, 2025-05-07T20:32:47.4046654Z compiled: bool, 2025-05-07T20:32:47.4046739Z ) -> None: 2025-05-07T20:32:47.4046834Z torch.manual_seed(2025) 2025-05-07T20:32:47.4046909Z 2025-05-07T20:32:47.4047089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4047164Z 2025-05-07T20:32:47.4047258Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4047389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4047479Z x = x_sign * x_clamp 2025-05-07T20:32:47.4047561Z x0 = x[:, :D] 2025-05-07T20:32:47.4047714Z x1 = x[:, D:] 2025-05-07T20:32:47.4047787Z 2025-05-07T20:32:47.4047874Z if contiguous: 2025-05-07T20:32:47.4047966Z x0 = x0.contiguous() 2025-05-07T20:32:47.4048057Z x1 = x1.contiguous() 2025-05-07T20:32:47.4048132Z 2025-05-07T20:32:47.4048225Z if scale_ub is not None: 2025-05-07T20:32:47.4048378Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4048556Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4048634Z ) 2025-05-07T20:32:47.4048712Z else: 2025-05-07T20:32:47.4048813Z scale_ub_tensor = None 2025-05-07T20:32:47.4048888Z 2025-05-07T20:32:47.4049045Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4049147Z op = silu_mul_quant 2025-05-07T20:32:47.4049251Z if compiled: 2025-05-07T20:32:47.4049356Z op = torch.compile(op) 2025-05-07T20:32:47.4049463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4049539Z 2025-05-07T20:32:47.4049634Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4049639Z 2025-05-07T20:32:47.4049740Z moe/activation_test.py:117: 2025-05-07T20:32:47.4049872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4049981Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4050086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4050604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4050707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4051077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4051311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4051665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4051764Z kernel = self.compile( 2025-05-07T20:32:47.4052165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4052349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4052489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4052496Z 2025-05-07T20:32:47.4052712Z self = 2025-05-07T20:32:47.4053513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4054035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acc86ca0>} 2025-05-07T20:32:47.4054803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4055004Z context = 2025-05-07T20:32:47.4055011Z 2025-05-07T20:32:47.4055229Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4055503Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4055617Z module_map=module_map) 2025-05-07T20:32:47.4055783Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4055887Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4055965Z E ^ 2025-05-07T20:32:47.4056332Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4056380Z 2025-05-07T20:32:47.4056814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4056820Z 2025-05-07T20:32:47.4056927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4057203Z self=, 2025-05-07T20:32:47.4057286Z T=1, 2025-05-07T20:32:47.4057400Z D=7168, 2025-05-07T20:32:47.4057490Z scale_ub=1200.0, 2025-05-07T20:32:47.4057579Z contiguous=False, 2025-05-07T20:32:47.4057664Z compiled=False, 2025-05-07T20:32:47.4057741Z ) 2025-05-07T20:32:47.4057967Z self = 2025-05-07T20:32:47.4058144Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.4058149Z 2025-05-07T20:32:47.4058236Z @given( 2025-05-07T20:32:47.4058358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4058466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4058584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4058703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4058822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4058900Z ) 2025-05-07T20:32:47.4059156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4059258Z def test_silu_mul_quant( 2025-05-07T20:32:47.4059335Z self, 2025-05-07T20:32:47.4059413Z T: int, 2025-05-07T20:32:47.4059492Z D: int, 2025-05-07T20:32:47.4059592Z scale_ub: Optional[float], 2025-05-07T20:32:47.4059685Z contiguous: bool, 2025-05-07T20:32:47.4059774Z compiled: bool, 2025-05-07T20:32:47.4059853Z ) -> None: 2025-05-07T20:32:47.4059952Z torch.manual_seed(2025) 2025-05-07T20:32:47.4060026Z 2025-05-07T20:32:47.4060199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4060281Z 2025-05-07T20:32:47.4060374Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4060502Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4060595Z x = x_sign * x_clamp 2025-05-07T20:32:47.4060675Z x0 = x[:, :D] 2025-05-07T20:32:47.4060760Z x1 = x[:, D:] 2025-05-07T20:32:47.4060840Z 2025-05-07T20:32:47.4060929Z if contiguous: 2025-05-07T20:32:47.4061021Z x0 = x0.contiguous() 2025-05-07T20:32:47.4061116Z x1 = x1.contiguous() 2025-05-07T20:32:47.4061189Z 2025-05-07T20:32:47.4061281Z if scale_ub is not None: 2025-05-07T20:32:47.4061394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4061532Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4061611Z ) 2025-05-07T20:32:47.4061688Z else: 2025-05-07T20:32:47.4061784Z scale_ub_tensor = None 2025-05-07T20:32:47.4061864Z 2025-05-07T20:32:47.4061997Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4062090Z op = silu_mul_quant 2025-05-07T20:32:47.4062178Z if compiled: 2025-05-07T20:32:47.4062280Z op = torch.compile(op) 2025-05-07T20:32:47.4062388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4062467Z 2025-05-07T20:32:47.4062609Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4062616Z 2025-05-07T20:32:47.4062720Z moe/activation_test.py:117: 2025-05-07T20:32:47.4062854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4062956Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4063063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4063575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4063676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4064092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4064323Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4064681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4064822Z kernel = self.compile( 2025-05-07T20:32:47.4065261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4065444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4065574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4065578Z 2025-05-07T20:32:47.4065794Z self = 2025-05-07T20:32:47.4066595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4067118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acd740e0>} 2025-05-07T20:32:47.4067892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4068090Z context = 2025-05-07T20:32:47.4068099Z 2025-05-07T20:32:47.4068269Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4068541Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4068656Z module_map=module_map) 2025-05-07T20:32:47.4068821Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4068922Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4069004Z E ^ 2025-05-07T20:32:47.4069369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4069378Z 2025-05-07T20:32:47.4069812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4069817Z 2025-05-07T20:32:47.4069923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4070155Z self=, 2025-05-07T20:32:47.4070236Z T=4096, 2025-05-07T20:32:47.4070314Z D=7168, 2025-05-07T20:32:47.4070399Z scale_ub=1200.0, 2025-05-07T20:32:47.4070490Z contiguous=False, 2025-05-07T20:32:47.4070574Z compiled=True, 2025-05-07T20:32:47.4070651Z ) 2025-05-07T20:32:47.4070882Z self = 2025-05-07T20:32:47.4071064Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.4071068Z 2025-05-07T20:32:47.4071149Z @given( 2025-05-07T20:32:47.4071272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4071421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4071546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4071667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4071782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4071862Z ) 2025-05-07T20:32:47.4072116Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4072215Z def test_silu_mul_quant( 2025-05-07T20:32:47.4072294Z self, 2025-05-07T20:32:47.4072371Z T: int, 2025-05-07T20:32:47.4072451Z D: int, 2025-05-07T20:32:47.4072636Z scale_ub: Optional[float], 2025-05-07T20:32:47.4072728Z contiguous: bool, 2025-05-07T20:32:47.4072819Z compiled: bool, 2025-05-07T20:32:47.4072898Z ) -> None: 2025-05-07T20:32:47.4072995Z torch.manual_seed(2025) 2025-05-07T20:32:47.4073071Z 2025-05-07T20:32:47.4073287Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4073365Z 2025-05-07T20:32:47.4073500Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4073629Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4073720Z x = x_sign * x_clamp 2025-05-07T20:32:47.4073805Z x0 = x[:, :D] 2025-05-07T20:32:47.4073885Z x1 = x[:, D:] 2025-05-07T20:32:47.4073961Z 2025-05-07T20:32:47.4074045Z if contiguous: 2025-05-07T20:32:47.4074138Z x0 = x0.contiguous() 2025-05-07T20:32:47.4074231Z x1 = x1.contiguous() 2025-05-07T20:32:47.4074305Z 2025-05-07T20:32:47.4074399Z if scale_ub is not None: 2025-05-07T20:32:47.4074510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4074649Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4074725Z ) 2025-05-07T20:32:47.4074806Z else: 2025-05-07T20:32:47.4074902Z scale_ub_tensor = None 2025-05-07T20:32:47.4074980Z 2025-05-07T20:32:47.4075120Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4075222Z op = silu_mul_quant 2025-05-07T20:32:47.4075310Z if compiled: 2025-05-07T20:32:47.4075412Z op = torch.compile(op) 2025-05-07T20:32:47.4075523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4075596Z 2025-05-07T20:32:47.4075687Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4075692Z 2025-05-07T20:32:47.4075793Z moe/activation_test.py:117: 2025-05-07T20:32:47.4075924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4076030Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4076134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4076513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.4076613Z return fn(*args, **kwargs) 2025-05-07T20:32:47.4077131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4077231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4077603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4077836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4078189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4078289Z kernel = self.compile( 2025-05-07T20:32:47.4078686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4078871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4079010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4079018Z 2025-05-07T20:32:47.4079340Z self = 2025-05-07T20:32:47.4080219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4080744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acd75300>} 2025-05-07T20:32:47.4081515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4081757Z context = 2025-05-07T20:32:47.4081762Z 2025-05-07T20:32:47.4081934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4082286Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4082399Z module_map=module_map) 2025-05-07T20:32:47.4082568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4082669Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4082748Z E ^ 2025-05-07T20:32:47.4083117Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4083121Z 2025-05-07T20:32:47.4083550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4083557Z 2025-05-07T20:32:47.4083666Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4083897Z self=, 2025-05-07T20:32:47.4083975Z T=128, 2025-05-07T20:32:47.4084059Z D=7168, 2025-05-07T20:32:47.4084145Z scale_ub=1200.0, 2025-05-07T20:32:47.4084234Z contiguous=False, 2025-05-07T20:32:47.4084323Z compiled=True, 2025-05-07T20:32:47.4084397Z ) 2025-05-07T20:32:47.4084623Z self = 2025-05-07T20:32:47.4084803Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.4084808Z 2025-05-07T20:32:47.4084886Z @given( 2025-05-07T20:32:47.4085013Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4085115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4085235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4085362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4085478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4085554Z ) 2025-05-07T20:32:47.4085812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4085910Z def test_silu_mul_quant( 2025-05-07T20:32:47.4085993Z self, 2025-05-07T20:32:47.4086073Z T: int, 2025-05-07T20:32:47.4086151Z D: int, 2025-05-07T20:32:47.4086253Z scale_ub: Optional[float], 2025-05-07T20:32:47.4086345Z contiguous: bool, 2025-05-07T20:32:47.4086434Z compiled: bool, 2025-05-07T20:32:47.4086519Z ) -> None: 2025-05-07T20:32:47.4086616Z torch.manual_seed(2025) 2025-05-07T20:32:47.4086689Z 2025-05-07T20:32:47.4086868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4086944Z 2025-05-07T20:32:47.4087040Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4087172Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4087263Z x = x_sign * x_clamp 2025-05-07T20:32:47.4087345Z x0 = x[:, :D] 2025-05-07T20:32:47.4087431Z x1 = x[:, D:] 2025-05-07T20:32:47.4087504Z 2025-05-07T20:32:47.4087597Z if contiguous: 2025-05-07T20:32:47.4087690Z x0 = x0.contiguous() 2025-05-07T20:32:47.4087826Z x1 = x1.contiguous() 2025-05-07T20:32:47.4087906Z 2025-05-07T20:32:47.4088000Z if scale_ub is not None: 2025-05-07T20:32:47.4088109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4088254Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4088331Z ) 2025-05-07T20:32:47.4088408Z else: 2025-05-07T20:32:47.4088511Z scale_ub_tensor = None 2025-05-07T20:32:47.4088586Z 2025-05-07T20:32:47.4088718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4088857Z op = silu_mul_quant 2025-05-07T20:32:47.4088945Z if compiled: 2025-05-07T20:32:47.4089048Z op = torch.compile(op) 2025-05-07T20:32:47.4089157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4089230Z 2025-05-07T20:32:47.4089324Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4089368Z 2025-05-07T20:32:47.4089470Z moe/activation_test.py:117: 2025-05-07T20:32:47.4089641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4089749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4089855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4090235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.4090334Z return fn(*args, **kwargs) 2025-05-07T20:32:47.4090845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4090951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4091321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4091555Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4091915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4092013Z kernel = self.compile( 2025-05-07T20:32:47.4092412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4092597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4092728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4092733Z 2025-05-07T20:32:47.4092946Z self = 2025-05-07T20:32:47.4093746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4094276Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acd76160>} 2025-05-07T20:32:47.4095049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4095247Z context = 2025-05-07T20:32:47.4095252Z 2025-05-07T20:32:47.4095425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4095697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4095811Z module_map=module_map) 2025-05-07T20:32:47.4095977Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4096080Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4096161Z E ^ 2025-05-07T20:32:47.4096573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4096581Z 2025-05-07T20:32:47.4097012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4097019Z 2025-05-07T20:32:47.4097124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4097355Z self=, 2025-05-07T20:32:47.4097435Z T=2048, 2025-05-07T20:32:47.4097512Z D=7168, 2025-05-07T20:32:47.4097596Z scale_ub=None, 2025-05-07T20:32:47.4097687Z contiguous=True, 2025-05-07T20:32:47.4097815Z compiled=True, 2025-05-07T20:32:47.4097890Z ) 2025-05-07T20:32:47.4098120Z self = 2025-05-07T20:32:47.4098298Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.4098303Z 2025-05-07T20:32:47.4098427Z @given( 2025-05-07T20:32:47.4098552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4098691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4098816Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4098935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4099051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4099128Z ) 2025-05-07T20:32:47.4099382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4099478Z def test_silu_mul_quant( 2025-05-07T20:32:47.4099559Z self, 2025-05-07T20:32:47.4099639Z T: int, 2025-05-07T20:32:47.4099716Z D: int, 2025-05-07T20:32:47.4099819Z scale_ub: Optional[float], 2025-05-07T20:32:47.4099910Z contiguous: bool, 2025-05-07T20:32:47.4099999Z compiled: bool, 2025-05-07T20:32:47.4100077Z ) -> None: 2025-05-07T20:32:47.4100174Z torch.manual_seed(2025) 2025-05-07T20:32:47.4100255Z 2025-05-07T20:32:47.4100433Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4100513Z 2025-05-07T20:32:47.4100608Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4100736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4100828Z x = x_sign * x_clamp 2025-05-07T20:32:47.4100914Z x0 = x[:, :D] 2025-05-07T20:32:47.4100995Z x1 = x[:, D:] 2025-05-07T20:32:47.4101068Z 2025-05-07T20:32:47.4101157Z if contiguous: 2025-05-07T20:32:47.4101250Z x0 = x0.contiguous() 2025-05-07T20:32:47.4101345Z x1 = x1.contiguous() 2025-05-07T20:32:47.4101422Z 2025-05-07T20:32:47.4101515Z if scale_ub is not None: 2025-05-07T20:32:47.4101627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4101767Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4101843Z ) 2025-05-07T20:32:47.4101926Z else: 2025-05-07T20:32:47.4102022Z scale_ub_tensor = None 2025-05-07T20:32:47.4102098Z 2025-05-07T20:32:47.4102237Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4102330Z op = silu_mul_quant 2025-05-07T20:32:47.4102416Z if compiled: 2025-05-07T20:32:47.4102522Z op = torch.compile(op) 2025-05-07T20:32:47.4102630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4102707Z 2025-05-07T20:32:47.4102799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4102803Z 2025-05-07T20:32:47.4102902Z moe/activation_test.py:117: 2025-05-07T20:32:47.4103039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4103145Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4103246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4103628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.4103726Z return fn(*args, **kwargs) 2025-05-07T20:32:47.4104283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4104387Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4104756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4104993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4105343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4105478Z kernel = self.compile( 2025-05-07T20:32:47.4105876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4106056Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4106187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4106230Z 2025-05-07T20:32:47.4106507Z self = 2025-05-07T20:32:47.4107309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4107834Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acd77420>} 2025-05-07T20:32:47.4108600Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4108800Z context = 2025-05-07T20:32:47.4108808Z 2025-05-07T20:32:47.4108985Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4109309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4109421Z module_map=module_map) 2025-05-07T20:32:47.4109585Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4109689Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4109766Z E ^ 2025-05-07T20:32:47.4110131Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4110137Z 2025-05-07T20:32:47.4110572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4110576Z 2025-05-07T20:32:47.4110682Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4110916Z self=, 2025-05-07T20:32:47.4110997Z T=16384, 2025-05-07T20:32:47.4111081Z D=5120, 2025-05-07T20:32:47.4111172Z scale_ub=None, 2025-05-07T20:32:47.4111260Z contiguous=False, 2025-05-07T20:32:47.4111344Z compiled=False, 2025-05-07T20:32:47.4111422Z ) 2025-05-07T20:32:47.4111646Z self = 2025-05-07T20:32:47.4111829Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.4111833Z 2025-05-07T20:32:47.4111914Z @given( 2025-05-07T20:32:47.4112037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4112144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4112260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4112378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4112494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4112570Z ) 2025-05-07T20:32:47.4112826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4112969Z def test_silu_mul_quant( 2025-05-07T20:32:47.4113049Z self, 2025-05-07T20:32:47.4113128Z T: int, 2025-05-07T20:32:47.4113208Z D: int, 2025-05-07T20:32:47.4113495Z scale_ub: Optional[float], 2025-05-07T20:32:47.4113633Z contiguous: bool, 2025-05-07T20:32:47.4113768Z compiled: bool, 2025-05-07T20:32:47.4113881Z ) -> None: 2025-05-07T20:32:47.4113985Z torch.manual_seed(2025) 2025-05-07T20:32:47.4114060Z 2025-05-07T20:32:47.4114234Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4114399Z 2025-05-07T20:32:47.4114492Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4114620Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4116553Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4116612Z 2025-05-07T20:32:47.4116736Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:47.4116741Z 2025-05-07T20:32:47.4116847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4117080Z self=, 2025-05-07T20:32:47.4117156Z T=4096, 2025-05-07T20:32:47.4117236Z D=7168, 2025-05-07T20:32:47.4117319Z scale_ub=1200.0, 2025-05-07T20:32:47.4117407Z contiguous=True, 2025-05-07T20:32:47.4117490Z compiled=True, 2025-05-07T20:32:47.4117564Z ) 2025-05-07T20:32:47.4117795Z self = 2025-05-07T20:32:47.4117975Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.4117979Z 2025-05-07T20:32:47.4118057Z @given( 2025-05-07T20:32:47.4118183Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4118284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4118401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4118522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4118637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4118716Z ) 2025-05-07T20:32:47.4118973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4119078Z def test_silu_mul_quant( 2025-05-07T20:32:47.4119173Z self, 2025-05-07T20:32:47.4119262Z T: int, 2025-05-07T20:32:47.4119356Z D: int, 2025-05-07T20:32:47.4119462Z scale_ub: Optional[float], 2025-05-07T20:32:47.4119555Z contiguous: bool, 2025-05-07T20:32:47.4119646Z compiled: bool, 2025-05-07T20:32:47.4119727Z ) -> None: 2025-05-07T20:32:47.4119822Z torch.manual_seed(2025) 2025-05-07T20:32:47.4119898Z 2025-05-07T20:32:47.4120137Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4120213Z 2025-05-07T20:32:47.4120309Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4120435Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4122346Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4122361Z 2025-05-07T20:32:47.4122482Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:47.4122487Z 2025-05-07T20:32:47.4122591Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4122823Z self=, 2025-05-07T20:32:47.4122902Z T=16384, 2025-05-07T20:32:47.4122980Z D=7168, 2025-05-07T20:32:47.4123068Z scale_ub=None, 2025-05-07T20:32:47.4123155Z contiguous=False, 2025-05-07T20:32:47.4123240Z compiled=False, 2025-05-07T20:32:47.4123359Z ) 2025-05-07T20:32:47.4123585Z self = 2025-05-07T20:32:47.4123768Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.4123773Z 2025-05-07T20:32:47.4123851Z @given( 2025-05-07T20:32:47.4124014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4124120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4124275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4124396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4124514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4124588Z ) 2025-05-07T20:32:47.4124840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4124937Z def test_silu_mul_quant( 2025-05-07T20:32:47.4125014Z self, 2025-05-07T20:32:47.4125093Z T: int, 2025-05-07T20:32:47.4125171Z D: int, 2025-05-07T20:32:47.4125271Z scale_ub: Optional[float], 2025-05-07T20:32:47.4125363Z contiguous: bool, 2025-05-07T20:32:47.4125450Z compiled: bool, 2025-05-07T20:32:47.4125529Z ) -> None: 2025-05-07T20:32:47.4125626Z torch.manual_seed(2025) 2025-05-07T20:32:47.4125703Z 2025-05-07T20:32:47.4125879Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4127745Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4127752Z 2025-05-07T20:32:47.4127872Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4127877Z 2025-05-07T20:32:47.4127985Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4128213Z self=, 2025-05-07T20:32:47.4128296Z T=2048, 2025-05-07T20:32:47.4128372Z D=7168, 2025-05-07T20:32:47.4128457Z scale_ub=1200.0, 2025-05-07T20:32:47.4128547Z contiguous=True, 2025-05-07T20:32:47.4128630Z compiled=True, 2025-05-07T20:32:47.4128703Z ) 2025-05-07T20:32:47.4128932Z self = 2025-05-07T20:32:47.4129109Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.4129114Z 2025-05-07T20:32:47.4129191Z @given( 2025-05-07T20:32:47.4129315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4129415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4129539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4129658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4129774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4129851Z ) 2025-05-07T20:32:47.4130104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4130248Z def test_silu_mul_quant( 2025-05-07T20:32:47.4130330Z self, 2025-05-07T20:32:47.4130408Z T: int, 2025-05-07T20:32:47.4130484Z D: int, 2025-05-07T20:32:47.4130585Z scale_ub: Optional[float], 2025-05-07T20:32:47.4130675Z contiguous: bool, 2025-05-07T20:32:47.4130761Z compiled: bool, 2025-05-07T20:32:47.4130846Z ) -> None: 2025-05-07T20:32:47.4130942Z torch.manual_seed(2025) 2025-05-07T20:32:47.4131018Z 2025-05-07T20:32:47.4131191Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4131308Z 2025-05-07T20:32:47.4131403Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4131529Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4133386Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4133430Z 2025-05-07T20:32:47.4133552Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:47.4133557Z 2025-05-07T20:32:47.4133661Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4133895Z self=, 2025-05-07T20:32:47.4133973Z T=2048, 2025-05-07T20:32:47.4134049Z D=7168, 2025-05-07T20:32:47.4134134Z scale_ub=None, 2025-05-07T20:32:47.4134218Z contiguous=True, 2025-05-07T20:32:47.4134305Z compiled=False, 2025-05-07T20:32:47.4134377Z ) 2025-05-07T20:32:47.4134601Z self = 2025-05-07T20:32:47.4134787Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4134791Z 2025-05-07T20:32:47.4134869Z @given( 2025-05-07T20:32:47.4134989Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4135093Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4135208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4135327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4135445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4135520Z ) 2025-05-07T20:32:47.4135781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4135876Z def test_silu_mul_quant( 2025-05-07T20:32:47.4135951Z self, 2025-05-07T20:32:47.4136031Z T: int, 2025-05-07T20:32:47.4136107Z D: int, 2025-05-07T20:32:47.4136207Z scale_ub: Optional[float], 2025-05-07T20:32:47.4136306Z contiguous: bool, 2025-05-07T20:32:47.4136395Z compiled: bool, 2025-05-07T20:32:47.4136475Z ) -> None: 2025-05-07T20:32:47.4136573Z torch.manual_seed(2025) 2025-05-07T20:32:47.4136646Z 2025-05-07T20:32:47.4136818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4136895Z 2025-05-07T20:32:47.4136987Z > x_sign = torch.sign(x) 2025-05-07T20:32:47.4138809Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4138818Z 2025-05-07T20:32:47.4138984Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:47.4138990Z 2025-05-07T20:32:47.4139098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4139328Z self=, 2025-05-07T20:32:47.4139404Z T=1, 2025-05-07T20:32:47.4139485Z D=7168, 2025-05-07T20:32:47.4139567Z scale_ub=1200.0, 2025-05-07T20:32:47.4139651Z contiguous=True, 2025-05-07T20:32:47.4139739Z compiled=False, 2025-05-07T20:32:47.4139812Z ) 2025-05-07T20:32:47.4140038Z self = 2025-05-07T20:32:47.4140253Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.4140258Z 2025-05-07T20:32:47.4143790Z @given( 2025-05-07T20:32:47.4143935Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4144039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4144257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4144420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4144542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4144621Z ) 2025-05-07T20:32:47.4144877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4144974Z def test_silu_mul_quant( 2025-05-07T20:32:47.4145057Z self, 2025-05-07T20:32:47.4145135Z T: int, 2025-05-07T20:32:47.4145216Z D: int, 2025-05-07T20:32:47.4145317Z scale_ub: Optional[float], 2025-05-07T20:32:47.4145410Z contiguous: bool, 2025-05-07T20:32:47.4145502Z compiled: bool, 2025-05-07T20:32:47.4145582Z ) -> None: 2025-05-07T20:32:47.4145679Z torch.manual_seed(2025) 2025-05-07T20:32:47.4145755Z 2025-05-07T20:32:47.4145932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4146010Z 2025-05-07T20:32:47.4146108Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4146240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4146330Z x = x_sign * x_clamp 2025-05-07T20:32:47.4146416Z x0 = x[:, :D] 2025-05-07T20:32:47.4146499Z x1 = x[:, D:] 2025-05-07T20:32:47.4146575Z 2025-05-07T20:32:47.4146661Z if contiguous: 2025-05-07T20:32:47.4146754Z x0 = x0.contiguous() 2025-05-07T20:32:47.4146848Z x1 = x1.contiguous() 2025-05-07T20:32:47.4146922Z 2025-05-07T20:32:47.4147015Z if scale_ub is not None: 2025-05-07T20:32:47.4147130Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4147274Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4147352Z ) 2025-05-07T20:32:47.4147432Z else: 2025-05-07T20:32:47.4147530Z scale_ub_tensor = None 2025-05-07T20:32:47.4147604Z 2025-05-07T20:32:47.4147740Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4147837Z op = silu_mul_quant 2025-05-07T20:32:47.4147927Z if compiled: 2025-05-07T20:32:47.4148036Z op = torch.compile(op) 2025-05-07T20:32:47.4148145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4148221Z 2025-05-07T20:32:47.4148317Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4148322Z 2025-05-07T20:32:47.4148422Z moe/activation_test.py:117: 2025-05-07T20:32:47.4148559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4148662Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4148765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4149319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4149436Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4149820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4150104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4150461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4150561Z kernel = self.compile( 2025-05-07T20:32:47.4150959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4151142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4151276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4151321Z 2025-05-07T20:32:47.4151535Z self = 2025-05-07T20:32:47.4152345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4152946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acb462a0>} 2025-05-07T20:32:47.4153721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4153920Z context = 2025-05-07T20:32:47.4153926Z 2025-05-07T20:32:47.4154097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4154376Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4154486Z module_map=module_map) 2025-05-07T20:32:47.4154657Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4154767Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4154848Z E ^ 2025-05-07T20:32:47.4155220Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4155225Z 2025-05-07T20:32:47.4155654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4155658Z 2025-05-07T20:32:47.4155764Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4155997Z self=, 2025-05-07T20:32:47.4156078Z T=128, 2025-05-07T20:32:47.4156162Z D=5120, 2025-05-07T20:32:47.4156245Z scale_ub=None, 2025-05-07T20:32:47.4156332Z contiguous=True, 2025-05-07T20:32:47.4156420Z compiled=False, 2025-05-07T20:32:47.4156494Z ) 2025-05-07T20:32:47.4156720Z self = 2025-05-07T20:32:47.4156905Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4156912Z 2025-05-07T20:32:47.4156990Z @given( 2025-05-07T20:32:47.4157113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4157220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4157339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4157462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4157578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4157652Z ) 2025-05-07T20:32:47.4157915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4158014Z def test_silu_mul_quant( 2025-05-07T20:32:47.4158092Z self, 2025-05-07T20:32:47.4158173Z T: int, 2025-05-07T20:32:47.4158250Z D: int, 2025-05-07T20:32:47.4158352Z scale_ub: Optional[float], 2025-05-07T20:32:47.4158450Z contiguous: bool, 2025-05-07T20:32:47.4158540Z compiled: bool, 2025-05-07T20:32:47.4158665Z ) -> None: 2025-05-07T20:32:47.4158768Z torch.manual_seed(2025) 2025-05-07T20:32:47.4158841Z 2025-05-07T20:32:47.4159019Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4159096Z 2025-05-07T20:32:47.4159189Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4159319Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4159409Z x = x_sign * x_clamp 2025-05-07T20:32:47.4159490Z x0 = x[:, :D] 2025-05-07T20:32:47.4159574Z x1 = x[:, D:] 2025-05-07T20:32:47.4159690Z 2025-05-07T20:32:47.4159776Z if contiguous: 2025-05-07T20:32:47.4159873Z x0 = x0.contiguous() 2025-05-07T20:32:47.4159964Z x1 = x1.contiguous() 2025-05-07T20:32:47.4160037Z 2025-05-07T20:32:47.4160227Z if scale_ub is not None: 2025-05-07T20:32:47.4160340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4160529Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4160649Z ) 2025-05-07T20:32:47.4160727Z else: 2025-05-07T20:32:47.4160827Z scale_ub_tensor = None 2025-05-07T20:32:47.4160901Z 2025-05-07T20:32:47.4161037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4161132Z op = silu_mul_quant 2025-05-07T20:32:47.4161217Z if compiled: 2025-05-07T20:32:47.4161319Z op = torch.compile(op) 2025-05-07T20:32:47.4161429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4161504Z 2025-05-07T20:32:47.4161599Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4161603Z 2025-05-07T20:32:47.4161706Z moe/activation_test.py:117: 2025-05-07T20:32:47.4161841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4161948Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4162049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4162573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4162677Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4163049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4163284Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4163641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4163738Z kernel = self.compile( 2025-05-07T20:32:47.4164141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4164323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4164453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4164462Z 2025-05-07T20:32:47.4164684Z self = 2025-05-07T20:32:47.4165489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4166015Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07acb471a0>} 2025-05-07T20:32:47.4166785Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4166987Z context = 2025-05-07T20:32:47.4166994Z 2025-05-07T20:32:47.4167165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4167486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4167601Z module_map=module_map) 2025-05-07T20:32:47.4167770Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4167870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4167952Z E ^ 2025-05-07T20:32:47.4168321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4168326Z 2025-05-07T20:32:47.4168800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4168804Z 2025-05-07T20:32:47.4168910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4169142Z self=, 2025-05-07T20:32:47.4169264Z T=128, 2025-05-07T20:32:47.4169341Z D=7168, 2025-05-07T20:32:47.4169429Z scale_ub=None, 2025-05-07T20:32:47.4169559Z contiguous=True, 2025-05-07T20:32:47.4169647Z compiled=False, 2025-05-07T20:32:47.4169722Z ) 2025-05-07T20:32:47.4169952Z self = 2025-05-07T20:32:47.4170128Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4170133Z 2025-05-07T20:32:47.4170214Z @given( 2025-05-07T20:32:47.4170337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4170438Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4170564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4170684Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4170799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4170876Z ) 2025-05-07T20:32:47.4171133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4171238Z def test_silu_mul_quant( 2025-05-07T20:32:47.4171317Z self, 2025-05-07T20:32:47.4171395Z T: int, 2025-05-07T20:32:47.4171476Z D: int, 2025-05-07T20:32:47.4171576Z scale_ub: Optional[float], 2025-05-07T20:32:47.4171673Z contiguous: bool, 2025-05-07T20:32:47.4171762Z compiled: bool, 2025-05-07T20:32:47.4171841Z ) -> None: 2025-05-07T20:32:47.4171937Z torch.manual_seed(2025) 2025-05-07T20:32:47.4172014Z 2025-05-07T20:32:47.4172192Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4172269Z 2025-05-07T20:32:47.4172365Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4172493Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4172585Z x = x_sign * x_clamp 2025-05-07T20:32:47.4172666Z x0 = x[:, :D] 2025-05-07T20:32:47.4172746Z x1 = x[:, D:] 2025-05-07T20:32:47.4172826Z 2025-05-07T20:32:47.4172911Z if contiguous: 2025-05-07T20:32:47.4173006Z x0 = x0.contiguous() 2025-05-07T20:32:47.4173104Z x1 = x1.contiguous() 2025-05-07T20:32:47.4173178Z 2025-05-07T20:32:47.4173271Z if scale_ub is not None: 2025-05-07T20:32:47.4173383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4173523Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4173600Z ) 2025-05-07T20:32:47.4173680Z else: 2025-05-07T20:32:47.4173776Z scale_ub_tensor = None 2025-05-07T20:32:47.4173850Z 2025-05-07T20:32:47.4173989Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4174083Z op = silu_mul_quant 2025-05-07T20:32:47.4174171Z if compiled: 2025-05-07T20:32:47.4174272Z op = torch.compile(op) 2025-05-07T20:32:47.4174380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4174456Z 2025-05-07T20:32:47.4174551Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4174556Z 2025-05-07T20:32:47.4174701Z moe/activation_test.py:117: 2025-05-07T20:32:47.4174842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4174944Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4175047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4175569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4175668Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4176043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4176340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4176697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4176837Z kernel = self.compile( 2025-05-07T20:32:47.4177274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4177466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4177603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4177607Z 2025-05-07T20:32:47.4177819Z self = 2025-05-07T20:32:47.4178632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4179185Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07aca58040>} 2025-05-07T20:32:47.4179987Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4180186Z context = 2025-05-07T20:32:47.4180191Z 2025-05-07T20:32:47.4180361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4180638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4180747Z module_map=module_map) 2025-05-07T20:32:47.4180918Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4181021Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4181100Z E ^ 2025-05-07T20:32:47.4181470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4181474Z 2025-05-07T20:32:47.4181910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4181917Z 2025-05-07T20:32:47.4182027Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4182257Z self=, 2025-05-07T20:32:47.4182335Z T=2048, 2025-05-07T20:32:47.4182415Z D=7168, 2025-05-07T20:32:47.4182500Z scale_ub=1200.0, 2025-05-07T20:32:47.4182586Z contiguous=True, 2025-05-07T20:32:47.4182673Z compiled=False, 2025-05-07T20:32:47.4182748Z ) 2025-05-07T20:32:47.4182974Z self = 2025-05-07T20:32:47.4183160Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.4183165Z 2025-05-07T20:32:47.4183243Z @given( 2025-05-07T20:32:47.4183370Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4183473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4183639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4183769Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4183889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4183963Z ) 2025-05-07T20:32:47.4184222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4184318Z def test_silu_mul_quant( 2025-05-07T20:32:47.4184397Z self, 2025-05-07T20:32:47.4184478Z T: int, 2025-05-07T20:32:47.4184556Z D: int, 2025-05-07T20:32:47.4184658Z scale_ub: Optional[float], 2025-05-07T20:32:47.4184799Z contiguous: bool, 2025-05-07T20:32:47.4184886Z compiled: bool, 2025-05-07T20:32:47.4184971Z ) -> None: 2025-05-07T20:32:47.4185067Z torch.manual_seed(2025) 2025-05-07T20:32:47.4185140Z 2025-05-07T20:32:47.4185317Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4187242Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4187249Z 2025-05-07T20:32:47.4187375Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4187381Z 2025-05-07T20:32:47.4187488Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4187719Z self=, 2025-05-07T20:32:47.4187800Z T=1, 2025-05-07T20:32:47.4187879Z D=5120, 2025-05-07T20:32:47.4187964Z scale_ub=1200.0, 2025-05-07T20:32:47.4188055Z contiguous=True, 2025-05-07T20:32:47.4188142Z compiled=False, 2025-05-07T20:32:47.4188221Z ) 2025-05-07T20:32:47.4188448Z self = 2025-05-07T20:32:47.4188621Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.4188626Z 2025-05-07T20:32:47.4188705Z @given( 2025-05-07T20:32:47.4188829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4188930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4189050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4189172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4189308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4189397Z ) 2025-05-07T20:32:47.4189676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4189775Z def test_silu_mul_quant( 2025-05-07T20:32:47.4189859Z self, 2025-05-07T20:32:47.4189937Z T: int, 2025-05-07T20:32:47.4190018Z D: int, 2025-05-07T20:32:47.4190124Z scale_ub: Optional[float], 2025-05-07T20:32:47.4190217Z contiguous: bool, 2025-05-07T20:32:47.4190307Z compiled: bool, 2025-05-07T20:32:47.4190386Z ) -> None: 2025-05-07T20:32:47.4190482Z torch.manual_seed(2025) 2025-05-07T20:32:47.4190561Z 2025-05-07T20:32:47.4190736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4190811Z 2025-05-07T20:32:47.4190910Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4191039Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4191135Z x = x_sign * x_clamp 2025-05-07T20:32:47.4191217Z x0 = x[:, :D] 2025-05-07T20:32:47.4191297Z x1 = x[:, D:] 2025-05-07T20:32:47.4191374Z 2025-05-07T20:32:47.4191463Z if contiguous: 2025-05-07T20:32:47.4191556Z x0 = x0.contiguous() 2025-05-07T20:32:47.4191654Z x1 = x1.contiguous() 2025-05-07T20:32:47.4191729Z 2025-05-07T20:32:47.4191872Z if scale_ub is not None: 2025-05-07T20:32:47.4191986Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4192127Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4192204Z ) 2025-05-07T20:32:47.4192284Z else: 2025-05-07T20:32:47.4192379Z scale_ub_tensor = None 2025-05-07T20:32:47.4192457Z 2025-05-07T20:32:47.4192592Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4192684Z op = silu_mul_quant 2025-05-07T20:32:47.4192775Z if compiled: 2025-05-07T20:32:47.4193208Z op = torch.compile(op) 2025-05-07T20:32:47.4193318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4193393Z 2025-05-07T20:32:47.4193487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4193492Z 2025-05-07T20:32:47.4193595Z moe/activation_test.py:117: 2025-05-07T20:32:47.4193772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4193913Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4194020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4194537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4194639Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4195012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4195243Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4195605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4195701Z kernel = self.compile( 2025-05-07T20:32:47.4196097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4196287Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4196419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4196423Z 2025-05-07T20:32:47.4196642Z self = 2025-05-07T20:32:47.4197443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4197967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07aca59580>} 2025-05-07T20:32:47.4198743Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4198950Z context = 2025-05-07T20:32:47.4198955Z 2025-05-07T20:32:47.4199130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4199404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4199517Z module_map=module_map) 2025-05-07T20:32:47.4199683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4199784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4199868Z E ^ 2025-05-07T20:32:47.4200290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4200295Z 2025-05-07T20:32:47.4200728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4200735Z 2025-05-07T20:32:47.4200893Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4201129Z self=, 2025-05-07T20:32:47.4201211Z T=2048, 2025-05-07T20:32:47.4201288Z D=5120, 2025-05-07T20:32:47.4201372Z scale_ub=None, 2025-05-07T20:32:47.4201463Z contiguous=True, 2025-05-07T20:32:47.4201549Z compiled=False, 2025-05-07T20:32:47.4201624Z ) 2025-05-07T20:32:47.4201852Z self = 2025-05-07T20:32:47.4202033Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4202080Z 2025-05-07T20:32:47.4202161Z @given( 2025-05-07T20:32:47.4202286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4202388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4202510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4202673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4202794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4202910Z ) 2025-05-07T20:32:47.4203165Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4203261Z def test_silu_mul_quant( 2025-05-07T20:32:47.4203341Z self, 2025-05-07T20:32:47.4203418Z T: int, 2025-05-07T20:32:47.4203496Z D: int, 2025-05-07T20:32:47.4203598Z scale_ub: Optional[float], 2025-05-07T20:32:47.4203690Z contiguous: bool, 2025-05-07T20:32:47.4203777Z compiled: bool, 2025-05-07T20:32:47.4203861Z ) -> None: 2025-05-07T20:32:47.4203961Z torch.manual_seed(2025) 2025-05-07T20:32:47.4204038Z 2025-05-07T20:32:47.4204215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4204299Z 2025-05-07T20:32:47.4204393Z > x_sign = torch.sign(x) 2025-05-07T20:32:47.4206232Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4206245Z 2025-05-07T20:32:47.4206366Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:47.4206373Z 2025-05-07T20:32:47.4206477Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4206712Z self=, 2025-05-07T20:32:47.4206791Z T=16384, 2025-05-07T20:32:47.4206869Z D=5120, 2025-05-07T20:32:47.4206956Z scale_ub=None, 2025-05-07T20:32:47.4207045Z contiguous=True, 2025-05-07T20:32:47.4207133Z compiled=False, 2025-05-07T20:32:47.4207210Z ) 2025-05-07T20:32:47.4207437Z self = 2025-05-07T20:32:47.4207622Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4207627Z 2025-05-07T20:32:47.4207704Z @given( 2025-05-07T20:32:47.4207826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4207929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4208046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4208165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4208289Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4208363Z ) 2025-05-07T20:32:47.4208620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4208718Z def test_silu_mul_quant( 2025-05-07T20:32:47.4208795Z self, 2025-05-07T20:32:47.4208877Z T: int, 2025-05-07T20:32:47.4208953Z D: int, 2025-05-07T20:32:47.4209202Z scale_ub: Optional[float], 2025-05-07T20:32:47.4209299Z contiguous: bool, 2025-05-07T20:32:47.4209386Z compiled: bool, 2025-05-07T20:32:47.4209465Z ) -> None: 2025-05-07T20:32:47.4209562Z torch.manual_seed(2025) 2025-05-07T20:32:47.4209635Z 2025-05-07T20:32:47.4209809Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4211642Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4211727Z 2025-05-07T20:32:47.4211887Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4211896Z 2025-05-07T20:32:47.4212003Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4212235Z self=, 2025-05-07T20:32:47.4212316Z T=4096, 2025-05-07T20:32:47.4212392Z D=5120, 2025-05-07T20:32:47.4212475Z scale_ub=None, 2025-05-07T20:32:47.4212563Z contiguous=True, 2025-05-07T20:32:47.4212648Z compiled=False, 2025-05-07T20:32:47.4212722Z ) 2025-05-07T20:32:47.4212951Z self = 2025-05-07T20:32:47.4213130Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4213135Z 2025-05-07T20:32:47.4213217Z @given( 2025-05-07T20:32:47.4213579Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4213738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4213883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4214009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4214127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4214205Z ) 2025-05-07T20:32:47.4214459Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4214555Z def test_silu_mul_quant( 2025-05-07T20:32:47.4214636Z self, 2025-05-07T20:32:47.4214714Z T: int, 2025-05-07T20:32:47.4214790Z D: int, 2025-05-07T20:32:47.4214893Z scale_ub: Optional[float], 2025-05-07T20:32:47.4214986Z contiguous: bool, 2025-05-07T20:32:47.4215079Z compiled: bool, 2025-05-07T20:32:47.4215159Z ) -> None: 2025-05-07T20:32:47.4215256Z torch.manual_seed(2025) 2025-05-07T20:32:47.4215332Z 2025-05-07T20:32:47.4215506Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4217338Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4217347Z 2025-05-07T20:32:47.4217472Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4217477Z 2025-05-07T20:32:47.4217582Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4217818Z self=, 2025-05-07T20:32:47.4217899Z T=2048, 2025-05-07T20:32:47.4217978Z D=5120, 2025-05-07T20:32:47.4218068Z scale_ub=None, 2025-05-07T20:32:47.4218155Z contiguous=False, 2025-05-07T20:32:47.4218349Z compiled=False, 2025-05-07T20:32:47.4218429Z ) 2025-05-07T20:32:47.4218655Z self = 2025-05-07T20:32:47.4218838Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.4218843Z 2025-05-07T20:32:47.4218921Z @given( 2025-05-07T20:32:47.4219043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4219147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4219264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4219444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4219566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4219642Z ) 2025-05-07T20:32:47.4219901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4220054Z def test_silu_mul_quant( 2025-05-07T20:32:47.4220132Z self, 2025-05-07T20:32:47.4220216Z T: int, 2025-05-07T20:32:47.4220348Z D: int, 2025-05-07T20:32:47.4220449Z scale_ub: Optional[float], 2025-05-07T20:32:47.4220544Z contiguous: bool, 2025-05-07T20:32:47.4220631Z compiled: bool, 2025-05-07T20:32:47.4220710Z ) -> None: 2025-05-07T20:32:47.4220810Z torch.manual_seed(2025) 2025-05-07T20:32:47.4220885Z 2025-05-07T20:32:47.4221057Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4222884Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4222894Z 2025-05-07T20:32:47.4223016Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4223024Z 2025-05-07T20:32:47.4223128Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4223359Z self=, 2025-05-07T20:32:47.4223441Z T=4096, 2025-05-07T20:32:47.4223521Z D=7168, 2025-05-07T20:32:47.4223605Z scale_ub=None, 2025-05-07T20:32:47.4223696Z contiguous=True, 2025-05-07T20:32:47.4223781Z compiled=True, 2025-05-07T20:32:47.4223860Z ) 2025-05-07T20:32:47.4224086Z self = 2025-05-07T20:32:47.4224262Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.4224267Z 2025-05-07T20:32:47.4224347Z @given( 2025-05-07T20:32:47.4224470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4224573Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4224696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4224816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4224932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4225012Z ) 2025-05-07T20:32:47.4225265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4225362Z def test_silu_mul_quant( 2025-05-07T20:32:47.4225445Z self, 2025-05-07T20:32:47.4225522Z T: int, 2025-05-07T20:32:47.4225604Z D: int, 2025-05-07T20:32:47.4225708Z scale_ub: Optional[float], 2025-05-07T20:32:47.4225798Z contiguous: bool, 2025-05-07T20:32:47.4225888Z compiled: bool, 2025-05-07T20:32:47.4225968Z ) -> None: 2025-05-07T20:32:47.4226065Z torch.manual_seed(2025) 2025-05-07T20:32:47.4226141Z 2025-05-07T20:32:47.4226365Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4228304Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4228349Z 2025-05-07T20:32:47.4228471Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4228476Z 2025-05-07T20:32:47.4228582Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4228817Z self=, 2025-05-07T20:32:47.4228934Z T=2048, 2025-05-07T20:32:47.4229014Z D=5120, 2025-05-07T20:32:47.4229102Z scale_ub=1200.0, 2025-05-07T20:32:47.4229228Z contiguous=False, 2025-05-07T20:32:47.4229319Z compiled=False, 2025-05-07T20:32:47.4229392Z ) 2025-05-07T20:32:47.4229617Z self = 2025-05-07T20:32:47.4229803Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.4229807Z 2025-05-07T20:32:47.4229884Z @given( 2025-05-07T20:32:47.4230004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4230107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4230226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4230350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4230468Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4230544Z ) 2025-05-07T20:32:47.4230800Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4230904Z def test_silu_mul_quant( 2025-05-07T20:32:47.4230987Z self, 2025-05-07T20:32:47.4231069Z T: int, 2025-05-07T20:32:47.4231147Z D: int, 2025-05-07T20:32:47.4231247Z scale_ub: Optional[float], 2025-05-07T20:32:47.4231341Z contiguous: bool, 2025-05-07T20:32:47.4231429Z compiled: bool, 2025-05-07T20:32:47.4231509Z ) -> None: 2025-05-07T20:32:47.4231609Z torch.manual_seed(2025) 2025-05-07T20:32:47.4231683Z 2025-05-07T20:32:47.4231861Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4233689Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4233698Z 2025-05-07T20:32:47.4233822Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4233827Z 2025-05-07T20:32:47.4233933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4234164Z self=, 2025-05-07T20:32:47.4234245Z T=4096, 2025-05-07T20:32:47.4234323Z D=7168, 2025-05-07T20:32:47.4234408Z scale_ub=1200.0, 2025-05-07T20:32:47.4234502Z contiguous=True, 2025-05-07T20:32:47.4234589Z compiled=False, 2025-05-07T20:32:47.4234664Z ) 2025-05-07T20:32:47.4234893Z self = 2025-05-07T20:32:47.4235073Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.4235080Z 2025-05-07T20:32:47.4235161Z @given( 2025-05-07T20:32:47.4235331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4235436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4235556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4235675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4235791Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4235868Z ) 2025-05-07T20:32:47.4236122Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4236217Z def test_silu_mul_quant( 2025-05-07T20:32:47.4236339Z self, 2025-05-07T20:32:47.4236418Z T: int, 2025-05-07T20:32:47.4236497Z D: int, 2025-05-07T20:32:47.4236598Z scale_ub: Optional[float], 2025-05-07T20:32:47.4236689Z contiguous: bool, 2025-05-07T20:32:47.4236779Z compiled: bool, 2025-05-07T20:32:47.4236858Z ) -> None: 2025-05-07T20:32:47.4236999Z torch.manual_seed(2025) 2025-05-07T20:32:47.4237075Z 2025-05-07T20:32:47.4237312Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4239151Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4239159Z 2025-05-07T20:32:47.4239279Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4239283Z 2025-05-07T20:32:47.4239388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4239625Z self=, 2025-05-07T20:32:47.4239707Z T=16384, 2025-05-07T20:32:47.4239790Z D=7168, 2025-05-07T20:32:47.4239875Z scale_ub=None, 2025-05-07T20:32:47.4239964Z contiguous=False, 2025-05-07T20:32:47.4240052Z compiled=True, 2025-05-07T20:32:47.4240247Z ) 2025-05-07T20:32:47.4240474Z self = 2025-05-07T20:32:47.4240658Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.4240663Z 2025-05-07T20:32:47.4240741Z @given( 2025-05-07T20:32:47.4240863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4240972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4241089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4241214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4241331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4241412Z ) 2025-05-07T20:32:47.4241672Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4241772Z def test_silu_mul_quant( 2025-05-07T20:32:47.4241850Z self, 2025-05-07T20:32:47.4241931Z T: int, 2025-05-07T20:32:47.4242009Z D: int, 2025-05-07T20:32:47.4242108Z scale_ub: Optional[float], 2025-05-07T20:32:47.4242205Z contiguous: bool, 2025-05-07T20:32:47.4242292Z compiled: bool, 2025-05-07T20:32:47.4242371Z ) -> None: 2025-05-07T20:32:47.4242469Z torch.manual_seed(2025) 2025-05-07T20:32:47.4242544Z 2025-05-07T20:32:47.4242722Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4244600Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4244608Z 2025-05-07T20:32:47.4244737Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4244741Z 2025-05-07T20:32:47.4244845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4245075Z self=, 2025-05-07T20:32:47.4245159Z T=4096, 2025-05-07T20:32:47.4245280Z D=7168, 2025-05-07T20:32:47.4245364Z scale_ub=None, 2025-05-07T20:32:47.4245452Z contiguous=True, 2025-05-07T20:32:47.4245536Z compiled=False, 2025-05-07T20:32:47.4245611Z ) 2025-05-07T20:32:47.4245839Z self = 2025-05-07T20:32:47.4246057Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4246064Z 2025-05-07T20:32:47.4246146Z @given( 2025-05-07T20:32:47.4246305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4246407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4246526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4246646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4246761Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4246839Z ) 2025-05-07T20:32:47.4247093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4247191Z def test_silu_mul_quant( 2025-05-07T20:32:47.4247273Z self, 2025-05-07T20:32:47.4247350Z T: int, 2025-05-07T20:32:47.4247432Z D: int, 2025-05-07T20:32:47.4247533Z scale_ub: Optional[float], 2025-05-07T20:32:47.4247625Z contiguous: bool, 2025-05-07T20:32:47.4247718Z compiled: bool, 2025-05-07T20:32:47.4247796Z ) -> None: 2025-05-07T20:32:47.4247894Z torch.manual_seed(2025) 2025-05-07T20:32:47.4247977Z 2025-05-07T20:32:47.4248152Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4249982Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4249990Z 2025-05-07T20:32:47.4250112Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4250116Z 2025-05-07T20:32:47.4250224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4250461Z self=, 2025-05-07T20:32:47.4250541Z T=16384, 2025-05-07T20:32:47.4250622Z D=7168, 2025-05-07T20:32:47.4250705Z scale_ub=None, 2025-05-07T20:32:47.4250790Z contiguous=True, 2025-05-07T20:32:47.4250877Z compiled=False, 2025-05-07T20:32:47.4250951Z ) 2025-05-07T20:32:47.4251176Z self = 2025-05-07T20:32:47.4251362Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4251366Z 2025-05-07T20:32:47.4251446Z @given( 2025-05-07T20:32:47.4251566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4251669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4251785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4251909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4252028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4252104Z ) 2025-05-07T20:32:47.4252412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4252510Z def test_silu_mul_quant( 2025-05-07T20:32:47.4252588Z self, 2025-05-07T20:32:47.4252669Z T: int, 2025-05-07T20:32:47.4252746Z D: int, 2025-05-07T20:32:47.4252845Z scale_ub: Optional[float], 2025-05-07T20:32:47.4252939Z contiguous: bool, 2025-05-07T20:32:47.4253026Z compiled: bool, 2025-05-07T20:32:47.4253105Z ) -> None: 2025-05-07T20:32:47.4253205Z torch.manual_seed(2025) 2025-05-07T20:32:47.4253320Z 2025-05-07T20:32:47.4253497Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4255362Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4255404Z 2025-05-07T20:32:47.4255532Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4255536Z 2025-05-07T20:32:47.4255642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4255872Z self=, 2025-05-07T20:32:47.4255956Z T=16384, 2025-05-07T20:32:47.4256035Z D=7168, 2025-05-07T20:32:47.4256121Z scale_ub=1200.0, 2025-05-07T20:32:47.4256210Z contiguous=True, 2025-05-07T20:32:47.4256296Z compiled=False, 2025-05-07T20:32:47.4256371Z ) 2025-05-07T20:32:47.4256596Z self = 2025-05-07T20:32:47.4256786Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.4256791Z 2025-05-07T20:32:47.4256872Z @given( 2025-05-07T20:32:47.4256994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4257094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4257214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4257333Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4257450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4257527Z ) 2025-05-07T20:32:47.4257785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4257881Z def test_silu_mul_quant( 2025-05-07T20:32:47.4257961Z self, 2025-05-07T20:32:47.4258038Z T: int, 2025-05-07T20:32:47.4258120Z D: int, 2025-05-07T20:32:47.4258221Z scale_ub: Optional[float], 2025-05-07T20:32:47.4258314Z contiguous: bool, 2025-05-07T20:32:47.4258408Z compiled: bool, 2025-05-07T20:32:47.4258489Z ) -> None: 2025-05-07T20:32:47.4258585Z torch.manual_seed(2025) 2025-05-07T20:32:47.4258663Z 2025-05-07T20:32:47.4258840Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4260669Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4260678Z 2025-05-07T20:32:47.4260797Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4260804Z 2025-05-07T20:32:47.4260955Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4261193Z self=, 2025-05-07T20:32:47.4261273Z T=128, 2025-05-07T20:32:47.4261354Z D=5120, 2025-05-07T20:32:47.4261439Z scale_ub=1200.0, 2025-05-07T20:32:47.4261526Z contiguous=False, 2025-05-07T20:32:47.4261616Z compiled=False, 2025-05-07T20:32:47.4261690Z ) 2025-05-07T20:32:47.4261914Z self = 2025-05-07T20:32:47.4262096Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.4262141Z 2025-05-07T20:32:47.4262220Z @given( 2025-05-07T20:32:47.4262340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4262445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4262561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4262724Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4262843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4262955Z ) 2025-05-07T20:32:47.4263214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4263309Z def test_silu_mul_quant( 2025-05-07T20:32:47.4263386Z self, 2025-05-07T20:32:47.4263465Z T: int, 2025-05-07T20:32:47.4263542Z D: int, 2025-05-07T20:32:47.4263641Z scale_ub: Optional[float], 2025-05-07T20:32:47.4263734Z contiguous: bool, 2025-05-07T20:32:47.4263821Z compiled: bool, 2025-05-07T20:32:47.4263902Z ) -> None: 2025-05-07T20:32:47.4264001Z torch.manual_seed(2025) 2025-05-07T20:32:47.4264074Z 2025-05-07T20:32:47.4264252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4264327Z 2025-05-07T20:32:47.4264423Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4264560Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4264653Z x = x_sign * x_clamp 2025-05-07T20:32:47.4264738Z x0 = x[:, :D] 2025-05-07T20:32:47.4264821Z x1 = x[:, D:] 2025-05-07T20:32:47.4264894Z 2025-05-07T20:32:47.4264978Z if contiguous: 2025-05-07T20:32:47.4265075Z x0 = x0.contiguous() 2025-05-07T20:32:47.4265168Z x1 = x1.contiguous() 2025-05-07T20:32:47.4265241Z 2025-05-07T20:32:47.4265337Z if scale_ub is not None: 2025-05-07T20:32:47.4265446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4265585Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4265667Z ) 2025-05-07T20:32:47.4265744Z else: 2025-05-07T20:32:47.4265843Z scale_ub_tensor = None 2025-05-07T20:32:47.4265916Z 2025-05-07T20:32:47.4266049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4266143Z op = silu_mul_quant 2025-05-07T20:32:47.4266232Z if compiled: 2025-05-07T20:32:47.4266335Z op = torch.compile(op) 2025-05-07T20:32:47.4266449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4266522Z 2025-05-07T20:32:47.4266617Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4266621Z 2025-05-07T20:32:47.4266722Z moe/activation_test.py:117: 2025-05-07T20:32:47.4266854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4266961Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4267067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4267585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4267690Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4271542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4271803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4272240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4272345Z kernel = self.compile( 2025-05-07T20:32:47.4272747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4272929Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4273065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4273070Z 2025-05-07T20:32:47.4273283Z self = 2025-05-07T20:32:47.4274136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4274775Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ac7b11c0>} 2025-05-07T20:32:47.4275543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4275745Z context = 2025-05-07T20:32:47.4275750Z 2025-05-07T20:32:47.4275923Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4276203Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4276314Z module_map=module_map) 2025-05-07T20:32:47.4276481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4276585Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4276666Z E ^ 2025-05-07T20:32:47.4277042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4277050Z 2025-05-07T20:32:47.4277481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4277485Z 2025-05-07T20:32:47.4277592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4277826Z self=, 2025-05-07T20:32:47.4277906Z T=2048, 2025-05-07T20:32:47.4277982Z D=7168, 2025-05-07T20:32:47.4278072Z scale_ub=None, 2025-05-07T20:32:47.4278160Z contiguous=False, 2025-05-07T20:32:47.4278245Z compiled=False, 2025-05-07T20:32:47.4278324Z ) 2025-05-07T20:32:47.4278550Z self = 2025-05-07T20:32:47.4278735Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.4278742Z 2025-05-07T20:32:47.4278824Z @given( 2025-05-07T20:32:47.4278950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4279056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4279176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4279296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4279414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4279492Z ) 2025-05-07T20:32:47.4279749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4279845Z def test_silu_mul_quant( 2025-05-07T20:32:47.4279925Z self, 2025-05-07T20:32:47.4280005Z T: int, 2025-05-07T20:32:47.4280163Z D: int, 2025-05-07T20:32:47.4280266Z scale_ub: Optional[float], 2025-05-07T20:32:47.4280359Z contiguous: bool, 2025-05-07T20:32:47.4280446Z compiled: bool, 2025-05-07T20:32:47.4280529Z ) -> None: 2025-05-07T20:32:47.4280628Z torch.manual_seed(2025) 2025-05-07T20:32:47.4280750Z 2025-05-07T20:32:47.4280933Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4282775Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4282822Z 2025-05-07T20:32:47.4282944Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4282949Z 2025-05-07T20:32:47.4283058Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4283332Z self=, 2025-05-07T20:32:47.4283450Z T=128, 2025-05-07T20:32:47.4283530Z D=7168, 2025-05-07T20:32:47.4283616Z scale_ub=1200.0, 2025-05-07T20:32:47.4283709Z contiguous=True, 2025-05-07T20:32:47.4283794Z compiled=True, 2025-05-07T20:32:47.4283869Z ) 2025-05-07T20:32:47.4284099Z self = 2025-05-07T20:32:47.4284273Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.4284278Z 2025-05-07T20:32:47.4284356Z @given( 2025-05-07T20:32:47.4284480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4284585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4284707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4284829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4284948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4285030Z ) 2025-05-07T20:32:47.4285288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4285385Z def test_silu_mul_quant( 2025-05-07T20:32:47.4285467Z self, 2025-05-07T20:32:47.4285545Z T: int, 2025-05-07T20:32:47.4285623Z D: int, 2025-05-07T20:32:47.4285727Z scale_ub: Optional[float], 2025-05-07T20:32:47.4285818Z contiguous: bool, 2025-05-07T20:32:47.4285905Z compiled: bool, 2025-05-07T20:32:47.4285989Z ) -> None: 2025-05-07T20:32:47.4286085Z torch.manual_seed(2025) 2025-05-07T20:32:47.4286161Z 2025-05-07T20:32:47.4286339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4286414Z 2025-05-07T20:32:47.4286514Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4286642Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4286734Z x = x_sign * x_clamp 2025-05-07T20:32:47.4286822Z x0 = x[:, :D] 2025-05-07T20:32:47.4286904Z x1 = x[:, D:] 2025-05-07T20:32:47.4286982Z 2025-05-07T20:32:47.4287074Z if contiguous: 2025-05-07T20:32:47.4287169Z x0 = x0.contiguous() 2025-05-07T20:32:47.4287260Z x1 = x1.contiguous() 2025-05-07T20:32:47.4287339Z 2025-05-07T20:32:47.4287431Z if scale_ub is not None: 2025-05-07T20:32:47.4287544Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4287684Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4287760Z ) 2025-05-07T20:32:47.4287840Z else: 2025-05-07T20:32:47.4287936Z scale_ub_tensor = None 2025-05-07T20:32:47.4288013Z 2025-05-07T20:32:47.4288149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4288240Z op = silu_mul_quant 2025-05-07T20:32:47.4288327Z if compiled: 2025-05-07T20:32:47.4288433Z op = torch.compile(op) 2025-05-07T20:32:47.4288545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4288618Z 2025-05-07T20:32:47.4288766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4288771Z 2025-05-07T20:32:47.4288872Z moe/activation_test.py:117: 2025-05-07T20:32:47.4289008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4289111Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4289213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4289599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.4289694Z return fn(*args, **kwargs) 2025-05-07T20:32:47.4290244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4290348Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4290717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4290998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4291388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4291487Z kernel = self.compile( 2025-05-07T20:32:47.4291885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4292067Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4292198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4292207Z 2025-05-07T20:32:47.4292421Z self = 2025-05-07T20:32:47.4293224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4293759Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f07ac8a3b00>} 2025-05-07T20:32:47.4294529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4294733Z context = 2025-05-07T20:32:47.4294738Z 2025-05-07T20:32:47.4294913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4295189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4295303Z module_map=module_map) 2025-05-07T20:32:47.4295470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4295578Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4295657Z E ^ 2025-05-07T20:32:47.4296026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4296030Z 2025-05-07T20:32:47.4296464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4296468Z 2025-05-07T20:32:47.4296574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4296805Z self=, 2025-05-07T20:32:47.4296891Z T=128, 2025-05-07T20:32:47.4296967Z D=7168, 2025-05-07T20:32:47.4297055Z scale_ub=1200.0, 2025-05-07T20:32:47.4297141Z contiguous=True, 2025-05-07T20:32:47.4297225Z compiled=False, 2025-05-07T20:32:47.4297306Z ) 2025-05-07T20:32:47.4297532Z self = 2025-05-07T20:32:47.4297758Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.4297762Z 2025-05-07T20:32:47.4297850Z @given( 2025-05-07T20:32:47.4297973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4298075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4298196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4298316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4298435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4298510Z ) 2025-05-07T20:32:47.4298767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4298917Z def test_silu_mul_quant( 2025-05-07T20:32:47.4299016Z self, 2025-05-07T20:32:47.4299098Z T: int, 2025-05-07T20:32:47.4299200Z D: int, 2025-05-07T20:32:47.4299302Z scale_ub: Optional[float], 2025-05-07T20:32:47.4299393Z contiguous: bool, 2025-05-07T20:32:47.4299525Z compiled: bool, 2025-05-07T20:32:47.4299607Z ) -> None: 2025-05-07T20:32:47.4299742Z torch.manual_seed(2025) 2025-05-07T20:32:47.4299819Z 2025-05-07T20:32:47.4299996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4300075Z 2025-05-07T20:32:47.4300169Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4300298Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4302146Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4302157Z 2025-05-07T20:32:47.4302287Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:47.4302292Z 2025-05-07T20:32:47.4302401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4302635Z self=, 2025-05-07T20:32:47.4302713Z T=128, 2025-05-07T20:32:47.4302793Z D=5120, 2025-05-07T20:32:47.4302878Z scale_ub=1200.0, 2025-05-07T20:32:47.4302964Z contiguous=True, 2025-05-07T20:32:47.4303051Z compiled=True, 2025-05-07T20:32:47.4303123Z ) 2025-05-07T20:32:47.4303353Z self = 2025-05-07T20:32:47.4303530Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.4303535Z 2025-05-07T20:32:47.4303614Z @given( 2025-05-07T20:32:47.4303738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4303841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4303965Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4304089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4304205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4304280Z ) 2025-05-07T20:32:47.4304536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4304633Z def test_silu_mul_quant( 2025-05-07T20:32:47.4304716Z self, 2025-05-07T20:32:47.4304793Z T: int, 2025-05-07T20:32:47.4304871Z D: int, 2025-05-07T20:32:47.4304976Z scale_ub: Optional[float], 2025-05-07T20:32:47.4305074Z contiguous: bool, 2025-05-07T20:32:47.4305162Z compiled: bool, 2025-05-07T20:32:47.4305244Z ) -> None: 2025-05-07T20:32:47.4305341Z torch.manual_seed(2025) 2025-05-07T20:32:47.4305414Z 2025-05-07T20:32:47.4305592Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4305669Z 2025-05-07T20:32:47.4305762Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4305945Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4307767Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4307836Z 2025-05-07T20:32:47.4307958Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:47.4307962Z 2025-05-07T20:32:47.4308068Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4308302Z self=, 2025-05-07T20:32:47.4308421Z T=128, 2025-05-07T20:32:47.4308498Z D=7168, 2025-05-07T20:32:47.4308622Z scale_ub=None, 2025-05-07T20:32:47.4308710Z contiguous=True, 2025-05-07T20:32:47.4308794Z compiled=True, 2025-05-07T20:32:47.4308870Z ) 2025-05-07T20:32:47.4309097Z self = 2025-05-07T20:32:47.4309294Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.4309303Z 2025-05-07T20:32:47.4309389Z @given( 2025-05-07T20:32:47.4309530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4309637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4309755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4309875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4309996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4310071Z ) 2025-05-07T20:32:47.4310331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4310432Z def test_silu_mul_quant( 2025-05-07T20:32:47.4310509Z self, 2025-05-07T20:32:47.4310587Z T: int, 2025-05-07T20:32:47.4310667Z D: int, 2025-05-07T20:32:47.4310768Z scale_ub: Optional[float], 2025-05-07T20:32:47.4310862Z contiguous: bool, 2025-05-07T20:32:47.4310950Z compiled: bool, 2025-05-07T20:32:47.4311029Z ) -> None: 2025-05-07T20:32:47.4311127Z torch.manual_seed(2025) 2025-05-07T20:32:47.4311201Z 2025-05-07T20:32:47.4311375Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4313201Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:47.4313210Z 2025-05-07T20:32:47.4313518Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:47.4313727Z =============================== warnings summary =============================== 2025-05-07T20:32:47.4314059Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:47.4314379Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:47.4314689Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:47.4315686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:47.4315935Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:47.4315940Z 2025-05-07T20:32:47.4316158Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:47.4316335Z ================= 1 failed, 1 deselected, 3 warnings in 12.03s ================= 2025-05-07T20:32:49.0335082Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:49.0949701Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:49.0949944Z 2025-05-07T20:32:51.0969800Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:53.2572544Z ============================= test session starts ============================== 2025-05-07T20:32:53.2573816Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:53.2574888Z cachedir: .pytest_cache 2025-05-07T20:32:53.2576079Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:53.2577589Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:53.2578432Z plugins: hypothesis-6.131.14 2025-05-07T20:32:54.8127502Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:54.9101688Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:54.9102121Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:54.9102354Z 2025-05-07T20:32:57.0258293Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.0258998Z self=, 2025-05-07T20:32:57.0259426Z T=1, 2025-05-07T20:32:57.0259627Z D=5120, 2025-05-07T20:32:57.0259835Z scale_ub=None, 2025-05-07T20:32:57.0260057Z contiguous=True, 2025-05-07T20:32:57.0260294Z compiled=True, 2025-05-07T20:32:57.0260516Z ) 2025-05-07T20:32:57.0260854Z self = 2025-05-07T20:32:57.0261373Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:57.0261693Z 2025-05-07T20:32:57.0261777Z @given( 2025-05-07T20:32:57.0262025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.0262362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.0262691Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.0263048Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.0263403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.0263709Z ) 2025-05-07T20:32:57.0264079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.0264550Z def test_silu_mul_quant( 2025-05-07T20:32:57.0264813Z self, 2025-05-07T20:32:57.0265042Z T: int, 2025-05-07T20:32:57.0265252Z D: int, 2025-05-07T20:32:57.0265489Z scale_ub: Optional[float], 2025-05-07T20:32:57.0265780Z contiguous: bool, 2025-05-07T20:32:57.0266034Z compiled: bool, 2025-05-07T20:32:57.0266276Z ) -> None: 2025-05-07T20:32:57.0266508Z torch.manual_seed(2025) 2025-05-07T20:32:57.0266775Z 2025-05-07T20:32:57.0267067Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.0267436Z 2025-05-07T20:32:57.0267645Z x_sign = torch.sign(x) 2025-05-07T20:32:57.0267954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.0268564Z x = x_sign * x_clamp 2025-05-07T20:32:57.0268831Z x0 = x[:, :D] 2025-05-07T20:32:57.0269057Z x1 = x[:, D:] 2025-05-07T20:32:57.0269279Z 2025-05-07T20:32:57.0269485Z if contiguous: 2025-05-07T20:32:57.0269725Z x0 = x0.contiguous() 2025-05-07T20:32:57.0270002Z x1 = x1.contiguous() 2025-05-07T20:32:57.0270261Z 2025-05-07T20:32:57.0270462Z if scale_ub is not None: 2025-05-07T20:32:57.0270753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.0271111Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.0271522Z ) 2025-05-07T20:32:57.0271727Z else: 2025-05-07T20:32:57.0271951Z scale_ub_tensor = None 2025-05-07T20:32:57.0272213Z 2025-05-07T20:32:57.0272463Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.0272802Z op = silu_mul_quant 2025-05-07T20:32:57.0273152Z if compiled: 2025-05-07T20:32:57.0273413Z op = torch.compile(op) 2025-05-07T20:32:57.0273806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.0274106Z 2025-05-07T20:32:57.0274310Z y_fp8, y_scale = fn() 2025-05-07T20:32:57.0274617Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:57.0274929Z 2025-05-07T20:32:57.0275176Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.0275532Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:57.0275842Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:57.0276171Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:57.0276558Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.0276891Z 2025-05-07T20:32:57.0277111Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:57.0277319Z 2025-05-07T20:32:57.0277428Z moe/activation_test.py:126: 2025-05-07T20:32:57.0277752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0278116Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:57.0278463Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.0279306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:57.0280186Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:57.0280780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.0281509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.0282250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:57.0283024Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.0283805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:57.0284492Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:57.0285139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:57.0285694Z fn() 2025-05-07T20:32:57.0286231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:57.0286856Z self.fn.run( 2025-05-07T20:32:57.0287360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.0287924Z kernel = self.compile( 2025-05-07T20:32:57.0288505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.0289207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.0289695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0289943Z 2025-05-07T20:32:57.0290169Z self = 2025-05-07T20:32:57.0291322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.0292796Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a7cae700>} 2025-05-07T20:32:57.0294263Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.0295347Z context = 2025-05-07T20:32:57.0295695Z 2025-05-07T20:32:57.0295914Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.0296479Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.0296980Z module_map=module_map) 2025-05-07T20:32:57.0297369Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.0297750Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:57.0298038Z E ^ 2025-05-07T20:32:57.0298535Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.0299018Z 2025-05-07T20:32:57.0299461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.0300013Z 2025-05-07T20:32:57.0300126Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.0300575Z self=, 2025-05-07T20:32:57.0301007Z T=2048, 2025-05-07T20:32:57.0301206Z D=5120, 2025-05-07T20:32:57.0301416Z scale_ub=1200.0, 2025-05-07T20:32:57.0301654Z contiguous=True, 2025-05-07T20:32:57.0301885Z compiled=False, 2025-05-07T20:32:57.0302104Z ) 2025-05-07T20:32:57.0302443Z self = 2025-05-07T20:32:57.0302970Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:57.0303268Z 2025-05-07T20:32:57.0303350Z @given( 2025-05-07T20:32:57.0303597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.0303932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.0304263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.0304617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.0304972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.0305275Z ) 2025-05-07T20:32:57.0305653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.0306127Z def test_silu_mul_quant( 2025-05-07T20:32:57.0306379Z self, 2025-05-07T20:32:57.0306588Z T: int, 2025-05-07T20:32:57.0306798Z D: int, 2025-05-07T20:32:57.0307026Z scale_ub: Optional[float], 2025-05-07T20:32:57.0307316Z contiguous: bool, 2025-05-07T20:32:57.0307573Z compiled: bool, 2025-05-07T20:32:57.0307810Z ) -> None: 2025-05-07T20:32:57.0308047Z torch.manual_seed(2025) 2025-05-07T20:32:57.0308311Z 2025-05-07T20:32:57.0308599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.0308973Z 2025-05-07T20:32:57.0309185Z x_sign = torch.sign(x) 2025-05-07T20:32:57.0309494Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.0309831Z x = x_sign * x_clamp 2025-05-07T20:32:57.0310090Z x0 = x[:, :D] 2025-05-07T20:32:57.0310328Z x1 = x[:, D:] 2025-05-07T20:32:57.0310547Z 2025-05-07T20:32:57.0310795Z if contiguous: 2025-05-07T20:32:57.0311042Z x0 = x0.contiguous() 2025-05-07T20:32:57.0311312Z x1 = x1.contiguous() 2025-05-07T20:32:57.0311566Z 2025-05-07T20:32:57.0311768Z if scale_ub is not None: 2025-05-07T20:32:57.0312054Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.0312410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.0312741Z ) 2025-05-07T20:32:57.0312941Z else: 2025-05-07T20:32:57.0313164Z scale_ub_tensor = None 2025-05-07T20:32:57.0313725Z 2025-05-07T20:32:57.0313967Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.0314301Z op = silu_mul_quant 2025-05-07T20:32:57.0314571Z if compiled: 2025-05-07T20:32:57.0314828Z op = torch.compile(op) 2025-05-07T20:32:57.0315144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.0315512Z 2025-05-07T20:32:57.0315719Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.0315955Z 2025-05-07T20:32:57.0316065Z moe/activation_test.py:117: 2025-05-07T20:32:57.0316383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0316740Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.0317039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.0317770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.0318501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.0319074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.0319806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.0320600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.0321175Z kernel = self.compile( 2025-05-07T20:32:57.0321778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.0322505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.0322931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.0323179Z 2025-05-07T20:32:57.0323407Z self = 2025-05-07T20:32:57.0324548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.0326003Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a7b62020>} 2025-05-07T20:32:57.0327440Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.0328523Z context = 2025-05-07T20:32:57.0328832Z 2025-05-07T20:32:57.0329019Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.0329575Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.0330079Z module_map=module_map) 2025-05-07T20:32:57.0330471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.0330846Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.0331127Z E ^ 2025-05-07T20:32:57.0331624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.0332105Z 2025-05-07T20:32:57.0332628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.6885269Z 2025-05-07T20:32:57.6885753Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.6886431Z self=, 2025-05-07T20:32:57.6886948Z T=2048, 2025-05-07T20:32:57.6887146Z D=5120, 2025-05-07T20:32:57.6887349Z scale_ub=1200.0, 2025-05-07T20:32:57.6887576Z contiguous=True, 2025-05-07T20:32:57.6887807Z compiled=True, 2025-05-07T20:32:57.6888324Z ) 2025-05-07T20:32:57.6888653Z self = 2025-05-07T20:32:57.6889169Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:57.6889456Z 2025-05-07T20:32:57.6889563Z @given( 2025-05-07T20:32:57.6889801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.6890261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.6890659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.6891001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.6891344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.6891646Z ) 2025-05-07T20:32:57.6892004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.6892465Z def test_silu_mul_quant( 2025-05-07T20:32:57.6892719Z self, 2025-05-07T20:32:57.6892917Z T: int, 2025-05-07T20:32:57.6893121Z D: int, 2025-05-07T20:32:57.6893349Z scale_ub: Optional[float], 2025-05-07T20:32:57.6893626Z contiguous: bool, 2025-05-07T20:32:57.6893875Z compiled: bool, 2025-05-07T20:32:57.6894126Z ) -> None: 2025-05-07T20:32:57.6894350Z torch.manual_seed(2025) 2025-05-07T20:32:57.6900938Z 2025-05-07T20:32:57.6901238Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.6901615Z 2025-05-07T20:32:57.6901829Z x_sign = torch.sign(x) 2025-05-07T20:32:57.6902143Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.6902478Z x = x_sign * x_clamp 2025-05-07T20:32:57.6902726Z x0 = x[:, :D] 2025-05-07T20:32:57.6902955Z x1 = x[:, D:] 2025-05-07T20:32:57.6903176Z 2025-05-07T20:32:57.6903369Z if contiguous: 2025-05-07T20:32:57.6903619Z x0 = x0.contiguous() 2025-05-07T20:32:57.6903897Z x1 = x1.contiguous() 2025-05-07T20:32:57.6904142Z 2025-05-07T20:32:57.6904347Z if scale_ub is not None: 2025-05-07T20:32:57.6904640Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.6904992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.6905321Z ) 2025-05-07T20:32:57.6905531Z else: 2025-05-07T20:32:57.6905749Z scale_ub_tensor = None 2025-05-07T20:32:57.6906022Z 2025-05-07T20:32:57.6906273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.6906604Z op = silu_mul_quant 2025-05-07T20:32:57.6906871Z if compiled: 2025-05-07T20:32:57.6907134Z op = torch.compile(op) 2025-05-07T20:32:57.6907448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6907733Z 2025-05-07T20:32:57.6907938Z y_fp8, y_scale = fn() 2025-05-07T20:32:57.6908239Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:57.6908545Z 2025-05-07T20:32:57.6908797Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.6909158Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:57.6909461Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:57.6909792Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:57.6910173Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.6910495Z 2025-05-07T20:32:57.6910710Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:57.6911048Z 2025-05-07T20:32:57.6911169Z moe/activation_test.py:126: 2025-05-07T20:32:57.6911482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6911834Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:57.6912229Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.6913057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:57.6914197Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:57.6914873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.6915594Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.6916318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:57.6917197Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.6917971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:57.6918645Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:57.6919285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:57.6919822Z fn() 2025-05-07T20:32:57.6920433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:57.6921046Z self.fn.run( 2025-05-07T20:32:57.6921528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.6922088Z kernel = self.compile( 2025-05-07T20:32:57.6922660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.6923356Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.6923769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6924018Z 2025-05-07T20:32:57.6924239Z self = 2025-05-07T20:32:57.6925382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.6926828Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6c44400>} 2025-05-07T20:32:57.6928224Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.6929303Z context = 2025-05-07T20:32:57.6929616Z 2025-05-07T20:32:57.6929791Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.6930339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.6930826Z module_map=module_map) 2025-05-07T20:32:57.6931213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.6931595Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:57.6931869Z E ^ 2025-05-07T20:32:57.6932360Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.6932835Z 2025-05-07T20:32:57.6933269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.6933872Z 2025-05-07T20:32:57.6933994Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.6934429Z self=, 2025-05-07T20:32:57.6934854Z T=16384, 2025-05-07T20:32:57.6935059Z D=7168, 2025-05-07T20:32:57.6935256Z scale_ub=1200.0, 2025-05-07T20:32:57.6935489Z contiguous=False, 2025-05-07T20:32:57.6935730Z compiled=False, 2025-05-07T20:32:57.6935943Z ) 2025-05-07T20:32:57.6936282Z self = 2025-05-07T20:32:57.6936810Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:57.6937152Z 2025-05-07T20:32:57.6937240Z @given( 2025-05-07T20:32:57.6937480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.6937816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.6938142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.6938531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.6938918Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.6939232Z ) 2025-05-07T20:32:57.6939596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.6940059Z def test_silu_mul_quant( 2025-05-07T20:32:57.6940313Z self, 2025-05-07T20:32:57.6940518Z T: int, 2025-05-07T20:32:57.6940719Z D: int, 2025-05-07T20:32:57.6940948Z scale_ub: Optional[float], 2025-05-07T20:32:57.6941234Z contiguous: bool, 2025-05-07T20:32:57.6941477Z compiled: bool, 2025-05-07T20:32:57.6941715Z ) -> None: 2025-05-07T20:32:57.6941942Z torch.manual_seed(2025) 2025-05-07T20:32:57.6942186Z 2025-05-07T20:32:57.6942473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.6942832Z 2025-05-07T20:32:57.6943029Z x_sign = torch.sign(x) 2025-05-07T20:32:57.6943338Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.6943668Z x = x_sign * x_clamp 2025-05-07T20:32:57.6943915Z x0 = x[:, :D] 2025-05-07T20:32:57.6944146Z x1 = x[:, D:] 2025-05-07T20:32:57.6944366Z 2025-05-07T20:32:57.6944556Z if contiguous: 2025-05-07T20:32:57.6944805Z x0 = x0.contiguous() 2025-05-07T20:32:57.6945083Z x1 = x1.contiguous() 2025-05-07T20:32:57.6945336Z 2025-05-07T20:32:57.6945534Z if scale_ub is not None: 2025-05-07T20:32:57.6945822Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.6946180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.6946505Z ) 2025-05-07T20:32:57.6946714Z else: 2025-05-07T20:32:57.6946943Z scale_ub_tensor = None 2025-05-07T20:32:57.6947201Z 2025-05-07T20:32:57.6947448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.6947783Z op = silu_mul_quant 2025-05-07T20:32:57.6948040Z if compiled: 2025-05-07T20:32:57.6948306Z op = torch.compile(op) 2025-05-07T20:32:57.6948618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6948901Z 2025-05-07T20:32:57.6949112Z > y_fp8, y_scale = fn() 2025-05-07T20:32:57.6949283Z 2025-05-07T20:32:57.6949398Z moe/activation_test.py:117: 2025-05-07T20:32:57.6949709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6950053Z moe/activation_test.py:115: in fn 2025-05-07T20:32:57.6950353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.6951072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:57.6951781Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:57.6952344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.6953111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.6953808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.6954360Z kernel = self.compile( 2025-05-07T20:32:57.6954929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.6955624Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.6956032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.6956321Z 2025-05-07T20:32:57.6956536Z self = 2025-05-07T20:32:57.6957655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.6959166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6a271a0>} 2025-05-07T20:32:57.6960628Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.6961689Z context = 2025-05-07T20:32:57.6961998Z 2025-05-07T20:32:57.6962172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.6962721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.6963210Z module_map=module_map) 2025-05-07T20:32:57.6963586Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.6963960Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.6964233Z E ^ 2025-05-07T20:32:57.6964719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.6965192Z 2025-05-07T20:32:57.6965625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.3892432Z 2025-05-07T20:32:58.3892933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3893422Z self=, 2025-05-07T20:32:58.3893839Z T=1, 2025-05-07T20:32:58.3894057Z D=7168, 2025-05-07T20:32:58.3894260Z scale_ub=None, 2025-05-07T20:32:58.3894480Z contiguous=True, 2025-05-07T20:32:58.3894714Z compiled=True, 2025-05-07T20:32:58.3894930Z ) 2025-05-07T20:32:58.3895264Z self = 2025-05-07T20:32:58.3895763Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.3896047Z 2025-05-07T20:32:58.3896133Z @given( 2025-05-07T20:32:58.3896381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3896702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3897023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3897369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3897704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3898001Z ) 2025-05-07T20:32:58.3898362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3898823Z def test_silu_mul_quant( 2025-05-07T20:32:58.3899066Z self, 2025-05-07T20:32:58.3899269Z T: int, 2025-05-07T20:32:58.3899474Z D: int, 2025-05-07T20:32:58.3899694Z scale_ub: Optional[float], 2025-05-07T20:32:58.3899977Z contiguous: bool, 2025-05-07T20:32:58.3900227Z compiled: bool, 2025-05-07T20:32:58.3900462Z ) -> None: 2025-05-07T20:32:58.3900973Z torch.manual_seed(2025) 2025-05-07T20:32:58.3901235Z 2025-05-07T20:32:58.3901513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3901873Z 2025-05-07T20:32:58.3902087Z x_sign = torch.sign(x) 2025-05-07T20:32:58.3902393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.3902710Z x = x_sign * x_clamp 2025-05-07T20:32:58.3902961Z x0 = x[:, :D] 2025-05-07T20:32:58.3903191Z x1 = x[:, D:] 2025-05-07T20:32:58.3903401Z 2025-05-07T20:32:58.3903600Z if contiguous: 2025-05-07T20:32:58.3903923Z x0 = x0.contiguous() 2025-05-07T20:32:58.3904195Z x1 = x1.contiguous() 2025-05-07T20:32:58.3904441Z 2025-05-07T20:32:58.3904643Z if scale_ub is not None: 2025-05-07T20:32:58.3904929Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.3905277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.3905682Z ) 2025-05-07T20:32:58.3905892Z else: 2025-05-07T20:32:58.3906176Z scale_ub_tensor = None 2025-05-07T20:32:58.3906438Z 2025-05-07T20:32:58.3906683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.3907005Z op = silu_mul_quant 2025-05-07T20:32:58.3907265Z if compiled: 2025-05-07T20:32:58.3907547Z op = torch.compile(op) 2025-05-07T20:32:58.3907854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.3908139Z 2025-05-07T20:32:58.3908333Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.3908629Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.3908933Z 2025-05-07T20:32:58.3909176Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.3909522Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.3909826Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.3910148Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.3910527Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.3910850Z 2025-05-07T20:32:58.3911061Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.3911262Z 2025-05-07T20:32:58.3911367Z moe/activation_test.py:126: 2025-05-07T20:32:58.3911679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.3912035Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.3912370Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.3913194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.3914259Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.3914827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.3915539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.3916253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.3917003Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.3917765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.3918426Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.3919054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.3919592Z fn() 2025-05-07T20:32:58.3920217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.3920824Z self.fn.run( 2025-05-07T20:32:58.3921311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.3921950Z kernel = self.compile( 2025-05-07T20:32:58.3922511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.3923190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.3923610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.3923848Z 2025-05-07T20:32:58.3924064Z self = 2025-05-07T20:32:58.3925191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.3926689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6d00860>} 2025-05-07T20:32:58.3928214Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.3929271Z context = 2025-05-07T20:32:58.3929571Z 2025-05-07T20:32:58.3929746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.3930295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.3930785Z module_map=module_map) 2025-05-07T20:32:58.3931172Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.3931540Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.3931817Z E ^ 2025-05-07T20:32:58.3932298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.3932770Z 2025-05-07T20:32:58.3933204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.3933739Z 2025-05-07T20:32:58.3933846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.3934277Z self=, 2025-05-07T20:32:58.3934694Z T=4096, 2025-05-07T20:32:58.3934883Z D=5120, 2025-05-07T20:32:58.3935082Z scale_ub=None, 2025-05-07T20:32:58.3935307Z contiguous=False, 2025-05-07T20:32:58.3935536Z compiled=False, 2025-05-07T20:32:58.3935752Z ) 2025-05-07T20:32:58.3936085Z self = 2025-05-07T20:32:58.3936596Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:58.3936886Z 2025-05-07T20:32:58.3936965Z @given( 2025-05-07T20:32:58.3937208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.3937533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.3937859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.3938205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.3938550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.3938842Z ) 2025-05-07T20:32:58.3939203Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.3939662Z def test_silu_mul_quant( 2025-05-07T20:32:58.3939906Z self, 2025-05-07T20:32:58.3940107Z T: int, 2025-05-07T20:32:58.3940313Z D: int, 2025-05-07T20:32:58.3940536Z scale_ub: Optional[float], 2025-05-07T20:32:58.3940819Z contiguous: bool, 2025-05-07T20:32:58.3941069Z compiled: bool, 2025-05-07T20:32:58.3941295Z ) -> None: 2025-05-07T20:32:58.3941520Z torch.manual_seed(2025) 2025-05-07T20:32:58.3941777Z 2025-05-07T20:32:58.3942140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.3942523Z 2025-05-07T20:32:58.3942730Z x_sign = torch.sign(x) 2025-05-07T20:32:58.3943035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.3943351Z x = x_sign * x_clamp 2025-05-07T20:32:58.3943601Z x0 = x[:, :D] 2025-05-07T20:32:58.3943825Z x1 = x[:, D:] 2025-05-07T20:32:58.3944034Z 2025-05-07T20:32:58.3944227Z if contiguous: 2025-05-07T20:32:58.3944470Z x0 = x0.contiguous() 2025-05-07T20:32:58.3944732Z x1 = x1.contiguous() 2025-05-07T20:32:58.3944983Z 2025-05-07T20:32:58.3945231Z if scale_ub is not None: 2025-05-07T20:32:58.3945510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.3945862Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.3946185Z ) 2025-05-07T20:32:58.3946378Z else: 2025-05-07T20:32:58.3946639Z scale_ub_tensor = None 2025-05-07T20:32:58.3946900Z 2025-05-07T20:32:58.3947177Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.3947508Z op = silu_mul_quant 2025-05-07T20:32:58.3947771Z if compiled: 2025-05-07T20:32:58.3948022Z op = torch.compile(op) 2025-05-07T20:32:58.3948333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.3948618Z 2025-05-07T20:32:58.3948819Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.3948989Z 2025-05-07T20:32:58.3949091Z moe/activation_test.py:117: 2025-05-07T20:32:58.3949395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.3949742Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.3950029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.3950743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.3951461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.3952027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.3952730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.3953418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.3953969Z kernel = self.compile( 2025-05-07T20:32:58.3954529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.3955210Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.3955627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.3955863Z 2025-05-07T20:32:58.3956083Z self = 2025-05-07T20:32:58.3957200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.3958629Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6118cc0>} 2025-05-07T20:32:58.3960016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.3961170Z context = 2025-05-07T20:32:58.3961468Z 2025-05-07T20:32:58.3961648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.3962211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.3962727Z module_map=module_map) 2025-05-07T20:32:58.3963160Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.3963526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.3963802Z E ^ 2025-05-07T20:32:58.3964283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.3964750Z 2025-05-07T20:32:58.3965186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.1024343Z 2025-05-07T20:32:59.1024694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.1025510Z self=, 2025-05-07T20:32:59.1026049Z T=4096, 2025-05-07T20:32:59.1026329Z D=7168, 2025-05-07T20:32:59.1026594Z scale_ub=None, 2025-05-07T20:32:59.1026884Z contiguous=False, 2025-05-07T20:32:59.1027191Z compiled=False, 2025-05-07T20:32:59.1027621Z ) 2025-05-07T20:32:59.1028048Z self = 2025-05-07T20:32:59.1028581Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:59.1028873Z 2025-05-07T20:32:59.1028957Z @given( 2025-05-07T20:32:59.1029200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.1029528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.1029853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.1030200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.1030544Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.1030850Z ) 2025-05-07T20:32:59.1031222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.1031694Z def test_silu_mul_quant( 2025-05-07T20:32:59.1031947Z self, 2025-05-07T20:32:59.1032159Z T: int, 2025-05-07T20:32:59.1032402Z D: int, 2025-05-07T20:32:59.1032654Z scale_ub: Optional[float], 2025-05-07T20:32:59.1032947Z contiguous: bool, 2025-05-07T20:32:59.1033205Z compiled: bool, 2025-05-07T20:32:59.1033442Z ) -> None: 2025-05-07T20:32:59.1033673Z torch.manual_seed(2025) 2025-05-07T20:32:59.1033930Z 2025-05-07T20:32:59.1034251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.1034615Z 2025-05-07T20:32:59.1034820Z x_sign = torch.sign(x) 2025-05-07T20:32:59.1035131Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.1035463Z x = x_sign * x_clamp 2025-05-07T20:32:59.1035713Z x0 = x[:, :D] 2025-05-07T20:32:59.1035950Z x1 = x[:, D:] 2025-05-07T20:32:59.1036166Z 2025-05-07T20:32:59.1036377Z if contiguous: 2025-05-07T20:32:59.1043241Z x0 = x0.contiguous() 2025-05-07T20:32:59.1043544Z x1 = x1.contiguous() 2025-05-07T20:32:59.1043814Z 2025-05-07T20:32:59.1044027Z if scale_ub is not None: 2025-05-07T20:32:59.1044315Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.1044679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.1045015Z ) 2025-05-07T20:32:59.1045220Z else: 2025-05-07T20:32:59.1045437Z scale_ub_tensor = None 2025-05-07T20:32:59.1045706Z 2025-05-07T20:32:59.1045957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.1046291Z op = silu_mul_quant 2025-05-07T20:32:59.1046561Z if compiled: 2025-05-07T20:32:59.1046825Z op = torch.compile(op) 2025-05-07T20:32:59.1047136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1047430Z 2025-05-07T20:32:59.1047640Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.1047816Z 2025-05-07T20:32:59.1047925Z moe/activation_test.py:117: 2025-05-07T20:32:59.1048243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1048602Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.1049034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1049757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.1050488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.1051054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.1051764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.1052460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.1053072Z kernel = self.compile( 2025-05-07T20:32:59.1053639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.1054324Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.1054830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1055070Z 2025-05-07T20:32:59.1055297Z self = 2025-05-07T20:32:59.1056421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.1057850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6119260>} 2025-05-07T20:32:59.1059246Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.1060315Z context = 2025-05-07T20:32:59.1060617Z 2025-05-07T20:32:59.1060801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.1061339Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.1061827Z module_map=module_map) 2025-05-07T20:32:59.1062215Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.1062587Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.1062856Z E ^ 2025-05-07T20:32:59.1063349Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.1063822Z 2025-05-07T20:32:59.1064262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.1064795Z 2025-05-07T20:32:59.1064909Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.1065342Z self=, 2025-05-07T20:32:59.1065767Z T=128, 2025-05-07T20:32:59.1065969Z D=7168, 2025-05-07T20:32:59.1066166Z scale_ub=None, 2025-05-07T20:32:59.1066395Z contiguous=False, 2025-05-07T20:32:59.1066633Z compiled=True, 2025-05-07T20:32:59.1066841Z ) 2025-05-07T20:32:59.1067178Z self = 2025-05-07T20:32:59.1067702Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:59.1067979Z 2025-05-07T20:32:59.1068059Z @given( 2025-05-07T20:32:59.1068308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.1068638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.1068968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.1069309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.1069655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.1069960Z ) 2025-05-07T20:32:59.1070370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.1070837Z def test_silu_mul_quant( 2025-05-07T20:32:59.1071096Z self, 2025-05-07T20:32:59.1071298Z T: int, 2025-05-07T20:32:59.1071508Z D: int, 2025-05-07T20:32:59.1071742Z scale_ub: Optional[float], 2025-05-07T20:32:59.1072027Z contiguous: bool, 2025-05-07T20:32:59.1072284Z compiled: bool, 2025-05-07T20:32:59.1072520Z ) -> None: 2025-05-07T20:32:59.1072740Z torch.manual_seed(2025) 2025-05-07T20:32:59.1073040Z 2025-05-07T20:32:59.1073330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.1073680Z 2025-05-07T20:32:59.1073889Z x_sign = torch.sign(x) 2025-05-07T20:32:59.1074195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.1074520Z x = x_sign * x_clamp 2025-05-07T20:32:59.1074809Z x0 = x[:, :D] 2025-05-07T20:32:59.1075043Z x1 = x[:, D:] 2025-05-07T20:32:59.1075263Z 2025-05-07T20:32:59.1075496Z if contiguous: 2025-05-07T20:32:59.1075746Z x0 = x0.contiguous() 2025-05-07T20:32:59.1076022Z x1 = x1.contiguous() 2025-05-07T20:32:59.1076268Z 2025-05-07T20:32:59.1076469Z if scale_ub is not None: 2025-05-07T20:32:59.1076755Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.1077104Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.1077430Z ) 2025-05-07T20:32:59.1077630Z else: 2025-05-07T20:32:59.1077846Z scale_ub_tensor = None 2025-05-07T20:32:59.1078107Z 2025-05-07T20:32:59.1078348Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.1078668Z op = silu_mul_quant 2025-05-07T20:32:59.1078930Z if compiled: 2025-05-07T20:32:59.1079191Z op = torch.compile(op) 2025-05-07T20:32:59.1079510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1079796Z 2025-05-07T20:32:59.1080002Z y_fp8, y_scale = fn() 2025-05-07T20:32:59.1080374Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:59.1080671Z 2025-05-07T20:32:59.1080922Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.1081273Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:59.1081574Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:59.1081904Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:59.1082287Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.1082610Z 2025-05-07T20:32:59.1082830Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:59.1083037Z 2025-05-07T20:32:59.1083151Z moe/activation_test.py:126: 2025-05-07T20:32:59.1083467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1083824Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:59.1084173Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.1084995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:59.1085772Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:59.1086343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.1087055Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.1087772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:59.1088523Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.1089295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:59.1090016Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:59.1090651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:59.1091183Z fn() 2025-05-07T20:32:59.1091713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:59.1092343Z self.fn.run( 2025-05-07T20:32:59.1092854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.1093409Z kernel = self.compile( 2025-05-07T20:32:59.1094024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.1094707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.1095118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1095402Z 2025-05-07T20:32:59.1095623Z self = 2025-05-07T20:32:59.1096787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.1098217Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a611b420>} 2025-05-07T20:32:59.1099602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.1100665Z context = 2025-05-07T20:32:59.1100970Z 2025-05-07T20:32:59.1101144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.1101698Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.1102178Z module_map=module_map) 2025-05-07T20:32:59.1102564Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.1102941Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:59.1103217Z E ^ 2025-05-07T20:32:59.1103698Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.1104175Z 2025-05-07T20:32:59.1104605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.3488959Z 2025-05-07T20:32:59.3489483Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.3490068Z self=, 2025-05-07T20:32:59.3490627Z T=128, 2025-05-07T20:32:59.3490836Z D=7168, 2025-05-07T20:32:59.3491054Z scale_ub=None, 2025-05-07T20:32:59.3491289Z contiguous=False, 2025-05-07T20:32:59.3491534Z compiled=False, 2025-05-07T20:32:59.3491758Z ) 2025-05-07T20:32:59.3492102Z self = 2025-05-07T20:32:59.3492618Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:59.3492907Z 2025-05-07T20:32:59.3492989Z @given( 2025-05-07T20:32:59.3493233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.3493559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.3493888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.3494242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.3494591Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.3494886Z ) 2025-05-07T20:32:59.3495255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.3495725Z def test_silu_mul_quant( 2025-05-07T20:32:59.3496281Z self, 2025-05-07T20:32:59.3496497Z T: int, 2025-05-07T20:32:59.3496707Z D: int, 2025-05-07T20:32:59.3496932Z scale_ub: Optional[float], 2025-05-07T20:32:59.3497222Z contiguous: bool, 2025-05-07T20:32:59.3497476Z compiled: bool, 2025-05-07T20:32:59.3497708Z ) -> None: 2025-05-07T20:32:59.3497934Z torch.manual_seed(2025) 2025-05-07T20:32:59.3498190Z 2025-05-07T20:32:59.3498471Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.3498828Z 2025-05-07T20:32:59.3499135Z x_sign = torch.sign(x) 2025-05-07T20:32:59.3499435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.3499760Z x = x_sign * x_clamp 2025-05-07T20:32:59.3500012Z x0 = x[:, :D] 2025-05-07T20:32:59.3500239Z x1 = x[:, D:] 2025-05-07T20:32:59.3500455Z 2025-05-07T20:32:59.3500738Z if contiguous: 2025-05-07T20:32:59.3500980Z x0 = x0.contiguous() 2025-05-07T20:32:59.3501342Z x1 = x1.contiguous() 2025-05-07T20:32:59.3501598Z 2025-05-07T20:32:59.3501798Z if scale_ub is not None: 2025-05-07T20:32:59.3502102Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.3502452Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.3502777Z ) 2025-05-07T20:32:59.3502981Z else: 2025-05-07T20:32:59.3503200Z scale_ub_tensor = None 2025-05-07T20:32:59.3503465Z 2025-05-07T20:32:59.3503712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3504042Z op = silu_mul_quant 2025-05-07T20:32:59.3504306Z if compiled: 2025-05-07T20:32:59.3504569Z op = torch.compile(op) 2025-05-07T20:32:59.3504888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3505174Z 2025-05-07T20:32:59.3505378Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.3505556Z 2025-05-07T20:32:59.3505673Z moe/activation_test.py:117: 2025-05-07T20:32:59.3505985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3506336Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.3506632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3507357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.3508083Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.3508654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.3509379Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.3510073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.3510634Z kernel = self.compile( 2025-05-07T20:32:59.3511208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.3511894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.3512319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3512608Z 2025-05-07T20:32:59.3512836Z self = 2025-05-07T20:32:59.3514444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.3515915Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781c60a40>} 2025-05-07T20:32:59.3517400Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.3518475Z context = 2025-05-07T20:32:59.3518782Z 2025-05-07T20:32:59.3518958Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.3519509Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.3520001Z module_map=module_map) 2025-05-07T20:32:59.3520496Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.3520958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.3521230Z E ^ 2025-05-07T20:32:59.3521718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.3522269Z 2025-05-07T20:32:59.3522818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.3523570Z 2025-05-07T20:32:59.3523777Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.3524267Z self=, 2025-05-07T20:32:59.3524695Z T=4096, 2025-05-07T20:32:59.3524895Z D=5120, 2025-05-07T20:32:59.3525096Z scale_ub=1200.0, 2025-05-07T20:32:59.3525338Z contiguous=True, 2025-05-07T20:32:59.3525576Z compiled=False, 2025-05-07T20:32:59.3525786Z ) 2025-05-07T20:32:59.3526126Z self = 2025-05-07T20:32:59.3526664Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:59.3526951Z 2025-05-07T20:32:59.3527043Z @given( 2025-05-07T20:32:59.3527282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.3527618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.3527956Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.3528301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.3528653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.3528959Z ) 2025-05-07T20:32:59.3529323Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.3529793Z def test_silu_mul_quant( 2025-05-07T20:32:59.3530051Z self, 2025-05-07T20:32:59.3530256Z T: int, 2025-05-07T20:32:59.3530474Z D: int, 2025-05-07T20:32:59.3530715Z scale_ub: Optional[float], 2025-05-07T20:32:59.3531008Z contiguous: bool, 2025-05-07T20:32:59.3531264Z compiled: bool, 2025-05-07T20:32:59.3531505Z ) -> None: 2025-05-07T20:32:59.3531735Z torch.manual_seed(2025) 2025-05-07T20:32:59.3531989Z 2025-05-07T20:32:59.3532279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.3532646Z 2025-05-07T20:32:59.3532852Z x_sign = torch.sign(x) 2025-05-07T20:32:59.3533174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.3533506Z x = x_sign * x_clamp 2025-05-07T20:32:59.3533760Z x0 = x[:, :D] 2025-05-07T20:32:59.3533996Z x1 = x[:, D:] 2025-05-07T20:32:59.3534218Z 2025-05-07T20:32:59.3534412Z if contiguous: 2025-05-07T20:32:59.3534658Z x0 = x0.contiguous() 2025-05-07T20:32:59.3534937Z x1 = x1.contiguous() 2025-05-07T20:32:59.3535194Z 2025-05-07T20:32:59.3535400Z if scale_ub is not None: 2025-05-07T20:32:59.3535691Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.3536053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.3536381Z ) 2025-05-07T20:32:59.3536588Z else: 2025-05-07T20:32:59.3536812Z scale_ub_tensor = None 2025-05-07T20:32:59.3537074Z 2025-05-07T20:32:59.3537327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3537666Z op = silu_mul_quant 2025-05-07T20:32:59.3537980Z if compiled: 2025-05-07T20:32:59.3538251Z op = torch.compile(op) 2025-05-07T20:32:59.3538570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3538859Z 2025-05-07T20:32:59.3539072Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.3539247Z 2025-05-07T20:32:59.3539358Z moe/activation_test.py:117: 2025-05-07T20:32:59.3539671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3540028Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.3540330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3541106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.3541829Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.3542426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.3543289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.3543991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.3544558Z kernel = self.compile( 2025-05-07T20:32:59.3545138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.3545836Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.3546258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3546511Z 2025-05-07T20:32:59.3546730Z self = 2025-05-07T20:32:59.3547872Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.3549322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781c60ea0>} 2025-05-07T20:32:59.3550727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.3551790Z context = 2025-05-07T20:32:59.3552098Z 2025-05-07T20:32:59.3552277Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.3552826Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.3553314Z module_map=module_map) 2025-05-07T20:32:59.3553697Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.3554072Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.3554349Z E ^ 2025-05-07T20:32:59.3554833Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.3555313Z 2025-05-07T20:32:59.3555750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.3556286Z 2025-05-07T20:32:59.3556401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.3556840Z self=, 2025-05-07T20:32:59.3557260Z T=1, 2025-05-07T20:32:59.3557456Z D=5120, 2025-05-07T20:32:59.3557661Z scale_ub=None, 2025-05-07T20:32:59.3557883Z contiguous=True, 2025-05-07T20:32:59.3558118Z compiled=True, 2025-05-07T20:32:59.3558336Z ) 2025-05-07T20:32:59.3558670Z self = 2025-05-07T20:32:59.3559230Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:59.3559502Z 2025-05-07T20:32:59.3559593Z @given( 2025-05-07T20:32:59.3559831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.3560284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.3560609Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.3560962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.3561305Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.3561606Z ) 2025-05-07T20:32:59.3561976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.3562484Z def test_silu_mul_quant( 2025-05-07T20:32:59.3562743Z self, 2025-05-07T20:32:59.3562951Z T: int, 2025-05-07T20:32:59.3563153Z D: int, 2025-05-07T20:32:59.3563384Z scale_ub: Optional[float], 2025-05-07T20:32:59.3563675Z contiguous: bool, 2025-05-07T20:32:59.3563969Z compiled: bool, 2025-05-07T20:32:59.3564206Z ) -> None: 2025-05-07T20:32:59.3564477Z torch.manual_seed(2025) 2025-05-07T20:32:59.3564729Z 2025-05-07T20:32:59.3565017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.3565378Z 2025-05-07T20:32:59.3565578Z x_sign = torch.sign(x) 2025-05-07T20:32:59.3565889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.3566218Z x = x_sign * x_clamp 2025-05-07T20:32:59.3566478Z x0 = x[:, :D] 2025-05-07T20:32:59.3566703Z x1 = x[:, D:] 2025-05-07T20:32:59.3566924Z 2025-05-07T20:32:59.3567124Z if contiguous: 2025-05-07T20:32:59.3567363Z x0 = x0.contiguous() 2025-05-07T20:32:59.3567637Z x1 = x1.contiguous() 2025-05-07T20:32:59.3567894Z 2025-05-07T20:32:59.3568095Z if scale_ub is not None: 2025-05-07T20:32:59.3568384Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.3568739Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.3569064Z ) 2025-05-07T20:32:59.3569271Z else: 2025-05-07T20:32:59.3569494Z scale_ub_tensor = None 2025-05-07T20:32:59.3569753Z 2025-05-07T20:32:59.3569998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3570338Z op = silu_mul_quant 2025-05-07T20:32:59.3570602Z if compiled: 2025-05-07T20:32:59.3570863Z op = torch.compile(op) 2025-05-07T20:32:59.3571179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3571472Z 2025-05-07T20:32:59.3571669Z y_fp8, y_scale = fn() 2025-05-07T20:32:59.3571976Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:59.3572290Z 2025-05-07T20:32:59.3572535Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3572889Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:59.3573203Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:59.3573534Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:59.3573915Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.3574243Z 2025-05-07T20:32:59.3574450Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:59.3574660Z 2025-05-07T20:32:59.3574765Z moe/activation_test.py:126: 2025-05-07T20:32:59.3575076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3575432Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:59.3575776Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.3576605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:59.3577390Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:59.3577955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.3586111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.3586867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:59.3587637Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.3588421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:59.3589104Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:59.3589731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:59.3590331Z fn() 2025-05-07T20:32:59.3590870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:59.3591475Z self.fn.run( 2025-05-07T20:32:59.3592023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.3592625Z kernel = self.compile( 2025-05-07T20:32:59.3593198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.3593876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.3594301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3594544Z 2025-05-07T20:32:59.3594770Z self = 2025-05-07T20:32:59.3595898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.3597324Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781c62c00>} 2025-05-07T20:32:59.3598721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.3599789Z context = 2025-05-07T20:32:59.3600188Z 2025-05-07T20:32:59.3600368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.3600910Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.3601401Z module_map=module_map) 2025-05-07T20:32:59.3601788Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.3602163Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:59.3602434Z E ^ 2025-05-07T20:32:59.3602918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.3603386Z 2025-05-07T20:32:59.3603827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.0568056Z 2025-05-07T20:33:00.0568396Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0569016Z self=, 2025-05-07T20:33:00.0569577Z T=2048, 2025-05-07T20:33:00.0569841Z D=5120, 2025-05-07T20:33:00.0570110Z scale_ub=None, 2025-05-07T20:33:00.0570400Z contiguous=True, 2025-05-07T20:33:00.0570697Z compiled=True, 2025-05-07T20:33:00.0570923Z ) 2025-05-07T20:33:00.0571259Z self = 2025-05-07T20:33:00.0571790Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.0572082Z 2025-05-07T20:33:00.0572172Z @given( 2025-05-07T20:33:00.0572425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0573078Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0573419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0573776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0574123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0574430Z ) 2025-05-07T20:33:00.0574808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0575273Z def test_silu_mul_quant( 2025-05-07T20:33:00.0575535Z self, 2025-05-07T20:33:00.0575840Z T: int, 2025-05-07T20:33:00.0576087Z D: int, 2025-05-07T20:33:00.0576323Z scale_ub: Optional[float], 2025-05-07T20:33:00.0576610Z contiguous: bool, 2025-05-07T20:33:00.0576872Z compiled: bool, 2025-05-07T20:33:00.0577125Z ) -> None: 2025-05-07T20:33:00.0577356Z torch.manual_seed(2025) 2025-05-07T20:33:00.0577702Z 2025-05-07T20:33:00.0577999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0578436Z 2025-05-07T20:33:00.0578649Z x_sign = torch.sign(x) 2025-05-07T20:33:00.0578963Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.0579300Z x = x_sign * x_clamp 2025-05-07T20:33:00.0579550Z x0 = x[:, :D] 2025-05-07T20:33:00.0579786Z x1 = x[:, D:] 2025-05-07T20:33:00.0580007Z 2025-05-07T20:33:00.0580202Z if contiguous: 2025-05-07T20:33:00.0580451Z x0 = x0.contiguous() 2025-05-07T20:33:00.0580732Z x1 = x1.contiguous() 2025-05-07T20:33:00.0580989Z 2025-05-07T20:33:00.0581200Z if scale_ub is not None: 2025-05-07T20:33:00.0581495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.0581848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.0582184Z ) 2025-05-07T20:33:00.0582401Z else: 2025-05-07T20:33:00.0582631Z scale_ub_tensor = None 2025-05-07T20:33:00.0582906Z 2025-05-07T20:33:00.0583163Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.0583494Z op = silu_mul_quant 2025-05-07T20:33:00.0583768Z if compiled: 2025-05-07T20:33:00.0584036Z op = torch.compile(op) 2025-05-07T20:33:00.0584347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.0584646Z 2025-05-07T20:33:00.0584859Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.0585174Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.0585481Z 2025-05-07T20:33:00.0585740Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.0586099Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.0586409Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.0586746Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.0587135Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.0587470Z 2025-05-07T20:33:00.0587698Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.0587904Z 2025-05-07T20:33:00.0588022Z moe/activation_test.py:126: 2025-05-07T20:33:00.0588342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0588699Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.0589052Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.0589890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.0590680Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.0591262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.0591987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.0592778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.0593541Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.0594313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.0594993Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.0595635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.0596179Z fn() 2025-05-07T20:33:00.0596764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.0597380Z self.fn.run( 2025-05-07T20:33:00.0597871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.0598477Z kernel = self.compile( 2025-05-07T20:33:00.0599092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.0599785Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.0600354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0600602Z 2025-05-07T20:33:00.0600822Z self = 2025-05-07T20:33:00.0601956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.0603406Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781c7d1c0>} 2025-05-07T20:33:00.0604803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.0605873Z context = 2025-05-07T20:33:00.0606187Z 2025-05-07T20:33:00.0606366Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.0606919Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.0607410Z module_map=module_map) 2025-05-07T20:33:00.0607803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.0608191Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.0608473Z E ^ 2025-05-07T20:33:00.0608966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.0609444Z 2025-05-07T20:33:00.0609885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.0610426Z 2025-05-07T20:33:00.0610546Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.0610983Z self=, 2025-05-07T20:33:00.0611410Z T=128, 2025-05-07T20:33:00.0611617Z D=5120, 2025-05-07T20:33:00.0611821Z scale_ub=None, 2025-05-07T20:33:00.0612054Z contiguous=True, 2025-05-07T20:33:00.0612297Z compiled=True, 2025-05-07T20:33:00.0612516Z ) 2025-05-07T20:33:00.0612855Z self = 2025-05-07T20:33:00.0613706Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.0614088Z 2025-05-07T20:33:00.0614185Z @given( 2025-05-07T20:33:00.0614428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.0614765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.0615108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.0615562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.0615933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.0616246Z ) 2025-05-07T20:33:00.0616617Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.0617088Z def test_silu_mul_quant( 2025-05-07T20:33:00.0617355Z self, 2025-05-07T20:33:00.0617568Z T: int, 2025-05-07T20:33:00.0617777Z D: int, 2025-05-07T20:33:00.0618017Z scale_ub: Optional[float], 2025-05-07T20:33:00.0618315Z contiguous: bool, 2025-05-07T20:33:00.0618644Z compiled: bool, 2025-05-07T20:33:00.0618880Z ) -> None: 2025-05-07T20:33:00.0619114Z torch.manual_seed(2025) 2025-05-07T20:33:00.0619374Z 2025-05-07T20:33:00.0619661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.0620126Z 2025-05-07T20:33:00.0620337Z x_sign = torch.sign(x) 2025-05-07T20:33:00.0620705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.0621040Z x = x_sign * x_clamp 2025-05-07T20:33:00.0621298Z x0 = x[:, :D] 2025-05-07T20:33:00.0621525Z x1 = x[:, D:] 2025-05-07T20:33:00.0621754Z 2025-05-07T20:33:00.0621956Z if contiguous: 2025-05-07T20:33:00.0622200Z x0 = x0.contiguous() 2025-05-07T20:33:00.0622486Z x1 = x1.contiguous() 2025-05-07T20:33:00.0622745Z 2025-05-07T20:33:00.0622946Z if scale_ub is not None: 2025-05-07T20:33:00.0623239Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.0623595Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.0623925Z ) 2025-05-07T20:33:00.0624123Z else: 2025-05-07T20:33:00.0624348Z scale_ub_tensor = None 2025-05-07T20:33:00.0624613Z 2025-05-07T20:33:00.0624854Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.0625192Z op = silu_mul_quant 2025-05-07T20:33:00.0625462Z if compiled: 2025-05-07T20:33:00.0625722Z op = torch.compile(op) 2025-05-07T20:33:00.0626041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.0626338Z 2025-05-07T20:33:00.0626540Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.0626849Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.0627156Z 2025-05-07T20:33:00.0627428Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.0627782Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.0628095Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.0628432Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.0628815Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.0629142Z 2025-05-07T20:33:00.0629362Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.0629572Z 2025-05-07T20:33:00.0629686Z moe/activation_test.py:126: 2025-05-07T20:33:00.0630002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0630360Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.0630713Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.0631537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.0632318Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.0632897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.0633622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.0634346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.0635094Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.0635915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.0636588Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.0637216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.0637760Z fn() 2025-05-07T20:33:00.0638292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.0638899Z self.fn.run( 2025-05-07T20:33:00.0639428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.0639983Z kernel = self.compile( 2025-05-07T20:33:00.0640645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.0641376Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.0641838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.0642086Z 2025-05-07T20:33:00.0642304Z self = 2025-05-07T20:33:00.0643441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.0644862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78162ad40>} 2025-05-07T20:33:00.0646255Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.0647328Z context = 2025-05-07T20:33:00.0647633Z 2025-05-07T20:33:00.0647820Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.0648373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.0648860Z module_map=module_map) 2025-05-07T20:33:00.0649246Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.0649624Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.0649902Z E ^ 2025-05-07T20:33:00.0650393Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.0650861Z 2025-05-07T20:33:00.0651304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.8504894Z 2025-05-07T20:33:00.8505768Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.8506347Z self=, 2025-05-07T20:33:00.8506779Z T=4096, 2025-05-07T20:33:00.8506985Z D=5120, 2025-05-07T20:33:00.8507189Z scale_ub=None, 2025-05-07T20:33:00.8507410Z contiguous=True, 2025-05-07T20:33:00.8507650Z compiled=True, 2025-05-07T20:33:00.8507866Z ) 2025-05-07T20:33:00.8508197Z self = 2025-05-07T20:33:00.8508717Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.8509004Z 2025-05-07T20:33:00.8509096Z @given( 2025-05-07T20:33:00.8509338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.8509660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.8509986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.8510334Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.8510681Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.8511191Z ) 2025-05-07T20:33:00.8511569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.8512027Z def test_silu_mul_quant( 2025-05-07T20:33:00.8512282Z self, 2025-05-07T20:33:00.8512489Z T: int, 2025-05-07T20:33:00.8512691Z D: int, 2025-05-07T20:33:00.8512923Z scale_ub: Optional[float], 2025-05-07T20:33:00.8513212Z contiguous: bool, 2025-05-07T20:33:00.8513658Z compiled: bool, 2025-05-07T20:33:00.8513912Z ) -> None: 2025-05-07T20:33:00.8514141Z torch.manual_seed(2025) 2025-05-07T20:33:00.8514481Z 2025-05-07T20:33:00.8514766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.8515130Z 2025-05-07T20:33:00.8515337Z x_sign = torch.sign(x) 2025-05-07T20:33:00.8515642Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.8516052Z x = x_sign * x_clamp 2025-05-07T20:33:00.8516310Z x0 = x[:, :D] 2025-05-07T20:33:00.8516603Z x1 = x[:, D:] 2025-05-07T20:33:00.8516822Z 2025-05-07T20:33:00.8517016Z if contiguous: 2025-05-07T20:33:00.8517252Z x0 = x0.contiguous() 2025-05-07T20:33:00.8517525Z x1 = x1.contiguous() 2025-05-07T20:33:00.8517776Z 2025-05-07T20:33:00.8517970Z if scale_ub is not None: 2025-05-07T20:33:00.8518262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.8518615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.8518941Z ) 2025-05-07T20:33:00.8519142Z else: 2025-05-07T20:33:00.8519363Z scale_ub_tensor = None 2025-05-07T20:33:00.8519628Z 2025-05-07T20:33:00.8519869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.8520290Z op = silu_mul_quant 2025-05-07T20:33:00.8520553Z if compiled: 2025-05-07T20:33:00.8520810Z op = torch.compile(op) 2025-05-07T20:33:00.8521124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.8521416Z 2025-05-07T20:33:00.8521614Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.8521914Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.8522220Z 2025-05-07T20:33:00.8522466Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.8522816Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.8523163Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.8523494Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.8523875Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.8524199Z 2025-05-07T20:33:00.8524412Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.8524622Z 2025-05-07T20:33:00.8524731Z moe/activation_test.py:126: 2025-05-07T20:33:00.8525050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8525407Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.8525762Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.8526591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.8527375Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.8527950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.8528667Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.8529392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.8530144Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.8530913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.8531692Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.8532331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.8532866Z fn() 2025-05-07T20:33:00.8533400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.8534011Z self.fn.run( 2025-05-07T20:33:00.8534493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.8535097Z kernel = self.compile( 2025-05-07T20:33:00.8535667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.8536353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.8536764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8537054Z 2025-05-07T20:33:00.8537310Z self = 2025-05-07T20:33:00.8538440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.8539897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7815ea660>} 2025-05-07T20:33:00.8541291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.8542353Z context = 2025-05-07T20:33:00.8542662Z 2025-05-07T20:33:00.8542840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.8543394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.8543930Z module_map=module_map) 2025-05-07T20:33:00.8544313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.8544688Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.8544964Z E ^ 2025-05-07T20:33:00.8545450Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.8545929Z 2025-05-07T20:33:00.8546365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.8546901Z 2025-05-07T20:33:00.8547019Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.8547448Z self=, 2025-05-07T20:33:00.8547873Z T=16384, 2025-05-07T20:33:00.8548078Z D=5120, 2025-05-07T20:33:00.8548279Z scale_ub=None, 2025-05-07T20:33:00.8548504Z contiguous=True, 2025-05-07T20:33:00.8548736Z compiled=True, 2025-05-07T20:33:00.8548949Z ) 2025-05-07T20:33:00.8549279Z self = 2025-05-07T20:33:00.8549798Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.8550084Z 2025-05-07T20:33:00.8550172Z @given( 2025-05-07T20:33:00.8550410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.8550742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.8551072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.8551413Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.8551764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.8552067Z ) 2025-05-07T20:33:00.8552439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.8552945Z def test_silu_mul_quant( 2025-05-07T20:33:00.8553208Z self, 2025-05-07T20:33:00.8553416Z T: int, 2025-05-07T20:33:00.8553624Z D: int, 2025-05-07T20:33:00.8553896Z scale_ub: Optional[float], 2025-05-07T20:33:00.8554189Z contiguous: bool, 2025-05-07T20:33:00.8554437Z compiled: bool, 2025-05-07T20:33:00.8554675Z ) -> None: 2025-05-07T20:33:00.8554900Z torch.manual_seed(2025) 2025-05-07T20:33:00.8555147Z 2025-05-07T20:33:00.8555441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.8555841Z 2025-05-07T20:33:00.8556038Z x_sign = torch.sign(x) 2025-05-07T20:33:00.8556344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.8556668Z x = x_sign * x_clamp 2025-05-07T20:33:00.8556913Z x0 = x[:, :D] 2025-05-07T20:33:00.8557139Z x1 = x[:, D:] 2025-05-07T20:33:00.8557397Z 2025-05-07T20:33:00.8557586Z if contiguous: 2025-05-07T20:33:00.8557830Z x0 = x0.contiguous() 2025-05-07T20:33:00.8558141Z x1 = x1.contiguous() 2025-05-07T20:33:00.8558395Z 2025-05-07T20:33:00.8558590Z if scale_ub is not None: 2025-05-07T20:33:00.8558880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.8559234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.8566291Z ) 2025-05-07T20:33:00.8566543Z else: 2025-05-07T20:33:00.8566764Z scale_ub_tensor = None 2025-05-07T20:33:00.8567039Z 2025-05-07T20:33:00.8567294Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.8567630Z op = silu_mul_quant 2025-05-07T20:33:00.8567890Z if compiled: 2025-05-07T20:33:00.8568157Z op = torch.compile(op) 2025-05-07T20:33:00.8568474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.8568760Z 2025-05-07T20:33:00.8568967Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.8569278Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.8569584Z 2025-05-07T20:33:00.8569836Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.8570189Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.8570490Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.8570822Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.8571203Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.8571524Z 2025-05-07T20:33:00.8571740Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.8571955Z 2025-05-07T20:33:00.8572064Z moe/activation_test.py:126: 2025-05-07T20:33:00.8572383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8572730Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.8573078Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.8573912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.8574685Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.8575254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.8575967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.8576682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.8577436Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.8578200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.8578868Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.8579580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.8580119Z fn() 2025-05-07T20:33:00.8580655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.8581266Z self.fn.run( 2025-05-07T20:33:00.8581746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.8582300Z kernel = self.compile( 2025-05-07T20:33:00.8582868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.8583640Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.8584052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8584298Z 2025-05-07T20:33:00.8584514Z self = 2025-05-07T20:33:00.8585716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.8587150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780b25800>} 2025-05-07T20:33:00.8588534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.8589601Z context = 2025-05-07T20:33:00.8589909Z 2025-05-07T20:33:00.8590085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.8590633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.8591125Z module_map=module_map) 2025-05-07T20:33:00.8591511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.8591891Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.8592178Z E ^ 2025-05-07T20:33:00.8592656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.8593162Z 2025-05-07T20:33:00.8593614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.8782116Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:00.8783754Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:00.8785149Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:00.8786196Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:00.8787364Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:01.2775612Z 2025-05-07T20:33:01.2775930Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2776409Z self=, 2025-05-07T20:33:01.2776889Z T=1, 2025-05-07T20:33:01.2777097Z D=5120, 2025-05-07T20:33:01.2777302Z scale_ub=1200.0, 2025-05-07T20:33:01.2777552Z contiguous=True, 2025-05-07T20:33:01.2777911Z compiled=True, 2025-05-07T20:33:01.2778132Z ) 2025-05-07T20:33:01.2778512Z self = 2025-05-07T20:33:01.2779092Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.2779403Z 2025-05-07T20:33:01.2779497Z @given( 2025-05-07T20:33:01.2779748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2780114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2780470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2780854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2781302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2781632Z ) 2025-05-07T20:33:01.2782041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2782576Z def test_silu_mul_quant( 2025-05-07T20:33:01.2782917Z self, 2025-05-07T20:33:01.2783129Z T: int, 2025-05-07T20:33:01.2783352Z D: int, 2025-05-07T20:33:01.2783658Z scale_ub: Optional[float], 2025-05-07T20:33:01.2783948Z contiguous: bool, 2025-05-07T20:33:01.2784215Z compiled: bool, 2025-05-07T20:33:01.2784461Z ) -> None: 2025-05-07T20:33:01.2784690Z torch.manual_seed(2025) 2025-05-07T20:33:01.2784951Z 2025-05-07T20:33:01.2785249Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2785617Z 2025-05-07T20:33:01.2785822Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2786139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2786476Z x = x_sign * x_clamp 2025-05-07T20:33:01.2786730Z x0 = x[:, :D] 2025-05-07T20:33:01.2786967Z x1 = x[:, D:] 2025-05-07T20:33:01.2787194Z 2025-05-07T20:33:01.2787390Z if contiguous: 2025-05-07T20:33:01.2787643Z x0 = x0.contiguous() 2025-05-07T20:33:01.2787929Z x1 = x1.contiguous() 2025-05-07T20:33:01.2788184Z 2025-05-07T20:33:01.2788398Z if scale_ub is not None: 2025-05-07T20:33:01.2788704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.2789059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.2789391Z ) 2025-05-07T20:33:01.2789602Z else: 2025-05-07T20:33:01.2789824Z scale_ub_tensor = None 2025-05-07T20:33:01.2790097Z 2025-05-07T20:33:01.2790347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2790685Z op = silu_mul_quant 2025-05-07T20:33:01.2790948Z if compiled: 2025-05-07T20:33:01.2791227Z op = torch.compile(op) 2025-05-07T20:33:01.2791549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2791842Z 2025-05-07T20:33:01.2792052Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.2792227Z 2025-05-07T20:33:01.2792344Z moe/activation_test.py:117: 2025-05-07T20:33:01.2792665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2793051Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.2793385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2793982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.2794579Z return fn(*args, **kwargs) 2025-05-07T20:33:01.2795280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.2796012Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.2796578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.2797307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.2798013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.2798581Z kernel = self.compile( 2025-05-07T20:33:01.2799203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.2799904Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2800421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2800665Z 2025-05-07T20:33:01.2800884Z self = 2025-05-07T20:33:01.2802027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.2803528Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780b334c0>} 2025-05-07T20:33:01.2805014Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.2806092Z context = 2025-05-07T20:33:01.2806398Z 2025-05-07T20:33:01.2806577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.2807137Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2807638Z module_map=module_map) 2025-05-07T20:33:01.2808040Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2808415Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.2808698Z E ^ 2025-05-07T20:33:01.2809223Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.2809729Z 2025-05-07T20:33:01.2810173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.2810718Z 2025-05-07T20:33:01.2810831Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2811278Z self=, 2025-05-07T20:33:01.2811710Z T=1, 2025-05-07T20:33:01.2811909Z D=5120, 2025-05-07T20:33:01.2812126Z scale_ub=None, 2025-05-07T20:33:01.2812363Z contiguous=False, 2025-05-07T20:33:01.2812610Z compiled=True, 2025-05-07T20:33:01.2812833Z ) 2025-05-07T20:33:01.2813179Z self = 2025-05-07T20:33:01.2813857Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.2814133Z 2025-05-07T20:33:01.2814218Z @given( 2025-05-07T20:33:01.2814609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2814949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2815281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2815640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2815995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2816296Z ) 2025-05-07T20:33:01.2816671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2817142Z def test_silu_mul_quant( 2025-05-07T20:33:01.2817405Z self, 2025-05-07T20:33:01.2817610Z T: int, 2025-05-07T20:33:01.2817826Z D: int, 2025-05-07T20:33:01.2818066Z scale_ub: Optional[float], 2025-05-07T20:33:01.2818352Z contiguous: bool, 2025-05-07T20:33:01.2818612Z compiled: bool, 2025-05-07T20:33:01.2818852Z ) -> None: 2025-05-07T20:33:01.2819076Z torch.manual_seed(2025) 2025-05-07T20:33:01.2819335Z 2025-05-07T20:33:01.2819626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2819986Z 2025-05-07T20:33:01.2820271Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2820585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2820916Z x = x_sign * x_clamp 2025-05-07T20:33:01.2821179Z x0 = x[:, :D] 2025-05-07T20:33:01.2821413Z x1 = x[:, D:] 2025-05-07T20:33:01.2821632Z 2025-05-07T20:33:01.2821833Z if contiguous: 2025-05-07T20:33:01.2822084Z x0 = x0.contiguous() 2025-05-07T20:33:01.2822363Z x1 = x1.contiguous() 2025-05-07T20:33:01.2822616Z 2025-05-07T20:33:01.2822827Z if scale_ub is not None: 2025-05-07T20:33:01.2823187Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.2823553Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.2823879Z ) 2025-05-07T20:33:01.2824089Z else: 2025-05-07T20:33:01.2824315Z scale_ub_tensor = None 2025-05-07T20:33:01.2824666Z 2025-05-07T20:33:01.2824915Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2825308Z op = silu_mul_quant 2025-05-07T20:33:01.2825577Z if compiled: 2025-05-07T20:33:01.2825848Z op = torch.compile(op) 2025-05-07T20:33:01.2826171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2826460Z 2025-05-07T20:33:01.2826669Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.2826976Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.2827294Z 2025-05-07T20:33:01.2827544Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2827908Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.2828232Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.2828563Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.2828955Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.2829288Z 2025-05-07T20:33:01.2829504Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.2829715Z 2025-05-07T20:33:01.2829825Z moe/activation_test.py:126: 2025-05-07T20:33:01.2830147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2830509Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.2830858Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.2831692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.2832485Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.2833093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.2833847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.2834580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.2835354Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.2836125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.2836810Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.2837456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.2838006Z fn() 2025-05-07T20:33:01.2838544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.2839173Z self.fn.run( 2025-05-07T20:33:01.2839672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.2840321Z kernel = self.compile( 2025-05-07T20:33:01.2840901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.2841662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2842093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2842338Z 2025-05-07T20:33:01.2842560Z self = 2025-05-07T20:33:01.2843745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.2845228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780702de0>} 2025-05-07T20:33:01.2846644Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.2847789Z context = 2025-05-07T20:33:01.2848107Z 2025-05-07T20:33:01.2848288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.2848844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2849342Z module_map=module_map) 2025-05-07T20:33:01.2849728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2850114Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.2850406Z E ^ 2025-05-07T20:33:01.2850893Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.2851374Z 2025-05-07T20:33:01.2851811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.4262890Z 2025-05-07T20:33:01.4263244Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.4263700Z self=, 2025-05-07T20:33:01.4264152Z T=1, 2025-05-07T20:33:01.4264345Z D=5120, 2025-05-07T20:33:01.4264552Z scale_ub=None, 2025-05-07T20:33:01.4264783Z contiguous=True, 2025-05-07T20:33:01.4265017Z compiled=False, 2025-05-07T20:33:01.4265240Z ) 2025-05-07T20:33:01.4265694Z self = 2025-05-07T20:33:01.4266378Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:01.4266741Z 2025-05-07T20:33:01.4266846Z @given( 2025-05-07T20:33:01.4267154Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.4267498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.4267826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.4268184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.4268542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.4268839Z ) 2025-05-07T20:33:01.4269207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.4269673Z def test_silu_mul_quant( 2025-05-07T20:33:01.4269923Z self, 2025-05-07T20:33:01.4270129Z T: int, 2025-05-07T20:33:01.4270338Z D: int, 2025-05-07T20:33:01.4270567Z scale_ub: Optional[float], 2025-05-07T20:33:01.4270855Z contiguous: bool, 2025-05-07T20:33:01.4271111Z compiled: bool, 2025-05-07T20:33:01.4271385Z ) -> None: 2025-05-07T20:33:01.4271611Z torch.manual_seed(2025) 2025-05-07T20:33:01.4271861Z 2025-05-07T20:33:01.4272149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.4272510Z 2025-05-07T20:33:01.4272712Z x_sign = torch.sign(x) 2025-05-07T20:33:01.4273024Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.4273472Z x = x_sign * x_clamp 2025-05-07T20:33:01.4273729Z x0 = x[:, :D] 2025-05-07T20:33:01.4273958Z x1 = x[:, D:] 2025-05-07T20:33:01.4274179Z 2025-05-07T20:33:01.4274370Z if contiguous: 2025-05-07T20:33:01.4274612Z x0 = x0.contiguous() 2025-05-07T20:33:01.4274885Z x1 = x1.contiguous() 2025-05-07T20:33:01.4275133Z 2025-05-07T20:33:01.4275335Z if scale_ub is not None: 2025-05-07T20:33:01.4275627Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.4275984Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.4276372Z ) 2025-05-07T20:33:01.4276578Z else: 2025-05-07T20:33:01.4276800Z scale_ub_tensor = None 2025-05-07T20:33:01.4277062Z 2025-05-07T20:33:01.4277307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.4277641Z op = silu_mul_quant 2025-05-07T20:33:01.4277965Z if compiled: 2025-05-07T20:33:01.4278231Z op = torch.compile(op) 2025-05-07T20:33:01.4278595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.4278882Z 2025-05-07T20:33:01.4279088Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.4279268Z 2025-05-07T20:33:01.4279385Z moe/activation_test.py:117: 2025-05-07T20:33:01.4279693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.4280044Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.4280452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.4281174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.4281890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.4282453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.4283169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.4283869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.4284422Z kernel = self.compile( 2025-05-07T20:33:01.4284991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.4285680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.4286093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.4286336Z 2025-05-07T20:33:01.4286556Z self = 2025-05-07T20:33:01.4287680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.4289118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781029940>} 2025-05-07T20:33:01.4290517Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.4291583Z context = 2025-05-07T20:33:01.4291892Z 2025-05-07T20:33:01.4292070Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.4292617Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.4293110Z module_map=module_map) 2025-05-07T20:33:01.4293489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.4293860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.4294137Z E ^ 2025-05-07T20:33:01.4294671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.4295148Z 2025-05-07T20:33:01.4295583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.4296123Z 2025-05-07T20:33:01.4296236Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.4296672Z self=, 2025-05-07T20:33:01.4297091Z T=128, 2025-05-07T20:33:01.4297289Z D=5120, 2025-05-07T20:33:01.4297534Z scale_ub=None, 2025-05-07T20:33:01.4297756Z contiguous=False, 2025-05-07T20:33:01.4297996Z compiled=True, 2025-05-07T20:33:01.4298212Z ) 2025-05-07T20:33:01.4298544Z self = 2025-05-07T20:33:01.4299062Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.4299386Z 2025-05-07T20:33:01.4299476Z @given( 2025-05-07T20:33:01.4299756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.4300088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.4300413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.4300762Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.4301102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.4301405Z ) 2025-05-07T20:33:01.4301773Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.4302234Z def test_silu_mul_quant( 2025-05-07T20:33:01.4302491Z self, 2025-05-07T20:33:01.4302698Z T: int, 2025-05-07T20:33:01.4302901Z D: int, 2025-05-07T20:33:01.4303131Z scale_ub: Optional[float], 2025-05-07T20:33:01.4303416Z contiguous: bool, 2025-05-07T20:33:01.4303666Z compiled: bool, 2025-05-07T20:33:01.4303903Z ) -> None: 2025-05-07T20:33:01.4304130Z torch.manual_seed(2025) 2025-05-07T20:33:01.4304381Z 2025-05-07T20:33:01.4304672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.4305029Z 2025-05-07T20:33:01.4305231Z x_sign = torch.sign(x) 2025-05-07T20:33:01.4305534Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.4305858Z x = x_sign * x_clamp 2025-05-07T20:33:01.4306107Z x0 = x[:, :D] 2025-05-07T20:33:01.4306333Z x1 = x[:, D:] 2025-05-07T20:33:01.4306556Z 2025-05-07T20:33:01.4306752Z if contiguous: 2025-05-07T20:33:01.4306995Z x0 = x0.contiguous() 2025-05-07T20:33:01.4307275Z x1 = x1.contiguous() 2025-05-07T20:33:01.4307526Z 2025-05-07T20:33:01.4307725Z if scale_ub is not None: 2025-05-07T20:33:01.4308013Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.4308370Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.4308694Z ) 2025-05-07T20:33:01.4308895Z else: 2025-05-07T20:33:01.4309121Z scale_ub_tensor = None 2025-05-07T20:33:01.4309382Z 2025-05-07T20:33:01.4309628Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.4309959Z op = silu_mul_quant 2025-05-07T20:33:01.4310218Z if compiled: 2025-05-07T20:33:01.4316484Z op = torch.compile(op) 2025-05-07T20:33:01.4316848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.4317143Z 2025-05-07T20:33:01.4317357Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.4317535Z 2025-05-07T20:33:01.4317648Z moe/activation_test.py:117: 2025-05-07T20:33:01.4317962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.4318316Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.4318620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.4319223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.4319924Z return fn(*args, **kwargs) 2025-05-07T20:33:01.4320686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.4321400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.4321965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.4322684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.4323427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.4324056Z kernel = self.compile( 2025-05-07T20:33:01.4324627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.4325323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.4325808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.4326110Z 2025-05-07T20:33:01.4326331Z self = 2025-05-07T20:33:01.4327464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.4328900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7807037e0>} 2025-05-07T20:33:01.4330301Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.4331367Z context = 2025-05-07T20:33:01.4331681Z 2025-05-07T20:33:01.4331862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.4332425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.4332920Z module_map=module_map) 2025-05-07T20:33:01.4333301Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.4333678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.4333953Z E ^ 2025-05-07T20:33:01.4334435Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.4334911Z 2025-05-07T20:33:01.4335346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.4335884Z 2025-05-07T20:33:01.4335993Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.4336436Z self=, 2025-05-07T20:33:01.4336856Z T=128, 2025-05-07T20:33:01.4337062Z D=7168, 2025-05-07T20:33:01.4337262Z scale_ub=1200.0, 2025-05-07T20:33:01.4337494Z contiguous=False, 2025-05-07T20:33:01.4337735Z compiled=False, 2025-05-07T20:33:01.5901431Z ) 2025-05-07T20:33:01.5902437Z self = 2025-05-07T20:33:01.5903394Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.5904030Z 2025-05-07T20:33:01.5904194Z @given( 2025-05-07T20:33:01.5904686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.5905360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.5905994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.5906672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.5907350Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.5907955Z ) 2025-05-07T20:33:01.5909061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.5909985Z def test_silu_mul_quant( 2025-05-07T20:33:01.5910483Z self, 2025-05-07T20:33:01.5910876Z T: int, 2025-05-07T20:33:01.5911287Z D: int, 2025-05-07T20:33:01.5911742Z scale_ub: Optional[float], 2025-05-07T20:33:01.5912295Z contiguous: bool, 2025-05-07T20:33:01.5912787Z compiled: bool, 2025-05-07T20:33:01.5913101Z ) -> None: 2025-05-07T20:33:01.5913601Z torch.manual_seed(2025) 2025-05-07T20:33:01.5913872Z 2025-05-07T20:33:01.5914264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.5914632Z 2025-05-07T20:33:01.5914837Z x_sign = torch.sign(x) 2025-05-07T20:33:01.5915163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.5915507Z x = x_sign * x_clamp 2025-05-07T20:33:01.5915875Z x0 = x[:, :D] 2025-05-07T20:33:01.5916119Z x1 = x[:, D:] 2025-05-07T20:33:01.5916349Z 2025-05-07T20:33:01.5916625Z if contiguous: 2025-05-07T20:33:01.5916878Z x0 = x0.contiguous() 2025-05-07T20:33:01.5917320Z x1 = x1.contiguous() 2025-05-07T20:33:01.5917571Z 2025-05-07T20:33:01.5917785Z if scale_ub is not None: 2025-05-07T20:33:01.5918076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.5918426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.5918758Z ) 2025-05-07T20:33:01.5918969Z else: 2025-05-07T20:33:01.5919191Z scale_ub_tensor = None 2025-05-07T20:33:01.5919468Z 2025-05-07T20:33:01.5919717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.5920047Z op = silu_mul_quant 2025-05-07T20:33:01.5920411Z if compiled: 2025-05-07T20:33:01.5920674Z op = torch.compile(op) 2025-05-07T20:33:01.5921001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.5921287Z 2025-05-07T20:33:01.5921501Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.5921673Z 2025-05-07T20:33:01.5921791Z moe/activation_test.py:117: 2025-05-07T20:33:01.5922093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.5922450Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.5922750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.5923471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.5924198Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.5924767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.5925486Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.5926175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.5926747Z kernel = self.compile( 2025-05-07T20:33:01.5927326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.5928020Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.5928436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.5928686Z 2025-05-07T20:33:01.5928903Z self = 2025-05-07T20:33:01.5930071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.5931527Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780720400>} 2025-05-07T20:33:01.5933017Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.5934096Z context = 2025-05-07T20:33:01.5934400Z 2025-05-07T20:33:01.5934578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.5935133Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.5935677Z module_map=module_map) 2025-05-07T20:33:01.5936067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.5936438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.5936723Z E ^ 2025-05-07T20:33:01.5937213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.5937732Z 2025-05-07T20:33:01.5938204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.5938750Z 2025-05-07T20:33:01.5938860Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.5939302Z self=, 2025-05-07T20:33:01.5939735Z T=128, 2025-05-07T20:33:01.5939931Z D=5120, 2025-05-07T20:33:01.5940143Z scale_ub=None, 2025-05-07T20:33:01.5940380Z contiguous=False, 2025-05-07T20:33:01.5940615Z compiled=False, 2025-05-07T20:33:01.5940846Z ) 2025-05-07T20:33:01.5941191Z self = 2025-05-07T20:33:01.5941705Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:01.5941997Z 2025-05-07T20:33:01.5942082Z @given( 2025-05-07T20:33:01.5942336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.5942676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.5943004Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.5943399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.5943750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.5944047Z ) 2025-05-07T20:33:01.5944422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.5944894Z def test_silu_mul_quant( 2025-05-07T20:33:01.5945145Z self, 2025-05-07T20:33:01.5945360Z T: int, 2025-05-07T20:33:01.5945580Z D: int, 2025-05-07T20:33:01.5945809Z scale_ub: Optional[float], 2025-05-07T20:33:01.5946106Z contiguous: bool, 2025-05-07T20:33:01.5946372Z compiled: bool, 2025-05-07T20:33:01.5946604Z ) -> None: 2025-05-07T20:33:01.5946844Z torch.manual_seed(2025) 2025-05-07T20:33:01.5947105Z 2025-05-07T20:33:01.5947405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.5947767Z 2025-05-07T20:33:01.5947983Z x_sign = torch.sign(x) 2025-05-07T20:33:01.5948299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.5948624Z x = x_sign * x_clamp 2025-05-07T20:33:01.5948887Z x0 = x[:, :D] 2025-05-07T20:33:01.5949124Z x1 = x[:, D:] 2025-05-07T20:33:01.5949344Z 2025-05-07T20:33:01.5949550Z if contiguous: 2025-05-07T20:33:01.5949803Z x0 = x0.contiguous() 2025-05-07T20:33:01.5950079Z x1 = x1.contiguous() 2025-05-07T20:33:01.5950343Z 2025-05-07T20:33:01.5950561Z if scale_ub is not None: 2025-05-07T20:33:01.5950847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.5951215Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.5951544Z ) 2025-05-07T20:33:01.5951754Z else: 2025-05-07T20:33:01.5951972Z scale_ub_tensor = None 2025-05-07T20:33:01.5952245Z 2025-05-07T20:33:01.5952544Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.5952877Z op = silu_mul_quant 2025-05-07T20:33:01.5953149Z if compiled: 2025-05-07T20:33:01.5953416Z op = torch.compile(op) 2025-05-07T20:33:01.5953723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.5954019Z 2025-05-07T20:33:01.5954228Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.5954400Z 2025-05-07T20:33:01.5954506Z moe/activation_test.py:117: 2025-05-07T20:33:01.5954821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.5955220Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.5955522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.5956233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.5956996Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.5957604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.5958313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.5959009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.5959567Z kernel = self.compile( 2025-05-07T20:33:01.5960223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.5960904Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.5961324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.5961573Z 2025-05-07T20:33:01.5961787Z self = 2025-05-07T20:33:01.5962916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.5964395Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781029f80>} 2025-05-07T20:33:01.5965796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.5966865Z context = 2025-05-07T20:33:01.5967167Z 2025-05-07T20:33:01.5967351Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.5967902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.5968415Z module_map=module_map) 2025-05-07T20:33:01.5968799Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.5969177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.5969457Z E ^ 2025-05-07T20:33:01.5969939Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.5970418Z 2025-05-07T20:33:01.5970851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.5971391Z 2025-05-07T20:33:01.5971501Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.5971944Z self=, 2025-05-07T20:33:01.5972367Z T=128, 2025-05-07T20:33:01.5972571Z D=5120, 2025-05-07T20:33:01.5972781Z scale_ub=1200.0, 2025-05-07T20:33:01.5973012Z contiguous=True, 2025-05-07T20:33:01.5973249Z compiled=False, 2025-05-07T20:33:01.5973473Z ) 2025-05-07T20:33:01.5973853Z self = 2025-05-07T20:33:01.5974376Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.5974667Z 2025-05-07T20:33:01.5974748Z @given( 2025-05-07T20:33:01.5974992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.5975316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.5975643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.5975990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.5976328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.5976675Z ) 2025-05-07T20:33:01.5977043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.5977498Z def test_silu_mul_quant( 2025-05-07T20:33:01.5977757Z self, 2025-05-07T20:33:01.5977969Z T: int, 2025-05-07T20:33:01.5978218Z D: int, 2025-05-07T20:33:01.5978452Z scale_ub: Optional[float], 2025-05-07T20:33:01.5978744Z contiguous: bool, 2025-05-07T20:33:01.5979032Z compiled: bool, 2025-05-07T20:33:01.5979273Z ) -> None: 2025-05-07T20:33:01.5979507Z torch.manual_seed(2025) 2025-05-07T20:33:01.5979764Z 2025-05-07T20:33:01.5980049Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.5980413Z 2025-05-07T20:33:01.5980621Z x_sign = torch.sign(x) 2025-05-07T20:33:01.5980925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.5981253Z x = x_sign * x_clamp 2025-05-07T20:33:01.5981512Z x0 = x[:, :D] 2025-05-07T20:33:01.5981736Z x1 = x[:, D:] 2025-05-07T20:33:01.5981964Z 2025-05-07T20:33:01.5982166Z if contiguous: 2025-05-07T20:33:01.5982406Z x0 = x0.contiguous() 2025-05-07T20:33:01.5982687Z x1 = x1.contiguous() 2025-05-07T20:33:01.5982944Z 2025-05-07T20:33:01.5983150Z if scale_ub is not None: 2025-05-07T20:33:01.5983444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.5983806Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.5984126Z ) 2025-05-07T20:33:01.5984338Z else: 2025-05-07T20:33:01.5984564Z scale_ub_tensor = None 2025-05-07T20:33:01.5984835Z 2025-05-07T20:33:01.5985075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.5985410Z op = silu_mul_quant 2025-05-07T20:33:01.5985680Z if compiled: 2025-05-07T20:33:01.5985936Z op = torch.compile(op) 2025-05-07T20:33:01.5986252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.5986545Z 2025-05-07T20:33:01.5986748Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.5986936Z 2025-05-07T20:33:01.5987041Z moe/activation_test.py:117: 2025-05-07T20:33:01.5987354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.5987705Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.5988012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.5988738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.5989464Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.5990023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.5990735Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.5991429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.5991985Z kernel = self.compile( 2025-05-07T20:33:01.5992554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.5993241Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.5993716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.5993959Z 2025-05-07T20:33:01.5994176Z self = 2025-05-07T20:33:01.5995300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.5996730Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780518c20>} 2025-05-07T20:33:01.5998204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.5999308Z context = 2025-05-07T20:33:01.5999611Z 2025-05-07T20:33:01.5999821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.6000433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.6000932Z module_map=module_map) 2025-05-07T20:33:01.6001310Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.6001687Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.6001967Z E ^ 2025-05-07T20:33:01.6002457Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.6002928Z 2025-05-07T20:33:01.6003398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.7552010Z 2025-05-07T20:33:01.7552333Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.7553084Z self=, 2025-05-07T20:33:01.7554186Z T=1, 2025-05-07T20:33:01.7554578Z D=7168, 2025-05-07T20:33:01.7554989Z scale_ub=1200.0, 2025-05-07T20:33:01.7555444Z contiguous=True, 2025-05-07T20:33:01.7555904Z compiled=True, 2025-05-07T20:33:01.7556331Z ) 2025-05-07T20:33:01.7556984Z self = 2025-05-07T20:33:01.7558009Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.7558552Z 2025-05-07T20:33:01.7558724Z @given( 2025-05-07T20:33:01.7559202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.7559858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.7560627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.7561324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.7562006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.7562609Z ) 2025-05-07T20:33:01.7563300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.7563813Z def test_silu_mul_quant( 2025-05-07T20:33:01.7564076Z self, 2025-05-07T20:33:01.7564290Z T: int, 2025-05-07T20:33:01.7564498Z D: int, 2025-05-07T20:33:01.7564738Z scale_ub: Optional[float], 2025-05-07T20:33:01.7565031Z contiguous: bool, 2025-05-07T20:33:01.7565285Z compiled: bool, 2025-05-07T20:33:01.7565534Z ) -> None: 2025-05-07T20:33:01.7565772Z torch.manual_seed(2025) 2025-05-07T20:33:01.7566026Z 2025-05-07T20:33:01.7566326Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.7566695Z 2025-05-07T20:33:01.7566900Z x_sign = torch.sign(x) 2025-05-07T20:33:01.7567219Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.7567555Z x = x_sign * x_clamp 2025-05-07T20:33:01.7567820Z x0 = x[:, :D] 2025-05-07T20:33:01.7568048Z x1 = x[:, D:] 2025-05-07T20:33:01.7568556Z 2025-05-07T20:33:01.7568770Z if contiguous: 2025-05-07T20:33:01.7569018Z x0 = x0.contiguous() 2025-05-07T20:33:01.7569304Z x1 = x1.contiguous() 2025-05-07T20:33:01.7569567Z 2025-05-07T20:33:01.7569774Z if scale_ub is not None: 2025-05-07T20:33:01.7570070Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.7570436Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.7570764Z ) 2025-05-07T20:33:01.7570977Z else: 2025-05-07T20:33:01.7571209Z scale_ub_tensor = None 2025-05-07T20:33:01.7571568Z 2025-05-07T20:33:01.7571822Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.7572165Z op = silu_mul_quant 2025-05-07T20:33:01.7572431Z if compiled: 2025-05-07T20:33:01.7572700Z op = torch.compile(op) 2025-05-07T20:33:01.7573109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.7573415Z 2025-05-07T20:33:01.7573695Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.7573878Z 2025-05-07T20:33:01.7573990Z moe/activation_test.py:117: 2025-05-07T20:33:01.7574306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.7574662Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.7574967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.7575568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.7576159Z return fn(*args, **kwargs) 2025-05-07T20:33:01.7576859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.7577588Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.7578159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.7578886Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.7579603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.7580168Z kernel = self.compile( 2025-05-07T20:33:01.7580747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.7581433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.7581860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.7582106Z 2025-05-07T20:33:01.7582334Z self = 2025-05-07T20:33:01.7583456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.7584918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780519ee0>} 2025-05-07T20:33:01.7586323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.7587393Z context = 2025-05-07T20:33:01.7587698Z 2025-05-07T20:33:01.7587888Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.7588437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.7588941Z module_map=module_map) 2025-05-07T20:33:01.7589337Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.7589720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.7590045Z E ^ 2025-05-07T20:33:01.7590547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.7591021Z 2025-05-07T20:33:01.7591466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.7592003Z 2025-05-07T20:33:01.7592115Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.7592561Z self=, 2025-05-07T20:33:01.7593038Z T=1, 2025-05-07T20:33:01.7593244Z D=7168, 2025-05-07T20:33:01.7593448Z scale_ub=1200.0, 2025-05-07T20:33:01.7593694Z contiguous=False, 2025-05-07T20:33:01.7593939Z compiled=True, 2025-05-07T20:33:01.7594154Z ) 2025-05-07T20:33:01.7601937Z self = 2025-05-07T20:33:01.7602602Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.7602890Z 2025-05-07T20:33:01.7603025Z @given( 2025-05-07T20:33:01.7603292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.7603641Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.7603970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.7604338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.7604699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.7605017Z ) 2025-05-07T20:33:01.7605391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.7605895Z def test_silu_mul_quant( 2025-05-07T20:33:01.7606165Z self, 2025-05-07T20:33:01.7606383Z T: int, 2025-05-07T20:33:01.7606593Z D: int, 2025-05-07T20:33:01.7606834Z scale_ub: Optional[float], 2025-05-07T20:33:01.7607132Z contiguous: bool, 2025-05-07T20:33:01.7607392Z compiled: bool, 2025-05-07T20:33:01.7607643Z ) -> None: 2025-05-07T20:33:01.7607886Z torch.manual_seed(2025) 2025-05-07T20:33:01.7608143Z 2025-05-07T20:33:01.7608444Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.7608815Z 2025-05-07T20:33:01.7609021Z x_sign = torch.sign(x) 2025-05-07T20:33:01.7609341Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.7609678Z x = x_sign * x_clamp 2025-05-07T20:33:01.7609943Z x0 = x[:, :D] 2025-05-07T20:33:01.7610173Z x1 = x[:, D:] 2025-05-07T20:33:01.7610402Z 2025-05-07T20:33:01.7610612Z if contiguous: 2025-05-07T20:33:01.7610859Z x0 = x0.contiguous() 2025-05-07T20:33:01.7611142Z x1 = x1.contiguous() 2025-05-07T20:33:01.7611406Z 2025-05-07T20:33:01.7611610Z if scale_ub is not None: 2025-05-07T20:33:01.7611909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.7612286Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.7612620Z ) 2025-05-07T20:33:01.7612839Z else: 2025-05-07T20:33:01.7613071Z scale_ub_tensor = None 2025-05-07T20:33:01.7613633Z 2025-05-07T20:33:01.7613895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.7614241Z op = silu_mul_quant 2025-05-07T20:33:01.7614504Z if compiled: 2025-05-07T20:33:01.7614776Z op = torch.compile(op) 2025-05-07T20:33:01.7615100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.7615388Z 2025-05-07T20:33:01.7615595Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.7615780Z 2025-05-07T20:33:01.7615884Z moe/activation_test.py:117: 2025-05-07T20:33:01.7616198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.7616545Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.7616848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.7617532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.7618121Z return fn(*args, **kwargs) 2025-05-07T20:33:01.7618812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.7619535Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.7620108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.7620821Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.7621599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.7622163Z kernel = self.compile( 2025-05-07T20:33:01.7622732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.7623501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.7623979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.7624225Z 2025-05-07T20:33:01.7624452Z self = 2025-05-07T20:33:01.7625579Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.7627019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78051ac00>} 2025-05-07T20:33:01.7628428Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.7629506Z context = 2025-05-07T20:33:01.7629812Z 2025-05-07T20:33:01.7629998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.7630548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.7631046Z module_map=module_map) 2025-05-07T20:33:01.7631437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.7631809Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.7632090Z E ^ 2025-05-07T20:33:01.7632587Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.7633056Z 2025-05-07T20:33:01.7633502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.9666365Z 2025-05-07T20:33:01.9666789Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.9668082Z self=, 2025-05-07T20:33:01.9669270Z T=1, 2025-05-07T20:33:01.9669796Z D=7168, 2025-05-07T20:33:01.9670221Z scale_ub=None, 2025-05-07T20:33:01.9670669Z contiguous=False, 2025-05-07T20:33:01.9671136Z compiled=True, 2025-05-07T20:33:01.9671551Z ) 2025-05-07T20:33:01.9672213Z self = 2025-05-07T20:33:01.9673158Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.9673435Z 2025-05-07T20:33:01.9673527Z @given( 2025-05-07T20:33:01.9673772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.9674103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.9674422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.9674774Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.9675129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.9675711Z ) 2025-05-07T20:33:01.9676082Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.9676549Z def test_silu_mul_quant( 2025-05-07T20:33:01.9676814Z self, 2025-05-07T20:33:01.9677019Z T: int, 2025-05-07T20:33:01.9677234Z D: int, 2025-05-07T20:33:01.9677503Z scale_ub: Optional[float], 2025-05-07T20:33:01.9677792Z contiguous: bool, 2025-05-07T20:33:01.9678051Z compiled: bool, 2025-05-07T20:33:01.9678290Z ) -> None: 2025-05-07T20:33:01.9678524Z torch.manual_seed(2025) 2025-05-07T20:33:01.9678867Z 2025-05-07T20:33:01.9679154Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.9679519Z 2025-05-07T20:33:01.9679732Z x_sign = torch.sign(x) 2025-05-07T20:33:01.9680035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.9680566Z x = x_sign * x_clamp 2025-05-07T20:33:01.9680827Z x0 = x[:, :D] 2025-05-07T20:33:01.9681128Z x1 = x[:, D:] 2025-05-07T20:33:01.9681359Z 2025-05-07T20:33:01.9681563Z if contiguous: 2025-05-07T20:33:01.9681802Z x0 = x0.contiguous() 2025-05-07T20:33:01.9682087Z x1 = x1.contiguous() 2025-05-07T20:33:01.9682344Z 2025-05-07T20:33:01.9682544Z if scale_ub is not None: 2025-05-07T20:33:01.9682842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.9683204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.9683532Z ) 2025-05-07T20:33:01.9683735Z else: 2025-05-07T20:33:01.9683963Z scale_ub_tensor = None 2025-05-07T20:33:01.9684229Z 2025-05-07T20:33:01.9684474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.9684811Z op = silu_mul_quant 2025-05-07T20:33:01.9685079Z if compiled: 2025-05-07T20:33:01.9685343Z op = torch.compile(op) 2025-05-07T20:33:01.9685661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9685958Z 2025-05-07T20:33:01.9686159Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.9686466Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.9686779Z 2025-05-07T20:33:01.9687024Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.9687380Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.9687692Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.9688028Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.9688407Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.9688746Z 2025-05-07T20:33:01.9688963Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.9689169Z 2025-05-07T20:33:01.9689277Z moe/activation_test.py:126: 2025-05-07T20:33:01.9689596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9689956Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.9690305Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.9691136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.9691925Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.9692508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.9693221Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.9693952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.9694722Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.9695498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.9696224Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.9696869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.9697423Z fn() 2025-05-07T20:33:01.9697955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.9698569Z self.fn.run( 2025-05-07T20:33:01.9699069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.9699677Z kernel = self.compile( 2025-05-07T20:33:01.9700252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.9700941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.9701369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9701658Z 2025-05-07T20:33:01.9701923Z self = 2025-05-07T20:33:01.9703051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.9704501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780a3c180>} 2025-05-07T20:33:01.9705902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.9706964Z context = 2025-05-07T20:33:01.9707273Z 2025-05-07T20:33:01.9707459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.9708012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.9708509Z module_map=module_map) 2025-05-07T20:33:01.9708900Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.9709274Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.9709559Z E ^ 2025-05-07T20:33:01.9710049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.9710528Z 2025-05-07T20:33:01.9710973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.9711509Z 2025-05-07T20:33:01.9711620Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.9712059Z self=, 2025-05-07T20:33:01.9712489Z T=1, 2025-05-07T20:33:01.9712683Z D=5120, 2025-05-07T20:33:01.9712890Z scale_ub=1200.0, 2025-05-07T20:33:01.9713131Z contiguous=False, 2025-05-07T20:33:01.9713657Z compiled=True, 2025-05-07T20:33:01.9713872Z ) 2025-05-07T20:33:01.9714207Z self = 2025-05-07T20:33:01.9714714Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.9714994Z 2025-05-07T20:33:01.9715076Z @given( 2025-05-07T20:33:01.9715320Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.9715658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.9715980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.9716331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.9716680Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.9716993Z ) 2025-05-07T20:33:01.9717364Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.9717934Z def test_silu_mul_quant( 2025-05-07T20:33:01.9718193Z self, 2025-05-07T20:33:01.9718400Z T: int, 2025-05-07T20:33:01.9718602Z D: int, 2025-05-07T20:33:01.9718832Z scale_ub: Optional[float], 2025-05-07T20:33:01.9719119Z contiguous: bool, 2025-05-07T20:33:01.9719370Z compiled: bool, 2025-05-07T20:33:01.9719610Z ) -> None: 2025-05-07T20:33:01.9719838Z torch.manual_seed(2025) 2025-05-07T20:33:01.9720157Z 2025-05-07T20:33:01.9720446Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.9720876Z 2025-05-07T20:33:01.9721076Z x_sign = torch.sign(x) 2025-05-07T20:33:01.9721390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.9721715Z x = x_sign * x_clamp 2025-05-07T20:33:01.9721961Z x0 = x[:, :D] 2025-05-07T20:33:01.9722188Z x1 = x[:, D:] 2025-05-07T20:33:01.9722468Z 2025-05-07T20:33:01.9722656Z if contiguous: 2025-05-07T20:33:01.9722899Z x0 = x0.contiguous() 2025-05-07T20:33:01.9723229Z x1 = x1.contiguous() 2025-05-07T20:33:01.9723483Z 2025-05-07T20:33:01.9723678Z if scale_ub is not None: 2025-05-07T20:33:01.9723966Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.9724319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.9724635Z ) 2025-05-07T20:33:01.9724838Z else: 2025-05-07T20:33:01.9725059Z scale_ub_tensor = None 2025-05-07T20:33:01.9725315Z 2025-05-07T20:33:01.9725557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.9725892Z op = silu_mul_quant 2025-05-07T20:33:01.9726149Z if compiled: 2025-05-07T20:33:01.9726413Z op = torch.compile(op) 2025-05-07T20:33:01.9726726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9727014Z 2025-05-07T20:33:01.9727218Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.9727389Z 2025-05-07T20:33:01.9727502Z moe/activation_test.py:117: 2025-05-07T20:33:01.9727818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9728161Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.9728460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9729045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.9729621Z return fn(*args, **kwargs) 2025-05-07T20:33:01.9730308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.9731029Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.9731589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.9732298Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.9733012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.9733607Z kernel = self.compile( 2025-05-07T20:33:01.9734168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.9734854Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.9735272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9735511Z 2025-05-07T20:33:01.9735732Z self = 2025-05-07T20:33:01.9736852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.9738335Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780a3d300>} 2025-05-07T20:33:01.9739731Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.9740791Z context = 2025-05-07T20:33:01.9741091Z 2025-05-07T20:33:01.9741273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.9741860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.9742352Z module_map=module_map) 2025-05-07T20:33:01.9742740Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.9743107Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.9743421Z E ^ 2025-05-07T20:33:01.9743950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.9744417Z 2025-05-07T20:33:01.9744855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.1142022Z 2025-05-07T20:33:02.1142463Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.1143187Z self=, 2025-05-07T20:33:02.1144363Z T=1, 2025-05-07T20:33:02.1144884Z D=5120, 2025-05-07T20:33:02.1145405Z scale_ub=1200.0, 2025-05-07T20:33:02.1145897Z contiguous=False, 2025-05-07T20:33:02.1146374Z compiled=False, 2025-05-07T20:33:02.1146795Z ) 2025-05-07T20:33:02.1147466Z self = 2025-05-07T20:33:02.1148494Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.1149062Z 2025-05-07T20:33:02.1149235Z @given( 2025-05-07T20:33:02.1149726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.1150379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.1151022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.1151706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.1152395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.1152992Z ) 2025-05-07T20:33:02.1153540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.1154032Z def test_silu_mul_quant( 2025-05-07T20:33:02.1154298Z self, 2025-05-07T20:33:02.1154505Z T: int, 2025-05-07T20:33:02.1154720Z D: int, 2025-05-07T20:33:02.1154957Z scale_ub: Optional[float], 2025-05-07T20:33:02.1155242Z contiguous: bool, 2025-05-07T20:33:02.1155502Z compiled: bool, 2025-05-07T20:33:02.1155741Z ) -> None: 2025-05-07T20:33:02.1155975Z torch.manual_seed(2025) 2025-05-07T20:33:02.1156226Z 2025-05-07T20:33:02.1156524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.1156897Z 2025-05-07T20:33:02.1157107Z x_sign = torch.sign(x) 2025-05-07T20:33:02.1157420Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.1157750Z x = x_sign * x_clamp 2025-05-07T20:33:02.1158006Z x0 = x[:, :D] 2025-05-07T20:33:02.1158238Z x1 = x[:, D:] 2025-05-07T20:33:02.1158462Z 2025-05-07T20:33:02.1158656Z if contiguous: 2025-05-07T20:33:02.1158905Z x0 = x0.contiguous() 2025-05-07T20:33:02.1159186Z x1 = x1.contiguous() 2025-05-07T20:33:02.1159433Z 2025-05-07T20:33:02.1159638Z if scale_ub is not None: 2025-05-07T20:33:02.1159932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.1160374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.1160711Z ) 2025-05-07T20:33:02.1160944Z else: 2025-05-07T20:33:02.1161416Z scale_ub_tensor = None 2025-05-07T20:33:02.1161693Z 2025-05-07T20:33:02.1161932Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.1162263Z op = silu_mul_quant 2025-05-07T20:33:02.1162529Z if compiled: 2025-05-07T20:33:02.1162790Z op = torch.compile(op) 2025-05-07T20:33:02.1163102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.1163394Z 2025-05-07T20:33:02.1163599Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.1163771Z 2025-05-07T20:33:02.1163878Z moe/activation_test.py:117: 2025-05-07T20:33:02.1164280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.1164630Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.1164923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.1165642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.1166438Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.1167068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.1167785Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.1168481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.1169039Z kernel = self.compile( 2025-05-07T20:33:02.1169602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.1170292Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.1170718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.1170959Z 2025-05-07T20:33:02.1171183Z self = 2025-05-07T20:33:02.1172307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.1173807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780a3e020>} 2025-05-07T20:33:02.1175205Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.1176272Z context = 2025-05-07T20:33:02.1176573Z 2025-05-07T20:33:02.1176756Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.1177308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.1177806Z module_map=module_map) 2025-05-07T20:33:02.1178191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.1178561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.1178842Z E ^ 2025-05-07T20:33:02.1179330Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.1179800Z 2025-05-07T20:33:02.1180243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.1180779Z 2025-05-07T20:33:02.1180891Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.1181331Z self=, 2025-05-07T20:33:02.1181756Z T=16384, 2025-05-07T20:33:02.1181957Z D=5120, 2025-05-07T20:33:02.1182173Z scale_ub=1200.0, 2025-05-07T20:33:02.1182412Z contiguous=False, 2025-05-07T20:33:02.1182694Z compiled=True, 2025-05-07T20:33:02.1182910Z ) 2025-05-07T20:33:02.1183262Z self = 2025-05-07T20:33:02.1183829Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.1184129Z 2025-05-07T20:33:02.1184211Z @given( 2025-05-07T20:33:02.1184459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.1184792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.1185112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.1185506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.1185853Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.1186149Z ) 2025-05-07T20:33:02.1186519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.1186985Z def test_silu_mul_quant( 2025-05-07T20:33:02.1187275Z self, 2025-05-07T20:33:02.1187482Z T: int, 2025-05-07T20:33:02.1187692Z D: int, 2025-05-07T20:33:02.1187988Z scale_ub: Optional[float], 2025-05-07T20:33:02.1188279Z contiguous: bool, 2025-05-07T20:33:02.1188535Z compiled: bool, 2025-05-07T20:33:02.1188774Z ) -> None: 2025-05-07T20:33:02.1188998Z torch.manual_seed(2025) 2025-05-07T20:33:02.1189256Z 2025-05-07T20:33:02.1189545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.1189898Z 2025-05-07T20:33:02.1190109Z x_sign = torch.sign(x) 2025-05-07T20:33:02.1190419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.1190741Z x = x_sign * x_clamp 2025-05-07T20:33:02.1190994Z x0 = x[:, :D] 2025-05-07T20:33:02.1191224Z x1 = x[:, D:] 2025-05-07T20:33:02.1191438Z 2025-05-07T20:33:02.1191635Z if contiguous: 2025-05-07T20:33:02.1191881Z x0 = x0.contiguous() 2025-05-07T20:33:02.1192152Z x1 = x1.contiguous() 2025-05-07T20:33:02.1192406Z 2025-05-07T20:33:02.1192625Z if scale_ub is not None: 2025-05-07T20:33:02.1200391Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.1200766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.1201092Z ) 2025-05-07T20:33:02.1201304Z else: 2025-05-07T20:33:02.1201537Z scale_ub_tensor = None 2025-05-07T20:33:02.1201805Z 2025-05-07T20:33:02.1202062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.1202406Z op = silu_mul_quant 2025-05-07T20:33:02.1202676Z if compiled: 2025-05-07T20:33:02.1202946Z op = torch.compile(op) 2025-05-07T20:33:02.1203269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.1203560Z 2025-05-07T20:33:02.1203771Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.1203955Z 2025-05-07T20:33:02.1204064Z moe/activation_test.py:117: 2025-05-07T20:33:02.1204389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.1204743Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.1205046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.1205644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.1206231Z return fn(*args, **kwargs) 2025-05-07T20:33:02.1206932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.1207652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.1208225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.1208936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.1209639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.1210280Z kernel = self.compile( 2025-05-07T20:33:02.1210855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.1211549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.1211975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.1212219Z 2025-05-07T20:33:02.1212447Z self = 2025-05-07T20:33:02.1213831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.1215374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780a3f600>} 2025-05-07T20:33:02.1216901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.1217974Z context = 2025-05-07T20:33:02.1218278Z 2025-05-07T20:33:02.1218463Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.1219012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.1219509Z module_map=module_map) 2025-05-07T20:33:02.1219897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.1220271Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.1220553Z E ^ 2025-05-07T20:33:02.1221044Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.1221517Z 2025-05-07T20:33:02.1221965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.1222499Z 2025-05-07T20:33:02.1222609Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.1223049Z self=, 2025-05-07T20:33:02.1223473Z T=2048, 2025-05-07T20:33:02.1223666Z D=7168, 2025-05-07T20:33:02.1223872Z scale_ub=1200.0, 2025-05-07T20:33:02.1224116Z contiguous=False, 2025-05-07T20:33:02.1224357Z compiled=True, 2025-05-07T20:33:02.3085998Z ) 2025-05-07T20:33:02.3086551Z self = 2025-05-07T20:33:02.3087302Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.3087678Z 2025-05-07T20:33:02.3087773Z @given( 2025-05-07T20:33:02.3088019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3088383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3088724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3089073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3089435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3089750Z ) 2025-05-07T20:33:02.3090129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3090597Z def test_silu_mul_quant( 2025-05-07T20:33:02.3090869Z self, 2025-05-07T20:33:02.3091096Z T: int, 2025-05-07T20:33:02.3091307Z D: int, 2025-05-07T20:33:02.3091554Z scale_ub: Optional[float], 2025-05-07T20:33:02.3091846Z contiguous: bool, 2025-05-07T20:33:02.3092098Z compiled: bool, 2025-05-07T20:33:02.3092345Z ) -> None: 2025-05-07T20:33:02.3092578Z torch.manual_seed(2025) 2025-05-07T20:33:02.3092830Z 2025-05-07T20:33:02.3093127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3093780Z 2025-05-07T20:33:02.3093995Z x_sign = torch.sign(x) 2025-05-07T20:33:02.3094311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.3094644Z x = x_sign * x_clamp 2025-05-07T20:33:02.3094896Z x0 = x[:, :D] 2025-05-07T20:33:02.3095130Z x1 = x[:, D:] 2025-05-07T20:33:02.3095363Z 2025-05-07T20:33:02.3095555Z if contiguous: 2025-05-07T20:33:02.3095801Z x0 = x0.contiguous() 2025-05-07T20:33:02.3096079Z x1 = x1.contiguous() 2025-05-07T20:33:02.3096339Z 2025-05-07T20:33:02.3096635Z if scale_ub is not None: 2025-05-07T20:33:02.3096932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.3097294Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.3097619Z ) 2025-05-07T20:33:02.3097830Z else: 2025-05-07T20:33:02.3098056Z scale_ub_tensor = None 2025-05-07T20:33:02.3098401Z 2025-05-07T20:33:02.3098656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.3099076Z op = silu_mul_quant 2025-05-07T20:33:02.3099344Z if compiled: 2025-05-07T20:33:02.3099614Z op = torch.compile(op) 2025-05-07T20:33:02.3099940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3100229Z 2025-05-07T20:33:02.3100446Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.3100620Z 2025-05-07T20:33:02.3100743Z moe/activation_test.py:117: 2025-05-07T20:33:02.3101064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3101420Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.3101724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3102319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.3102909Z return fn(*args, **kwargs) 2025-05-07T20:33:02.3103614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.3104342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.3104913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.3105628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.3106330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.3106891Z kernel = self.compile( 2025-05-07T20:33:02.3107461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.3108156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.3108587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3108832Z 2025-05-07T20:33:02.3109064Z self = 2025-05-07T20:33:02.3110187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.3111648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780038720>} 2025-05-07T20:33:02.3113057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.3114534Z context = 2025-05-07T20:33:02.3114838Z 2025-05-07T20:33:02.3115024Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.3115657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.3116158Z module_map=module_map) 2025-05-07T20:33:02.3116550Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.3116925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.3117246Z E ^ 2025-05-07T20:33:02.3117736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.3118218Z 2025-05-07T20:33:02.3118655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.3119263Z 2025-05-07T20:33:02.3119375Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3119820Z self=, 2025-05-07T20:33:02.3120337Z T=1, 2025-05-07T20:33:02.3120607Z D=5120, 2025-05-07T20:33:02.3120818Z scale_ub=None, 2025-05-07T20:33:02.3121051Z contiguous=False, 2025-05-07T20:33:02.3121355Z compiled=False, 2025-05-07T20:33:02.3121578Z ) 2025-05-07T20:33:02.3121914Z self = 2025-05-07T20:33:02.3122436Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:02.3122712Z 2025-05-07T20:33:02.3122801Z @given( 2025-05-07T20:33:02.3123046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3123383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3123712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3124074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3124422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3124730Z ) 2025-05-07T20:33:02.3125107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3125573Z def test_silu_mul_quant( 2025-05-07T20:33:02.3125844Z self, 2025-05-07T20:33:02.3126056Z T: int, 2025-05-07T20:33:02.3126264Z D: int, 2025-05-07T20:33:02.3126500Z scale_ub: Optional[float], 2025-05-07T20:33:02.3126794Z contiguous: bool, 2025-05-07T20:33:02.3127045Z compiled: bool, 2025-05-07T20:33:02.3127287Z ) -> None: 2025-05-07T20:33:02.3127520Z torch.manual_seed(2025) 2025-05-07T20:33:02.3127780Z 2025-05-07T20:33:02.3128067Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3128431Z 2025-05-07T20:33:02.3128640Z x_sign = torch.sign(x) 2025-05-07T20:33:02.3128950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.3129282Z x = x_sign * x_clamp 2025-05-07T20:33:02.3129545Z x0 = x[:, :D] 2025-05-07T20:33:02.3129778Z x1 = x[:, D:] 2025-05-07T20:33:02.3130000Z 2025-05-07T20:33:02.3130199Z if contiguous: 2025-05-07T20:33:02.3130444Z x0 = x0.contiguous() 2025-05-07T20:33:02.3130723Z x1 = x1.contiguous() 2025-05-07T20:33:02.3130983Z 2025-05-07T20:33:02.3131182Z if scale_ub is not None: 2025-05-07T20:33:02.3131474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.3131834Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.3132155Z ) 2025-05-07T20:33:02.3132362Z else: 2025-05-07T20:33:02.3132589Z scale_ub_tensor = None 2025-05-07T20:33:02.3132859Z 2025-05-07T20:33:02.3133102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.3133490Z op = silu_mul_quant 2025-05-07T20:33:02.3133759Z if compiled: 2025-05-07T20:33:02.3134019Z op = torch.compile(op) 2025-05-07T20:33:02.3134334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3134641Z 2025-05-07T20:33:02.3134848Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.3135024Z 2025-05-07T20:33:02.3135131Z moe/activation_test.py:117: 2025-05-07T20:33:02.3135503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3135862Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.3136167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3136887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.3137609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.3138184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.3138944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.3139643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.3140204Z kernel = self.compile( 2025-05-07T20:33:02.3140856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.3141581Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.3142008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3142251Z 2025-05-07T20:33:02.3142475Z self = 2025-05-07T20:33:02.3143604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.3145030Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780039120>} 2025-05-07T20:33:02.3146431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.3147495Z context = 2025-05-07T20:33:02.3147798Z 2025-05-07T20:33:02.3147982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.3148534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.3149027Z module_map=module_map) 2025-05-07T20:33:02.3149415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.3149792Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.3150063Z E ^ 2025-05-07T20:33:02.3150553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.3151022Z 2025-05-07T20:33:02.3151464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.3152003Z 2025-05-07T20:33:02.3152118Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3152562Z self=, 2025-05-07T20:33:02.3152988Z T=4096, 2025-05-07T20:33:02.3153207Z D=7168, 2025-05-07T20:33:02.3153439Z scale_ub=1200.0, 2025-05-07T20:33:02.3153684Z contiguous=False, 2025-05-07T20:33:02.3153928Z compiled=False, 2025-05-07T20:33:02.3154140Z ) 2025-05-07T20:33:02.3154479Z self = 2025-05-07T20:33:02.3155011Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.3155300Z 2025-05-07T20:33:02.3155385Z @given( 2025-05-07T20:33:02.3155633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3155967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3156291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3156709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3157073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3157380Z ) 2025-05-07T20:33:02.3157747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3158221Z def test_silu_mul_quant( 2025-05-07T20:33:02.3158481Z self, 2025-05-07T20:33:02.3158686Z T: int, 2025-05-07T20:33:02.3158903Z D: int, 2025-05-07T20:33:02.3159139Z scale_ub: Optional[float], 2025-05-07T20:33:02.3159428Z contiguous: bool, 2025-05-07T20:33:02.3159735Z compiled: bool, 2025-05-07T20:33:02.3159974Z ) -> None: 2025-05-07T20:33:02.3160337Z torch.manual_seed(2025) 2025-05-07T20:33:02.3160599Z 2025-05-07T20:33:02.3160894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3161256Z 2025-05-07T20:33:02.3161512Z x_sign = torch.sign(x) 2025-05-07T20:33:02.3161828Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.3162205Z x = x_sign * x_clamp 2025-05-07T20:33:02.3162460Z x0 = x[:, :D] 2025-05-07T20:33:02.3162693Z x1 = x[:, D:] 2025-05-07T20:33:02.3162917Z 2025-05-07T20:33:02.3163112Z if contiguous: 2025-05-07T20:33:02.3163359Z x0 = x0.contiguous() 2025-05-07T20:33:02.3163636Z x1 = x1.contiguous() 2025-05-07T20:33:02.3163885Z 2025-05-07T20:33:02.3164090Z if scale_ub is not None: 2025-05-07T20:33:02.3164385Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.3164741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.3165075Z ) 2025-05-07T20:33:02.3165284Z else: 2025-05-07T20:33:02.3165507Z scale_ub_tensor = None 2025-05-07T20:33:02.3165774Z 2025-05-07T20:33:02.3166020Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.3166355Z op = silu_mul_quant 2025-05-07T20:33:02.3166625Z if compiled: 2025-05-07T20:33:02.3166894Z op = torch.compile(op) 2025-05-07T20:33:02.3167205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3167501Z 2025-05-07T20:33:02.3167708Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.3167881Z 2025-05-07T20:33:02.3167995Z moe/activation_test.py:117: 2025-05-07T20:33:02.3168301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3168653Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.3168957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3169674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.3170394Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.3170963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.3171686Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.3172383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.3172942Z kernel = self.compile( 2025-05-07T20:33:02.3173513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.3174194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.3174620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3174869Z 2025-05-07T20:33:02.3175087Z self = 2025-05-07T20:33:02.3176213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.3177700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78003a480>} 2025-05-07T20:33:02.3179092Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.3180158Z context = 2025-05-07T20:33:02.3180466Z 2025-05-07T20:33:02.3180643Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.3181238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.3181727Z module_map=module_map) 2025-05-07T20:33:02.3182116Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.3182533Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.3182806Z E ^ 2025-05-07T20:33:02.3183338Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.3183818Z 2025-05-07T20:33:02.3184258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.4739689Z 2025-05-07T20:33:02.4740043Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.4740715Z self=, 2025-05-07T20:33:02.4741312Z T=16384, 2025-05-07T20:33:02.4741577Z D=7168, 2025-05-07T20:33:02.4741787Z scale_ub=None, 2025-05-07T20:33:02.4742018Z contiguous=True, 2025-05-07T20:33:02.4742256Z compiled=True, 2025-05-07T20:33:02.4742473Z ) 2025-05-07T20:33:02.4742809Z self = 2025-05-07T20:33:02.4743376Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:02.4743697Z 2025-05-07T20:33:02.4743797Z @given( 2025-05-07T20:33:02.4744040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.4744376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.4744706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.4745052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.4745405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.4745710Z ) 2025-05-07T20:33:02.4746084Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.4746552Z def test_silu_mul_quant( 2025-05-07T20:33:02.4746811Z self, 2025-05-07T20:33:02.4747019Z T: int, 2025-05-07T20:33:02.4747225Z D: int, 2025-05-07T20:33:02.4747459Z scale_ub: Optional[float], 2025-05-07T20:33:02.4747748Z contiguous: bool, 2025-05-07T20:33:02.4748044Z compiled: bool, 2025-05-07T20:33:02.4748289Z ) -> None: 2025-05-07T20:33:02.4748516Z torch.manual_seed(2025) 2025-05-07T20:33:02.4748777Z 2025-05-07T20:33:02.4749068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.4749429Z 2025-05-07T20:33:02.4749639Z x_sign = torch.sign(x) 2025-05-07T20:33:02.4749953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.4750280Z x = x_sign * x_clamp 2025-05-07T20:33:02.4750541Z x0 = x[:, :D] 2025-05-07T20:33:02.4750783Z x1 = x[:, D:] 2025-05-07T20:33:02.4751000Z 2025-05-07T20:33:02.4751204Z if contiguous: 2025-05-07T20:33:02.4751455Z x0 = x0.contiguous() 2025-05-07T20:33:02.4751728Z x1 = x1.contiguous() 2025-05-07T20:33:02.4751993Z 2025-05-07T20:33:02.4752201Z if scale_ub is not None: 2025-05-07T20:33:02.4752488Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.4752853Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.4753481Z ) 2025-05-07T20:33:02.4753699Z else: 2025-05-07T20:33:02.4753921Z scale_ub_tensor = None 2025-05-07T20:33:02.4754190Z 2025-05-07T20:33:02.4754442Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.4754773Z op = silu_mul_quant 2025-05-07T20:33:02.4755043Z if compiled: 2025-05-07T20:33:02.4755311Z op = torch.compile(op) 2025-05-07T20:33:02.4755624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4755920Z 2025-05-07T20:33:02.4756126Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.4756398Z 2025-05-07T20:33:02.4756506Z moe/activation_test.py:117: 2025-05-07T20:33:02.4756822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4757178Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.4757480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4758151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.4758822Z return fn(*args, **kwargs) 2025-05-07T20:33:02.4759522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.4760346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.4760919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.4761645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.4762352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.4762910Z kernel = self.compile( 2025-05-07T20:33:02.4763482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.4764179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.4764607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4764857Z 2025-05-07T20:33:02.4765077Z self = 2025-05-07T20:33:02.4766209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.4767667Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78003b740>} 2025-05-07T20:33:02.4769074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.4770150Z context = 2025-05-07T20:33:02.4770464Z 2025-05-07T20:33:02.4770644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.4771199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.4771692Z module_map=module_map) 2025-05-07T20:33:02.4772074Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.4772465Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.4780546Z E ^ 2025-05-07T20:33:02.4781057Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.4781535Z 2025-05-07T20:33:02.4781989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.4782536Z 2025-05-07T20:33:02.4782644Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.4783178Z self=, 2025-05-07T20:33:02.4783608Z T=4096, 2025-05-07T20:33:02.4783803Z D=5120, 2025-05-07T20:33:02.4784006Z scale_ub=None, 2025-05-07T20:33:02.4784234Z contiguous=False, 2025-05-07T20:33:02.4784472Z compiled=True, 2025-05-07T20:33:02.4784682Z ) 2025-05-07T20:33:02.4785022Z self = 2025-05-07T20:33:02.4785544Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:02.4785826Z 2025-05-07T20:33:02.4785952Z @given( 2025-05-07T20:33:02.4786194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.4786524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.4786837Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.4787185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.4787574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.4787869Z ) 2025-05-07T20:33:02.4788279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.4788745Z def test_silu_mul_quant( 2025-05-07T20:33:02.4789000Z self, 2025-05-07T20:33:02.4789198Z T: int, 2025-05-07T20:33:02.4789407Z D: int, 2025-05-07T20:33:02.4789636Z scale_ub: Optional[float], 2025-05-07T20:33:02.4789914Z contiguous: bool, 2025-05-07T20:33:02.4790167Z compiled: bool, 2025-05-07T20:33:02.4790402Z ) -> None: 2025-05-07T20:33:02.4790620Z torch.manual_seed(2025) 2025-05-07T20:33:02.4790876Z 2025-05-07T20:33:02.4791162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.4791513Z 2025-05-07T20:33:02.4791714Z x_sign = torch.sign(x) 2025-05-07T20:33:02.4792019Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.4792339Z x = x_sign * x_clamp 2025-05-07T20:33:02.4792588Z x0 = x[:, :D] 2025-05-07T20:33:02.4792818Z x1 = x[:, D:] 2025-05-07T20:33:02.4793031Z 2025-05-07T20:33:02.4793231Z if contiguous: 2025-05-07T20:33:02.4793472Z x0 = x0.contiguous() 2025-05-07T20:33:02.4793746Z x1 = x1.contiguous() 2025-05-07T20:33:02.4793993Z 2025-05-07T20:33:02.4794192Z if scale_ub is not None: 2025-05-07T20:33:02.4794476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.4794822Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.4795147Z ) 2025-05-07T20:33:02.4795357Z else: 2025-05-07T20:33:02.4795567Z scale_ub_tensor = None 2025-05-07T20:33:02.4795817Z 2025-05-07T20:33:02.4796063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.4796389Z op = silu_mul_quant 2025-05-07T20:33:02.4796653Z if compiled: 2025-05-07T20:33:02.4796914Z op = torch.compile(op) 2025-05-07T20:33:02.4797226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4797520Z 2025-05-07T20:33:02.4797728Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.4797899Z 2025-05-07T20:33:02.4798010Z moe/activation_test.py:117: 2025-05-07T20:33:02.4798312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4798659Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.4798945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4799522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.4800212Z return fn(*args, **kwargs) 2025-05-07T20:33:02.4800897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.4801609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.4802175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.4802956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.4803651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.4804199Z kernel = self.compile( 2025-05-07T20:33:02.4804762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.4805446Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.4805861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4806142Z 2025-05-07T20:33:02.4806356Z self = 2025-05-07T20:33:02.4807476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.4808980Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780678c20>} 2025-05-07T20:33:02.4810366Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.4811414Z context = 2025-05-07T20:33:02.4811722Z 2025-05-07T20:33:02.4811894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.4812436Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.4812922Z module_map=module_map) 2025-05-07T20:33:02.4813585Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.4814556Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.4814859Z E ^ 2025-05-07T20:33:02.4815357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.4815837Z 2025-05-07T20:33:02.4816284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6184531Z 2025-05-07T20:33:02.6184774Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6185404Z self=, 2025-05-07T20:33:02.6185965Z T=4096, 2025-05-07T20:33:02.6186165Z D=5120, 2025-05-07T20:33:02.6186376Z scale_ub=1200.0, 2025-05-07T20:33:02.6186621Z contiguous=False, 2025-05-07T20:33:02.6186900Z compiled=False, 2025-05-07T20:33:02.6187120Z ) 2025-05-07T20:33:02.6187464Z self = 2025-05-07T20:33:02.6188013Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.6188303Z 2025-05-07T20:33:02.6188392Z @given( 2025-05-07T20:33:02.6188634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6188976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6189306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6189655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6190011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6190320Z ) 2025-05-07T20:33:02.6190693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6191166Z def test_silu_mul_quant( 2025-05-07T20:33:02.6191431Z self, 2025-05-07T20:33:02.6191635Z T: int, 2025-05-07T20:33:02.6191851Z D: int, 2025-05-07T20:33:02.6192087Z scale_ub: Optional[float], 2025-05-07T20:33:02.6192385Z contiguous: bool, 2025-05-07T20:33:02.6192792Z compiled: bool, 2025-05-07T20:33:02.6193043Z ) -> None: 2025-05-07T20:33:02.6193279Z torch.manual_seed(2025) 2025-05-07T20:33:02.6193531Z 2025-05-07T20:33:02.6193822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6194187Z 2025-05-07T20:33:02.6194393Z x_sign = torch.sign(x) 2025-05-07T20:33:02.6194707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.6195041Z x = x_sign * x_clamp 2025-05-07T20:33:02.6195292Z x0 = x[:, :D] 2025-05-07T20:33:02.6195526Z x1 = x[:, D:] 2025-05-07T20:33:02.6195827Z 2025-05-07T20:33:02.6196022Z if contiguous: 2025-05-07T20:33:02.6196274Z x0 = x0.contiguous() 2025-05-07T20:33:02.6196553Z x1 = x1.contiguous() 2025-05-07T20:33:02.6196803Z 2025-05-07T20:33:02.6197010Z if scale_ub is not None: 2025-05-07T20:33:02.6197374Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.6197738Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.6198127Z ) 2025-05-07T20:33:02.6198338Z else: 2025-05-07T20:33:02.6198567Z scale_ub_tensor = None 2025-05-07T20:33:02.6198827Z 2025-05-07T20:33:02.6199077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6199415Z op = silu_mul_quant 2025-05-07T20:33:02.6199678Z if compiled: 2025-05-07T20:33:02.6199942Z op = torch.compile(op) 2025-05-07T20:33:02.6200364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6200655Z 2025-05-07T20:33:02.6200864Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.6201036Z 2025-05-07T20:33:02.6201150Z moe/activation_test.py:117: 2025-05-07T20:33:02.6201461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6201820Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.6202127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6202865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.6203589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.6204160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.6204884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.6205575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.6206143Z kernel = self.compile( 2025-05-07T20:33:02.6206716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.6207407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.6207825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6208076Z 2025-05-07T20:33:02.6208297Z self = 2025-05-07T20:33:02.6209433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.6210890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7806796c0>} 2025-05-07T20:33:02.6212299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.6213535Z context = 2025-05-07T20:33:02.6213855Z 2025-05-07T20:33:02.6214107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.6214667Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.6215165Z module_map=module_map) 2025-05-07T20:33:02.6215548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.6215924Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.6216198Z E ^ 2025-05-07T20:33:02.6216682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6217223Z 2025-05-07T20:33:02.6217663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6218206Z 2025-05-07T20:33:02.6218318Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6218767Z self=, 2025-05-07T20:33:02.6219280Z T=4096, 2025-05-07T20:33:02.6219488Z D=5120, 2025-05-07T20:33:02.6219776Z scale_ub=1200.0, 2025-05-07T20:33:02.6220011Z contiguous=False, 2025-05-07T20:33:02.6220250Z compiled=True, 2025-05-07T20:33:02.6220466Z ) 2025-05-07T20:33:02.6220796Z self = 2025-05-07T20:33:02.6221320Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.6221614Z 2025-05-07T20:33:02.6221697Z @given( 2025-05-07T20:33:02.6221942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6222270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6222593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6222939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6223284Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6223586Z ) 2025-05-07T20:33:02.6223960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6224421Z def test_silu_mul_quant( 2025-05-07T20:33:02.6224680Z self, 2025-05-07T20:33:02.6224885Z T: int, 2025-05-07T20:33:02.6225099Z D: int, 2025-05-07T20:33:02.6225334Z scale_ub: Optional[float], 2025-05-07T20:33:02.6225616Z contiguous: bool, 2025-05-07T20:33:02.6225875Z compiled: bool, 2025-05-07T20:33:02.6226111Z ) -> None: 2025-05-07T20:33:02.6226334Z torch.manual_seed(2025) 2025-05-07T20:33:02.6226590Z 2025-05-07T20:33:02.6226879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6227236Z 2025-05-07T20:33:02.6227445Z x_sign = torch.sign(x) 2025-05-07T20:33:02.6227759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.6228081Z x = x_sign * x_clamp 2025-05-07T20:33:02.6228335Z x0 = x[:, :D] 2025-05-07T20:33:02.6228571Z x1 = x[:, D:] 2025-05-07T20:33:02.6228786Z 2025-05-07T20:33:02.6228986Z if contiguous: 2025-05-07T20:33:02.6229235Z x0 = x0.contiguous() 2025-05-07T20:33:02.6229503Z x1 = x1.contiguous() 2025-05-07T20:33:02.6229757Z 2025-05-07T20:33:02.6229961Z if scale_ub is not None: 2025-05-07T20:33:02.6230242Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.6230597Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.6230921Z ) 2025-05-07T20:33:02.6231126Z else: 2025-05-07T20:33:02.6231343Z scale_ub_tensor = None 2025-05-07T20:33:02.6231607Z 2025-05-07T20:33:02.6231858Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6232184Z op = silu_mul_quant 2025-05-07T20:33:02.6232449Z if compiled: 2025-05-07T20:33:02.6232712Z op = torch.compile(op) 2025-05-07T20:33:02.6233021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6233318Z 2025-05-07T20:33:02.6233537Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.6233793Z 2025-05-07T20:33:02.6233903Z moe/activation_test.py:117: 2025-05-07T20:33:02.6234220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6234573Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.6234869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6235449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.6236036Z return fn(*args, **kwargs) 2025-05-07T20:33:02.6236729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.6237492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.6238059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.6238820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.6239563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.6240192Z kernel = self.compile( 2025-05-07T20:33:02.6240767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.6241458Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.6241878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6242127Z 2025-05-07T20:33:02.6242348Z self = 2025-05-07T20:33:02.6243484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.6244988Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78067afc0>} 2025-05-07T20:33:02.6246391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.6247455Z context = 2025-05-07T20:33:02.6247766Z 2025-05-07T20:33:02.6247942Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.6248498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.6248991Z module_map=module_map) 2025-05-07T20:33:02.6249375Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.6249754Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.6250031Z E ^ 2025-05-07T20:33:02.6250519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6250998Z 2025-05-07T20:33:02.6251435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6251976Z 2025-05-07T20:33:02.6252087Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6252531Z self=, 2025-05-07T20:33:02.6252957Z T=2048, 2025-05-07T20:33:02.6253165Z D=7168, 2025-05-07T20:33:02.6253377Z scale_ub=1200.0, 2025-05-07T20:33:02.6253607Z contiguous=False, 2025-05-07T20:33:02.6253845Z compiled=False, 2025-05-07T20:33:02.8218061Z ) 2025-05-07T20:33:02.8219234Z self = 2025-05-07T20:33:02.8220061Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.8220499Z 2025-05-07T20:33:02.8220939Z @given( 2025-05-07T20:33:02.8221293Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.8221621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.8221946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.8222298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.8222639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.8222943Z ) 2025-05-07T20:33:02.8223312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.8223883Z def test_silu_mul_quant( 2025-05-07T20:33:02.8224140Z self, 2025-05-07T20:33:02.8224355Z T: int, 2025-05-07T20:33:02.8224558Z D: int, 2025-05-07T20:33:02.8224790Z scale_ub: Optional[float], 2025-05-07T20:33:02.8225081Z contiguous: bool, 2025-05-07T20:33:02.8225334Z compiled: bool, 2025-05-07T20:33:02.8225659Z ) -> None: 2025-05-07T20:33:02.8225889Z torch.manual_seed(2025) 2025-05-07T20:33:02.8226149Z 2025-05-07T20:33:02.8226517Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8226882Z 2025-05-07T20:33:02.8227094Z x_sign = torch.sign(x) 2025-05-07T20:33:02.8227395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.8227735Z x = x_sign * x_clamp 2025-05-07T20:33:02.8227988Z x0 = x[:, :D] 2025-05-07T20:33:02.8228214Z x1 = x[:, D:] 2025-05-07T20:33:02.8228434Z 2025-05-07T20:33:02.8228637Z if contiguous: 2025-05-07T20:33:02.8228883Z x0 = x0.contiguous() 2025-05-07T20:33:02.8229168Z x1 = x1.contiguous() 2025-05-07T20:33:02.8229429Z 2025-05-07T20:33:02.8229630Z if scale_ub is not None: 2025-05-07T20:33:02.8229920Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.8230276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.8230611Z ) 2025-05-07T20:33:02.8230812Z else: 2025-05-07T20:33:02.8231040Z scale_ub_tensor = None 2025-05-07T20:33:02.8231307Z 2025-05-07T20:33:02.8231549Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.8231885Z op = silu_mul_quant 2025-05-07T20:33:02.8232157Z if compiled: 2025-05-07T20:33:02.8232416Z op = torch.compile(op) 2025-05-07T20:33:02.8232735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8233028Z 2025-05-07T20:33:02.8233230Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.8233434Z 2025-05-07T20:33:02.8233556Z moe/activation_test.py:117: 2025-05-07T20:33:02.8233881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8234226Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.8234528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8235258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.8235990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.8236549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.8237263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.8237960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.8238522Z kernel = self.compile( 2025-05-07T20:33:02.8239085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.8239777Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.8240350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8240592Z 2025-05-07T20:33:02.8240812Z self = 2025-05-07T20:33:02.8241990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.8243440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78067bec0>} 2025-05-07T20:33:02.8244844Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.8245954Z context = 2025-05-07T20:33:02.8246259Z 2025-05-07T20:33:02.8246435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.8247032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.8247569Z module_map=module_map) 2025-05-07T20:33:02.8247952Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.8248327Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.8248602Z E ^ 2025-05-07T20:33:02.8249098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.8249568Z 2025-05-07T20:33:02.8250006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.8250552Z 2025-05-07T20:33:02.8250668Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.8251107Z self=, 2025-05-07T20:33:02.8251534Z T=1, 2025-05-07T20:33:02.8251730Z D=7168, 2025-05-07T20:33:02.8251943Z scale_ub=None, 2025-05-07T20:33:02.8252172Z contiguous=True, 2025-05-07T20:33:02.8252408Z compiled=False, 2025-05-07T20:33:02.8252633Z ) 2025-05-07T20:33:02.8252969Z self = 2025-05-07T20:33:02.8253515Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.8253809Z 2025-05-07T20:33:02.8253892Z @given( 2025-05-07T20:33:02.8254138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.8254466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.8254793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.8255148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.8255502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.8255801Z ) 2025-05-07T20:33:02.8256174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.8256647Z def test_silu_mul_quant( 2025-05-07T20:33:02.8256899Z self, 2025-05-07T20:33:02.8257110Z T: int, 2025-05-07T20:33:02.8257325Z D: int, 2025-05-07T20:33:02.8257552Z scale_ub: Optional[float], 2025-05-07T20:33:02.8257840Z contiguous: bool, 2025-05-07T20:33:02.8258093Z compiled: bool, 2025-05-07T20:33:02.8258324Z ) -> None: 2025-05-07T20:33:02.8258551Z torch.manual_seed(2025) 2025-05-07T20:33:02.8258809Z 2025-05-07T20:33:02.8259095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8259454Z 2025-05-07T20:33:02.8259663Z x_sign = torch.sign(x) 2025-05-07T20:33:02.8259972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.8260301Z x = x_sign * x_clamp 2025-05-07T20:33:02.8260559Z x0 = x[:, :D] 2025-05-07T20:33:02.8260802Z x1 = x[:, D:] 2025-05-07T20:33:02.8269091Z 2025-05-07T20:33:02.8269302Z if contiguous: 2025-05-07T20:33:02.8269555Z x0 = x0.contiguous() 2025-05-07T20:33:02.8269835Z x1 = x1.contiguous() 2025-05-07T20:33:02.8270172Z 2025-05-07T20:33:02.8270374Z if scale_ub is not None: 2025-05-07T20:33:02.8270667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.8271028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.8271350Z ) 2025-05-07T20:33:02.8271557Z else: 2025-05-07T20:33:02.8271789Z scale_ub_tensor = None 2025-05-07T20:33:02.8272050Z 2025-05-07T20:33:02.8272302Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.8272641Z op = silu_mul_quant 2025-05-07T20:33:02.8272960Z if compiled: 2025-05-07T20:33:02.8273214Z op = torch.compile(op) 2025-05-07T20:33:02.8273527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8273821Z 2025-05-07T20:33:02.8274017Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.8274200Z 2025-05-07T20:33:02.8274348Z moe/activation_test.py:117: 2025-05-07T20:33:02.8274666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8275054Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.8275355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8276086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.8276815Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.8277372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.8278092Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.8278787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.8279338Z kernel = self.compile( 2025-05-07T20:33:02.8279905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.8280710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.8281128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8281366Z 2025-05-07T20:33:02.8281582Z self = 2025-05-07T20:33:02.8282713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.8284148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780e4ccc0>} 2025-05-07T20:33:02.8285551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.8286622Z context = 2025-05-07T20:33:02.8286921Z 2025-05-07T20:33:02.8287096Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.8287648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.8288130Z module_map=module_map) 2025-05-07T20:33:02.8288504Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.8288874Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.8289150Z E ^ 2025-05-07T20:33:02.8289635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.8290102Z 2025-05-07T20:33:02.8290535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.8291074Z 2025-05-07T20:33:02.8291232Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.8291668Z self=, 2025-05-07T20:33:02.8292085Z T=16384, 2025-05-07T20:33:02.8292282Z D=7168, 2025-05-07T20:33:02.8292488Z scale_ub=1200.0, 2025-05-07T20:33:02.8292725Z contiguous=False, 2025-05-07T20:33:02.8292958Z compiled=True, 2025-05-07T20:33:02.8293173Z ) 2025-05-07T20:33:02.8293557Z self = 2025-05-07T20:33:02.8294076Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.8294418Z 2025-05-07T20:33:02.8294499Z @given( 2025-05-07T20:33:02.8294741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.8295062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.8295388Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.8295776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.8296156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.8296461Z ) 2025-05-07T20:33:02.8296830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.8297290Z def test_silu_mul_quant( 2025-05-07T20:33:02.8297538Z self, 2025-05-07T20:33:02.8297748Z T: int, 2025-05-07T20:33:02.8297957Z D: int, 2025-05-07T20:33:02.8298180Z scale_ub: Optional[float], 2025-05-07T20:33:02.8298468Z contiguous: bool, 2025-05-07T20:33:02.8298721Z compiled: bool, 2025-05-07T20:33:02.8298951Z ) -> None: 2025-05-07T20:33:02.8299178Z torch.manual_seed(2025) 2025-05-07T20:33:02.8299464Z 2025-05-07T20:33:02.8299759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8300107Z 2025-05-07T20:33:02.8300312Z x_sign = torch.sign(x) 2025-05-07T20:33:02.8300631Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.8300954Z x = x_sign * x_clamp 2025-05-07T20:33:02.8301213Z x0 = x[:, :D] 2025-05-07T20:33:02.8301448Z x1 = x[:, D:] 2025-05-07T20:33:02.8301664Z 2025-05-07T20:33:02.8301866Z if contiguous: 2025-05-07T20:33:02.8302113Z x0 = x0.contiguous() 2025-05-07T20:33:02.8302385Z x1 = x1.contiguous() 2025-05-07T20:33:02.8302648Z 2025-05-07T20:33:02.8302856Z if scale_ub is not None: 2025-05-07T20:33:02.8303137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.8303495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.8303834Z ) 2025-05-07T20:33:02.8304035Z else: 2025-05-07T20:33:02.8304263Z scale_ub_tensor = None 2025-05-07T20:33:02.8304532Z 2025-05-07T20:33:02.8304783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.8305110Z op = silu_mul_quant 2025-05-07T20:33:02.8305382Z if compiled: 2025-05-07T20:33:02.8305648Z op = torch.compile(op) 2025-05-07T20:33:02.8305957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8306252Z 2025-05-07T20:33:02.8306460Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.8306631Z 2025-05-07T20:33:02.8306736Z moe/activation_test.py:117: 2025-05-07T20:33:02.8307048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8307404Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.8307696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.8308283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.8308875Z return fn(*args, **kwargs) 2025-05-07T20:33:02.8309567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.8310279Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.8310898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.8311613Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.8312308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.8312857Z kernel = self.compile( 2025-05-07T20:33:02.8313849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.8314560Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.8315129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.8315376Z 2025-05-07T20:33:02.8315592Z self = 2025-05-07T20:33:02.8316840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.8318345Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780e4e0c0>} 2025-05-07T20:33:02.8319745Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.8320858Z context = 2025-05-07T20:33:02.8321172Z 2025-05-07T20:33:02.8321347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.8321902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.8322397Z module_map=module_map) 2025-05-07T20:33:02.8322781Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.8323158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.8323465Z E ^ 2025-05-07T20:33:02.8323963Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.8324442Z 2025-05-07T20:33:02.8324874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.9644961Z 2025-05-07T20:33:02.9645696Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9647013Z self=, 2025-05-07T20:33:02.9648179Z T=1, 2025-05-07T20:33:02.9648572Z D=7168, 2025-05-07T20:33:02.9649026Z scale_ub=None, 2025-05-07T20:33:02.9649470Z contiguous=False, 2025-05-07T20:33:02.9649946Z compiled=False, 2025-05-07T20:33:02.9650381Z ) 2025-05-07T20:33:02.9651049Z self = 2025-05-07T20:33:02.9652059Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:02.9652611Z 2025-05-07T20:33:02.9652770Z @given( 2025-05-07T20:33:02.9653251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.9653735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.9654103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.9654463Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.9654808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.9655121Z ) 2025-05-07T20:33:02.9655497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.9655971Z def test_silu_mul_quant( 2025-05-07T20:33:02.9656224Z self, 2025-05-07T20:33:02.9656433Z T: int, 2025-05-07T20:33:02.9656655Z D: int, 2025-05-07T20:33:02.9656894Z scale_ub: Optional[float], 2025-05-07T20:33:02.9657429Z contiguous: bool, 2025-05-07T20:33:02.9657698Z compiled: bool, 2025-05-07T20:33:02.9657940Z ) -> None: 2025-05-07T20:33:02.9658172Z torch.manual_seed(2025) 2025-05-07T20:33:02.9658422Z 2025-05-07T20:33:02.9658713Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.9659074Z 2025-05-07T20:33:02.9659276Z x_sign = torch.sign(x) 2025-05-07T20:33:02.9659587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.9659918Z x = x_sign * x_clamp 2025-05-07T20:33:02.9660249Z x0 = x[:, :D] 2025-05-07T20:33:02.9660483Z x1 = x[:, D:] 2025-05-07T20:33:02.9660709Z 2025-05-07T20:33:02.9660902Z if contiguous: 2025-05-07T20:33:02.9661148Z x0 = x0.contiguous() 2025-05-07T20:33:02.9661420Z x1 = x1.contiguous() 2025-05-07T20:33:02.9661667Z 2025-05-07T20:33:02.9661984Z if scale_ub is not None: 2025-05-07T20:33:02.9662279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.9662696Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.9663028Z ) 2025-05-07T20:33:02.9663233Z else: 2025-05-07T20:33:02.9663452Z scale_ub_tensor = None 2025-05-07T20:33:02.9663720Z 2025-05-07T20:33:02.9663969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.9664301Z op = silu_mul_quant 2025-05-07T20:33:02.9664562Z if compiled: 2025-05-07T20:33:02.9664826Z op = torch.compile(op) 2025-05-07T20:33:02.9665145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9665430Z 2025-05-07T20:33:02.9665651Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.9665831Z 2025-05-07T20:33:02.9665938Z moe/activation_test.py:117: 2025-05-07T20:33:02.9666254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9666604Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.9666908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9667633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.9668354Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.9668916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.9669638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.9670336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.9670900Z kernel = self.compile( 2025-05-07T20:33:02.9671465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.9672157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.9672589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9672828Z 2025-05-07T20:33:02.9673047Z self = 2025-05-07T20:33:02.9674175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.9675618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780e4ec00>} 2025-05-07T20:33:02.9677016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.9678084Z context = 2025-05-07T20:33:02.9678436Z 2025-05-07T20:33:02.9678614Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.9679166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.9679657Z module_map=module_map) 2025-05-07T20:33:02.9680037Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.9680558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.9680835Z E ^ 2025-05-07T20:33:02.9681328Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.9681848Z 2025-05-07T20:33:02.9682282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.9682825Z 2025-05-07T20:33:02.9682938Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9683441Z self=, 2025-05-07T20:33:02.9683945Z T=2048, 2025-05-07T20:33:02.9684144Z D=7168, 2025-05-07T20:33:02.9684356Z scale_ub=None, 2025-05-07T20:33:02.9684593Z contiguous=False, 2025-05-07T20:33:02.9684835Z compiled=True, 2025-05-07T20:33:02.9685057Z ) 2025-05-07T20:33:02.9685399Z self = 2025-05-07T20:33:02.9685921Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:02.9686214Z 2025-05-07T20:33:02.9686298Z @given( 2025-05-07T20:33:02.9686554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.9686885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.9687217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.9687573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.9687930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.9688237Z ) 2025-05-07T20:33:02.9688619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.9689089Z def test_silu_mul_quant( 2025-05-07T20:33:02.9689342Z self, 2025-05-07T20:33:02.9689557Z T: int, 2025-05-07T20:33:02.9689774Z D: int, 2025-05-07T20:33:02.9690005Z scale_ub: Optional[float], 2025-05-07T20:33:02.9690299Z contiguous: bool, 2025-05-07T20:33:02.9690562Z compiled: bool, 2025-05-07T20:33:02.9690797Z ) -> None: 2025-05-07T20:33:02.9691036Z torch.manual_seed(2025) 2025-05-07T20:33:02.9691300Z 2025-05-07T20:33:02.9691589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.9691955Z 2025-05-07T20:33:02.9692171Z x_sign = torch.sign(x) 2025-05-07T20:33:02.9692478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.9692815Z x = x_sign * x_clamp 2025-05-07T20:33:02.9693080Z x0 = x[:, :D] 2025-05-07T20:33:02.9693312Z x1 = x[:, D:] 2025-05-07T20:33:02.9693530Z 2025-05-07T20:33:02.9693740Z if contiguous: 2025-05-07T20:33:02.9693989Z x0 = x0.contiguous() 2025-05-07T20:33:02.9694257Z x1 = x1.contiguous() 2025-05-07T20:33:02.9694513Z 2025-05-07T20:33:02.9694723Z if scale_ub is not None: 2025-05-07T20:33:02.9695009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.9695374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.9695702Z ) 2025-05-07T20:33:02.9695908Z else: 2025-05-07T20:33:02.9696139Z scale_ub_tensor = None 2025-05-07T20:33:02.9696417Z 2025-05-07T20:33:02.9696658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.9696996Z op = silu_mul_quant 2025-05-07T20:33:02.9697268Z if compiled: 2025-05-07T20:33:02.9697530Z op = torch.compile(op) 2025-05-07T20:33:02.9697856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9698202Z 2025-05-07T20:33:02.9698424Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.9698599Z 2025-05-07T20:33:02.9698705Z moe/activation_test.py:117: 2025-05-07T20:33:02.9699016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9699366Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.9699653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9700237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.9700819Z return fn(*args, **kwargs) 2025-05-07T20:33:02.9701547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.9702261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.9702821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.9703610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.9704297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.9704857Z kernel = self.compile( 2025-05-07T20:33:02.9705424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.9706114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.9706537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9706784Z 2025-05-07T20:33:02.9707000Z self = 2025-05-07T20:33:02.9708124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.9709567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7811442c0>} 2025-05-07T20:33:02.9710955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.9712016Z context = 2025-05-07T20:33:02.9712324Z 2025-05-07T20:33:02.9712502Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.9713055Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.9713921Z module_map=module_map) 2025-05-07T20:33:02.9714309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.9714685Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.9714954Z E ^ 2025-05-07T20:33:02.9715445Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.9715923Z 2025-05-07T20:33:02.9716359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.9716893Z 2025-05-07T20:33:02.9717010Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9717444Z self=, 2025-05-07T20:33:02.9717877Z T=4096, 2025-05-07T20:33:02.9718078Z D=7168, 2025-05-07T20:33:02.9718275Z scale_ub=None, 2025-05-07T20:33:02.9718506Z contiguous=False, 2025-05-07T20:33:02.9718742Z compiled=True, 2025-05-07T20:33:03.3793807Z ) 2025-05-07T20:33:03.3794462Z self = 2025-05-07T20:33:03.3795562Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.3795938Z 2025-05-07T20:33:03.3796036Z @given( 2025-05-07T20:33:03.3796284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.3796615Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.3796931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.3797279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.3797627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.3797921Z ) 2025-05-07T20:33:03.3798288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.3798851Z def test_silu_mul_quant( 2025-05-07T20:33:03.3799108Z self, 2025-05-07T20:33:03.3799310Z T: int, 2025-05-07T20:33:03.3799517Z D: int, 2025-05-07T20:33:03.3799748Z scale_ub: Optional[float], 2025-05-07T20:33:03.3800027Z contiguous: bool, 2025-05-07T20:33:03.3800484Z compiled: bool, 2025-05-07T20:33:03.3800731Z ) -> None: 2025-05-07T20:33:03.3801035Z torch.manual_seed(2025) 2025-05-07T20:33:03.3801294Z 2025-05-07T20:33:03.3801586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.3801943Z 2025-05-07T20:33:03.3802149Z x_sign = torch.sign(x) 2025-05-07T20:33:03.3802461Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.3802784Z x = x_sign * x_clamp 2025-05-07T20:33:03.3803037Z x0 = x[:, :D] 2025-05-07T20:33:03.3803267Z x1 = x[:, D:] 2025-05-07T20:33:03.3803484Z 2025-05-07T20:33:03.3803687Z if contiguous: 2025-05-07T20:33:03.3803935Z x0 = x0.contiguous() 2025-05-07T20:33:03.3804205Z x1 = x1.contiguous() 2025-05-07T20:33:03.3804463Z 2025-05-07T20:33:03.3804668Z if scale_ub is not None: 2025-05-07T20:33:03.3804958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.3805314Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.3805645Z ) 2025-05-07T20:33:03.3805851Z else: 2025-05-07T20:33:03.3806068Z scale_ub_tensor = None 2025-05-07T20:33:03.3806338Z 2025-05-07T20:33:03.3806585Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.3806914Z op = silu_mul_quant 2025-05-07T20:33:03.3807184Z if compiled: 2025-05-07T20:33:03.3807449Z op = torch.compile(op) 2025-05-07T20:33:03.3807760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.3808056Z 2025-05-07T20:33:03.3808262Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.3808437Z 2025-05-07T20:33:03.3808545Z moe/activation_test.py:117: 2025-05-07T20:33:03.3808857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.3809208Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.3809507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.3810101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.3810688Z return fn(*args, **kwargs) 2025-05-07T20:33:03.3811380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.3812095Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.3812658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.3813633Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.3814386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.3814938Z kernel = self.compile( 2025-05-07T20:33:03.3815526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.3816294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.3816722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.3816966Z 2025-05-07T20:33:03.3817192Z self = 2025-05-07T20:33:03.3825946Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.3827446Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781144d60>} 2025-05-07T20:33:03.3828974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.3830173Z context = 2025-05-07T20:33:03.3830485Z 2025-05-07T20:33:03.3830676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.3831230Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.3831729Z module_map=module_map) 2025-05-07T20:33:03.3832120Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.3832488Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.3832764Z E ^ 2025-05-07T20:33:03.3833260Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.3833731Z 2025-05-07T20:33:03.3834174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.3834711Z 2025-05-07T20:33:03.3834821Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.3835272Z self=, 2025-05-07T20:33:03.3835697Z T=16384, 2025-05-07T20:33:03.3835898Z D=5120, 2025-05-07T20:33:03.3836107Z scale_ub=1200.0, 2025-05-07T20:33:03.3836349Z contiguous=False, 2025-05-07T20:33:03.3836583Z compiled=False, 2025-05-07T20:33:03.3836803Z ) 2025-05-07T20:33:03.3837139Z self = 2025-05-07T20:33:03.3837667Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.3837966Z 2025-05-07T20:33:03.3838049Z @given( 2025-05-07T20:33:03.3838291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.3838629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.3838950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.3839298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.3839652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.3839947Z ) 2025-05-07T20:33:03.3840448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.3840917Z def test_silu_mul_quant( 2025-05-07T20:33:03.3841173Z self, 2025-05-07T20:33:03.3841374Z T: int, 2025-05-07T20:33:03.3841586Z D: int, 2025-05-07T20:33:03.3841820Z scale_ub: Optional[float], 2025-05-07T20:33:03.3842107Z contiguous: bool, 2025-05-07T20:33:03.3842366Z compiled: bool, 2025-05-07T20:33:03.3842604Z ) -> None: 2025-05-07T20:33:03.3842830Z torch.manual_seed(2025) 2025-05-07T20:33:03.3843091Z 2025-05-07T20:33:03.3843385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.3843747Z 2025-05-07T20:33:03.3843955Z x_sign = torch.sign(x) 2025-05-07T20:33:03.3844267Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.3844590Z x = x_sign * x_clamp 2025-05-07T20:33:03.3844900Z x0 = x[:, :D] 2025-05-07T20:33:03.3845137Z x1 = x[:, D:] 2025-05-07T20:33:03.3845357Z 2025-05-07T20:33:03.3845560Z if contiguous: 2025-05-07T20:33:03.3845807Z x0 = x0.contiguous() 2025-05-07T20:33:03.3846079Z x1 = x1.contiguous() 2025-05-07T20:33:03.3846340Z 2025-05-07T20:33:03.3846548Z if scale_ub is not None: 2025-05-07T20:33:03.3846837Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.3847197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.3847534Z ) 2025-05-07T20:33:03.3847797Z else: 2025-05-07T20:33:03.3848023Z scale_ub_tensor = None 2025-05-07T20:33:03.3848297Z 2025-05-07T20:33:03.3848547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.3848871Z op = silu_mul_quant 2025-05-07T20:33:03.3849167Z if compiled: 2025-05-07T20:33:03.3849482Z op = torch.compile(op) 2025-05-07T20:33:03.3849804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.3850129Z 2025-05-07T20:33:03.3850343Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.3850516Z 2025-05-07T20:33:03.3850629Z moe/activation_test.py:117: 2025-05-07T20:33:03.3850937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.3851292Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.3851590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.3852310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.3853026Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.3853591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.3854310Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.3855007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.3855568Z kernel = self.compile( 2025-05-07T20:33:03.3856135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.3856821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.3857233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.3857479Z 2025-05-07T20:33:03.3857696Z self = 2025-05-07T20:33:03.3858816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.3860247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781145c60>} 2025-05-07T20:33:03.3861639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.3862709Z context = 2025-05-07T20:33:03.3863019Z 2025-05-07T20:33:03.3863196Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.3863755Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.3864247Z module_map=module_map) 2025-05-07T20:33:03.3864639Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.3865023Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.3865299Z E ^ 2025-05-07T20:33:03.3865835Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.3866319Z 2025-05-07T20:33:03.3866751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.3867283Z 2025-05-07T20:33:03.3867400Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.3867838Z self=, 2025-05-07T20:33:03.3868256Z T=16384, 2025-05-07T20:33:03.3868464Z D=5120, 2025-05-07T20:33:03.3868670Z scale_ub=1200.0, 2025-05-07T20:33:03.3868944Z contiguous=True, 2025-05-07T20:33:03.3869181Z compiled=True, 2025-05-07T20:33:03.3869401Z ) 2025-05-07T20:33:03.3869732Z self = 2025-05-07T20:33:03.3870260Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.3870595Z 2025-05-07T20:33:03.3870684Z @given( 2025-05-07T20:33:03.3870927Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.3871300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.3871629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.3871972Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.3872320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.3872628Z ) 2025-05-07T20:33:03.3873002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.3873465Z def test_silu_mul_quant( 2025-05-07T20:33:03.3873726Z self, 2025-05-07T20:33:03.3873936Z T: int, 2025-05-07T20:33:03.3874142Z D: int, 2025-05-07T20:33:03.3874377Z scale_ub: Optional[float], 2025-05-07T20:33:03.3874664Z contiguous: bool, 2025-05-07T20:33:03.3874911Z compiled: bool, 2025-05-07T20:33:03.3875146Z ) -> None: 2025-05-07T20:33:03.3875382Z torch.manual_seed(2025) 2025-05-07T20:33:03.3875632Z 2025-05-07T20:33:03.3875927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.3876285Z 2025-05-07T20:33:03.3876484Z x_sign = torch.sign(x) 2025-05-07T20:33:03.3876793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.3877119Z x = x_sign * x_clamp 2025-05-07T20:33:03.3877373Z x0 = x[:, :D] 2025-05-07T20:33:03.3877596Z x1 = x[:, D:] 2025-05-07T20:33:03.3877816Z 2025-05-07T20:33:03.3878007Z if contiguous: 2025-05-07T20:33:03.3878242Z x0 = x0.contiguous() 2025-05-07T20:33:03.3878508Z x1 = x1.contiguous() 2025-05-07T20:33:03.3878762Z 2025-05-07T20:33:03.3878960Z if scale_ub is not None: 2025-05-07T20:33:03.3879251Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.3879611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.3879934Z ) 2025-05-07T20:33:03.3880222Z else: 2025-05-07T20:33:03.3880454Z scale_ub_tensor = None 2025-05-07T20:33:03.3880715Z 2025-05-07T20:33:03.3880968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.3881303Z op = silu_mul_quant 2025-05-07T20:33:03.3881562Z if compiled: 2025-05-07T20:33:03.3881827Z op = torch.compile(op) 2025-05-07T20:33:03.3882142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.3882432Z 2025-05-07T20:33:03.3882640Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.3882823Z 2025-05-07T20:33:03.3882927Z moe/activation_test.py:117: 2025-05-07T20:33:03.3883247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.3883613Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.3883942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.3884520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.3885111Z return fn(*args, **kwargs) 2025-05-07T20:33:03.3885850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.3886570Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.3887125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.3887834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.3888528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.3889127Z kernel = self.compile( 2025-05-07T20:33:03.3889688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.3890379Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.3890857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.3891095Z 2025-05-07T20:33:03.3891372Z self = 2025-05-07T20:33:03.3892500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.3893927Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781147380>} 2025-05-07T20:33:03.3895320Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.3896380Z context = 2025-05-07T20:33:03.3896687Z 2025-05-07T20:33:03.3896865Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.3897413Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.3897904Z module_map=module_map) 2025-05-07T20:33:03.3898284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.3898648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.3898919Z E ^ 2025-05-07T20:33:03.3899402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.3899870Z 2025-05-07T20:33:03.3900301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5439895Z 2025-05-07T20:33:03.5440399Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5441054Z self=, 2025-05-07T20:33:03.5441664Z T=16384, 2025-05-07T20:33:03.5441870Z D=5120, 2025-05-07T20:33:03.5442084Z scale_ub=None, 2025-05-07T20:33:03.5442313Z contiguous=False, 2025-05-07T20:33:03.5442562Z compiled=True, 2025-05-07T20:33:03.5442779Z ) 2025-05-07T20:33:03.5443123Z self = 2025-05-07T20:33:03.5443653Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5443947Z 2025-05-07T20:33:03.5444030Z @given( 2025-05-07T20:33:03.5444276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5444616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5444946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5445294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5445643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5445950Z ) 2025-05-07T20:33:03.5446620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5447091Z def test_silu_mul_quant( 2025-05-07T20:33:03.5447352Z self, 2025-05-07T20:33:03.5447555Z T: int, 2025-05-07T20:33:03.5447766Z D: int, 2025-05-07T20:33:03.5447998Z scale_ub: Optional[float], 2025-05-07T20:33:03.5448283Z contiguous: bool, 2025-05-07T20:33:03.5448538Z compiled: bool, 2025-05-07T20:33:03.5448776Z ) -> None: 2025-05-07T20:33:03.5448997Z torch.manual_seed(2025) 2025-05-07T20:33:03.5449254Z 2025-05-07T20:33:03.5449546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5450001Z 2025-05-07T20:33:03.5450206Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5450516Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5450848Z x = x_sign * x_clamp 2025-05-07T20:33:03.5451095Z x0 = x[:, :D] 2025-05-07T20:33:03.5451411Z x1 = x[:, D:] 2025-05-07T20:33:03.5451634Z 2025-05-07T20:33:03.5451831Z if contiguous: 2025-05-07T20:33:03.5452162Z x0 = x0.contiguous() 2025-05-07T20:33:03.5452442Z x1 = x1.contiguous() 2025-05-07T20:33:03.5452689Z 2025-05-07T20:33:03.5452895Z if scale_ub is not None: 2025-05-07T20:33:03.5453186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5453538Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5453866Z ) 2025-05-07T20:33:03.5454075Z else: 2025-05-07T20:33:03.5454292Z scale_ub_tensor = None 2025-05-07T20:33:03.5454566Z 2025-05-07T20:33:03.5454819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5455148Z op = silu_mul_quant 2025-05-07T20:33:03.5455421Z if compiled: 2025-05-07T20:33:03.5455687Z op = torch.compile(op) 2025-05-07T20:33:03.5456004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5456296Z 2025-05-07T20:33:03.5456505Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5456678Z 2025-05-07T20:33:03.5456800Z moe/activation_test.py:117: 2025-05-07T20:33:03.5457109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5457465Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5457766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5458356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5458951Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5459649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5460373Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5460937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5461659Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5462359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5462918Z kernel = self.compile( 2025-05-07T20:33:03.5463494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5464227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5464651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5464892Z 2025-05-07T20:33:03.5465113Z self = 2025-05-07T20:33:03.5466246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5467762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7802d85e0>} 2025-05-07T20:33:03.5469163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5470234Z context = 2025-05-07T20:33:03.5470538Z 2025-05-07T20:33:03.5470714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5471360Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5471855Z module_map=module_map) 2025-05-07T20:33:03.5472236Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5472658Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5472934Z E ^ 2025-05-07T20:33:03.5473470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5473944Z 2025-05-07T20:33:03.5474421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5474957Z 2025-05-07T20:33:03.5475065Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5475503Z self=, 2025-05-07T20:33:03.5475930Z T=2048, 2025-05-07T20:33:03.5476126Z D=5120, 2025-05-07T20:33:03.5476331Z scale_ub=None, 2025-05-07T20:33:03.5476562Z contiguous=False, 2025-05-07T20:33:03.5476800Z compiled=True, 2025-05-07T20:33:03.5477008Z ) 2025-05-07T20:33:03.5477344Z self = 2025-05-07T20:33:03.5477863Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5478151Z 2025-05-07T20:33:03.5478234Z @given( 2025-05-07T20:33:03.5478482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5478814Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5479135Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5479486Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5479835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5480227Z ) 2025-05-07T20:33:03.5480590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5481059Z def test_silu_mul_quant( 2025-05-07T20:33:03.5481316Z self, 2025-05-07T20:33:03.5481517Z T: int, 2025-05-07T20:33:03.5481728Z D: int, 2025-05-07T20:33:03.5481963Z scale_ub: Optional[float], 2025-05-07T20:33:03.5482249Z contiguous: bool, 2025-05-07T20:33:03.5482506Z compiled: bool, 2025-05-07T20:33:03.5482748Z ) -> None: 2025-05-07T20:33:03.5482975Z torch.manual_seed(2025) 2025-05-07T20:33:03.5483233Z 2025-05-07T20:33:03.5483529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5483891Z 2025-05-07T20:33:03.5484103Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5484412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5484738Z x = x_sign * x_clamp 2025-05-07T20:33:03.5484992Z x0 = x[:, :D] 2025-05-07T20:33:03.5485221Z x1 = x[:, D:] 2025-05-07T20:33:03.5485437Z 2025-05-07T20:33:03.5485635Z if contiguous: 2025-05-07T20:33:03.5485883Z x0 = x0.contiguous() 2025-05-07T20:33:03.5486157Z x1 = x1.contiguous() 2025-05-07T20:33:03.5486405Z 2025-05-07T20:33:03.5486610Z if scale_ub is not None: 2025-05-07T20:33:03.5486897Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5487247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5487575Z ) 2025-05-07T20:33:03.5487833Z else: 2025-05-07T20:33:03.5488055Z scale_ub_tensor = None 2025-05-07T20:33:03.5488322Z 2025-05-07T20:33:03.5488567Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5488895Z op = silu_mul_quant 2025-05-07T20:33:03.5489163Z if compiled: 2025-05-07T20:33:03.5489424Z op = torch.compile(op) 2025-05-07T20:33:03.5489732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5490027Z 2025-05-07T20:33:03.5490234Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5490455Z 2025-05-07T20:33:03.5490567Z moe/activation_test.py:117: 2025-05-07T20:33:03.5490874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5491226Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5491524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5492109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5492780Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5493477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5494197Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5494765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5495482Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5496180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5496735Z kernel = self.compile( 2025-05-07T20:33:03.5497304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5497992Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5498425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5498667Z 2025-05-07T20:33:03.5498884Z self = 2025-05-07T20:33:03.5500010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5501446Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7802d9440>} 2025-05-07T20:33:03.5502847Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5503909Z context = 2025-05-07T20:33:03.5504221Z 2025-05-07T20:33:03.5504400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5504956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5505451Z module_map=module_map) 2025-05-07T20:33:03.5505834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5506213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5506492Z E ^ 2025-05-07T20:33:03.5506978Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5507458Z 2025-05-07T20:33:03.5507893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.7118910Z 2025-05-07T20:33:03.7119202Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.7120152Z self=, 2025-05-07T20:33:03.7120599Z T=2048, 2025-05-07T20:33:03.7120803Z D=5120, 2025-05-07T20:33:03.7121010Z scale_ub=1200.0, 2025-05-07T20:33:03.7121247Z contiguous=False, 2025-05-07T20:33:03.7121489Z compiled=True, 2025-05-07T20:33:03.7121709Z ) 2025-05-07T20:33:03.7122043Z self = 2025-05-07T20:33:03.7122571Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.7122867Z 2025-05-07T20:33:03.7122948Z @given( 2025-05-07T20:33:03.7123305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.7123633Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.7123963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.7124314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.7124659Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.7125046Z ) 2025-05-07T20:33:03.7125493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.7125955Z def test_silu_mul_quant( 2025-05-07T20:33:03.7126227Z self, 2025-05-07T20:33:03.7126439Z T: int, 2025-05-07T20:33:03.7126646Z D: int, 2025-05-07T20:33:03.7126881Z scale_ub: Optional[float], 2025-05-07T20:33:03.7127168Z contiguous: bool, 2025-05-07T20:33:03.7127420Z compiled: bool, 2025-05-07T20:33:03.7134716Z ) -> None: 2025-05-07T20:33:03.7134983Z torch.manual_seed(2025) 2025-05-07T20:33:03.7135250Z 2025-05-07T20:33:03.7135548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.7135914Z 2025-05-07T20:33:03.7136118Z x_sign = torch.sign(x) 2025-05-07T20:33:03.7136431Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.7136765Z x = x_sign * x_clamp 2025-05-07T20:33:03.7137022Z x0 = x[:, :D] 2025-05-07T20:33:03.7137259Z x1 = x[:, D:] 2025-05-07T20:33:03.7137486Z 2025-05-07T20:33:03.7137684Z if contiguous: 2025-05-07T20:33:03.7137935Z x0 = x0.contiguous() 2025-05-07T20:33:03.7138214Z x1 = x1.contiguous() 2025-05-07T20:33:03.7138466Z 2025-05-07T20:33:03.7138674Z if scale_ub is not None: 2025-05-07T20:33:03.7138970Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.7139337Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.7139663Z ) 2025-05-07T20:33:03.7139874Z else: 2025-05-07T20:33:03.7140105Z scale_ub_tensor = None 2025-05-07T20:33:03.7140367Z 2025-05-07T20:33:03.7140648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7140984Z op = silu_mul_quant 2025-05-07T20:33:03.7141258Z if compiled: 2025-05-07T20:33:03.7141519Z op = torch.compile(op) 2025-05-07T20:33:03.7141843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7142141Z 2025-05-07T20:33:03.7142345Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.7142527Z 2025-05-07T20:33:03.7142635Z moe/activation_test.py:117: 2025-05-07T20:33:03.7142952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7143303Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.7143618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7144251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.7144846Z return fn(*args, **kwargs) 2025-05-07T20:33:03.7145539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.7146256Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.7146828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.7147625Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.7148330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.7148889Z kernel = self.compile( 2025-05-07T20:33:03.7149467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.7150149Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.7150574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7150863Z 2025-05-07T20:33:03.7151091Z self = 2025-05-07T20:33:03.7152227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.7153758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7802da660>} 2025-05-07T20:33:03.7155210Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.7156283Z context = 2025-05-07T20:33:03.7156586Z 2025-05-07T20:33:03.7156775Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.7157324Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.7157823Z module_map=module_map) 2025-05-07T20:33:03.7158213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.7158596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.7158868Z E ^ 2025-05-07T20:33:03.7159360Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.7159830Z 2025-05-07T20:33:03.7160386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.7160925Z 2025-05-07T20:33:03.7161044Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.7161479Z self=, 2025-05-07T20:33:03.7161910Z T=4096, 2025-05-07T20:33:03.7162117Z D=5120, 2025-05-07T20:33:03.7162318Z scale_ub=1200.0, 2025-05-07T20:33:03.7162563Z contiguous=True, 2025-05-07T20:33:03.7162804Z compiled=True, 2025-05-07T20:33:03.7163016Z ) 2025-05-07T20:33:03.7163358Z self = 2025-05-07T20:33:03.7163891Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.7164176Z 2025-05-07T20:33:03.7164259Z @given( 2025-05-07T20:33:03.7164508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.7164844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.7165173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.7165517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.7165867Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.7166170Z ) 2025-05-07T20:33:03.7166535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.7167006Z def test_silu_mul_quant( 2025-05-07T20:33:03.7167265Z self, 2025-05-07T20:33:03.7167467Z T: int, 2025-05-07T20:33:03.7167680Z D: int, 2025-05-07T20:33:03.7167920Z scale_ub: Optional[float], 2025-05-07T20:33:03.7168211Z contiguous: bool, 2025-05-07T20:33:03.7168471Z compiled: bool, 2025-05-07T20:33:03.7168765Z ) -> None: 2025-05-07T20:33:03.7168993Z torch.manual_seed(2025) 2025-05-07T20:33:03.7169253Z 2025-05-07T20:33:03.7169545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.7169904Z 2025-05-07T20:33:03.7170106Z x_sign = torch.sign(x) 2025-05-07T20:33:03.7170419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.7170751Z x = x_sign * x_clamp 2025-05-07T20:33:03.7171004Z x0 = x[:, :D] 2025-05-07T20:33:03.7171238Z x1 = x[:, D:] 2025-05-07T20:33:03.7171509Z 2025-05-07T20:33:03.7171703Z if contiguous: 2025-05-07T20:33:03.7171952Z x0 = x0.contiguous() 2025-05-07T20:33:03.7172230Z x1 = x1.contiguous() 2025-05-07T20:33:03.7172481Z 2025-05-07T20:33:03.7172690Z if scale_ub is not None: 2025-05-07T20:33:03.7172987Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.7173391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.7173724Z ) 2025-05-07T20:33:03.7173977Z else: 2025-05-07T20:33:03.7174200Z scale_ub_tensor = None 2025-05-07T20:33:03.7174471Z 2025-05-07T20:33:03.7174721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7175058Z op = silu_mul_quant 2025-05-07T20:33:03.7175322Z if compiled: 2025-05-07T20:33:03.7175590Z op = torch.compile(op) 2025-05-07T20:33:03.7175910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7176197Z 2025-05-07T20:33:03.7176412Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.7176585Z 2025-05-07T20:33:03.7176699Z moe/activation_test.py:117: 2025-05-07T20:33:03.7177009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7177368Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.7177671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7178263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.7178849Z return fn(*args, **kwargs) 2025-05-07T20:33:03.7179544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.7180269Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.7180831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.7181548Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.7182251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.7182812Z kernel = self.compile( 2025-05-07T20:33:03.7183378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.7184127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.7184551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7184793Z 2025-05-07T20:33:03.7185010Z self = 2025-05-07T20:33:03.7186139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.7187568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7802db9c0>} 2025-05-07T20:33:03.7188968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.7190092Z context = 2025-05-07T20:33:03.7190398Z 2025-05-07T20:33:03.7190574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.7191135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.7191628Z module_map=module_map) 2025-05-07T20:33:03.7192018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.7192390Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.7192735Z E ^ 2025-05-07T20:33:03.7193225Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.7193697Z 2025-05-07T20:33:03.7194133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.8885168Z 2025-05-07T20:33:03.8885574Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.8886542Z self=, 2025-05-07T20:33:03.8887105Z T=128, 2025-05-07T20:33:03.8887340Z D=5120, 2025-05-07T20:33:03.8887551Z scale_ub=1200.0, 2025-05-07T20:33:03.8887798Z contiguous=False, 2025-05-07T20:33:03.8888035Z compiled=True, 2025-05-07T20:33:03.8888259Z ) 2025-05-07T20:33:03.8888604Z self = 2025-05-07T20:33:03.8889136Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.8889431Z 2025-05-07T20:33:03.8889515Z @given( 2025-05-07T20:33:03.8889768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.8890100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.8890423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.8890781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.8891147Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.8891454Z ) 2025-05-07T20:33:03.8891824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.8892294Z def test_silu_mul_quant( 2025-05-07T20:33:03.8892554Z self, 2025-05-07T20:33:03.8892760Z T: int, 2025-05-07T20:33:03.8892974Z D: int, 2025-05-07T20:33:03.8893211Z scale_ub: Optional[float], 2025-05-07T20:33:03.8893497Z contiguous: bool, 2025-05-07T20:33:03.8893759Z compiled: bool, 2025-05-07T20:33:03.8894001Z ) -> None: 2025-05-07T20:33:03.8894234Z torch.manual_seed(2025) 2025-05-07T20:33:03.8894493Z 2025-05-07T20:33:03.8894784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.8895139Z 2025-05-07T20:33:03.8895346Z x_sign = torch.sign(x) 2025-05-07T20:33:03.8895658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.8895989Z x = x_sign * x_clamp 2025-05-07T20:33:03.8896250Z x0 = x[:, :D] 2025-05-07T20:33:03.8896485Z x1 = x[:, D:] 2025-05-07T20:33:03.8896709Z 2025-05-07T20:33:03.8896904Z if contiguous: 2025-05-07T20:33:03.8897154Z x0 = x0.contiguous() 2025-05-07T20:33:03.8897431Z x1 = x1.contiguous() 2025-05-07T20:33:03.8897683Z 2025-05-07T20:33:03.8897891Z if scale_ub is not None: 2025-05-07T20:33:03.8898184Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.8898539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.8898883Z ) 2025-05-07T20:33:03.8899091Z else: 2025-05-07T20:33:03.8899322Z scale_ub_tensor = None 2025-05-07T20:33:03.8899588Z 2025-05-07T20:33:03.8899839Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.8900174Z op = silu_mul_quant 2025-05-07T20:33:03.8900439Z if compiled: 2025-05-07T20:33:03.8900709Z op = torch.compile(op) 2025-05-07T20:33:03.8901140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8901438Z 2025-05-07T20:33:03.8901646Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.8901830Z 2025-05-07T20:33:03.8901937Z moe/activation_test.py:117: 2025-05-07T20:33:03.8902255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8902609Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.8902913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8903532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.8904266Z return fn(*args, **kwargs) 2025-05-07T20:33:03.8904956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.8905682Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.8906369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.8907127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.8907820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.8908388Z kernel = self.compile( 2025-05-07T20:33:03.8908960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.8909652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.8910084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8910330Z 2025-05-07T20:33:03.8910549Z self = 2025-05-07T20:33:03.8911680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.8913142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfcd0fe0>} 2025-05-07T20:33:03.8914812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.8915879Z context = 2025-05-07T20:33:03.8916194Z 2025-05-07T20:33:03.8916374Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.8916926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.8917419Z module_map=module_map) 2025-05-07T20:33:03.8917818Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.8918200Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.8918473Z E ^ 2025-05-07T20:33:03.8918966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.8919445Z 2025-05-07T20:33:03.8919878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.8920471Z 2025-05-07T20:33:03.8920589Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.8921024Z self=, 2025-05-07T20:33:03.8921454Z T=16384, 2025-05-07T20:33:03.8921666Z D=7168, 2025-05-07T20:33:03.8921869Z scale_ub=1200.0, 2025-05-07T20:33:03.8922108Z contiguous=True, 2025-05-07T20:33:03.8922351Z compiled=True, 2025-05-07T20:33:03.8922562Z ) 2025-05-07T20:33:03.8922905Z self = 2025-05-07T20:33:03.8923518Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.8923818Z 2025-05-07T20:33:03.8923909Z @given( 2025-05-07T20:33:03.8924152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.8924490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.8924819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.8925167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.8925519Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.8925889Z ) 2025-05-07T20:33:03.8926254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.8926721Z def test_silu_mul_quant( 2025-05-07T20:33:03.8926982Z self, 2025-05-07T20:33:03.8927194Z T: int, 2025-05-07T20:33:03.8927404Z D: int, 2025-05-07T20:33:03.8927703Z scale_ub: Optional[float], 2025-05-07T20:33:03.8927997Z contiguous: bool, 2025-05-07T20:33:03.8928311Z compiled: bool, 2025-05-07T20:33:03.8928554Z ) -> None: 2025-05-07T20:33:03.8928787Z torch.manual_seed(2025) 2025-05-07T20:33:03.8929041Z 2025-05-07T20:33:03.8929328Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.8929689Z 2025-05-07T20:33:03.8929894Z x_sign = torch.sign(x) 2025-05-07T20:33:03.8930205Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.8930538Z x = x_sign * x_clamp 2025-05-07T20:33:03.8930789Z x0 = x[:, :D] 2025-05-07T20:33:03.8931031Z x1 = x[:, D:] 2025-05-07T20:33:03.8931259Z 2025-05-07T20:33:03.8931452Z if contiguous: 2025-05-07T20:33:03.8931700Z x0 = x0.contiguous() 2025-05-07T20:33:03.8931977Z x1 = x1.contiguous() 2025-05-07T20:33:03.8932228Z 2025-05-07T20:33:03.8932439Z if scale_ub is not None: 2025-05-07T20:33:03.8932738Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.8933103Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.8933427Z ) 2025-05-07T20:33:03.8933635Z else: 2025-05-07T20:33:03.8933864Z scale_ub_tensor = None 2025-05-07T20:33:03.8934127Z 2025-05-07T20:33:03.8934374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.8934709Z op = silu_mul_quant 2025-05-07T20:33:03.8934973Z if compiled: 2025-05-07T20:33:03.8935244Z op = torch.compile(op) 2025-05-07T20:33:03.8935564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8935856Z 2025-05-07T20:33:03.8936068Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.8936244Z 2025-05-07T20:33:03.8936355Z moe/activation_test.py:117: 2025-05-07T20:33:03.8936665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8937030Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.8937340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.8937936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.8938518Z return fn(*args, **kwargs) 2025-05-07T20:33:03.8939211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.8939934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.8940494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.8941214Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.8941912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.8942475Z kernel = self.compile( 2025-05-07T20:33:03.8943098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.8943842Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.8944273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8944515Z 2025-05-07T20:33:03.8944739Z self = 2025-05-07T20:33:03.8945857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.8947334Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfcd1e40>} 2025-05-07T20:33:03.8948734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.8949884Z context = 2025-05-07T20:33:03.8950192Z 2025-05-07T20:33:03.8950375Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.8950919Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.8951413Z module_map=module_map) 2025-05-07T20:33:03.8951801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.8952175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.8952452Z E ^ 2025-05-07T20:33:03.8952943Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.8953412Z 2025-05-07T20:33:03.8953853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0115956Z 2025-05-07T20:33:04.0116349Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0116975Z self=, 2025-05-07T20:33:04.0117453Z T=16384, 2025-05-07T20:33:04.0117666Z D=5120, 2025-05-07T20:33:04.0117879Z scale_ub=1200.0, 2025-05-07T20:33:04.0118128Z contiguous=True, 2025-05-07T20:33:04.0118376Z compiled=False, 2025-05-07T20:33:04.0118606Z ) 2025-05-07T20:33:04.0118976Z self = 2025-05-07T20:33:04.0119573Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.0119915Z 2025-05-07T20:33:04.0119999Z @given( 2025-05-07T20:33:04.0120372Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0120706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0121032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0121389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0121739Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0122047Z ) 2025-05-07T20:33:04.0122414Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0122877Z def test_silu_mul_quant( 2025-05-07T20:33:04.0123139Z self, 2025-05-07T20:33:04.0123344Z T: int, 2025-05-07T20:33:04.0123558Z D: int, 2025-05-07T20:33:04.0123793Z scale_ub: Optional[float], 2025-05-07T20:33:04.0124077Z contiguous: bool, 2025-05-07T20:33:04.0124336Z compiled: bool, 2025-05-07T20:33:04.0124580Z ) -> None: 2025-05-07T20:33:04.0124806Z torch.manual_seed(2025) 2025-05-07T20:33:04.0125064Z 2025-05-07T20:33:04.0125358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0125714Z 2025-05-07T20:33:04.0125924Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0126240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0126848Z x = x_sign * x_clamp 2025-05-07T20:33:04.0127104Z x0 = x[:, :D] 2025-05-07T20:33:04.0127342Z x1 = x[:, D:] 2025-05-07T20:33:04.0127564Z 2025-05-07T20:33:04.0127757Z if contiguous: 2025-05-07T20:33:04.0128005Z x0 = x0.contiguous() 2025-05-07T20:33:04.0128280Z x1 = x1.contiguous() 2025-05-07T20:33:04.0128527Z 2025-05-07T20:33:04.0128730Z if scale_ub is not None: 2025-05-07T20:33:04.0129023Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0129373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0129779Z ) 2025-05-07T20:33:04.0129986Z else: 2025-05-07T20:33:04.0130204Z scale_ub_tensor = None 2025-05-07T20:33:04.0130473Z 2025-05-07T20:33:04.0130750Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0131177Z op = silu_mul_quant 2025-05-07T20:33:04.0131447Z if compiled: 2025-05-07T20:33:04.0131787Z op = torch.compile(op) 2025-05-07T20:33:04.0132105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0132397Z 2025-05-07T20:33:04.0132599Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.0132778Z 2025-05-07T20:33:04.0132882Z moe/activation_test.py:117: 2025-05-07T20:33:04.0133196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0133550Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.0133843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0134569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.0135292Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.0135853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0136577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0137278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0137838Z kernel = self.compile( 2025-05-07T20:33:04.0138401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0139092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0139533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0139778Z 2025-05-07T20:33:04.0140002Z self = 2025-05-07T20:33:04.0141120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0142588Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfcd2ca0>} 2025-05-07T20:33:04.0144046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0153046Z context = 2025-05-07T20:33:04.0153374Z 2025-05-07T20:33:04.0153560Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0154134Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0154629Z module_map=module_map) 2025-05-07T20:33:04.0155026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0155411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.0155684Z E ^ 2025-05-07T20:33:04.0156283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0156768Z 2025-05-07T20:33:04.0157213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0157755Z 2025-05-07T20:33:04.0157875Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0158315Z self=, 2025-05-07T20:33:04.0158749Z T=1, 2025-05-07T20:33:04.0159005Z D=7168, 2025-05-07T20:33:04.0159221Z scale_ub=1200.0, 2025-05-07T20:33:04.0159458Z contiguous=False, 2025-05-07T20:33:04.0159712Z compiled=False, 2025-05-07T20:33:04.0159936Z ) 2025-05-07T20:33:04.0160393Z self = 2025-05-07T20:33:04.0160974Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.0161262Z 2025-05-07T20:33:04.0161357Z @given( 2025-05-07T20:33:04.0161644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0161983Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0162313Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0162664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0163022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0163332Z ) 2025-05-07T20:33:04.0163709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0164177Z def test_silu_mul_quant( 2025-05-07T20:33:04.0164441Z self, 2025-05-07T20:33:04.0164656Z T: int, 2025-05-07T20:33:04.0164866Z D: int, 2025-05-07T20:33:04.0165108Z scale_ub: Optional[float], 2025-05-07T20:33:04.0165404Z contiguous: bool, 2025-05-07T20:33:04.0165662Z compiled: bool, 2025-05-07T20:33:04.0165905Z ) -> None: 2025-05-07T20:33:04.0166143Z torch.manual_seed(2025) 2025-05-07T20:33:04.0166400Z 2025-05-07T20:33:04.0166696Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0167062Z 2025-05-07T20:33:04.0167266Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0167582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0167917Z x = x_sign * x_clamp 2025-05-07T20:33:04.0168168Z x0 = x[:, :D] 2025-05-07T20:33:04.0168405Z x1 = x[:, D:] 2025-05-07T20:33:04.0168628Z 2025-05-07T20:33:04.0168825Z if contiguous: 2025-05-07T20:33:04.0169078Z x0 = x0.contiguous() 2025-05-07T20:33:04.0169355Z x1 = x1.contiguous() 2025-05-07T20:33:04.0169614Z 2025-05-07T20:33:04.0169815Z if scale_ub is not None: 2025-05-07T20:33:04.0170109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0170469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0170804Z ) 2025-05-07T20:33:04.0171018Z else: 2025-05-07T20:33:04.0171250Z scale_ub_tensor = None 2025-05-07T20:33:04.0171516Z 2025-05-07T20:33:04.0171770Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0172111Z op = silu_mul_quant 2025-05-07T20:33:04.0172379Z if compiled: 2025-05-07T20:33:04.0172649Z op = torch.compile(op) 2025-05-07T20:33:04.0172976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0173266Z 2025-05-07T20:33:04.0173481Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.0173673Z 2025-05-07T20:33:04.0173780Z moe/activation_test.py:117: 2025-05-07T20:33:04.0174101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0177290Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.0177600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0178389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.0179251Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.0179903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0180742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0181549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0182202Z kernel = self.compile( 2025-05-07T20:33:04.0182861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0183665Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0184158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0184458Z 2025-05-07T20:33:04.0184681Z self = 2025-05-07T20:33:04.0185856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0187288Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfd0c0e0>} 2025-05-07T20:33:04.0188686Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0189753Z context = 2025-05-07T20:33:04.0190060Z 2025-05-07T20:33:04.0190245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0190812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0191303Z module_map=module_map) 2025-05-07T20:33:04.0191695Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0192073Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.0192348Z E ^ 2025-05-07T20:33:04.0192842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0193312Z 2025-05-07T20:33:04.0193752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0194287Z 2025-05-07T20:33:04.0194404Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0194838Z self=, 2025-05-07T20:33:04.0195269Z T=4096, 2025-05-07T20:33:04.0195480Z D=7168, 2025-05-07T20:33:04.0195682Z scale_ub=1200.0, 2025-05-07T20:33:04.0195926Z contiguous=False, 2025-05-07T20:33:04.0196172Z compiled=True, 2025-05-07T20:33:04.1805457Z ) 2025-05-07T20:33:04.1806022Z self = 2025-05-07T20:33:04.1806650Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:04.1806939Z 2025-05-07T20:33:04.1807036Z @given( 2025-05-07T20:33:04.1807282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.1807625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.1807961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.1808314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.1808677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.1809250Z ) 2025-05-07T20:33:04.1809633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.1810110Z def test_silu_mul_quant( 2025-05-07T20:33:04.1810472Z self, 2025-05-07T20:33:04.1810705Z T: int, 2025-05-07T20:33:04.1810918Z D: int, 2025-05-07T20:33:04.1811160Z scale_ub: Optional[float], 2025-05-07T20:33:04.1811458Z contiguous: bool, 2025-05-07T20:33:04.1811714Z compiled: bool, 2025-05-07T20:33:04.1811966Z ) -> None: 2025-05-07T20:33:04.1812203Z torch.manual_seed(2025) 2025-05-07T20:33:04.1812461Z 2025-05-07T20:33:04.1812761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.1813126Z 2025-05-07T20:33:04.1813606Z x_sign = torch.sign(x) 2025-05-07T20:33:04.1813929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.1814266Z x = x_sign * x_clamp 2025-05-07T20:33:04.1814519Z x0 = x[:, :D] 2025-05-07T20:33:04.1814762Z x1 = x[:, D:] 2025-05-07T20:33:04.1814991Z 2025-05-07T20:33:04.1815197Z if contiguous: 2025-05-07T20:33:04.1815534Z x0 = x0.contiguous() 2025-05-07T20:33:04.1815818Z x1 = x1.contiguous() 2025-05-07T20:33:04.1816161Z 2025-05-07T20:33:04.1816366Z if scale_ub is not None: 2025-05-07T20:33:04.1816666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.1817030Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.1817358Z ) 2025-05-07T20:33:04.1817576Z else: 2025-05-07T20:33:04.1817807Z scale_ub_tensor = None 2025-05-07T20:33:04.1818081Z 2025-05-07T20:33:04.1818334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.1818673Z op = silu_mul_quant 2025-05-07T20:33:04.1818936Z if compiled: 2025-05-07T20:33:04.1819201Z op = torch.compile(op) 2025-05-07T20:33:04.1819516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.1819806Z 2025-05-07T20:33:04.1820014Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.1820195Z 2025-05-07T20:33:04.1820309Z moe/activation_test.py:117: 2025-05-07T20:33:04.1820620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.1820977Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.1821282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.1821866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.1822456Z return fn(*args, **kwargs) 2025-05-07T20:33:04.1823148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.1823872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.1824432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.1825153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.1825861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.1826424Z kernel = self.compile( 2025-05-07T20:33:04.1826989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.1827681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.1828106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.1828350Z 2025-05-07T20:33:04.1828567Z self = 2025-05-07T20:33:04.1829700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.1831312Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfd0d300>} 2025-05-07T20:33:04.1832733Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.1833800Z context = 2025-05-07T20:33:04.1834104Z 2025-05-07T20:33:04.1834282Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.1834838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.1835340Z module_map=module_map) 2025-05-07T20:33:04.1835733Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.1836129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.1836405Z E ^ 2025-05-07T20:33:04.1836901Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.1837454Z 2025-05-07T20:33:04.1837941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.1838481Z 2025-05-07T20:33:04.1838600Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.1839039Z self=, 2025-05-07T20:33:04.1839467Z T=128, 2025-05-07T20:33:04.1839673Z D=7168, 2025-05-07T20:33:04.1839879Z scale_ub=1200.0, 2025-05-07T20:33:04.1840221Z contiguous=False, 2025-05-07T20:33:04.1840466Z compiled=True, 2025-05-07T20:33:04.1840684Z ) 2025-05-07T20:33:04.1841025Z self = 2025-05-07T20:33:04.1841553Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:04.1841843Z 2025-05-07T20:33:04.1841933Z @given( 2025-05-07T20:33:04.1842181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.1842522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.1842851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.1843197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.1843550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.1843857Z ) 2025-05-07T20:33:04.1844224Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.1844692Z def test_silu_mul_quant( 2025-05-07T20:33:04.1844952Z self, 2025-05-07T20:33:04.1845156Z T: int, 2025-05-07T20:33:04.1845370Z D: int, 2025-05-07T20:33:04.1845605Z scale_ub: Optional[float], 2025-05-07T20:33:04.1845890Z contiguous: bool, 2025-05-07T20:33:04.1846148Z compiled: bool, 2025-05-07T20:33:04.1846392Z ) -> None: 2025-05-07T20:33:04.1846618Z torch.manual_seed(2025) 2025-05-07T20:33:04.1846882Z 2025-05-07T20:33:04.1847178Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.1847544Z 2025-05-07T20:33:04.1847750Z x_sign = torch.sign(x) 2025-05-07T20:33:04.1848065Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.1848395Z x = x_sign * x_clamp 2025-05-07T20:33:04.1848651Z x0 = x[:, :D] 2025-05-07T20:33:04.1848882Z x1 = x[:, D:] 2025-05-07T20:33:04.1849108Z 2025-05-07T20:33:04.1849305Z if contiguous: 2025-05-07T20:33:04.1849557Z x0 = x0.contiguous() 2025-05-07T20:33:04.1849839Z x1 = x1.contiguous() 2025-05-07T20:33:04.1850092Z 2025-05-07T20:33:04.1850303Z if scale_ub is not None: 2025-05-07T20:33:04.1850597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.1850950Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.1851347Z ) 2025-05-07T20:33:04.1851560Z else: 2025-05-07T20:33:04.1851790Z scale_ub_tensor = None 2025-05-07T20:33:04.1852109Z 2025-05-07T20:33:04.1852366Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.1852722Z op = silu_mul_quant 2025-05-07T20:33:04.1852985Z if compiled: 2025-05-07T20:33:04.1853253Z op = torch.compile(op) 2025-05-07T20:33:04.1853567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.1853855Z 2025-05-07T20:33:04.1854063Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.1854237Z 2025-05-07T20:33:04.1854349Z moe/activation_test.py:117: 2025-05-07T20:33:04.1854665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.1855012Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.1855312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.1855903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.1856534Z return fn(*args, **kwargs) 2025-05-07T20:33:04.1857270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.1857990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.1858545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.1859258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.1859953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.1860509Z kernel = self.compile( 2025-05-07T20:33:04.1861068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.1861761Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.1862184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.1862429Z 2025-05-07T20:33:04.1862657Z self = 2025-05-07T20:33:04.1863773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.1865200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfd0e160>} 2025-05-07T20:33:04.1866594Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.1867655Z context = 2025-05-07T20:33:04.1867960Z 2025-05-07T20:33:04.1868143Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.1868690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.1869180Z module_map=module_map) 2025-05-07T20:33:04.1869565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.1869935Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.1870211Z E ^ 2025-05-07T20:33:04.1870695Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.1871163Z 2025-05-07T20:33:04.1871603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.1872135Z 2025-05-07T20:33:04.1872302Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.1872739Z self=, 2025-05-07T20:33:04.1873171Z T=2048, 2025-05-07T20:33:04.1873409Z D=7168, 2025-05-07T20:33:04.1873617Z scale_ub=None, 2025-05-07T20:33:04.1873849Z contiguous=True, 2025-05-07T20:33:04.1874081Z compiled=True, 2025-05-07T20:33:04.3173040Z ) 2025-05-07T20:33:04.3173624Z self = 2025-05-07T20:33:04.3174374Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:04.3174767Z 2025-05-07T20:33:04.3174879Z @given( 2025-05-07T20:33:04.3175153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.3175485Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.3175803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.3176153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.3176512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.3176803Z ) 2025-05-07T20:33:04.3177169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.3177910Z def test_silu_mul_quant( 2025-05-07T20:33:04.3178248Z self, 2025-05-07T20:33:04.3178453Z T: int, 2025-05-07T20:33:04.3178661Z D: int, 2025-05-07T20:33:04.3178890Z scale_ub: Optional[float], 2025-05-07T20:33:04.3179169Z contiguous: bool, 2025-05-07T20:33:04.3179419Z compiled: bool, 2025-05-07T20:33:04.3179656Z ) -> None: 2025-05-07T20:33:04.3179874Z torch.manual_seed(2025) 2025-05-07T20:33:04.3180129Z 2025-05-07T20:33:04.3180414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.3180764Z 2025-05-07T20:33:04.3180969Z x_sign = torch.sign(x) 2025-05-07T20:33:04.3181274Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.3181592Z x = x_sign * x_clamp 2025-05-07T20:33:04.3181848Z x0 = x[:, :D] 2025-05-07T20:33:04.3182073Z x1 = x[:, D:] 2025-05-07T20:33:04.3182291Z 2025-05-07T20:33:04.3182487Z if contiguous: 2025-05-07T20:33:04.3182731Z x0 = x0.contiguous() 2025-05-07T20:33:04.3182997Z x1 = x1.contiguous() 2025-05-07T20:33:04.3183253Z 2025-05-07T20:33:04.3183453Z if scale_ub is not None: 2025-05-07T20:33:04.3183740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.3184089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.3184417Z ) 2025-05-07T20:33:04.3184620Z else: 2025-05-07T20:33:04.3184839Z scale_ub_tensor = None 2025-05-07T20:33:04.3185104Z 2025-05-07T20:33:04.3185348Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.3185674Z op = silu_mul_quant 2025-05-07T20:33:04.3185937Z if compiled: 2025-05-07T20:33:04.3186198Z op = torch.compile(op) 2025-05-07T20:33:04.3186508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3186804Z 2025-05-07T20:33:04.3187016Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.3187188Z 2025-05-07T20:33:04.3187298Z moe/activation_test.py:117: 2025-05-07T20:33:04.3187612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3187965Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.3188264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.3188846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.3189433Z return fn(*args, **kwargs) 2025-05-07T20:33:04.3190123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.3190833Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.3191396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.3192205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.3192975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.3193527Z kernel = self.compile( 2025-05-07T20:33:04.3194093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.3194774Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.3195192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.3195435Z 2025-05-07T20:33:04.3195647Z self = 2025-05-07T20:33:04.3196768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.3198296Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfd0f420>} 2025-05-07T20:33:04.3199694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.3200873Z context = 2025-05-07T20:33:04.3201176Z 2025-05-07T20:33:04.3201351Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.3201900Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.3202392Z module_map=module_map) 2025-05-07T20:33:04.3202774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.3203144Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.3203418Z E ^ 2025-05-07T20:33:04.3203908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.3204375Z 2025-05-07T20:33:04.3204810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.3205348Z 2025-05-07T20:33:04.3205455Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.3205888Z self=, 2025-05-07T20:33:04.3206311Z T=16384, 2025-05-07T20:33:04.3206508Z D=5120, 2025-05-07T20:33:04.3206728Z scale_ub=None, 2025-05-07T20:33:04.3206956Z contiguous=False, 2025-05-07T20:33:04.3207192Z compiled=False, 2025-05-07T20:33:04.3207409Z ) 2025-05-07T20:33:04.3207745Z self = 2025-05-07T20:33:04.3208268Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.3208575Z 2025-05-07T20:33:04.3208657Z @given( 2025-05-07T20:33:04.3208908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.3217715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.3218086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.3218439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.3218777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.3219077Z ) 2025-05-07T20:33:04.3219449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.3219914Z def test_silu_mul_quant( 2025-05-07T20:33:04.3220171Z self, 2025-05-07T20:33:04.3220376Z T: int, 2025-05-07T20:33:04.3220576Z D: int, 2025-05-07T20:33:04.3220809Z scale_ub: Optional[float], 2025-05-07T20:33:04.3221220Z contiguous: bool, 2025-05-07T20:33:04.3221466Z compiled: bool, 2025-05-07T20:33:04.3221705Z ) -> None: 2025-05-07T20:33:04.3221938Z torch.manual_seed(2025) 2025-05-07T20:33:04.3222254Z 2025-05-07T20:33:04.3222551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.3222917Z 2025-05-07T20:33:04.3223133Z x_sign = torch.sign(x) 2025-05-07T20:33:04.3223439Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.3225552Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.3227509Z 2025-05-07T20:33:04.3227702Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:04.3227937Z 2025-05-07T20:33:04.3228112Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.3228551Z self=, 2025-05-07T20:33:04.3228969Z T=4096, 2025-05-07T20:33:04.3229171Z D=7168, 2025-05-07T20:33:04.3229379Z scale_ub=1200.0, 2025-05-07T20:33:04.3229609Z contiguous=True, 2025-05-07T20:33:04.3229846Z compiled=True, 2025-05-07T20:33:04.3230062Z ) 2025-05-07T20:33:04.3230393Z self = 2025-05-07T20:33:04.3230916Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:04.3231204Z 2025-05-07T20:33:04.3231284Z @given( 2025-05-07T20:33:04.3231525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.3231852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.3232177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.3232531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.3232871Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.3233177Z ) 2025-05-07T20:33:04.3233544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.3234003Z def test_silu_mul_quant( 2025-05-07T20:33:04.3234257Z self, 2025-05-07T20:33:04.3234463Z T: int, 2025-05-07T20:33:04.3234674Z D: int, 2025-05-07T20:33:04.3234895Z scale_ub: Optional[float], 2025-05-07T20:33:04.3235185Z contiguous: bool, 2025-05-07T20:33:04.3235437Z compiled: bool, 2025-05-07T20:33:04.3235663Z ) -> None: 2025-05-07T20:33:04.3235887Z torch.manual_seed(2025) 2025-05-07T20:33:04.3236140Z 2025-05-07T20:33:04.3236417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.3236776Z 2025-05-07T20:33:04.3236981Z x_sign = torch.sign(x) 2025-05-07T20:33:04.3237284Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.3239370Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.3241382Z 2025-05-07T20:33:04.3241505Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:04.3241734Z 2025-05-07T20:33:04.3241841Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.3242333Z self=, 2025-05-07T20:33:04.3242906Z T=16384, 2025-05-07T20:33:04.3243186Z D=7168, 2025-05-07T20:33:04.3243523Z scale_ub=None, 2025-05-07T20:33:04.3243842Z contiguous=False, 2025-05-07T20:33:04.3244172Z compiled=False, 2025-05-07T20:33:04.3244465Z ) 2025-05-07T20:33:04.3244916Z self = 2025-05-07T20:33:04.3245645Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.3246033Z 2025-05-07T20:33:04.3246154Z @given( 2025-05-07T20:33:04.3246482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.3246927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.3247369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.3247855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.3248328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.3248759Z ) 2025-05-07T20:33:04.3249273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.3249998Z def test_silu_mul_quant( 2025-05-07T20:33:04.3250360Z self, 2025-05-07T20:33:04.3250694Z T: int, 2025-05-07T20:33:04.3251010Z D: int, 2025-05-07T20:33:04.3251339Z scale_ub: Optional[float], 2025-05-07T20:33:04.3251750Z contiguous: bool, 2025-05-07T20:33:04.3252089Z compiled: bool, 2025-05-07T20:33:04.3252420Z ) -> None: 2025-05-07T20:33:04.3252651Z torch.manual_seed(2025) 2025-05-07T20:33:04.3252905Z 2025-05-07T20:33:04.3253185Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.3255335Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.3257279Z 2025-05-07T20:33:04.3257401Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.4493201Z 2025-05-07T20:33:04.4494034Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.4495418Z self=, 2025-05-07T20:33:04.4496581Z T=2048, 2025-05-07T20:33:04.4497118Z D=7168, 2025-05-07T20:33:04.4497663Z scale_ub=1200.0, 2025-05-07T20:33:04.4498259Z contiguous=True, 2025-05-07T20:33:04.4498852Z compiled=True, 2025-05-07T20:33:04.4499265Z ) 2025-05-07T20:33:04.4499918Z self = 2025-05-07T20:33:04.4500961Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:04.4501529Z 2025-05-07T20:33:04.4501687Z @given( 2025-05-07T20:33:04.4502167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.4502823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.4503446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.4503952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.4504300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.4504595Z ) 2025-05-07T20:33:04.4504962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.4505429Z def test_silu_mul_quant( 2025-05-07T20:33:04.4505692Z self, 2025-05-07T20:33:04.4505894Z T: int, 2025-05-07T20:33:04.4506103Z D: int, 2025-05-07T20:33:04.4506339Z scale_ub: Optional[float], 2025-05-07T20:33:04.4506621Z contiguous: bool, 2025-05-07T20:33:04.4507165Z compiled: bool, 2025-05-07T20:33:04.4507412Z ) -> None: 2025-05-07T20:33:04.4507637Z torch.manual_seed(2025) 2025-05-07T20:33:04.4507893Z 2025-05-07T20:33:04.4508275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.4508628Z 2025-05-07T20:33:04.4508839Z x_sign = torch.sign(x) 2025-05-07T20:33:04.4509151Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.4511235Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.4513169Z 2025-05-07T20:33:04.4513705Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:04.4514125Z 2025-05-07T20:33:04.4514241Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.4514788Z self=, 2025-05-07T20:33:04.4515214Z T=2048, 2025-05-07T20:33:04.4515406Z D=7168, 2025-05-07T20:33:04.4515610Z scale_ub=None, 2025-05-07T20:33:04.4515836Z contiguous=True, 2025-05-07T20:33:04.4516066Z compiled=False, 2025-05-07T20:33:04.4516282Z ) 2025-05-07T20:33:04.4516619Z self = 2025-05-07T20:33:04.4517129Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.4517418Z 2025-05-07T20:33:04.4517499Z @given( 2025-05-07T20:33:04.4517744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.4518075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.4518394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.4518742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.4519089Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.4519386Z ) 2025-05-07T20:33:04.4519754Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.4520317Z def test_silu_mul_quant( 2025-05-07T20:33:04.4520570Z self, 2025-05-07T20:33:04.4520775Z T: int, 2025-05-07T20:33:04.4520981Z D: int, 2025-05-07T20:33:04.4521204Z scale_ub: Optional[float], 2025-05-07T20:33:04.4521490Z contiguous: bool, 2025-05-07T20:33:04.4521743Z compiled: bool, 2025-05-07T20:33:04.4521969Z ) -> None: 2025-05-07T20:33:04.4522194Z torch.manual_seed(2025) 2025-05-07T20:33:04.4522449Z 2025-05-07T20:33:04.4522729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.4523092Z 2025-05-07T20:33:04.4523301Z > x_sign = torch.sign(x) 2025-05-07T20:33:04.4525331Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.4527240Z 2025-05-07T20:33:04.4527371Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:04.4527591Z 2025-05-07T20:33:04.4527700Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.4528139Z self=, 2025-05-07T20:33:04.4528645Z T=1, 2025-05-07T20:33:04.4528833Z D=7168, 2025-05-07T20:33:04.4529033Z scale_ub=1200.0, 2025-05-07T20:33:04.4529268Z contiguous=True, 2025-05-07T20:33:04.4529504Z compiled=False, 2025-05-07T20:33:04.4529774Z ) 2025-05-07T20:33:04.4530110Z self = 2025-05-07T20:33:04.4530619Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.4530894Z 2025-05-07T20:33:04.4530972Z @given( 2025-05-07T20:33:04.4531216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.4531546Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.4531861Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.4532208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.4532553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.4532843Z ) 2025-05-07T20:33:04.4533226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.4533687Z def test_silu_mul_quant( 2025-05-07T20:33:04.4534035Z self, 2025-05-07T20:33:04.4534245Z T: int, 2025-05-07T20:33:04.4534449Z D: int, 2025-05-07T20:33:04.4534721Z scale_ub: Optional[float], 2025-05-07T20:33:04.4535013Z contiguous: bool, 2025-05-07T20:33:04.4535258Z compiled: bool, 2025-05-07T20:33:04.4535491Z ) -> None: 2025-05-07T20:33:04.4535718Z torch.manual_seed(2025) 2025-05-07T20:33:04.4535966Z 2025-05-07T20:33:04.4536250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.4536605Z 2025-05-07T20:33:04.4536806Z x_sign = torch.sign(x) 2025-05-07T20:33:04.4537113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.4537439Z x = x_sign * x_clamp 2025-05-07T20:33:04.4537684Z x0 = x[:, :D] 2025-05-07T20:33:04.4537911Z x1 = x[:, D:] 2025-05-07T20:33:04.4538132Z 2025-05-07T20:33:04.4538330Z if contiguous: 2025-05-07T20:33:04.4538565Z x0 = x0.contiguous() 2025-05-07T20:33:04.4538840Z x1 = x1.contiguous() 2025-05-07T20:33:04.4539091Z 2025-05-07T20:33:04.4539290Z if scale_ub is not None: 2025-05-07T20:33:04.4539577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.4539928Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.4540247Z ) 2025-05-07T20:33:04.4540451Z else: 2025-05-07T20:33:04.4540669Z scale_ub_tensor = None 2025-05-07T20:33:04.4540928Z 2025-05-07T20:33:04.4541173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.4541502Z op = silu_mul_quant 2025-05-07T20:33:04.4541758Z if compiled: 2025-05-07T20:33:04.4542017Z op = torch.compile(op) 2025-05-07T20:33:04.4542336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.4542618Z 2025-05-07T20:33:04.4542823Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.4543005Z 2025-05-07T20:33:04.4543108Z moe/activation_test.py:117: 2025-05-07T20:33:04.4543418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.4543766Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.4544063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.4544782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.4545494Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.4546058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.4546771Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.4547463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.4548010Z kernel = self.compile( 2025-05-07T20:33:04.4548638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.4549365Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.4549786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.4550030Z 2025-05-07T20:33:04.4550246Z self = 2025-05-07T20:33:04.4551371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.4552805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfbc22a0>} 2025-05-07T20:33:04.4554204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.4555346Z context = 2025-05-07T20:33:04.4555656Z 2025-05-07T20:33:04.4555830Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.4556379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.4556869Z module_map=module_map) 2025-05-07T20:33:04.4557249Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.4557622Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.4557899Z E ^ 2025-05-07T20:33:04.4558379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.4558856Z 2025-05-07T20:33:04.4559290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.4559831Z 2025-05-07T20:33:04.4559940Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.4560446Z self=, 2025-05-07T20:33:04.4560860Z T=128, 2025-05-07T20:33:04.4561059Z D=5120, 2025-05-07T20:33:04.4561257Z scale_ub=None, 2025-05-07T20:33:04.4561475Z contiguous=True, 2025-05-07T20:33:04.4561709Z compiled=False, 2025-05-07T20:33:04.4561921Z ) 2025-05-07T20:33:04.4562251Z self = 2025-05-07T20:33:04.4562769Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.4563055Z 2025-05-07T20:33:04.4563136Z @given( 2025-05-07T20:33:04.4563379Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.4563700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.4564023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.4564371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.4564721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.4565022Z ) 2025-05-07T20:33:04.4565390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.4565849Z def test_silu_mul_quant( 2025-05-07T20:33:04.4566103Z self, 2025-05-07T20:33:04.4566308Z T: int, 2025-05-07T20:33:04.4566509Z D: int, 2025-05-07T20:33:04.4566735Z scale_ub: Optional[float], 2025-05-07T20:33:04.4567016Z contiguous: bool, 2025-05-07T20:33:04.4567263Z compiled: bool, 2025-05-07T20:33:04.4567499Z ) -> None: 2025-05-07T20:33:04.4567722Z torch.manual_seed(2025) 2025-05-07T20:33:04.4567976Z 2025-05-07T20:33:04.4568253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.4568640Z 2025-05-07T20:33:04.4568907Z x_sign = torch.sign(x) 2025-05-07T20:33:04.4569208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.4569541Z x = x_sign * x_clamp 2025-05-07T20:33:04.4569836Z x0 = x[:, :D] 2025-05-07T20:33:04.4570062Z x1 = x[:, D:] 2025-05-07T20:33:04.4570280Z 2025-05-07T20:33:04.4570475Z if contiguous: 2025-05-07T20:33:04.4570717Z x0 = x0.contiguous() 2025-05-07T20:33:04.4570983Z x1 = x1.contiguous() 2025-05-07T20:33:04.4571234Z 2025-05-07T20:33:04.4571437Z if scale_ub is not None: 2025-05-07T20:33:04.4571719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.4572077Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.4572405Z ) 2025-05-07T20:33:04.4572603Z else: 2025-05-07T20:33:04.4572826Z scale_ub_tensor = None 2025-05-07T20:33:04.4573088Z 2025-05-07T20:33:04.4573325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.4573660Z op = silu_mul_quant 2025-05-07T20:33:04.4573922Z if compiled: 2025-05-07T20:33:04.4574265Z op = torch.compile(op) 2025-05-07T20:33:04.4574579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.4574906Z 2025-05-07T20:33:04.4575103Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.4575280Z 2025-05-07T20:33:04.4575383Z moe/activation_test.py:117: 2025-05-07T20:33:04.4575689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.4576041Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.4576330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.4577045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.4577759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.4578313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.4579028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.4579729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.4580283Z kernel = self.compile( 2025-05-07T20:33:04.4580842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.4581533Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.4581949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.4582187Z 2025-05-07T20:33:04.4582410Z self = 2025-05-07T20:33:04.4583525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.4584960Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfbc31a0>} 2025-05-07T20:33:04.4586360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.4587420Z context = 2025-05-07T20:33:04.4587721Z 2025-05-07T20:33:04.4587897Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.4588447Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.4588937Z module_map=module_map) 2025-05-07T20:33:04.4589320Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.4589742Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.4590019Z E ^ 2025-05-07T20:33:04.4590548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.4591023Z 2025-05-07T20:33:04.4591464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.5719644Z 2025-05-07T20:33:04.5720305Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.5720976Z self=, 2025-05-07T20:33:04.5721554Z T=128, 2025-05-07T20:33:04.5721758Z D=7168, 2025-05-07T20:33:04.5721963Z scale_ub=None, 2025-05-07T20:33:04.5722184Z contiguous=True, 2025-05-07T20:33:04.5722421Z compiled=False, 2025-05-07T20:33:04.5722642Z ) 2025-05-07T20:33:04.5722974Z self = 2025-05-07T20:33:04.5723514Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.5724147Z 2025-05-07T20:33:04.5724229Z @given( 2025-05-07T20:33:04.5724475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.5724896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.5725223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.5725574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.5725912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.5726213Z ) 2025-05-07T20:33:04.5726581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.5727037Z def test_silu_mul_quant( 2025-05-07T20:33:04.5727290Z self, 2025-05-07T20:33:04.5727494Z T: int, 2025-05-07T20:33:04.5727694Z D: int, 2025-05-07T20:33:04.5727922Z scale_ub: Optional[float], 2025-05-07T20:33:04.5728206Z contiguous: bool, 2025-05-07T20:33:04.5728463Z compiled: bool, 2025-05-07T20:33:04.5728691Z ) -> None: 2025-05-07T20:33:04.5728917Z torch.manual_seed(2025) 2025-05-07T20:33:04.5729166Z 2025-05-07T20:33:04.5729448Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.5729803Z 2025-05-07T20:33:04.5730005Z x_sign = torch.sign(x) 2025-05-07T20:33:04.5730304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.5730632Z x = x_sign * x_clamp 2025-05-07T20:33:04.5730886Z x0 = x[:, :D] 2025-05-07T20:33:04.5731107Z x1 = x[:, D:] 2025-05-07T20:33:04.5731324Z 2025-05-07T20:33:04.5731516Z if contiguous: 2025-05-07T20:33:04.5731751Z x0 = x0.contiguous() 2025-05-07T20:33:04.5732021Z x1 = x1.contiguous() 2025-05-07T20:33:04.5732271Z 2025-05-07T20:33:04.5732464Z if scale_ub is not None: 2025-05-07T20:33:04.5732750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.5733103Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.5733425Z ) 2025-05-07T20:33:04.5733638Z else: 2025-05-07T20:33:04.5733862Z scale_ub_tensor = None 2025-05-07T20:33:04.5734154Z 2025-05-07T20:33:04.5734399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.5734726Z op = silu_mul_quant 2025-05-07T20:33:04.5734995Z if compiled: 2025-05-07T20:33:04.5735254Z op = torch.compile(op) 2025-05-07T20:33:04.5735564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5735847Z 2025-05-07T20:33:04.5736047Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.5736219Z 2025-05-07T20:33:04.5736328Z moe/activation_test.py:117: 2025-05-07T20:33:04.5736689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5737302Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.5746328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5747254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.5748064Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.5748642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.5749355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.5750056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.5750617Z kernel = self.compile( 2025-05-07T20:33:04.5751184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.5751881Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.5752302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5752548Z 2025-05-07T20:33:04.5752775Z self = 2025-05-07T20:33:04.5753996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.5755443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfaac040>} 2025-05-07T20:33:04.5756845Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.5757918Z context = 2025-05-07T20:33:04.5758221Z 2025-05-07T20:33:04.5758407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.5758951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.5759456Z module_map=module_map) 2025-05-07T20:33:04.5759846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.5760286Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.5760565Z E ^ 2025-05-07T20:33:04.5761060Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.5761534Z 2025-05-07T20:33:04.5761975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.5762506Z 2025-05-07T20:33:04.5762619Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.5763058Z self=, 2025-05-07T20:33:04.5763486Z T=2048, 2025-05-07T20:33:04.5763680Z D=7168, 2025-05-07T20:33:04.5763912Z scale_ub=1200.0, 2025-05-07T20:33:04.5764177Z contiguous=True, 2025-05-07T20:33:04.5764411Z compiled=False, 2025-05-07T20:33:04.5764633Z ) 2025-05-07T20:33:04.5764973Z self = 2025-05-07T20:33:04.5765498Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.5765785Z 2025-05-07T20:33:04.5765868Z @given( 2025-05-07T20:33:04.5766119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.5766453Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.5766774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.5767125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.5767474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.5767770Z ) 2025-05-07T20:33:04.5768141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.5768665Z def test_silu_mul_quant( 2025-05-07T20:33:04.5768922Z self, 2025-05-07T20:33:04.5769123Z T: int, 2025-05-07T20:33:04.5769375Z D: int, 2025-05-07T20:33:04.5769616Z scale_ub: Optional[float], 2025-05-07T20:33:04.5769899Z contiguous: bool, 2025-05-07T20:33:04.5770155Z compiled: bool, 2025-05-07T20:33:04.5770396Z ) -> None: 2025-05-07T20:33:04.5770618Z torch.manual_seed(2025) 2025-05-07T20:33:04.5770871Z 2025-05-07T20:33:04.5771160Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.5773293Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.5775263Z 2025-05-07T20:33:04.5775427Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.5775655Z 2025-05-07T20:33:04.5775759Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.5776187Z self=, 2025-05-07T20:33:04.5776605Z T=1, 2025-05-07T20:33:04.5776792Z D=5120, 2025-05-07T20:33:04.5776995Z scale_ub=1200.0, 2025-05-07T20:33:04.5777228Z contiguous=True, 2025-05-07T20:33:04.5777455Z compiled=False, 2025-05-07T20:33:04.5777672Z ) 2025-05-07T20:33:04.5778007Z self = 2025-05-07T20:33:04.5778515Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.5778801Z 2025-05-07T20:33:04.5778881Z @given( 2025-05-07T20:33:04.5779126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.5779461Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.5779786Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.5780136Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.5780485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.5780780Z ) 2025-05-07T20:33:04.5781150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.5781615Z def test_silu_mul_quant( 2025-05-07T20:33:04.5781864Z self, 2025-05-07T20:33:04.5782075Z T: int, 2025-05-07T20:33:04.5782292Z D: int, 2025-05-07T20:33:04.5782516Z scale_ub: Optional[float], 2025-05-07T20:33:04.5782806Z contiguous: bool, 2025-05-07T20:33:04.5783062Z compiled: bool, 2025-05-07T20:33:04.5783297Z ) -> None: 2025-05-07T20:33:04.5783530Z torch.manual_seed(2025) 2025-05-07T20:33:04.5783789Z 2025-05-07T20:33:04.5784071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.5784434Z 2025-05-07T20:33:04.5784642Z x_sign = torch.sign(x) 2025-05-07T20:33:04.5784944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.5785264Z x = x_sign * x_clamp 2025-05-07T20:33:04.5785520Z x0 = x[:, :D] 2025-05-07T20:33:04.5785752Z x1 = x[:, D:] 2025-05-07T20:33:04.5785964Z 2025-05-07T20:33:04.5786160Z if contiguous: 2025-05-07T20:33:04.5786404Z x0 = x0.contiguous() 2025-05-07T20:33:04.5786671Z x1 = x1.contiguous() 2025-05-07T20:33:04.5786930Z 2025-05-07T20:33:04.5787134Z if scale_ub is not None: 2025-05-07T20:33:04.5787415Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.5787771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.5788099Z ) 2025-05-07T20:33:04.5788352Z else: 2025-05-07T20:33:04.5788580Z scale_ub_tensor = None 2025-05-07T20:33:04.5788848Z 2025-05-07T20:33:04.5789128Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.5789464Z op = silu_mul_quant 2025-05-07T20:33:04.5789729Z if compiled: 2025-05-07T20:33:04.5789988Z op = torch.compile(op) 2025-05-07T20:33:04.5790295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5790588Z 2025-05-07T20:33:04.5790792Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.5790965Z 2025-05-07T20:33:04.5791073Z moe/activation_test.py:117: 2025-05-07T20:33:04.5791384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5791735Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.5792027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.5792751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.5793479Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.5794222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.5795152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.5796023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.5796615Z kernel = self.compile( 2025-05-07T20:33:04.5797180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.5797872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.5798289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.5798526Z 2025-05-07T20:33:04.5798748Z self = 2025-05-07T20:33:04.5799874Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.5801368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfaad580>} 2025-05-07T20:33:04.5802769Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.5803951Z context = 2025-05-07T20:33:04.5804329Z 2025-05-07T20:33:04.5804559Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.5805240Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.5805811Z module_map=module_map) 2025-05-07T20:33:04.5806204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.5806575Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.5806851Z E ^ 2025-05-07T20:33:04.5807341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.5807809Z 2025-05-07T20:33:04.5808250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.6624698Z 2025-05-07T20:33:04.6625272Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.6625923Z self=, 2025-05-07T20:33:04.6626453Z T=2048, 2025-05-07T20:33:04.6626658Z D=5120, 2025-05-07T20:33:04.6626863Z scale_ub=None, 2025-05-07T20:33:04.6627374Z contiguous=True, 2025-05-07T20:33:04.6627606Z compiled=False, 2025-05-07T20:33:04.6627826Z ) 2025-05-07T20:33:04.6628174Z self = 2025-05-07T20:33:04.6628787Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.6629083Z 2025-05-07T20:33:04.6629166Z @given( 2025-05-07T20:33:04.6629413Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.6629746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.6630067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.6630417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.6630766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.6631063Z ) 2025-05-07T20:33:04.6631434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.6631900Z def test_silu_mul_quant( 2025-05-07T20:33:04.6632152Z self, 2025-05-07T20:33:04.6632359Z T: int, 2025-05-07T20:33:04.6632568Z D: int, 2025-05-07T20:33:04.6632884Z scale_ub: Optional[float], 2025-05-07T20:33:04.6633167Z contiguous: bool, 2025-05-07T20:33:04.6633425Z compiled: bool, 2025-05-07T20:33:04.6633752Z ) -> None: 2025-05-07T20:33:04.6633976Z torch.manual_seed(2025) 2025-05-07T20:33:04.6634232Z 2025-05-07T20:33:04.6634519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.6634874Z 2025-05-07T20:33:04.6635081Z > x_sign = torch.sign(x) 2025-05-07T20:33:04.6637111Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.6639058Z 2025-05-07T20:33:04.6639192Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:04.6639415Z 2025-05-07T20:33:04.6639531Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.6639966Z self=, 2025-05-07T20:33:04.6640511Z T=16384, 2025-05-07T20:33:04.6640719Z D=5120, 2025-05-07T20:33:04.6640928Z scale_ub=None, 2025-05-07T20:33:04.6641148Z contiguous=True, 2025-05-07T20:33:04.6641389Z compiled=False, 2025-05-07T20:33:04.6641605Z ) 2025-05-07T20:33:04.6641967Z self = 2025-05-07T20:33:04.6642492Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.6642786Z 2025-05-07T20:33:04.6642874Z @given( 2025-05-07T20:33:04.6643118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.6643456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.6643777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.6644133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.6644486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.6644788Z ) 2025-05-07T20:33:04.6645165Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.6645629Z def test_silu_mul_quant( 2025-05-07T20:33:04.6645886Z self, 2025-05-07T20:33:04.6646086Z T: int, 2025-05-07T20:33:04.6646296Z D: int, 2025-05-07T20:33:04.6646525Z scale_ub: Optional[float], 2025-05-07T20:33:04.6646807Z contiguous: bool, 2025-05-07T20:33:04.6647069Z compiled: bool, 2025-05-07T20:33:04.6647303Z ) -> None: 2025-05-07T20:33:04.6647527Z torch.manual_seed(2025) 2025-05-07T20:33:04.6647869Z 2025-05-07T20:33:04.6648159Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.6650334Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.6652259Z 2025-05-07T20:33:04.6652389Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.6652616Z 2025-05-07T20:33:04.6652724Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.6653160Z self=, 2025-05-07T20:33:04.6653589Z T=4096, 2025-05-07T20:33:04.6653784Z D=5120, 2025-05-07T20:33:04.6654034Z scale_ub=None, 2025-05-07T20:33:04.6654263Z contiguous=True, 2025-05-07T20:33:04.6654502Z compiled=False, 2025-05-07T20:33:04.6654758Z ) 2025-05-07T20:33:04.6655091Z self = 2025-05-07T20:33:04.6655610Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.6655903Z 2025-05-07T20:33:04.6655989Z @given( 2025-05-07T20:33:04.6656232Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.6656569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.6656887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.6657242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.6657590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.6657886Z ) 2025-05-07T20:33:04.6658256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.6658734Z def test_silu_mul_quant( 2025-05-07T20:33:04.6658991Z self, 2025-05-07T20:33:04.6659202Z T: int, 2025-05-07T20:33:04.6659413Z D: int, 2025-05-07T20:33:04.6659648Z scale_ub: Optional[float], 2025-05-07T20:33:04.6659941Z contiguous: bool, 2025-05-07T20:33:04.6660196Z compiled: bool, 2025-05-07T20:33:04.6660428Z ) -> None: 2025-05-07T20:33:04.6660660Z torch.manual_seed(2025) 2025-05-07T20:33:04.6660919Z 2025-05-07T20:33:04.6661213Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.6663330Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.6665263Z 2025-05-07T20:33:04.6665389Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.6665633Z 2025-05-07T20:33:04.6665742Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.6666190Z self=, 2025-05-07T20:33:04.6666612Z T=2048, 2025-05-07T20:33:04.6666817Z D=5120, 2025-05-07T20:33:04.6667027Z scale_ub=None, 2025-05-07T20:33:04.6667259Z contiguous=False, 2025-05-07T20:33:04.6667495Z compiled=False, 2025-05-07T20:33:04.6667720Z ) 2025-05-07T20:33:04.6668055Z self = 2025-05-07T20:33:04.6668571Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.6668913Z 2025-05-07T20:33:04.6668995Z @given( 2025-05-07T20:33:04.6669239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.6669601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.6669927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.6670273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.6670612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.6670914Z ) 2025-05-07T20:33:04.6671283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.6671747Z def test_silu_mul_quant( 2025-05-07T20:33:04.6671997Z self, 2025-05-07T20:33:04.6672203Z T: int, 2025-05-07T20:33:04.6672411Z D: int, 2025-05-07T20:33:04.6672636Z scale_ub: Optional[float], 2025-05-07T20:33:04.6672921Z contiguous: bool, 2025-05-07T20:33:04.6673174Z compiled: bool, 2025-05-07T20:33:04.6673407Z ) -> None: 2025-05-07T20:33:04.6673636Z torch.manual_seed(2025) 2025-05-07T20:33:04.6673919Z 2025-05-07T20:33:04.6674272Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.6676467Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.6678392Z 2025-05-07T20:33:04.6678516Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.6678743Z 2025-05-07T20:33:04.6678853Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.6679291Z self=, 2025-05-07T20:33:04.6679710Z T=4096, 2025-05-07T20:33:04.6679912Z D=7168, 2025-05-07T20:33:04.6680188Z scale_ub=None, 2025-05-07T20:33:04.6680411Z contiguous=True, 2025-05-07T20:33:04.6680648Z compiled=True, 2025-05-07T20:33:04.6680864Z ) 2025-05-07T20:33:04.6681192Z self = 2025-05-07T20:33:04.6681713Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:04.6681998Z 2025-05-07T20:33:04.6682078Z @given( 2025-05-07T20:33:04.6682320Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.6682649Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.6682973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.6683322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.6683663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.6683969Z ) 2025-05-07T20:33:04.6684339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.6684802Z def test_silu_mul_quant( 2025-05-07T20:33:04.6685061Z self, 2025-05-07T20:33:04.6685271Z T: int, 2025-05-07T20:33:04.6685473Z D: int, 2025-05-07T20:33:04.6685707Z scale_ub: Optional[float], 2025-05-07T20:33:04.6685992Z contiguous: bool, 2025-05-07T20:33:04.6686239Z compiled: bool, 2025-05-07T20:33:04.6686472Z ) -> None: 2025-05-07T20:33:04.6686699Z torch.manual_seed(2025) 2025-05-07T20:33:04.6686959Z 2025-05-07T20:33:04.6687239Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.6689421Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.6691394Z 2025-05-07T20:33:04.6691519Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.6691743Z 2025-05-07T20:33:04.6691857Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.6692288Z self=, 2025-05-07T20:33:04.6692710Z T=2048, 2025-05-07T20:33:04.6692908Z D=5120, 2025-05-07T20:33:04.6693113Z scale_ub=1200.0, 2025-05-07T20:33:04.6693345Z contiguous=False, 2025-05-07T20:33:04.6693585Z compiled=False, 2025-05-07T20:33:04.7245335Z ) 2025-05-07T20:33:04.7245871Z self = 2025-05-07T20:33:04.7246601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.7247217Z 2025-05-07T20:33:04.7247347Z @given( 2025-05-07T20:33:04.7247679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7248195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7248526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7248872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7249227Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7249533Z ) 2025-05-07T20:33:04.7249896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7250363Z def test_silu_mul_quant( 2025-05-07T20:33:04.7250622Z self, 2025-05-07T20:33:04.7250829Z T: int, 2025-05-07T20:33:04.7251034Z D: int, 2025-05-07T20:33:04.7251267Z scale_ub: Optional[float], 2025-05-07T20:33:04.7251555Z contiguous: bool, 2025-05-07T20:33:04.7251811Z compiled: bool, 2025-05-07T20:33:04.7252052Z ) -> None: 2025-05-07T20:33:04.7252287Z torch.manual_seed(2025) 2025-05-07T20:33:04.7252577Z 2025-05-07T20:33:04.7252871Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7254997Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.7256933Z 2025-05-07T20:33:04.7257059Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.7257291Z 2025-05-07T20:33:04.7257400Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7257843Z self=, 2025-05-07T20:33:04.7258266Z T=4096, 2025-05-07T20:33:04.7258466Z D=7168, 2025-05-07T20:33:04.7258675Z scale_ub=1200.0, 2025-05-07T20:33:04.7258910Z contiguous=True, 2025-05-07T20:33:04.7259141Z compiled=False, 2025-05-07T20:33:04.7259358Z ) 2025-05-07T20:33:04.7259699Z self = 2025-05-07T20:33:04.7260219Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.7260515Z 2025-05-07T20:33:04.7260599Z @given( 2025-05-07T20:33:04.7260841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7261173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7261502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7261852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7262282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7262590Z ) 2025-05-07T20:33:04.7263028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7263499Z def test_silu_mul_quant( 2025-05-07T20:33:04.7263756Z self, 2025-05-07T20:33:04.7263964Z T: int, 2025-05-07T20:33:04.7264175Z D: int, 2025-05-07T20:33:04.7264414Z scale_ub: Optional[float], 2025-05-07T20:33:04.7264703Z contiguous: bool, 2025-05-07T20:33:04.7264953Z compiled: bool, 2025-05-07T20:33:04.7265191Z ) -> None: 2025-05-07T20:33:04.7265423Z torch.manual_seed(2025) 2025-05-07T20:33:04.7265675Z 2025-05-07T20:33:04.7265970Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7268146Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.7270108Z 2025-05-07T20:33:04.7270240Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.7270465Z 2025-05-07T20:33:04.7270575Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7271015Z self=, 2025-05-07T20:33:04.7271439Z T=16384, 2025-05-07T20:33:04.7271645Z D=7168, 2025-05-07T20:33:04.7271916Z scale_ub=None, 2025-05-07T20:33:04.7272266Z contiguous=False, 2025-05-07T20:33:04.7272714Z compiled=True, 2025-05-07T20:33:04.7273044Z ) 2025-05-07T20:33:04.7273465Z self = 2025-05-07T20:33:04.7282332Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:04.7282644Z 2025-05-07T20:33:04.7282736Z @given( 2025-05-07T20:33:04.7282979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7283313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7283640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7283984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7284332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7284634Z ) 2025-05-07T20:33:04.7285005Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7285466Z def test_silu_mul_quant( 2025-05-07T20:33:04.7285722Z self, 2025-05-07T20:33:04.7285926Z T: int, 2025-05-07T20:33:04.7286126Z D: int, 2025-05-07T20:33:04.7286354Z scale_ub: Optional[float], 2025-05-07T20:33:04.7286641Z contiguous: bool, 2025-05-07T20:33:04.7286886Z compiled: bool, 2025-05-07T20:33:04.7287124Z ) -> None: 2025-05-07T20:33:04.7287352Z torch.manual_seed(2025) 2025-05-07T20:33:04.7287598Z 2025-05-07T20:33:04.7287894Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7290043Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.7291986Z 2025-05-07T20:33:04.7292190Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.7292411Z 2025-05-07T20:33:04.7292526Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7293001Z self=, 2025-05-07T20:33:04.7293429Z T=4096, 2025-05-07T20:33:04.7293627Z D=7168, 2025-05-07T20:33:04.7293823Z scale_ub=None, 2025-05-07T20:33:04.7294053Z contiguous=True, 2025-05-07T20:33:04.7294289Z compiled=False, 2025-05-07T20:33:04.7294498Z ) 2025-05-07T20:33:04.7294839Z self = 2025-05-07T20:33:04.7295363Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.7295646Z 2025-05-07T20:33:04.7295735Z @given( 2025-05-07T20:33:04.7295972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7296301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7296625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7296970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7297365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7297668Z ) 2025-05-07T20:33:04.7298073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7298538Z def test_silu_mul_quant( 2025-05-07T20:33:04.7298794Z self, 2025-05-07T20:33:04.7298992Z T: int, 2025-05-07T20:33:04.7299203Z D: int, 2025-05-07T20:33:04.7299434Z scale_ub: Optional[float], 2025-05-07T20:33:04.7299712Z contiguous: bool, 2025-05-07T20:33:04.7299967Z compiled: bool, 2025-05-07T20:33:04.7300203Z ) -> None: 2025-05-07T20:33:04.7300433Z torch.manual_seed(2025) 2025-05-07T20:33:04.7300681Z 2025-05-07T20:33:04.7300967Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7303088Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.7305017Z 2025-05-07T20:33:04.7305146Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.7305366Z 2025-05-07T20:33:04.7305474Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7305908Z self=, 2025-05-07T20:33:04.7306327Z T=16384, 2025-05-07T20:33:04.7306530Z D=7168, 2025-05-07T20:33:04.7306724Z scale_ub=None, 2025-05-07T20:33:04.7306949Z contiguous=True, 2025-05-07T20:33:04.7307184Z compiled=False, 2025-05-07T20:33:04.7307392Z ) 2025-05-07T20:33:04.7307723Z self = 2025-05-07T20:33:04.7308252Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:04.7308544Z 2025-05-07T20:33:04.7308625Z @given( 2025-05-07T20:33:04.7308866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7309191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7309504Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7309850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7310195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7310495Z ) 2025-05-07T20:33:04.7310852Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7311311Z def test_silu_mul_quant( 2025-05-07T20:33:04.7311566Z self, 2025-05-07T20:33:04.7311763Z T: int, 2025-05-07T20:33:04.7312024Z D: int, 2025-05-07T20:33:04.7312259Z scale_ub: Optional[float], 2025-05-07T20:33:04.7312543Z contiguous: bool, 2025-05-07T20:33:04.7312796Z compiled: bool, 2025-05-07T20:33:04.7313069Z ) -> None: 2025-05-07T20:33:04.7313293Z torch.manual_seed(2025) 2025-05-07T20:33:04.7313963Z 2025-05-07T20:33:04.7314300Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7316453Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.7318498Z 2025-05-07T20:33:04.7318630Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.7318976Z 2025-05-07T20:33:04.7319088Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7319585Z self=, 2025-05-07T20:33:04.7320009Z T=16384, 2025-05-07T20:33:04.7320299Z D=7168, 2025-05-07T20:33:04.7320504Z scale_ub=1200.0, 2025-05-07T20:33:04.7320739Z contiguous=True, 2025-05-07T20:33:04.7320964Z compiled=False, 2025-05-07T20:33:04.7321181Z ) 2025-05-07T20:33:04.7321517Z self = 2025-05-07T20:33:04.7322031Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.7322333Z 2025-05-07T20:33:04.7322415Z @given( 2025-05-07T20:33:04.7322656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7322983Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7323302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7323652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7324004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7324299Z ) 2025-05-07T20:33:04.7324664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7325124Z def test_silu_mul_quant( 2025-05-07T20:33:04.7325371Z self, 2025-05-07T20:33:04.7325576Z T: int, 2025-05-07T20:33:04.7325782Z D: int, 2025-05-07T20:33:04.7326002Z scale_ub: Optional[float], 2025-05-07T20:33:04.7326290Z contiguous: bool, 2025-05-07T20:33:04.7326544Z compiled: bool, 2025-05-07T20:33:04.7326779Z ) -> None: 2025-05-07T20:33:04.7326995Z torch.manual_seed(2025) 2025-05-07T20:33:04.7327247Z 2025-05-07T20:33:04.7327533Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7329649Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.7331573Z 2025-05-07T20:33:04.7331694Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.9144238Z 2025-05-07T20:33:04.9145058Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.9145754Z self=, 2025-05-07T20:33:04.9146368Z T=128, 2025-05-07T20:33:04.9146643Z D=5120, 2025-05-07T20:33:04.9146929Z scale_ub=1200.0, 2025-05-07T20:33:04.9147514Z contiguous=False, 2025-05-07T20:33:04.9147755Z compiled=False, 2025-05-07T20:33:04.9147983Z ) 2025-05-07T20:33:04.9148403Z self = 2025-05-07T20:33:04.9148934Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:04.9149230Z 2025-05-07T20:33:04.9149312Z @given( 2025-05-07T20:33:04.9149557Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.9149883Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.9150207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.9150557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.9150899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.9151202Z ) 2025-05-07T20:33:04.9151572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.9152030Z def test_silu_mul_quant( 2025-05-07T20:33:04.9152288Z self, 2025-05-07T20:33:04.9152500Z T: int, 2025-05-07T20:33:04.9152700Z D: int, 2025-05-07T20:33:04.9153058Z scale_ub: Optional[float], 2025-05-07T20:33:04.9153358Z contiguous: bool, 2025-05-07T20:33:04.9153699Z compiled: bool, 2025-05-07T20:33:04.9153935Z ) -> None: 2025-05-07T20:33:04.9154194Z torch.manual_seed(2025) 2025-05-07T20:33:04.9154440Z 2025-05-07T20:33:04.9154725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.9155082Z 2025-05-07T20:33:04.9155280Z x_sign = torch.sign(x) 2025-05-07T20:33:04.9155588Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.9155915Z x = x_sign * x_clamp 2025-05-07T20:33:04.9156160Z x0 = x[:, :D] 2025-05-07T20:33:04.9156392Z x1 = x[:, D:] 2025-05-07T20:33:04.9156617Z 2025-05-07T20:33:04.9156805Z if contiguous: 2025-05-07T20:33:04.9157047Z x0 = x0.contiguous() 2025-05-07T20:33:04.9157321Z x1 = x1.contiguous() 2025-05-07T20:33:04.9157564Z 2025-05-07T20:33:04.9157767Z if scale_ub is not None: 2025-05-07T20:33:04.9158054Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.9158405Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.9158723Z ) 2025-05-07T20:33:04.9158925Z else: 2025-05-07T20:33:04.9159142Z scale_ub_tensor = None 2025-05-07T20:33:04.9159397Z 2025-05-07T20:33:04.9159637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.9159966Z op = silu_mul_quant 2025-05-07T20:33:04.9160357Z if compiled: 2025-05-07T20:33:04.9160615Z op = torch.compile(op) 2025-05-07T20:33:04.9160924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.9161205Z 2025-05-07T20:33:04.9161410Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.9161579Z 2025-05-07T20:33:04.9161687Z moe/activation_test.py:117: 2025-05-07T20:33:04.9162008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.9162361Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.9162652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.9163380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.9164150Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.9164714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.9165432Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.9166124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.9166685Z kernel = self.compile( 2025-05-07T20:33:04.9167242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.9167979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.9168434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.9168675Z 2025-05-07T20:33:04.9168894Z self = 2025-05-07T20:33:04.9170018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.9171484Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cf7e11c0>} 2025-05-07T20:33:04.9172878Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.9173949Z context = 2025-05-07T20:33:04.9174297Z 2025-05-07T20:33:04.9174510Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.9175070Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.9175559Z module_map=module_map) 2025-05-07T20:33:04.9175936Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.9176298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.9176568Z E ^ 2025-05-07T20:33:04.9177049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.9177523Z 2025-05-07T20:33:04.9177956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.9178500Z 2025-05-07T20:33:04.9178608Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.9179048Z self=, 2025-05-07T20:33:04.9179472Z T=2048, 2025-05-07T20:33:04.9179666Z D=7168, 2025-05-07T20:33:04.9179868Z scale_ub=None, 2025-05-07T20:33:04.9180093Z contiguous=False, 2025-05-07T20:33:04.9180324Z compiled=False, 2025-05-07T20:33:04.9180538Z ) 2025-05-07T20:33:04.9180873Z self = 2025-05-07T20:33:04.9181390Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.9181682Z 2025-05-07T20:33:04.9181762Z @given( 2025-05-07T20:33:04.9182004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.9182333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.9182647Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.9182998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.9183345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.9183642Z ) 2025-05-07T20:33:04.9184014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.9184475Z def test_silu_mul_quant( 2025-05-07T20:33:04.9184723Z self, 2025-05-07T20:33:04.9184926Z T: int, 2025-05-07T20:33:04.9185131Z D: int, 2025-05-07T20:33:04.9185357Z scale_ub: Optional[float], 2025-05-07T20:33:04.9185645Z contiguous: bool, 2025-05-07T20:33:04.9185897Z compiled: bool, 2025-05-07T20:33:04.9186124Z ) -> None: 2025-05-07T20:33:04.9186353Z torch.manual_seed(2025) 2025-05-07T20:33:04.9186608Z 2025-05-07T20:33:04.9186891Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.9189075Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:04.9191042Z 2025-05-07T20:33:04.9191165Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:04.9191392Z 2025-05-07T20:33:04.9191498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.9191935Z self=, 2025-05-07T20:33:04.9192348Z T=128, 2025-05-07T20:33:04.9192544Z D=7168, 2025-05-07T20:33:04.9192745Z scale_ub=1200.0, 2025-05-07T20:33:04.9192978Z contiguous=True, 2025-05-07T20:33:04.9193209Z compiled=True, 2025-05-07T20:33:04.9193425Z ) 2025-05-07T20:33:04.9193756Z self = 2025-05-07T20:33:04.9194362Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:04.9194649Z 2025-05-07T20:33:04.9194767Z @given( 2025-05-07T20:33:04.9195009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.9195329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.9195649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.9195991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.9196329Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.9196627Z ) 2025-05-07T20:33:04.9196991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.9197451Z def test_silu_mul_quant( 2025-05-07T20:33:04.9197694Z self, 2025-05-07T20:33:04.9197896Z T: int, 2025-05-07T20:33:04.9198101Z D: int, 2025-05-07T20:33:04.9198325Z scale_ub: Optional[float], 2025-05-07T20:33:04.9198608Z contiguous: bool, 2025-05-07T20:33:04.9198861Z compiled: bool, 2025-05-07T20:33:04.9199087Z ) -> None: 2025-05-07T20:33:04.9199318Z torch.manual_seed(2025) 2025-05-07T20:33:04.9199575Z 2025-05-07T20:33:04.9199850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.9200292Z 2025-05-07T20:33:04.9200496Z x_sign = torch.sign(x) 2025-05-07T20:33:04.9200794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.9201116Z x = x_sign * x_clamp 2025-05-07T20:33:04.9201367Z x0 = x[:, :D] 2025-05-07T20:33:04.9201588Z x1 = x[:, D:] 2025-05-07T20:33:04.9201808Z 2025-05-07T20:33:04.9202000Z if contiguous: 2025-05-07T20:33:04.9202238Z x0 = x0.contiguous() 2025-05-07T20:33:04.9202508Z x1 = x1.contiguous() 2025-05-07T20:33:04.9202759Z 2025-05-07T20:33:04.9202959Z if scale_ub is not None: 2025-05-07T20:33:04.9203244Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.9203599Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.9203924Z ) 2025-05-07T20:33:04.9204122Z else: 2025-05-07T20:33:04.9204344Z scale_ub_tensor = None 2025-05-07T20:33:04.9204608Z 2025-05-07T20:33:04.9204844Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.9205176Z op = silu_mul_quant 2025-05-07T20:33:04.9205436Z if compiled: 2025-05-07T20:33:04.9205688Z op = torch.compile(op) 2025-05-07T20:33:04.9205999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.9206290Z 2025-05-07T20:33:04.9206486Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.9206663Z 2025-05-07T20:33:04.9206765Z moe/activation_test.py:117: 2025-05-07T20:33:04.9207070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.9207479Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.9207768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.9208391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:04.9208977Z return fn(*args, **kwargs) 2025-05-07T20:33:04.9209657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.9210368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.9210931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.9211643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.9212332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.9212887Z kernel = self.compile( 2025-05-07T20:33:04.9213758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.9214537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.9215011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.9215260Z 2025-05-07T20:33:04.9215476Z self = 2025-05-07T20:33:04.9216601Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.9218034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cf9f7b00>} 2025-05-07T20:33:04.9219431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.9220504Z context = 2025-05-07T20:33:04.9220817Z 2025-05-07T20:33:04.9220994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.9221543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.9222031Z module_map=module_map) 2025-05-07T20:33:04.9222415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.9222793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.9223063Z E ^ 2025-05-07T20:33:04.9223550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.9224053Z 2025-05-07T20:33:04.9224512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.1986812Z 2025-05-07T20:33:05.1987413Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.1988076Z self=, 2025-05-07T20:33:05.1988656Z T=128, 2025-05-07T20:33:05.1988913Z D=7168, 2025-05-07T20:33:05.1989169Z scale_ub=1200.0, 2025-05-07T20:33:05.1989463Z contiguous=True, 2025-05-07T20:33:05.1989761Z compiled=False, 2025-05-07T20:33:05.1989999Z ) 2025-05-07T20:33:05.1990348Z self = 2025-05-07T20:33:05.1990858Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.1991144Z 2025-05-07T20:33:05.1991225Z @given( 2025-05-07T20:33:05.1991465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.1991781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.1992102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.1992714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.1993049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.1993434Z ) 2025-05-07T20:33:05.1993802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.1994260Z def test_silu_mul_quant( 2025-05-07T20:33:05.1994503Z self, 2025-05-07T20:33:05.1994702Z T: int, 2025-05-07T20:33:05.1994902Z D: int, 2025-05-07T20:33:05.1995120Z scale_ub: Optional[float], 2025-05-07T20:33:05.1995401Z contiguous: bool, 2025-05-07T20:33:05.1995651Z compiled: bool, 2025-05-07T20:33:05.1995879Z ) -> None: 2025-05-07T20:33:05.1996104Z torch.manual_seed(2025) 2025-05-07T20:33:05.1996352Z 2025-05-07T20:33:05.1996627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.1996979Z 2025-05-07T20:33:05.1997179Z x_sign = torch.sign(x) 2025-05-07T20:33:05.1997481Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.1999768Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.2001818Z 2025-05-07T20:33:05.2001943Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:05.2002169Z 2025-05-07T20:33:05.2002275Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2002701Z self=, 2025-05-07T20:33:05.2003114Z T=128, 2025-05-07T20:33:05.2003307Z D=5120, 2025-05-07T20:33:05.2003521Z scale_ub=1200.0, 2025-05-07T20:33:05.2003754Z contiguous=True, 2025-05-07T20:33:05.2003983Z compiled=True, 2025-05-07T20:33:05.2004198Z ) 2025-05-07T20:33:05.2004532Z self = 2025-05-07T20:33:05.2005039Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.2005314Z 2025-05-07T20:33:05.2005392Z @given( 2025-05-07T20:33:05.2005628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2005950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2006261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2006606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2006949Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2007241Z ) 2025-05-07T20:33:05.2007607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2008067Z def test_silu_mul_quant( 2025-05-07T20:33:05.2008315Z self, 2025-05-07T20:33:05.2021798Z T: int, 2025-05-07T20:33:05.2022187Z D: int, 2025-05-07T20:33:05.2022619Z scale_ub: Optional[float], 2025-05-07T20:33:05.2023114Z contiguous: bool, 2025-05-07T20:33:05.2023473Z compiled: bool, 2025-05-07T20:33:05.2023822Z ) -> None: 2025-05-07T20:33:05.2024150Z torch.manual_seed(2025) 2025-05-07T20:33:05.2024518Z 2025-05-07T20:33:05.2024940Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2025461Z 2025-05-07T20:33:05.2025767Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2026218Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2029445Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.2032289Z 2025-05-07T20:33:05.2032473Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:05.2032790Z 2025-05-07T20:33:05.2032961Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2033571Z self=, 2025-05-07T20:33:05.2034199Z T=128, 2025-05-07T20:33:05.2034505Z D=7168, 2025-05-07T20:33:05.2034788Z scale_ub=None, 2025-05-07T20:33:05.2035113Z contiguous=True, 2025-05-07T20:33:05.2035448Z compiled=True, 2025-05-07T20:33:05.2035742Z ) 2025-05-07T20:33:05.2036218Z self = 2025-05-07T20:33:05.2036938Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2037426Z 2025-05-07T20:33:05.2037554Z @given( 2025-05-07T20:33:05.2037976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2038452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2038915Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2039398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2039893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2040426Z ) 2025-05-07T20:33:05.2040934Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2041477Z def test_silu_mul_quant( 2025-05-07T20:33:05.2041803Z self, 2025-05-07T20:33:05.2042024Z T: int, 2025-05-07T20:33:05.2042263Z D: int, 2025-05-07T20:33:05.2042570Z scale_ub: Optional[float], 2025-05-07T20:33:05.2042889Z contiguous: bool, 2025-05-07T20:33:05.2043198Z compiled: bool, 2025-05-07T20:33:05.2043522Z ) -> None: 2025-05-07T20:33:05.2048285Z torch.manual_seed(2025) 2025-05-07T20:33:05.2048550Z 2025-05-07T20:33:05.2048849Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2050983Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.2052892Z 2025-05-07T20:33:05.2053019Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.2053247Z 2025-05-07T20:33:05.2053846Z FAILED 2025-05-07T20:33:05.2053969Z 2025-05-07T20:33:05.2054114Z =================================== FAILURES =================================== 2025-05-07T20:33:05.2054568Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:05.2055038Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:05.2055681Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:05.2056244Z | yield 2025-05-07T20:33:05.2056705Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:33:05.2057326Z | self._callTestMethod(testMethod) 2025-05-07T20:33:05.2057639Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:33:05.2058234Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:33:05.2058917Z | if method() is not None: 2025-05-07T20:33:05.2059189Z | ~~~~~~^^ 2025-05-07T20:33:05.2059906Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:05.2060671Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2060992Z | ^^^^^^^ 2025-05-07T20:33:05.2061595Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:05.2062386Z | raise the_error_hypothesis_found 2025-05-07T20:33:05.2062975Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:05.2063517Z +-+---------------- 1 ---------------- 2025-05-07T20:33:05.2063879Z | Traceback (most recent call last): 2025-05-07T20:33:05.2064970Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:05.2066203Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2069257Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.2072123Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:05.2072759Z | self=, 2025-05-07T20:33:05.2073352Z | T=2048, 2025-05-07T20:33:05.2073689Z | D=5120, # or any other generated value 2025-05-07T20:33:05.2074169Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:05.2074700Z | contiguous=True, # or any other generated value 2025-05-07T20:33:05.2075236Z | compiled=False, # or any other generated value 2025-05-07T20:33:05.2075684Z | ) 2025-05-07T20:33:05.2075935Z | 2025-05-07T20:33:05.2076699Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:05.2077583Z +---------------- 2 ---------------- 2025-05-07T20:33:05.2078001Z | Traceback (most recent call last): 2025-05-07T20:33:05.2079031Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:05.2080279Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2083283Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.2085984Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:05.2086450Z | self=, 2025-05-07T20:33:05.2086888Z | T=128, 2025-05-07T20:33:05.2087107Z | D=7168, 2025-05-07T20:33:05.2087325Z | scale_ub=None, 2025-05-07T20:33:05.2087581Z | contiguous=True, 2025-05-07T20:33:05.2087906Z | compiled=True, 2025-05-07T20:33:05.2088143Z | ) 2025-05-07T20:33:05.2088328Z | 2025-05-07T20:33:05.2088925Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:05.2089562Z +---------------- 3 ---------------- 2025-05-07T20:33:05.2089873Z | Traceback (most recent call last): 2025-05-07T20:33:05.2090613Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:05.2091426Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2093588Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.2095721Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:05.2096185Z | self=, 2025-05-07T20:33:05.2096604Z | T=128, 2025-05-07T20:33:05.2096819Z | D=5120, 2025-05-07T20:33:05.2097047Z | scale_ub=1200.0, 2025-05-07T20:33:05.2097296Z | contiguous=True, 2025-05-07T20:33:05.2097551Z | compiled=True, 2025-05-07T20:33:05.2097788Z | ) 2025-05-07T20:33:05.2097970Z | 2025-05-07T20:33:05.2098522Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:05.2099148Z +---------------- 4 ---------------- 2025-05-07T20:33:05.2099459Z | Traceback (most recent call last): 2025-05-07T20:33:05.2100208Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:05.2100956Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2101257Z | ~~~~~~^^ 2025-05-07T20:33:05.2101924Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:05.2102648Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2103518Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:05.2104347Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2104645Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:33:05.2104922Z | a, 2025-05-07T20:33:05.2105133Z | ^^ 2025-05-07T20:33:05.2105365Z | ...<23 lines>... 2025-05-07T20:33:05.2105626Z | USE_INT64=use_int64, 2025-05-07T20:33:05.2105903Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.2106162Z | ) 2025-05-07T20:33:05.2106362Z | ^ 2025-05-07T20:33:05.2106904Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:05.2107672Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2108146Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.2108819Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:05.2109631Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2110181Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.2110926Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:05.2111647Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2112053Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.2112698Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:05.2113283Z | fn() 2025-05-07T20:33:05.2113898Z | ~~^^ 2025-05-07T20:33:05.2114566Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:05.2115479Z | self.fn.run( 2025-05-07T20:33:05.2115795Z | ~~~~~~~~~~~^ 2025-05-07T20:33:05.2116113Z | *args, 2025-05-07T20:33:05.2116418Z | ^^^^^^ 2025-05-07T20:33:05.2116930Z | **current, 2025-05-07T20:33:05.2117250Z | ^^^^^^^^^^ 2025-05-07T20:33:05.2117569Z | ) 2025-05-07T20:33:05.2117923Z | ^ 2025-05-07T20:33:05.2118654Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:05.2119504Z | kernel = self.compile( 2025-05-07T20:33:05.2119880Z | src, 2025-05-07T20:33:05.2120278Z | target=target, 2025-05-07T20:33:05.2120665Z | options=options.__dict__, 2025-05-07T20:33:05.2121065Z | ) 2025-05-07T20:33:05.2121850Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:05.2122886Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2123937Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:05.2125106Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2125791Z | module_map=module_map) 2025-05-07T20:33:05.2126318Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2126824Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2127193Z | ^ 2025-05-07T20:33:05.2165832Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2166708Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:05.2167281Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:05.2167998Z | self=, 2025-05-07T20:33:05.2168610Z | T=1, # or any other generated value 2025-05-07T20:33:05.2169039Z | D=5120, # or any other generated value 2025-05-07T20:33:05.2169496Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:05.2169993Z | contiguous=True, # or any other generated value 2025-05-07T20:33:05.2170487Z | compiled=True, # or any other generated value 2025-05-07T20:33:05.2170895Z | ) 2025-05-07T20:33:05.2171146Z | 2025-05-07T20:33:05.2171897Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:05.2172766Z +------------------------------------ 2025-05-07T20:33:05.2173282Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:05.2173820Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2174409Z self=, 2025-05-07T20:33:05.2174964Z T=1, 2025-05-07T20:33:05.2175422Z D=5120, 2025-05-07T20:33:05.2175692Z scale_ub=None, 2025-05-07T20:33:05.2175982Z contiguous=True, 2025-05-07T20:33:05.2176287Z compiled=True, 2025-05-07T20:33:05.2176566Z ) 2025-05-07T20:33:05.2177085Z self = 2025-05-07T20:33:05.2177740Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2178099Z 2025-05-07T20:33:05.2178210Z @given( 2025-05-07T20:33:05.2178519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2178934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2179351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2179796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2180236Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2180637Z ) 2025-05-07T20:33:05.2181124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2181743Z def test_silu_mul_quant( 2025-05-07T20:33:05.2182083Z self, 2025-05-07T20:33:05.2182422Z T: int, 2025-05-07T20:33:05.2182695Z D: int, 2025-05-07T20:33:05.2183005Z scale_ub: Optional[float], 2025-05-07T20:33:05.2183448Z contiguous: bool, 2025-05-07T20:33:05.2183795Z compiled: bool, 2025-05-07T20:33:05.2184129Z ) -> None: 2025-05-07T20:33:05.2184460Z torch.manual_seed(2025) 2025-05-07T20:33:05.2184815Z 2025-05-07T20:33:05.2185188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2185676Z 2025-05-07T20:33:05.2185951Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2186358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2186787Z x = x_sign * x_clamp 2025-05-07T20:33:05.2187120Z x0 = x[:, :D] 2025-05-07T20:33:05.2187411Z x1 = x[:, D:] 2025-05-07T20:33:05.2187696Z 2025-05-07T20:33:05.2187948Z if contiguous: 2025-05-07T20:33:05.2188257Z x0 = x0.contiguous() 2025-05-07T20:33:05.2188607Z x1 = x1.contiguous() 2025-05-07T20:33:05.2188940Z 2025-05-07T20:33:05.2189196Z if scale_ub is not None: 2025-05-07T20:33:05.2189580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2190036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2190476Z ) 2025-05-07T20:33:05.2190751Z else: 2025-05-07T20:33:05.2191059Z scale_ub_tensor = None 2025-05-07T20:33:05.2191423Z 2025-05-07T20:33:05.2191740Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2192192Z op = silu_mul_quant 2025-05-07T20:33:05.2192539Z if compiled: 2025-05-07T20:33:05.2192879Z op = torch.compile(op) 2025-05-07T20:33:05.2193294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2193671Z 2025-05-07T20:33:05.2193943Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2194398Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2194806Z 2025-05-07T20:33:05.2195129Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2195607Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2196039Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2196491Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2197019Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2197472Z 2025-05-07T20:33:05.2197756Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2198023Z 2025-05-07T20:33:05.2198162Z moe/activation_test.py:126: 2025-05-07T20:33:05.2198582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2199041Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2199484Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2200726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2201884Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2202752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2203734Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2204731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2205756Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2206819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2207734Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2208571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2209292Z fn() 2025-05-07T20:33:05.2210042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2210909Z self.fn.run( 2025-05-07T20:33:05.2211595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2212369Z kernel = self.compile( 2025-05-07T20:33:05.2213157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2214398Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2214971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2215290Z 2025-05-07T20:33:05.2215570Z self = 2025-05-07T20:33:05.2217057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2218971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a7cae700>} 2025-05-07T20:33:05.2220821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2222213Z context = 2025-05-07T20:33:05.2222600Z 2025-05-07T20:33:05.2222827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2223546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2224205Z module_map=module_map) 2025-05-07T20:33:05.2224720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2225218Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2225596Z E ^ 2025-05-07T20:33:05.2226268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2226943Z 2025-05-07T20:33:05.2227558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2228314Z 2025-05-07T20:33:05.2228458Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2229050Z self=, 2025-05-07T20:33:05.2229630Z T=2048, 2025-05-07T20:33:05.2229884Z D=5120, 2025-05-07T20:33:05.2230148Z scale_ub=1200.0, 2025-05-07T20:33:05.2230450Z contiguous=True, 2025-05-07T20:33:05.2230862Z compiled=False, 2025-05-07T20:33:05.2231145Z ) 2025-05-07T20:33:05.2231585Z self = 2025-05-07T20:33:05.2232335Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.2232723Z 2025-05-07T20:33:05.2232829Z @given( 2025-05-07T20:33:05.2233147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2233569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2233992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2234474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2234964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2235379Z ) 2025-05-07T20:33:05.2235895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2236540Z def test_silu_mul_quant( 2025-05-07T20:33:05.2236888Z self, 2025-05-07T20:33:05.2237176Z T: int, 2025-05-07T20:33:05.2237467Z D: int, 2025-05-07T20:33:05.2237779Z scale_ub: Optional[float], 2025-05-07T20:33:05.2238252Z contiguous: bool, 2025-05-07T20:33:05.2238589Z compiled: bool, 2025-05-07T20:33:05.2238892Z ) -> None: 2025-05-07T20:33:05.2239263Z torch.manual_seed(2025) 2025-05-07T20:33:05.2239601Z 2025-05-07T20:33:05.2239962Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2240556Z 2025-05-07T20:33:05.2240823Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2241219Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2241633Z x = x_sign * x_clamp 2025-05-07T20:33:05.2241966Z x0 = x[:, :D] 2025-05-07T20:33:05.2242271Z x1 = x[:, D:] 2025-05-07T20:33:05.2242548Z 2025-05-07T20:33:05.2242801Z if contiguous: 2025-05-07T20:33:05.2243116Z x0 = x0.contiguous() 2025-05-07T20:33:05.2243477Z x1 = x1.contiguous() 2025-05-07T20:33:05.2243808Z 2025-05-07T20:33:05.2244083Z if scale_ub is not None: 2025-05-07T20:33:05.2244490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2245387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2245812Z ) 2025-05-07T20:33:05.2246085Z else: 2025-05-07T20:33:05.2246396Z scale_ub_tensor = None 2025-05-07T20:33:05.2246750Z 2025-05-07T20:33:05.2247066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2247499Z op = silu_mul_quant 2025-05-07T20:33:05.2268875Z if compiled: 2025-05-07T20:33:05.2269227Z op = torch.compile(op) 2025-05-07T20:33:05.2269633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2269983Z 2025-05-07T20:33:05.2270227Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2270453Z 2025-05-07T20:33:05.2270579Z moe/activation_test.py:117: 2025-05-07T20:33:05.2270950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2271376Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2271728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2272673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2273681Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2274520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2275504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2276406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2277146Z kernel = self.compile( 2025-05-07T20:33:05.2277932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2279014Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2279578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2279919Z 2025-05-07T20:33:05.2280406Z self = 2025-05-07T20:33:05.2281954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2283965Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a7b62020>} 2025-05-07T20:33:05.2285862Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2286938Z context = 2025-05-07T20:33:05.2287307Z 2025-05-07T20:33:05.2287485Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2288076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2288557Z module_map=module_map) 2025-05-07T20:33:05.2288935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2289302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2289564Z E ^ 2025-05-07T20:33:05.2290047Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2290511Z 2025-05-07T20:33:05.2290946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2291475Z 2025-05-07T20:33:05.2291588Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2292003Z self=, 2025-05-07T20:33:05.2292413Z T=2048, 2025-05-07T20:33:05.2292604Z D=5120, 2025-05-07T20:33:05.2292795Z scale_ub=1200.0, 2025-05-07T20:33:05.2293016Z contiguous=True, 2025-05-07T20:33:05.2293235Z compiled=True, 2025-05-07T20:33:05.2293434Z ) 2025-05-07T20:33:05.2293758Z self = 2025-05-07T20:33:05.2294323Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.2294598Z 2025-05-07T20:33:05.2294683Z @given( 2025-05-07T20:33:05.2294908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2295228Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2295543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2295875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2296216Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2296508Z ) 2025-05-07T20:33:05.2296861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2297322Z def test_silu_mul_quant( 2025-05-07T20:33:05.2297576Z self, 2025-05-07T20:33:05.2297767Z T: int, 2025-05-07T20:33:05.2297970Z D: int, 2025-05-07T20:33:05.2298193Z scale_ub: Optional[float], 2025-05-07T20:33:05.2298467Z contiguous: bool, 2025-05-07T20:33:05.2298711Z compiled: bool, 2025-05-07T20:33:05.2298937Z ) -> None: 2025-05-07T20:33:05.2299159Z torch.manual_seed(2025) 2025-05-07T20:33:05.2299398Z 2025-05-07T20:33:05.2299677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2300028Z 2025-05-07T20:33:05.2300219Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2300520Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2300842Z x = x_sign * x_clamp 2025-05-07T20:33:05.2301136Z x0 = x[:, :D] 2025-05-07T20:33:05.2301358Z x1 = x[:, D:] 2025-05-07T20:33:05.2301573Z 2025-05-07T20:33:05.2301757Z if contiguous: 2025-05-07T20:33:05.2302039Z x0 = x0.contiguous() 2025-05-07T20:33:05.2302313Z x1 = x1.contiguous() 2025-05-07T20:33:05.2302557Z 2025-05-07T20:33:05.2302756Z if scale_ub is not None: 2025-05-07T20:33:05.2303040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2303383Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2303705Z ) 2025-05-07T20:33:05.2303902Z else: 2025-05-07T20:33:05.2304106Z scale_ub_tensor = None 2025-05-07T20:33:05.2304389Z 2025-05-07T20:33:05.2304651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2304978Z op = silu_mul_quant 2025-05-07T20:33:05.2305234Z if compiled: 2025-05-07T20:33:05.2305489Z op = torch.compile(op) 2025-05-07T20:33:05.2305798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2306126Z 2025-05-07T20:33:05.2306324Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2306624Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2306958Z 2025-05-07T20:33:05.2307204Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2307552Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2307846Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2308172Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2308544Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2308864Z 2025-05-07T20:33:05.2309061Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2309264Z 2025-05-07T20:33:05.2309368Z moe/activation_test.py:126: 2025-05-07T20:33:05.2309669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2310010Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2310348Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2311172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2311950Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2312502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2313201Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2314237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2314979Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2315740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2316406Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2317029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2317559Z fn() 2025-05-07T20:33:05.2318082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2318678Z self.fn.run( 2025-05-07T20:33:05.2319149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2319693Z kernel = self.compile( 2025-05-07T20:33:05.2320347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2321021Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2321423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2321838Z 2025-05-07T20:33:05.2322052Z self = 2025-05-07T20:33:05.2323244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2324666Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6c44400>} 2025-05-07T20:33:05.2326050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2327101Z context = 2025-05-07T20:33:05.2327403Z 2025-05-07T20:33:05.2327577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2328115Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2328765Z module_map=module_map) 2025-05-07T20:33:05.2329145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2329513Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2329787Z E ^ 2025-05-07T20:33:05.2330256Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2330725Z 2025-05-07T20:33:05.2331151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2331674Z 2025-05-07T20:33:05.2331782Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2332207Z self=, 2025-05-07T20:33:05.2332616Z T=16384, 2025-05-07T20:33:05.2332809Z D=7168, 2025-05-07T20:33:05.2333005Z scale_ub=1200.0, 2025-05-07T20:33:05.2333235Z contiguous=False, 2025-05-07T20:33:05.2333469Z compiled=False, 2025-05-07T20:33:05.2333675Z ) 2025-05-07T20:33:05.2333992Z self = 2025-05-07T20:33:05.2334508Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.2334796Z 2025-05-07T20:33:05.2334877Z @given( 2025-05-07T20:33:05.2335106Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2335426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2335738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2336074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2336407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2336699Z ) 2025-05-07T20:33:05.2337053Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2337500Z def test_silu_mul_quant( 2025-05-07T20:33:05.2337753Z self, 2025-05-07T20:33:05.2337958Z T: int, 2025-05-07T20:33:05.2338160Z D: int, 2025-05-07T20:33:05.2338388Z scale_ub: Optional[float], 2025-05-07T20:33:05.2338674Z contiguous: bool, 2025-05-07T20:33:05.2338915Z compiled: bool, 2025-05-07T20:33:05.2339147Z ) -> None: 2025-05-07T20:33:05.2339375Z torch.manual_seed(2025) 2025-05-07T20:33:05.2339621Z 2025-05-07T20:33:05.2339908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2340258Z 2025-05-07T20:33:05.2340457Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2340750Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2341067Z x = x_sign * x_clamp 2025-05-07T20:33:05.2341314Z x0 = x[:, :D] 2025-05-07T20:33:05.2341530Z x1 = x[:, D:] 2025-05-07T20:33:05.2341803Z 2025-05-07T20:33:05.2341994Z if contiguous: 2025-05-07T20:33:05.2342225Z x0 = x0.contiguous() 2025-05-07T20:33:05.2342495Z x1 = x1.contiguous() 2025-05-07T20:33:05.2342739Z 2025-05-07T20:33:05.2342980Z if scale_ub is not None: 2025-05-07T20:33:05.2343262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2343608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2343918Z ) 2025-05-07T20:33:05.2344114Z else: 2025-05-07T20:33:05.2344326Z scale_ub_tensor = None 2025-05-07T20:33:05.2344579Z 2025-05-07T20:33:05.2344807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2345126Z op = silu_mul_quant 2025-05-07T20:33:05.2345378Z if compiled: 2025-05-07T20:33:05.2345624Z op = torch.compile(op) 2025-05-07T20:33:05.2345925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2346204Z 2025-05-07T20:33:05.2346397Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2346570Z 2025-05-07T20:33:05.2346671Z moe/activation_test.py:117: 2025-05-07T20:33:05.2347022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2347394Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2347686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2348394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2349102Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2349665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2350359Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2351038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2351586Z kernel = self.compile( 2025-05-07T20:33:05.2352140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2352813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2353221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2353456Z 2025-05-07T20:33:05.2353676Z self = 2025-05-07T20:33:05.2354840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2356248Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6a271a0>} 2025-05-07T20:33:05.2357635Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2358694Z context = 2025-05-07T20:33:05.2358994Z 2025-05-07T20:33:05.2359174Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2359709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2360281Z module_map=module_map) 2025-05-07T20:33:05.2360657Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2361018Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2361281Z E ^ 2025-05-07T20:33:05.2361760Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2362221Z 2025-05-07T20:33:05.2362714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2363240Z 2025-05-07T20:33:05.2363386Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2363816Z self=, 2025-05-07T20:33:05.2364238Z T=1, 2025-05-07T20:33:05.2364453Z D=7168, 2025-05-07T20:33:05.2364655Z scale_ub=None, 2025-05-07T20:33:05.2364873Z contiguous=True, 2025-05-07T20:33:05.2365100Z compiled=True, 2025-05-07T20:33:05.2365298Z ) 2025-05-07T20:33:05.2365623Z self = 2025-05-07T20:33:05.2366121Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2366385Z 2025-05-07T20:33:05.2366465Z @given( 2025-05-07T20:33:05.2366698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2367016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2367330Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2367670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2368060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2368392Z ) 2025-05-07T20:33:05.2368749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2369201Z def test_silu_mul_quant( 2025-05-07T20:33:05.2369446Z self, 2025-05-07T20:33:05.2369636Z T: int, 2025-05-07T20:33:05.2369835Z D: int, 2025-05-07T20:33:05.2370056Z scale_ub: Optional[float], 2025-05-07T20:33:05.2370328Z contiguous: bool, 2025-05-07T20:33:05.2370574Z compiled: bool, 2025-05-07T20:33:05.2370798Z ) -> None: 2025-05-07T20:33:05.2371014Z torch.manual_seed(2025) 2025-05-07T20:33:05.2371259Z 2025-05-07T20:33:05.2371540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2371886Z 2025-05-07T20:33:05.2372092Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2372390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2372714Z x = x_sign * x_clamp 2025-05-07T20:33:05.2372958Z x0 = x[:, :D] 2025-05-07T20:33:05.2373184Z x1 = x[:, D:] 2025-05-07T20:33:05.2373395Z 2025-05-07T20:33:05.2373578Z if contiguous: 2025-05-07T20:33:05.2373814Z x0 = x0.contiguous() 2025-05-07T20:33:05.2374081Z x1 = x1.contiguous() 2025-05-07T20:33:05.2374321Z 2025-05-07T20:33:05.2374518Z if scale_ub is not None: 2025-05-07T20:33:05.2374799Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2375142Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2375459Z ) 2025-05-07T20:33:05.2375658Z else: 2025-05-07T20:33:05.2375868Z scale_ub_tensor = None 2025-05-07T20:33:05.2376128Z 2025-05-07T20:33:05.2376361Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2376678Z op = silu_mul_quant 2025-05-07T20:33:05.2376933Z if compiled: 2025-05-07T20:33:05.2377189Z op = torch.compile(op) 2025-05-07T20:33:05.2377492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2377773Z 2025-05-07T20:33:05.2377968Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2378263Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2378553Z 2025-05-07T20:33:05.2378792Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2379140Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2379436Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2379757Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2380126Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2380440Z 2025-05-07T20:33:05.2380647Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2380902Z 2025-05-07T20:33:05.2381009Z moe/activation_test.py:126: 2025-05-07T20:33:05.2381308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2381686Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2382025Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2382837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2383602Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2384190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2384913Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2385620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2386360Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2387120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2387886Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2388509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2389033Z fn() 2025-05-07T20:33:05.2389554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2390150Z self.fn.run( 2025-05-07T20:33:05.2390624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2391166Z kernel = self.compile( 2025-05-07T20:33:05.2391718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2392385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2392800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2393041Z 2025-05-07T20:33:05.2393256Z self = 2025-05-07T20:33:05.2394395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2395833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6d00860>} 2025-05-07T20:33:05.2397212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2398266Z context = 2025-05-07T20:33:05.2398570Z 2025-05-07T20:33:05.2398743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2399285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2399759Z module_map=module_map) 2025-05-07T20:33:05.2400226Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2400597Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2400866Z E ^ 2025-05-07T20:33:05.2401343Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2401811Z 2025-05-07T20:33:05.2402239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2402824Z 2025-05-07T20:33:05.2402935Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2403356Z self=, 2025-05-07T20:33:05.2403770Z T=4096, 2025-05-07T20:33:05.2404003Z D=5120, 2025-05-07T20:33:05.2404204Z scale_ub=None, 2025-05-07T20:33:05.2404424Z contiguous=False, 2025-05-07T20:33:05.2404656Z compiled=False, 2025-05-07T20:33:05.2404857Z ) 2025-05-07T20:33:05.2405181Z self = 2025-05-07T20:33:05.2405904Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.2406184Z 2025-05-07T20:33:05.2406268Z @given( 2025-05-07T20:33:05.2406497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2406817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2407131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2407465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2407809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2408100Z ) 2025-05-07T20:33:05.2408512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2409009Z def test_silu_mul_quant( 2025-05-07T20:33:05.2409258Z self, 2025-05-07T20:33:05.2409453Z T: int, 2025-05-07T20:33:05.2409661Z D: int, 2025-05-07T20:33:05.2409886Z scale_ub: Optional[float], 2025-05-07T20:33:05.2410172Z contiguous: bool, 2025-05-07T20:33:05.2410413Z compiled: bool, 2025-05-07T20:33:05.2410642Z ) -> None: 2025-05-07T20:33:05.2410863Z torch.manual_seed(2025) 2025-05-07T20:33:05.2411106Z 2025-05-07T20:33:05.2411392Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2411747Z 2025-05-07T20:33:05.2411942Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2412243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2412565Z x = x_sign * x_clamp 2025-05-07T20:33:05.2412809Z x0 = x[:, :D] 2025-05-07T20:33:05.2413036Z x1 = x[:, D:] 2025-05-07T20:33:05.2413254Z 2025-05-07T20:33:05.2413744Z if contiguous: 2025-05-07T20:33:05.2413998Z x0 = x0.contiguous() 2025-05-07T20:33:05.2414308Z x1 = x1.contiguous() 2025-05-07T20:33:05.2414551Z 2025-05-07T20:33:05.2414752Z if scale_ub is not None: 2025-05-07T20:33:05.2415041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2415390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2415704Z ) 2025-05-07T20:33:05.2415906Z else: 2025-05-07T20:33:05.2416120Z scale_ub_tensor = None 2025-05-07T20:33:05.2416377Z 2025-05-07T20:33:05.2416616Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2416940Z op = silu_mul_quant 2025-05-07T20:33:05.2417190Z if compiled: 2025-05-07T20:33:05.2417447Z op = torch.compile(op) 2025-05-07T20:33:05.2417752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2418034Z 2025-05-07T20:33:05.2418239Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2418411Z 2025-05-07T20:33:05.2418520Z moe/activation_test.py:117: 2025-05-07T20:33:05.2418815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2418929Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2419032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2419553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2419657Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2420026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2420263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2420751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2420855Z kernel = self.compile( 2025-05-07T20:33:05.2421316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2421501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2421639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2421644Z 2025-05-07T20:33:05.2421852Z self = 2025-05-07T20:33:05.2422651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2423183Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6118cc0>} 2025-05-07T20:33:05.2424089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2424293Z context = 2025-05-07T20:33:05.2424298Z 2025-05-07T20:33:05.2424467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2424745Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2424855Z module_map=module_map) 2025-05-07T20:33:05.2425020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2425127Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2425205Z E ^ 2025-05-07T20:33:05.2425571Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2425580Z 2025-05-07T20:33:05.2426018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2426023Z 2025-05-07T20:33:05.2426129Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2426364Z self=, 2025-05-07T20:33:05.2426442Z T=4096, 2025-05-07T20:33:05.2426519Z D=7168, 2025-05-07T20:33:05.2426610Z scale_ub=None, 2025-05-07T20:33:05.2426698Z contiguous=False, 2025-05-07T20:33:05.2426785Z compiled=False, 2025-05-07T20:33:05.2426865Z ) 2025-05-07T20:33:05.2434695Z self = 2025-05-07T20:33:05.2434916Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.2434927Z 2025-05-07T20:33:05.2435010Z @given( 2025-05-07T20:33:05.2435146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2435257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2435383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2435514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2435634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2435719Z ) 2025-05-07T20:33:05.2435977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2436077Z def test_silu_mul_quant( 2025-05-07T20:33:05.2436165Z self, 2025-05-07T20:33:05.2436247Z T: int, 2025-05-07T20:33:05.2436328Z D: int, 2025-05-07T20:33:05.2436440Z scale_ub: Optional[float], 2025-05-07T20:33:05.2436534Z contiguous: bool, 2025-05-07T20:33:05.2436626Z compiled: bool, 2025-05-07T20:33:05.2436717Z ) -> None: 2025-05-07T20:33:05.2436815Z torch.manual_seed(2025) 2025-05-07T20:33:05.2436978Z 2025-05-07T20:33:05.2437164Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2437245Z 2025-05-07T20:33:05.2437390Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2437525Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2437619Z x = x_sign * x_clamp 2025-05-07T20:33:05.2437712Z x0 = x[:, :D] 2025-05-07T20:33:05.2437794Z x1 = x[:, D:] 2025-05-07T20:33:05.2437869Z 2025-05-07T20:33:05.2437965Z if contiguous: 2025-05-07T20:33:05.2438062Z x0 = x0.contiguous() 2025-05-07T20:33:05.2438158Z x1 = x1.contiguous() 2025-05-07T20:33:05.2438240Z 2025-05-07T20:33:05.2438333Z if scale_ub is not None: 2025-05-07T20:33:05.2438443Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2438589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2438669Z ) 2025-05-07T20:33:05.2438760Z else: 2025-05-07T20:33:05.2438860Z scale_ub_tensor = None 2025-05-07T20:33:05.2438982Z 2025-05-07T20:33:05.2439125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2439259Z op = silu_mul_quant 2025-05-07T20:33:05.2439349Z if compiled: 2025-05-07T20:33:05.2439459Z op = torch.compile(op) 2025-05-07T20:33:05.2439569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2439645Z 2025-05-07T20:33:05.2439749Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2439754Z 2025-05-07T20:33:05.2439858Z moe/activation_test.py:117: 2025-05-07T20:33:05.2439992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2440174Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2440280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2440809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2440913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2441295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2441539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2441891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2441994Z kernel = self.compile( 2025-05-07T20:33:05.2442392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2442574Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2442712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2442717Z 2025-05-07T20:33:05.2442929Z self = 2025-05-07T20:33:05.2443748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2444325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a6119260>} 2025-05-07T20:33:05.2445094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2445301Z context = 2025-05-07T20:33:05.2445305Z 2025-05-07T20:33:05.2445475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2445758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2445922Z module_map=module_map) 2025-05-07T20:33:05.2446137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2446250Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2446330Z E ^ 2025-05-07T20:33:05.2446695Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2446707Z 2025-05-07T20:33:05.2447134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2447139Z 2025-05-07T20:33:05.2447245Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2447484Z self=, 2025-05-07T20:33:05.2447563Z T=128, 2025-05-07T20:33:05.2447645Z D=7168, 2025-05-07T20:33:05.2447737Z scale_ub=None, 2025-05-07T20:33:05.2447829Z contiguous=False, 2025-05-07T20:33:05.2447913Z compiled=True, 2025-05-07T20:33:05.2447995Z ) 2025-05-07T20:33:05.2448290Z self = 2025-05-07T20:33:05.2448517Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2448522Z 2025-05-07T20:33:05.2448601Z @given( 2025-05-07T20:33:05.2448724Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2448834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2448951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2449072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2449196Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2449276Z ) 2025-05-07T20:33:05.2449529Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2449632Z def test_silu_mul_quant( 2025-05-07T20:33:05.2449712Z self, 2025-05-07T20:33:05.2449802Z T: int, 2025-05-07T20:33:05.2449880Z D: int, 2025-05-07T20:33:05.2449986Z scale_ub: Optional[float], 2025-05-07T20:33:05.2450084Z contiguous: bool, 2025-05-07T20:33:05.2450175Z compiled: bool, 2025-05-07T20:33:05.2450260Z ) -> None: 2025-05-07T20:33:05.2450366Z torch.manual_seed(2025) 2025-05-07T20:33:05.2450442Z 2025-05-07T20:33:05.2450620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2450703Z 2025-05-07T20:33:05.2450799Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2450929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2451028Z x = x_sign * x_clamp 2025-05-07T20:33:05.2451111Z x0 = x[:, :D] 2025-05-07T20:33:05.2451203Z x1 = x[:, D:] 2025-05-07T20:33:05.2451279Z 2025-05-07T20:33:05.2451365Z if contiguous: 2025-05-07T20:33:05.2451467Z x0 = x0.contiguous() 2025-05-07T20:33:05.2451559Z x1 = x1.contiguous() 2025-05-07T20:33:05.2451638Z 2025-05-07T20:33:05.2451742Z if scale_ub is not None: 2025-05-07T20:33:05.2451854Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2452000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2452088Z ) 2025-05-07T20:33:05.2452167Z else: 2025-05-07T20:33:05.2452265Z scale_ub_tensor = None 2025-05-07T20:33:05.2452347Z 2025-05-07T20:33:05.2452480Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2452574Z op = silu_mul_quant 2025-05-07T20:33:05.2452669Z if compiled: 2025-05-07T20:33:05.2452772Z op = torch.compile(op) 2025-05-07T20:33:05.2452889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2452964Z 2025-05-07T20:33:05.2453057Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2453188Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2453314Z 2025-05-07T20:33:05.2453456Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2453572Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2453716Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2453844Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2454007Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2454084Z 2025-05-07T20:33:05.2454196Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2454200Z 2025-05-07T20:33:05.2454302Z moe/activation_test.py:126: 2025-05-07T20:33:05.2454437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2454555Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2454693Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2455270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2455386Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2455804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2456081Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2456462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2456729Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2457125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2457297Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2457658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2457741Z fn() 2025-05-07T20:33:05.2458156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2458251Z self.fn.run( 2025-05-07T20:33:05.2458606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2458703Z kernel = self.compile( 2025-05-07T20:33:05.2459107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2459288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2459427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2459432Z 2025-05-07T20:33:05.2459647Z self = 2025-05-07T20:33:05.2460449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2460991Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7a611b420>} 2025-05-07T20:33:05.2461758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2461962Z context = 2025-05-07T20:33:05.2461966Z 2025-05-07T20:33:05.2462137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2462410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2462527Z module_map=module_map) 2025-05-07T20:33:05.2462741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2462853Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2462937Z E ^ 2025-05-07T20:33:05.2463343Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2463348Z 2025-05-07T20:33:05.2463785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2463789Z 2025-05-07T20:33:05.2463897Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2464170Z self=, 2025-05-07T20:33:05.2464264Z T=128, 2025-05-07T20:33:05.2464342Z D=7168, 2025-05-07T20:33:05.2464432Z scale_ub=None, 2025-05-07T20:33:05.2464520Z contiguous=False, 2025-05-07T20:33:05.2464605Z compiled=False, 2025-05-07T20:33:05.2464687Z ) 2025-05-07T20:33:05.2464911Z self = 2025-05-07T20:33:05.2465090Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.2465137Z 2025-05-07T20:33:05.2465228Z @given( 2025-05-07T20:33:05.2465391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2465495Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2465621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2465742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2465869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2465946Z ) 2025-05-07T20:33:05.2466200Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2466303Z def test_silu_mul_quant( 2025-05-07T20:33:05.2466382Z self, 2025-05-07T20:33:05.2466465Z T: int, 2025-05-07T20:33:05.2466552Z D: int, 2025-05-07T20:33:05.2466655Z scale_ub: Optional[float], 2025-05-07T20:33:05.2466751Z contiguous: bool, 2025-05-07T20:33:05.2466848Z compiled: bool, 2025-05-07T20:33:05.2466934Z ) -> None: 2025-05-07T20:33:05.2467041Z torch.manual_seed(2025) 2025-05-07T20:33:05.2467123Z 2025-05-07T20:33:05.2467303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2467387Z 2025-05-07T20:33:05.2467483Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2467613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2467714Z x = x_sign * x_clamp 2025-05-07T20:33:05.2467798Z x0 = x[:, :D] 2025-05-07T20:33:05.2467881Z x1 = x[:, D:] 2025-05-07T20:33:05.2467965Z 2025-05-07T20:33:05.2468052Z if contiguous: 2025-05-07T20:33:05.2468146Z x0 = x0.contiguous() 2025-05-07T20:33:05.2468246Z x1 = x1.contiguous() 2025-05-07T20:33:05.2468323Z 2025-05-07T20:33:05.2468419Z if scale_ub is not None: 2025-05-07T20:33:05.2468538Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2468677Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2468767Z ) 2025-05-07T20:33:05.2468845Z else: 2025-05-07T20:33:05.2468948Z scale_ub_tensor = None 2025-05-07T20:33:05.2469032Z 2025-05-07T20:33:05.2469165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2469260Z op = silu_mul_quant 2025-05-07T20:33:05.2469355Z if compiled: 2025-05-07T20:33:05.2469458Z op = torch.compile(op) 2025-05-07T20:33:05.2469571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2469653Z 2025-05-07T20:33:05.2469747Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2469751Z 2025-05-07T20:33:05.2469860Z moe/activation_test.py:117: 2025-05-07T20:33:05.2469992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2470096Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2470257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2470811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2470921Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2471302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2471532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2471892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2471989Z kernel = self.compile( 2025-05-07T20:33:05.2472384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2472576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2472708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2472715Z 2025-05-07T20:33:05.2472969Z self = 2025-05-07T20:33:05.2473814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2474391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781c60a40>} 2025-05-07T20:33:05.2475163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2475361Z context = 2025-05-07T20:33:05.2475368Z 2025-05-07T20:33:05.2475546Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2475825Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2475939Z module_map=module_map) 2025-05-07T20:33:05.2476114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2476217Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2476297Z E ^ 2025-05-07T20:33:05.2476669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2476673Z 2025-05-07T20:33:05.2477101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2477106Z 2025-05-07T20:33:05.2477218Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2477447Z self=, 2025-05-07T20:33:05.2477528Z T=4096, 2025-05-07T20:33:05.2477612Z D=5120, 2025-05-07T20:33:05.2477701Z scale_ub=1200.0, 2025-05-07T20:33:05.2477787Z contiguous=True, 2025-05-07T20:33:05.2477881Z compiled=False, 2025-05-07T20:33:05.2477957Z ) 2025-05-07T20:33:05.2478190Z self = 2025-05-07T20:33:05.2478371Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.2478376Z 2025-05-07T20:33:05.2478454Z @given( 2025-05-07T20:33:05.2478585Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2478688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2478805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2478933Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2479049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2479124Z ) 2025-05-07T20:33:05.2479431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2479530Z def test_silu_mul_quant( 2025-05-07T20:33:05.2479615Z self, 2025-05-07T20:33:05.2479756Z T: int, 2025-05-07T20:33:05.2479837Z D: int, 2025-05-07T20:33:05.2479945Z scale_ub: Optional[float], 2025-05-07T20:33:05.2480038Z contiguous: bool, 2025-05-07T20:33:05.2480185Z compiled: bool, 2025-05-07T20:33:05.2480273Z ) -> None: 2025-05-07T20:33:05.2480373Z torch.manual_seed(2025) 2025-05-07T20:33:05.2480449Z 2025-05-07T20:33:05.2480636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2480712Z 2025-05-07T20:33:05.2480807Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2480944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2481038Z x = x_sign * x_clamp 2025-05-07T20:33:05.2481130Z x0 = x[:, :D] 2025-05-07T20:33:05.2481213Z x1 = x[:, D:] 2025-05-07T20:33:05.2481293Z 2025-05-07T20:33:05.2481387Z if contiguous: 2025-05-07T20:33:05.2481529Z x0 = x0.contiguous() 2025-05-07T20:33:05.2481623Z x1 = x1.contiguous() 2025-05-07T20:33:05.2481708Z 2025-05-07T20:33:05.2481842Z if scale_ub is not None: 2025-05-07T20:33:05.2481953Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2482102Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2482185Z ) 2025-05-07T20:33:05.2482265Z else: 2025-05-07T20:33:05.2482371Z scale_ub_tensor = None 2025-05-07T20:33:05.2482447Z 2025-05-07T20:33:05.2482582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2482684Z op = silu_mul_quant 2025-05-07T20:33:05.2482771Z if compiled: 2025-05-07T20:33:05.2482884Z op = torch.compile(op) 2025-05-07T20:33:05.2482995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2483073Z 2025-05-07T20:33:05.2483176Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2483180Z 2025-05-07T20:33:05.2483285Z moe/activation_test.py:117: 2025-05-07T20:33:05.2483424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2483536Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2483640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2484157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2484269Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2484639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2484870Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2485235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2485335Z kernel = self.compile( 2025-05-07T20:33:05.2485733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2485933Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2486064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2486069Z 2025-05-07T20:33:05.2486288Z self = 2025-05-07T20:33:05.2487090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2487618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781c60ea0>} 2025-05-07T20:33:05.2488492Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2488700Z context = 2025-05-07T20:33:05.2488705Z 2025-05-07T20:33:05.2488884Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2489159Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2489280Z module_map=module_map) 2025-05-07T20:33:05.2489447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2489551Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2489637Z E ^ 2025-05-07T20:33:05.2490004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2490012Z 2025-05-07T20:33:05.2490439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2490489Z 2025-05-07T20:33:05.2490600Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2490870Z self=, 2025-05-07T20:33:05.2490959Z T=1, 2025-05-07T20:33:05.2491038Z D=5120, 2025-05-07T20:33:05.2491123Z scale_ub=None, 2025-05-07T20:33:05.2491219Z contiguous=True, 2025-05-07T20:33:05.2491304Z compiled=True, 2025-05-07T20:33:05.2491381Z ) 2025-05-07T20:33:05.2491613Z self = 2025-05-07T20:33:05.2491781Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2491785Z 2025-05-07T20:33:05.2491865Z @given( 2025-05-07T20:33:05.2491994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2492102Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2492226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2492350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2492473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2492555Z ) 2025-05-07T20:33:05.2492809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2492906Z def test_silu_mul_quant( 2025-05-07T20:33:05.2492989Z self, 2025-05-07T20:33:05.2493068Z T: int, 2025-05-07T20:33:05.2493146Z D: int, 2025-05-07T20:33:05.2493254Z scale_ub: Optional[float], 2025-05-07T20:33:05.2493346Z contiguous: bool, 2025-05-07T20:33:05.2493443Z compiled: bool, 2025-05-07T20:33:05.2493522Z ) -> None: 2025-05-07T20:33:05.2493620Z torch.manual_seed(2025) 2025-05-07T20:33:05.2493703Z 2025-05-07T20:33:05.2493877Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2493955Z 2025-05-07T20:33:05.2494055Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2494186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2494280Z x = x_sign * x_clamp 2025-05-07T20:33:05.2494368Z x0 = x[:, :D] 2025-05-07T20:33:05.2494450Z x1 = x[:, D:] 2025-05-07T20:33:05.2494527Z 2025-05-07T20:33:05.2494619Z if contiguous: 2025-05-07T20:33:05.2494714Z x0 = x0.contiguous() 2025-05-07T20:33:05.2494806Z x1 = x1.contiguous() 2025-05-07T20:33:05.2494888Z 2025-05-07T20:33:05.2494984Z if scale_ub is not None: 2025-05-07T20:33:05.2495098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2495241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2495325Z ) 2025-05-07T20:33:05.2495404Z else: 2025-05-07T20:33:05.2495500Z scale_ub_tensor = None 2025-05-07T20:33:05.2495583Z 2025-05-07T20:33:05.2495769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2495867Z op = silu_mul_quant 2025-05-07T20:33:05.2495962Z if compiled: 2025-05-07T20:33:05.2496110Z op = torch.compile(op) 2025-05-07T20:33:05.2496221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2496302Z 2025-05-07T20:33:05.2496395Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2496527Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2496602Z 2025-05-07T20:33:05.2496742Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2496854Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2496958Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2497086Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2497243Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2497318Z 2025-05-07T20:33:05.2497421Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2497428Z 2025-05-07T20:33:05.2497539Z moe/activation_test.py:126: 2025-05-07T20:33:05.2497715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2497872Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2498018Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2498598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2498709Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2499080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2499319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2499698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2499968Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2500369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2500543Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2500896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2500982Z fn() 2025-05-07T20:33:05.2501398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2501490Z self.fn.run( 2025-05-07T20:33:05.2501842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2501940Z kernel = self.compile( 2025-05-07T20:33:05.2502337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2502521Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2502658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2502673Z 2025-05-07T20:33:05.2502887Z self = 2025-05-07T20:33:05.2503692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2504270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781c62c00>} 2025-05-07T20:33:05.2505038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2505290Z context = 2025-05-07T20:33:05.2505297Z 2025-05-07T20:33:05.2505509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2505788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2505905Z module_map=module_map) 2025-05-07T20:33:05.2506075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2506181Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2506266Z E ^ 2025-05-07T20:33:05.2506630Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2506635Z 2025-05-07T20:33:05.2507068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2507076Z 2025-05-07T20:33:05.2507182Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2507480Z self=, 2025-05-07T20:33:05.2507568Z T=2048, 2025-05-07T20:33:05.2507685Z D=5120, 2025-05-07T20:33:05.2507778Z scale_ub=None, 2025-05-07T20:33:05.2507865Z contiguous=True, 2025-05-07T20:33:05.2507950Z compiled=True, 2025-05-07T20:33:05.2508030Z ) 2025-05-07T20:33:05.2508256Z self = 2025-05-07T20:33:05.2508433Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2508437Z 2025-05-07T20:33:05.2508523Z @given( 2025-05-07T20:33:05.2508650Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2508752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2508880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2509002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2509128Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2509207Z ) 2025-05-07T20:33:05.2509462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2509566Z def test_silu_mul_quant( 2025-05-07T20:33:05.2509644Z self, 2025-05-07T20:33:05.2509723Z T: int, 2025-05-07T20:33:05.2509807Z D: int, 2025-05-07T20:33:05.2509908Z scale_ub: Optional[float], 2025-05-07T20:33:05.2510000Z contiguous: bool, 2025-05-07T20:33:05.2510095Z compiled: bool, 2025-05-07T20:33:05.2510175Z ) -> None: 2025-05-07T20:33:05.2510274Z torch.manual_seed(2025) 2025-05-07T20:33:05.2510356Z 2025-05-07T20:33:05.2510531Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2510615Z 2025-05-07T20:33:05.2510710Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2510839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2510943Z x = x_sign * x_clamp 2025-05-07T20:33:05.2511026Z x0 = x[:, :D] 2025-05-07T20:33:05.2511112Z x1 = x[:, D:] 2025-05-07T20:33:05.2511191Z 2025-05-07T20:33:05.2511279Z if contiguous: 2025-05-07T20:33:05.2511375Z x0 = x0.contiguous() 2025-05-07T20:33:05.2511474Z x1 = x1.contiguous() 2025-05-07T20:33:05.2511549Z 2025-05-07T20:33:05.2511643Z if scale_ub is not None: 2025-05-07T20:33:05.2511759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2511901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2511977Z ) 2025-05-07T20:33:05.2512064Z else: 2025-05-07T20:33:05.2512162Z scale_ub_tensor = None 2025-05-07T20:33:05.2512244Z 2025-05-07T20:33:05.2512381Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2512473Z op = silu_mul_quant 2025-05-07T20:33:05.2512566Z if compiled: 2025-05-07T20:33:05.2512743Z op = torch.compile(op) 2025-05-07T20:33:05.2512852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2512936Z 2025-05-07T20:33:05.2513068Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2513197Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2513279Z 2025-05-07T20:33:05.2513718Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2513869Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2514003Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2514129Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2514285Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2514360Z 2025-05-07T20:33:05.2514462Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2514466Z 2025-05-07T20:33:05.2514573Z moe/activation_test.py:126: 2025-05-07T20:33:05.2514705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2514817Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2515124Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2515769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2515881Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2516250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2516482Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2516868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2517139Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2517535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2517712Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2518067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2518151Z fn() 2025-05-07T20:33:05.2518564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2518647Z self.fn.run( 2025-05-07T20:33:05.2519007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2519102Z kernel = self.compile( 2025-05-07T20:33:05.2519498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2519678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2519809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2519817Z 2025-05-07T20:33:05.2520033Z self = 2025-05-07T20:33:05.2520904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2521433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781c7d1c0>} 2025-05-07T20:33:05.2522201Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2522398Z context = 2025-05-07T20:33:05.2522475Z 2025-05-07T20:33:05.2522652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2522991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2523108Z module_map=module_map) 2025-05-07T20:33:05.2523271Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2523378Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2523462Z E ^ 2025-05-07T20:33:05.2523826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2523830Z 2025-05-07T20:33:05.2524262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2524266Z 2025-05-07T20:33:05.2524372Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2524601Z self=, 2025-05-07T20:33:05.2524689Z T=128, 2025-05-07T20:33:05.2524808Z D=5120, 2025-05-07T20:33:05.2524894Z scale_ub=None, 2025-05-07T20:33:05.2524989Z contiguous=True, 2025-05-07T20:33:05.2525111Z compiled=True, 2025-05-07T20:33:05.2525187Z ) 2025-05-07T20:33:05.2525423Z self = 2025-05-07T20:33:05.2525596Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2525600Z 2025-05-07T20:33:05.2525684Z @given( 2025-05-07T20:33:05.2525805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2525906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2526028Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2526146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2526261Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2526343Z ) 2025-05-07T20:33:05.2526597Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2526696Z def test_silu_mul_quant( 2025-05-07T20:33:05.2526780Z self, 2025-05-07T20:33:05.2526860Z T: int, 2025-05-07T20:33:05.2526945Z D: int, 2025-05-07T20:33:05.2527046Z scale_ub: Optional[float], 2025-05-07T20:33:05.2527138Z contiguous: bool, 2025-05-07T20:33:05.2527230Z compiled: bool, 2025-05-07T20:33:05.2527310Z ) -> None: 2025-05-07T20:33:05.2527407Z torch.manual_seed(2025) 2025-05-07T20:33:05.2527485Z 2025-05-07T20:33:05.2527661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2527738Z 2025-05-07T20:33:05.2527838Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2527967Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2528058Z x = x_sign * x_clamp 2025-05-07T20:33:05.2528145Z x0 = x[:, :D] 2025-05-07T20:33:05.2528230Z x1 = x[:, D:] 2025-05-07T20:33:05.2528303Z 2025-05-07T20:33:05.2528394Z if contiguous: 2025-05-07T20:33:05.2528489Z x0 = x0.contiguous() 2025-05-07T20:33:05.2528588Z x1 = x1.contiguous() 2025-05-07T20:33:05.2528661Z 2025-05-07T20:33:05.2528756Z if scale_ub is not None: 2025-05-07T20:33:05.2528869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2529005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2529083Z ) 2025-05-07T20:33:05.2529170Z else: 2025-05-07T20:33:05.2529266Z scale_ub_tensor = None 2025-05-07T20:33:05.2529341Z 2025-05-07T20:33:05.2529478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2529570Z op = silu_mul_quant 2025-05-07T20:33:05.2529656Z if compiled: 2025-05-07T20:33:05.2529770Z op = torch.compile(op) 2025-05-07T20:33:05.2529879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2530013Z 2025-05-07T20:33:05.2530107Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2530235Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2530315Z 2025-05-07T20:33:05.2530496Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2530602Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2530711Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2530834Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2530978Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2531059Z 2025-05-07T20:33:05.2531164Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2531169Z 2025-05-07T20:33:05.2531274Z moe/activation_test.py:126: 2025-05-07T20:33:05.2531403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2531509Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2531655Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2532230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2532415Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2532794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2533026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2533411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2533679Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2534072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2534276Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2534643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2534730Z fn() 2025-05-07T20:33:05.2535146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2535229Z self.fn.run( 2025-05-07T20:33:05.2535581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2535676Z kernel = self.compile( 2025-05-07T20:33:05.2536067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2536253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2536381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2536386Z 2025-05-07T20:33:05.2536602Z self = 2025-05-07T20:33:05.2537407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2537931Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78162ad40>} 2025-05-07T20:33:05.2538706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2538904Z context = 2025-05-07T20:33:05.2538908Z 2025-05-07T20:33:05.2539086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2539404Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2539515Z module_map=module_map) 2025-05-07T20:33:05.2539731Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2539838Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2539921Z E ^ 2025-05-07T20:33:05.2540285Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2540289Z 2025-05-07T20:33:05.2540715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2540719Z 2025-05-07T20:33:05.2540831Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2541060Z self=, 2025-05-07T20:33:05.2541144Z T=4096, 2025-05-07T20:33:05.2541223Z D=5120, 2025-05-07T20:33:05.2541309Z scale_ub=None, 2025-05-07T20:33:05.2541401Z contiguous=True, 2025-05-07T20:33:05.2541487Z compiled=True, 2025-05-07T20:33:05.2541605Z ) 2025-05-07T20:33:05.2541835Z self = 2025-05-07T20:33:05.2542075Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2542080Z 2025-05-07T20:33:05.2542160Z @given( 2025-05-07T20:33:05.2542286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2542387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2542505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2542630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2542747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2542829Z ) 2025-05-07T20:33:05.2543082Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2543179Z def test_silu_mul_quant( 2025-05-07T20:33:05.2543264Z self, 2025-05-07T20:33:05.2543341Z T: int, 2025-05-07T20:33:05.2543422Z D: int, 2025-05-07T20:33:05.2543528Z scale_ub: Optional[float], 2025-05-07T20:33:05.2543621Z contiguous: bool, 2025-05-07T20:33:05.2543712Z compiled: bool, 2025-05-07T20:33:05.2543800Z ) -> None: 2025-05-07T20:33:05.2543897Z torch.manual_seed(2025) 2025-05-07T20:33:05.2543971Z 2025-05-07T20:33:05.2544157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2544232Z 2025-05-07T20:33:05.2544356Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2544501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2544598Z x = x_sign * x_clamp 2025-05-07T20:33:05.2544687Z x0 = x[:, :D] 2025-05-07T20:33:05.2544768Z x1 = x[:, D:] 2025-05-07T20:33:05.2544841Z 2025-05-07T20:33:05.2544933Z if contiguous: 2025-05-07T20:33:05.2545025Z x0 = x0.contiguous() 2025-05-07T20:33:05.2545118Z x1 = x1.contiguous() 2025-05-07T20:33:05.2545198Z 2025-05-07T20:33:05.2545294Z if scale_ub is not None: 2025-05-07T20:33:05.2545405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2545551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2545627Z ) 2025-05-07T20:33:05.2545711Z else: 2025-05-07T20:33:05.2545807Z scale_ub_tensor = None 2025-05-07T20:33:05.2545881Z 2025-05-07T20:33:05.2546022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2546114Z op = silu_mul_quant 2025-05-07T20:33:05.2546199Z if compiled: 2025-05-07T20:33:05.2546307Z op = torch.compile(op) 2025-05-07T20:33:05.2546415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2546489Z 2025-05-07T20:33:05.2546591Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2546715Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2546841Z 2025-05-07T20:33:05.2546986Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2547092Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2547248Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2547374Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2547517Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2547597Z 2025-05-07T20:33:05.2547700Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2547704Z 2025-05-07T20:33:05.2547806Z moe/activation_test.py:126: 2025-05-07T20:33:05.2547944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2548053Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2548191Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2548769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2548877Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2549348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2549579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2549955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2550227Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2550612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2550790Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2551140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2551221Z fn() 2025-05-07T20:33:05.2551639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2551729Z self.fn.run( 2025-05-07T20:33:05.2552080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2552181Z kernel = self.compile( 2025-05-07T20:33:05.2552574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2552758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2552887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2552892Z 2025-05-07T20:33:05.2553103Z self = 2025-05-07T20:33:05.2553910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2554481Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7815ea660>} 2025-05-07T20:33:05.2555267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2555463Z context = 2025-05-07T20:33:05.2555467Z 2025-05-07T20:33:05.2555643Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2555915Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2556025Z module_map=module_map) 2025-05-07T20:33:05.2556243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2556352Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2556470Z E ^ 2025-05-07T20:33:05.2556846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2556850Z 2025-05-07T20:33:05.2557277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2557282Z 2025-05-07T20:33:05.2557394Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2557626Z self=, 2025-05-07T20:33:05.2557707Z T=16384, 2025-05-07T20:33:05.2557790Z D=5120, 2025-05-07T20:33:05.2557875Z scale_ub=None, 2025-05-07T20:33:05.2557961Z contiguous=True, 2025-05-07T20:33:05.2558055Z compiled=True, 2025-05-07T20:33:05.2558133Z ) 2025-05-07T20:33:05.2558357Z self = 2025-05-07T20:33:05.2558584Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2558592Z 2025-05-07T20:33:05.2558708Z @given( 2025-05-07T20:33:05.2558836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2558942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2559059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2559185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2559302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2559377Z ) 2025-05-07T20:33:05.2559639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2559734Z def test_silu_mul_quant( 2025-05-07T20:33:05.2559817Z self, 2025-05-07T20:33:05.2559894Z T: int, 2025-05-07T20:33:05.2559971Z D: int, 2025-05-07T20:33:05.2560140Z scale_ub: Optional[float], 2025-05-07T20:33:05.2560232Z contiguous: bool, 2025-05-07T20:33:05.2560322Z compiled: bool, 2025-05-07T20:33:05.2560408Z ) -> None: 2025-05-07T20:33:05.2560520Z torch.manual_seed(2025) 2025-05-07T20:33:05.2560598Z 2025-05-07T20:33:05.2571890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2571984Z 2025-05-07T20:33:05.2572086Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2572235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2572330Z x = x_sign * x_clamp 2025-05-07T20:33:05.2572415Z x0 = x[:, :D] 2025-05-07T20:33:05.2572505Z x1 = x[:, D:] 2025-05-07T20:33:05.2572579Z 2025-05-07T20:33:05.2572666Z if contiguous: 2025-05-07T20:33:05.2572769Z x0 = x0.contiguous() 2025-05-07T20:33:05.2572861Z x1 = x1.contiguous() 2025-05-07T20:33:05.2572944Z 2025-05-07T20:33:05.2573039Z if scale_ub is not None: 2025-05-07T20:33:05.2573159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2573313Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2573395Z ) 2025-05-07T20:33:05.2573474Z else: 2025-05-07T20:33:05.2573585Z scale_ub_tensor = None 2025-05-07T20:33:05.2573661Z 2025-05-07T20:33:05.2573799Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2573904Z op = silu_mul_quant 2025-05-07T20:33:05.2573992Z if compiled: 2025-05-07T20:33:05.2574099Z op = torch.compile(op) 2025-05-07T20:33:05.2574223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2574297Z 2025-05-07T20:33:05.2574399Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2574526Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2574600Z 2025-05-07T20:33:05.2574747Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2574942Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2575047Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2575185Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2575380Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2575456Z 2025-05-07T20:33:05.2575568Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2575574Z 2025-05-07T20:33:05.2575676Z moe/activation_test.py:126: 2025-05-07T20:33:05.2575820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2575931Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2576071Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2576663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2576769Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2577150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2577441Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2577864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2578142Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2578534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2578710Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2579075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2579156Z fn() 2025-05-07T20:33:05.2579580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2579669Z self.fn.run( 2025-05-07T20:33:05.2580031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2580140Z kernel = self.compile( 2025-05-07T20:33:05.2580539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2580723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2580867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2580872Z 2025-05-07T20:33:05.2581086Z self = 2025-05-07T20:33:05.2581902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2582432Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780b25800>} 2025-05-07T20:33:05.2583220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2583422Z context = 2025-05-07T20:33:05.2583427Z 2025-05-07T20:33:05.2583602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2583891Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2584004Z module_map=module_map) 2025-05-07T20:33:05.2584173Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2584288Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2584416Z E ^ 2025-05-07T20:33:05.2584837Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2584845Z 2025-05-07T20:33:05.2585277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2585282Z 2025-05-07T20:33:05.2585390Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2585629Z self=, 2025-05-07T20:33:05.2585708Z T=1, 2025-05-07T20:33:05.2585794Z D=5120, 2025-05-07T20:33:05.2585881Z scale_ub=1200.0, 2025-05-07T20:33:05.2585967Z contiguous=True, 2025-05-07T20:33:05.2586063Z compiled=True, 2025-05-07T20:33:05.2586139Z ) 2025-05-07T20:33:05.2586365Z self = 2025-05-07T20:33:05.2586545Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.2586552Z 2025-05-07T20:33:05.2586633Z @given( 2025-05-07T20:33:05.2586830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2586944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2587102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2587231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2587348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2587425Z ) 2025-05-07T20:33:05.2587686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2587785Z def test_silu_mul_quant( 2025-05-07T20:33:05.2587865Z self, 2025-05-07T20:33:05.2587949Z T: int, 2025-05-07T20:33:05.2588027Z D: int, 2025-05-07T20:33:05.2588131Z scale_ub: Optional[float], 2025-05-07T20:33:05.2588237Z contiguous: bool, 2025-05-07T20:33:05.2588328Z compiled: bool, 2025-05-07T20:33:05.2588412Z ) -> None: 2025-05-07T20:33:05.2588517Z torch.manual_seed(2025) 2025-05-07T20:33:05.2588597Z 2025-05-07T20:33:05.2588784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2588862Z 2025-05-07T20:33:05.2588962Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2589099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2589192Z x = x_sign * x_clamp 2025-05-07T20:33:05.2589275Z x0 = x[:, :D] 2025-05-07T20:33:05.2589365Z x1 = x[:, D:] 2025-05-07T20:33:05.2589440Z 2025-05-07T20:33:05.2589529Z if contiguous: 2025-05-07T20:33:05.2589636Z x0 = x0.contiguous() 2025-05-07T20:33:05.2589730Z x1 = x1.contiguous() 2025-05-07T20:33:05.2589805Z 2025-05-07T20:33:05.2589907Z if scale_ub is not None: 2025-05-07T20:33:05.2590015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2590157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2590245Z ) 2025-05-07T20:33:05.2590324Z else: 2025-05-07T20:33:05.2590434Z scale_ub_tensor = None 2025-05-07T20:33:05.2590508Z 2025-05-07T20:33:05.2590646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2590746Z op = silu_mul_quant 2025-05-07T20:33:05.2590834Z if compiled: 2025-05-07T20:33:05.2590938Z op = torch.compile(op) 2025-05-07T20:33:05.2591060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2591135Z 2025-05-07T20:33:05.2591229Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2591234Z 2025-05-07T20:33:05.2591342Z moe/activation_test.py:117: 2025-05-07T20:33:05.2591475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2591588Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2591696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2592079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2592237Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2592790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2592893Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2593281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2593512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2593875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2593974Z kernel = self.compile( 2025-05-07T20:33:05.2594422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2594611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2594751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2594799Z 2025-05-07T20:33:05.2595060Z self = 2025-05-07T20:33:05.2595870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2596396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780b334c0>} 2025-05-07T20:33:05.2597178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2597378Z context = 2025-05-07T20:33:05.2597382Z 2025-05-07T20:33:05.2597564Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2597844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2597958Z module_map=module_map) 2025-05-07T20:33:05.2598132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2598235Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2598315Z E ^ 2025-05-07T20:33:05.2598694Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2598699Z 2025-05-07T20:33:05.2599130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2599135Z 2025-05-07T20:33:05.2599249Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2599483Z self=, 2025-05-07T20:33:05.2599566Z T=1, 2025-05-07T20:33:05.2599653Z D=5120, 2025-05-07T20:33:05.2599741Z scale_ub=None, 2025-05-07T20:33:05.2599844Z contiguous=False, 2025-05-07T20:33:05.2599929Z compiled=True, 2025-05-07T20:33:05.2600005Z ) 2025-05-07T20:33:05.2600334Z self = 2025-05-07T20:33:05.2600511Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2600516Z 2025-05-07T20:33:05.2600595Z @given( 2025-05-07T20:33:05.2600725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2600828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2600947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2601074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2601191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2601327Z ) 2025-05-07T20:33:05.2601583Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2601725Z def test_silu_mul_quant( 2025-05-07T20:33:05.2601813Z self, 2025-05-07T20:33:05.2601894Z T: int, 2025-05-07T20:33:05.2601974Z D: int, 2025-05-07T20:33:05.2602082Z scale_ub: Optional[float], 2025-05-07T20:33:05.2602173Z contiguous: bool, 2025-05-07T20:33:05.2602262Z compiled: bool, 2025-05-07T20:33:05.2602350Z ) -> None: 2025-05-07T20:33:05.2602448Z torch.manual_seed(2025) 2025-05-07T20:33:05.2602523Z 2025-05-07T20:33:05.2602706Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2602783Z 2025-05-07T20:33:05.2602877Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2603013Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2603104Z x = x_sign * x_clamp 2025-05-07T20:33:05.2603199Z x0 = x[:, :D] 2025-05-07T20:33:05.2603282Z x1 = x[:, D:] 2025-05-07T20:33:05.2603407Z 2025-05-07T20:33:05.2603505Z if contiguous: 2025-05-07T20:33:05.2603602Z x0 = x0.contiguous() 2025-05-07T20:33:05.2603733Z x1 = x1.contiguous() 2025-05-07T20:33:05.2603821Z 2025-05-07T20:33:05.2603915Z if scale_ub is not None: 2025-05-07T20:33:05.2604027Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2604178Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2604259Z ) 2025-05-07T20:33:05.2604337Z else: 2025-05-07T20:33:05.2604440Z scale_ub_tensor = None 2025-05-07T20:33:05.2604515Z 2025-05-07T20:33:05.2604656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2604750Z op = silu_mul_quant 2025-05-07T20:33:05.2604836Z if compiled: 2025-05-07T20:33:05.2604947Z op = torch.compile(op) 2025-05-07T20:33:05.2605059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2605134Z 2025-05-07T20:33:05.2605239Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2605366Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2605445Z 2025-05-07T20:33:05.2605594Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2605700Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2605803Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2605938Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2606084Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2606167Z 2025-05-07T20:33:05.2606273Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2606278Z 2025-05-07T20:33:05.2606383Z moe/activation_test.py:126: 2025-05-07T20:33:05.2606524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2606636Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2606776Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2607380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2607485Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2607864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2608097Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2608478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2608753Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2609142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2609380Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2609780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2609864Z fn() 2025-05-07T20:33:05.2610290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2610379Z self.fn.run( 2025-05-07T20:33:05.2610734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2610838Z kernel = self.compile( 2025-05-07T20:33:05.2611234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2611425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2611559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2611566Z 2025-05-07T20:33:05.2611779Z self = 2025-05-07T20:33:05.2612675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2613201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780702de0>} 2025-05-07T20:33:05.2614340Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2614543Z context = 2025-05-07T20:33:05.2614548Z 2025-05-07T20:33:05.2614718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2615005Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2615126Z module_map=module_map) 2025-05-07T20:33:05.2615302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2615408Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2615487Z E ^ 2025-05-07T20:33:05.2615862Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2615866Z 2025-05-07T20:33:05.2616297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2616302Z 2025-05-07T20:33:05.2616417Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2616653Z self=, 2025-05-07T20:33:05.2616732Z T=1, 2025-05-07T20:33:05.2616820Z D=5120, 2025-05-07T20:33:05.2616905Z scale_ub=None, 2025-05-07T20:33:05.2616995Z contiguous=True, 2025-05-07T20:33:05.2617092Z compiled=False, 2025-05-07T20:33:05.2617169Z ) 2025-05-07T20:33:05.2617401Z self = 2025-05-07T20:33:05.2617576Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.2617581Z 2025-05-07T20:33:05.2617661Z @given( 2025-05-07T20:33:05.2617791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2617893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2618014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2618141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2618257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2618333Z ) 2025-05-07T20:33:05.2618599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2618860Z def test_silu_mul_quant( 2025-05-07T20:33:05.2618939Z self, 2025-05-07T20:33:05.2619029Z T: int, 2025-05-07T20:33:05.2619108Z D: int, 2025-05-07T20:33:05.2619310Z scale_ub: Optional[float], 2025-05-07T20:33:05.2619406Z contiguous: bool, 2025-05-07T20:33:05.2619494Z compiled: bool, 2025-05-07T20:33:05.2619580Z ) -> None: 2025-05-07T20:33:05.2619679Z torch.manual_seed(2025) 2025-05-07T20:33:05.2619754Z 2025-05-07T20:33:05.2619936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2620013Z 2025-05-07T20:33:05.2620109Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2620247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2620338Z x = x_sign * x_clamp 2025-05-07T20:33:05.2620421Z x0 = x[:, :D] 2025-05-07T20:33:05.2620510Z x1 = x[:, D:] 2025-05-07T20:33:05.2620585Z 2025-05-07T20:33:05.2620671Z if contiguous: 2025-05-07T20:33:05.2620778Z x0 = x0.contiguous() 2025-05-07T20:33:05.2620871Z x1 = x1.contiguous() 2025-05-07T20:33:05.2621024Z 2025-05-07T20:33:05.2621123Z if scale_ub is not None: 2025-05-07T20:33:05.2621300Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2621458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2621539Z ) 2025-05-07T20:33:05.2621619Z else: 2025-05-07T20:33:05.2621723Z scale_ub_tensor = None 2025-05-07T20:33:05.2621800Z 2025-05-07T20:33:05.2621935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2622037Z op = silu_mul_quant 2025-05-07T20:33:05.2622125Z if compiled: 2025-05-07T20:33:05.2622228Z op = torch.compile(op) 2025-05-07T20:33:05.2622349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2622424Z 2025-05-07T20:33:05.2622526Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2622533Z 2025-05-07T20:33:05.2622634Z moe/activation_test.py:117: 2025-05-07T20:33:05.2622768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2622887Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2622993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2623509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2623618Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2623989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2624229Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2624584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2624681Z kernel = self.compile( 2025-05-07T20:33:05.2625086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2625276Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2625420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2625425Z 2025-05-07T20:33:05.2625636Z self = 2025-05-07T20:33:05.2626439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2626972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781029940>} 2025-05-07T20:33:05.2627746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2628042Z context = 2025-05-07T20:33:05.2628049Z 2025-05-07T20:33:05.2628222Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2628504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2628616Z module_map=module_map) 2025-05-07T20:33:05.2628782Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2628891Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2628975Z E ^ 2025-05-07T20:33:05.2629343Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2629348Z 2025-05-07T20:33:05.2629784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2629835Z 2025-05-07T20:33:05.2629944Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2630219Z self=, 2025-05-07T20:33:05.2630301Z T=128, 2025-05-07T20:33:05.2630380Z D=5120, 2025-05-07T20:33:05.2630473Z scale_ub=None, 2025-05-07T20:33:05.2630563Z contiguous=False, 2025-05-07T20:33:05.2630650Z compiled=True, 2025-05-07T20:33:05.2630734Z ) 2025-05-07T20:33:05.2630959Z self = 2025-05-07T20:33:05.2631137Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2631149Z 2025-05-07T20:33:05.2631230Z @given( 2025-05-07T20:33:05.2631352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2631462Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2631582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2631705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2631832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2631910Z ) 2025-05-07T20:33:05.2632168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2632269Z def test_silu_mul_quant( 2025-05-07T20:33:05.2632347Z self, 2025-05-07T20:33:05.2632426Z T: int, 2025-05-07T20:33:05.2632509Z D: int, 2025-05-07T20:33:05.2632610Z scale_ub: Optional[float], 2025-05-07T20:33:05.2632706Z contiguous: bool, 2025-05-07T20:33:05.2632794Z compiled: bool, 2025-05-07T20:33:05.2632874Z ) -> None: 2025-05-07T20:33:05.2632977Z torch.manual_seed(2025) 2025-05-07T20:33:05.2633050Z 2025-05-07T20:33:05.2633226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2633307Z 2025-05-07T20:33:05.2633405Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2633532Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2633630Z x = x_sign * x_clamp 2025-05-07T20:33:05.2633715Z x0 = x[:, :D] 2025-05-07T20:33:05.2633799Z x1 = x[:, D:] 2025-05-07T20:33:05.2633884Z 2025-05-07T20:33:05.2633972Z if contiguous: 2025-05-07T20:33:05.2634065Z x0 = x0.contiguous() 2025-05-07T20:33:05.2634163Z x1 = x1.contiguous() 2025-05-07T20:33:05.2634236Z 2025-05-07T20:33:05.2634336Z if scale_ub is not None: 2025-05-07T20:33:05.2634444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2634583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2634666Z ) 2025-05-07T20:33:05.2634746Z else: 2025-05-07T20:33:05.2634857Z scale_ub_tensor = None 2025-05-07T20:33:05.2634931Z 2025-05-07T20:33:05.2635065Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2635266Z op = silu_mul_quant 2025-05-07T20:33:05.2635358Z if compiled: 2025-05-07T20:33:05.2635464Z op = torch.compile(op) 2025-05-07T20:33:05.2635624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2635702Z 2025-05-07T20:33:05.2635803Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2635808Z 2025-05-07T20:33:05.2635907Z moe/activation_test.py:117: 2025-05-07T20:33:05.2636039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2636149Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2636252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2636634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2636735Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2637249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2637361Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2637731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2638051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2638415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2638512Z kernel = self.compile( 2025-05-07T20:33:05.2638912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2639099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2639230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2639235Z 2025-05-07T20:33:05.2639451Z self = 2025-05-07T20:33:05.2640325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2640865Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7807037e0>} 2025-05-07T20:33:05.2641639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2641837Z context = 2025-05-07T20:33:05.2641842Z 2025-05-07T20:33:05.2642020Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2642295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2642412Z module_map=module_map) 2025-05-07T20:33:05.2642581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2642686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2642773Z E ^ 2025-05-07T20:33:05.2643141Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2643146Z 2025-05-07T20:33:05.2643574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2643587Z 2025-05-07T20:33:05.2643692Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2643922Z self=, 2025-05-07T20:33:05.2644006Z T=128, 2025-05-07T20:33:05.2644084Z D=7168, 2025-05-07T20:33:05.2644169Z scale_ub=1200.0, 2025-05-07T20:33:05.2644263Z contiguous=False, 2025-05-07T20:33:05.2644398Z compiled=False, 2025-05-07T20:33:05.2644472Z ) 2025-05-07T20:33:05.2644702Z self = 2025-05-07T20:33:05.2644929Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.2644934Z 2025-05-07T20:33:05.2645019Z @given( 2025-05-07T20:33:05.2645143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2645244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2645368Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2645489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2645605Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2645687Z ) 2025-05-07T20:33:05.2645941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2646036Z def test_silu_mul_quant( 2025-05-07T20:33:05.2646122Z self, 2025-05-07T20:33:05.2646203Z T: int, 2025-05-07T20:33:05.2646282Z D: int, 2025-05-07T20:33:05.2646389Z scale_ub: Optional[float], 2025-05-07T20:33:05.2646529Z contiguous: bool, 2025-05-07T20:33:05.2646625Z compiled: bool, 2025-05-07T20:33:05.2646743Z ) -> None: 2025-05-07T20:33:05.2646844Z torch.manual_seed(2025) 2025-05-07T20:33:05.2646924Z 2025-05-07T20:33:05.2647100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2647177Z 2025-05-07T20:33:05.2647278Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2647405Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2647497Z x = x_sign * x_clamp 2025-05-07T20:33:05.2647584Z x0 = x[:, :D] 2025-05-07T20:33:05.2647668Z x1 = x[:, D:] 2025-05-07T20:33:05.2647744Z 2025-05-07T20:33:05.2647836Z if contiguous: 2025-05-07T20:33:05.2647929Z x0 = x0.contiguous() 2025-05-07T20:33:05.2648028Z x1 = x1.contiguous() 2025-05-07T20:33:05.2648104Z 2025-05-07T20:33:05.2648196Z if scale_ub is not None: 2025-05-07T20:33:05.2648314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2648456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2648536Z ) 2025-05-07T20:33:05.2648620Z else: 2025-05-07T20:33:05.2648717Z scale_ub_tensor = None 2025-05-07T20:33:05.2648792Z 2025-05-07T20:33:05.2648932Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2649025Z op = silu_mul_quant 2025-05-07T20:33:05.2649113Z if compiled: 2025-05-07T20:33:05.2649226Z op = torch.compile(op) 2025-05-07T20:33:05.2649334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2649410Z 2025-05-07T20:33:05.2649510Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2649514Z 2025-05-07T20:33:05.2649614Z moe/activation_test.py:117: 2025-05-07T20:33:05.2649752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2649859Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2649965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2650493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2650595Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2650967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2651206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2651559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2651661Z kernel = self.compile( 2025-05-07T20:33:05.2652057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2652314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2652453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2652496Z 2025-05-07T20:33:05.2652711Z self = 2025-05-07T20:33:05.2653525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2654049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780720400>} 2025-05-07T20:33:05.2654830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2655031Z context = 2025-05-07T20:33:05.2655075Z 2025-05-07T20:33:05.2655249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2655570Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2655684Z module_map=module_map) 2025-05-07T20:33:05.2655849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2655958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2656038Z E ^ 2025-05-07T20:33:05.2656413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2656418Z 2025-05-07T20:33:05.2656846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2656851Z 2025-05-07T20:33:05.2656960Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2657198Z self=, 2025-05-07T20:33:05.2657281Z T=128, 2025-05-07T20:33:05.2657373Z D=5120, 2025-05-07T20:33:05.2657462Z scale_ub=None, 2025-05-07T20:33:05.2657551Z contiguous=False, 2025-05-07T20:33:05.2657645Z compiled=False, 2025-05-07T20:33:05.2657720Z ) 2025-05-07T20:33:05.2657948Z self = 2025-05-07T20:33:05.2658131Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.2658136Z 2025-05-07T20:33:05.2658216Z @given( 2025-05-07T20:33:05.2658338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2658446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2658563Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2658689Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2658807Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2658884Z ) 2025-05-07T20:33:05.2659150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2659248Z def test_silu_mul_quant( 2025-05-07T20:33:05.2659325Z self, 2025-05-07T20:33:05.2659408Z T: int, 2025-05-07T20:33:05.2659485Z D: int, 2025-05-07T20:33:05.2659587Z scale_ub: Optional[float], 2025-05-07T20:33:05.2659685Z contiguous: bool, 2025-05-07T20:33:05.2659773Z compiled: bool, 2025-05-07T20:33:05.2659853Z ) -> None: 2025-05-07T20:33:05.2659955Z torch.manual_seed(2025) 2025-05-07T20:33:05.2660029Z 2025-05-07T20:33:05.2660207Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2660288Z 2025-05-07T20:33:05.2660382Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2660513Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2660652Z x = x_sign * x_clamp 2025-05-07T20:33:05.2660735Z x0 = x[:, :D] 2025-05-07T20:33:05.2660829Z x1 = x[:, D:] 2025-05-07T20:33:05.2660903Z 2025-05-07T20:33:05.2661029Z if contiguous: 2025-05-07T20:33:05.2661133Z x0 = x0.contiguous() 2025-05-07T20:33:05.2661224Z x1 = x1.contiguous() 2025-05-07T20:33:05.2661298Z 2025-05-07T20:33:05.2661397Z if scale_ub is not None: 2025-05-07T20:33:05.2661505Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2661644Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2661727Z ) 2025-05-07T20:33:05.2661805Z else: 2025-05-07T20:33:05.2661907Z scale_ub_tensor = None 2025-05-07T20:33:05.2661980Z 2025-05-07T20:33:05.2662112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2662213Z op = silu_mul_quant 2025-05-07T20:33:05.2662299Z if compiled: 2025-05-07T20:33:05.2662403Z op = torch.compile(op) 2025-05-07T20:33:05.2662516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2662633Z 2025-05-07T20:33:05.2662726Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2662733Z 2025-05-07T20:33:05.2662877Z moe/activation_test.py:117: 2025-05-07T20:33:05.2663011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2663119Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2663221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2663736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2663844Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2664218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2664451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2664818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2664917Z kernel = self.compile( 2025-05-07T20:33:05.2665325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2665507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2665637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2665641Z 2025-05-07T20:33:05.2665857Z self = 2025-05-07T20:33:05.2666662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2667192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781029f80>} 2025-05-07T20:33:05.2667971Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2668170Z context = 2025-05-07T20:33:05.2668181Z 2025-05-07T20:33:05.2668355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2668628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2668745Z module_map=module_map) 2025-05-07T20:33:05.2668912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2669013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2669098Z E ^ 2025-05-07T20:33:05.2669511Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2669518Z 2025-05-07T20:33:05.2669997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2670002Z 2025-05-07T20:33:05.2670111Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2670341Z self=, 2025-05-07T20:33:05.2670427Z T=128, 2025-05-07T20:33:05.2670506Z D=5120, 2025-05-07T20:33:05.2670592Z scale_ub=1200.0, 2025-05-07T20:33:05.2670688Z contiguous=True, 2025-05-07T20:33:05.2670774Z compiled=False, 2025-05-07T20:33:05.2670850Z ) 2025-05-07T20:33:05.2671083Z self = 2025-05-07T20:33:05.2671260Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.2671264Z 2025-05-07T20:33:05.2671353Z @given( 2025-05-07T20:33:05.2671478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2671630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2671759Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2671919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2672038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2672123Z ) 2025-05-07T20:33:05.2672378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2672474Z def test_silu_mul_quant( 2025-05-07T20:33:05.2672560Z self, 2025-05-07T20:33:05.2672637Z T: int, 2025-05-07T20:33:05.2672723Z D: int, 2025-05-07T20:33:05.2672824Z scale_ub: Optional[float], 2025-05-07T20:33:05.2672916Z contiguous: bool, 2025-05-07T20:33:05.2673011Z compiled: bool, 2025-05-07T20:33:05.2673091Z ) -> None: 2025-05-07T20:33:05.2673188Z torch.manual_seed(2025) 2025-05-07T20:33:05.2673269Z 2025-05-07T20:33:05.2673444Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2673524Z 2025-05-07T20:33:05.2673628Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2673757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2673848Z x = x_sign * x_clamp 2025-05-07T20:33:05.2673936Z x0 = x[:, :D] 2025-05-07T20:33:05.2674017Z x1 = x[:, D:] 2025-05-07T20:33:05.2674096Z 2025-05-07T20:33:05.2674182Z if contiguous: 2025-05-07T20:33:05.2674275Z x0 = x0.contiguous() 2025-05-07T20:33:05.2674372Z x1 = x1.contiguous() 2025-05-07T20:33:05.2674446Z 2025-05-07T20:33:05.2674538Z if scale_ub is not None: 2025-05-07T20:33:05.2674651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2674791Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2674867Z ) 2025-05-07T20:33:05.2674954Z else: 2025-05-07T20:33:05.2675053Z scale_ub_tensor = None 2025-05-07T20:33:05.2675127Z 2025-05-07T20:33:05.2675274Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2675370Z op = silu_mul_quant 2025-05-07T20:33:05.2675461Z if compiled: 2025-05-07T20:33:05.2675569Z op = torch.compile(op) 2025-05-07T20:33:05.2675676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2675757Z 2025-05-07T20:33:05.2675848Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2675853Z 2025-05-07T20:33:05.2675951Z moe/activation_test.py:117: 2025-05-07T20:33:05.2676088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2676191Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2676292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2676817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2676968Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2677389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2677627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2677981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2678083Z kernel = self.compile( 2025-05-07T20:33:05.2678479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2678659Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2678793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2678798Z 2025-05-07T20:33:05.2679009Z self = 2025-05-07T20:33:05.2679868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2680508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780518c20>} 2025-05-07T20:33:05.2681283Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2681480Z context = 2025-05-07T20:33:05.2681485Z 2025-05-07T20:33:05.2681656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2681935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2682048Z module_map=module_map) 2025-05-07T20:33:05.2682223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2682328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2682408Z E ^ 2025-05-07T20:33:05.2682780Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2682785Z 2025-05-07T20:33:05.2683213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2683217Z 2025-05-07T20:33:05.2683323Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2683563Z self=, 2025-05-07T20:33:05.2683642Z T=1, 2025-05-07T20:33:05.2683725Z D=7168, 2025-05-07T20:33:05.2683811Z scale_ub=1200.0, 2025-05-07T20:33:05.2683898Z contiguous=True, 2025-05-07T20:33:05.2683990Z compiled=True, 2025-05-07T20:33:05.2684066Z ) 2025-05-07T20:33:05.2684295Z self = 2025-05-07T20:33:05.2684479Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.2684483Z 2025-05-07T20:33:05.2684568Z @given( 2025-05-07T20:33:05.2684690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2684799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2684918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2685046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2685162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2685240Z ) 2025-05-07T20:33:05.2685502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2685599Z def test_silu_mul_quant( 2025-05-07T20:33:05.2685677Z self, 2025-05-07T20:33:05.2685818Z T: int, 2025-05-07T20:33:05.2685896Z D: int, 2025-05-07T20:33:05.2685996Z scale_ub: Optional[float], 2025-05-07T20:33:05.2686097Z contiguous: bool, 2025-05-07T20:33:05.2686226Z compiled: bool, 2025-05-07T20:33:05.2686309Z ) -> None: 2025-05-07T20:33:05.2686414Z torch.manual_seed(2025) 2025-05-07T20:33:05.2686487Z 2025-05-07T20:33:05.2686669Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2686743Z 2025-05-07T20:33:05.2686837Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2686970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2687060Z x = x_sign * x_clamp 2025-05-07T20:33:05.2687140Z x0 = x[:, :D] 2025-05-07T20:33:05.2687227Z x1 = x[:, D:] 2025-05-07T20:33:05.2687300Z 2025-05-07T20:33:05.2687386Z if contiguous: 2025-05-07T20:33:05.2687488Z x0 = x0.contiguous() 2025-05-07T20:33:05.2687579Z x1 = x1.contiguous() 2025-05-07T20:33:05.2687655Z 2025-05-07T20:33:05.2687753Z if scale_ub is not None: 2025-05-07T20:33:05.2687904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2688091Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2688168Z ) 2025-05-07T20:33:05.2688246Z else: 2025-05-07T20:33:05.2688346Z scale_ub_tensor = None 2025-05-07T20:33:05.2688419Z 2025-05-07T20:33:05.2688550Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2688649Z op = silu_mul_quant 2025-05-07T20:33:05.2688735Z if compiled: 2025-05-07T20:33:05.2688835Z op = torch.compile(op) 2025-05-07T20:33:05.2688948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2689020Z 2025-05-07T20:33:05.2689111Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2689115Z 2025-05-07T20:33:05.2689220Z moe/activation_test.py:117: 2025-05-07T20:33:05.2689349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2689460Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2689564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2689947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2690050Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2690558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2690657Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2691032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2691263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2691618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2691718Z kernel = self.compile( 2025-05-07T20:33:05.2692110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2692303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2692432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2692436Z 2025-05-07T20:33:05.2692652Z self = 2025-05-07T20:33:05.2693454Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2693975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780519ee0>} 2025-05-07T20:33:05.2694883Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2695084Z context = 2025-05-07T20:33:05.2695088Z 2025-05-07T20:33:05.2695261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2695534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2695643Z module_map=module_map) 2025-05-07T20:33:05.2695812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2695912Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2695996Z E ^ 2025-05-07T20:33:05.2696361Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2696368Z 2025-05-07T20:33:05.2696795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2696853Z 2025-05-07T20:33:05.2696969Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2697238Z self=, 2025-05-07T20:33:05.2697317Z T=1, 2025-05-07T20:33:05.2697400Z D=7168, 2025-05-07T20:33:05.2697485Z scale_ub=1200.0, 2025-05-07T20:33:05.2697578Z contiguous=False, 2025-05-07T20:33:05.2697663Z compiled=True, 2025-05-07T20:33:05.2697736Z ) 2025-05-07T20:33:05.2697964Z self = 2025-05-07T20:33:05.2698135Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.2698139Z 2025-05-07T20:33:05.2698216Z @given( 2025-05-07T20:33:05.2698343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2698449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2702995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2703152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2703290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2703368Z ) 2025-05-07T20:33:05.2703624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2703733Z def test_silu_mul_quant( 2025-05-07T20:33:05.2703812Z self, 2025-05-07T20:33:05.2703892Z T: int, 2025-05-07T20:33:05.2703980Z D: int, 2025-05-07T20:33:05.2704082Z scale_ub: Optional[float], 2025-05-07T20:33:05.2704181Z contiguous: bool, 2025-05-07T20:33:05.2704270Z compiled: bool, 2025-05-07T20:33:05.2704356Z ) -> None: 2025-05-07T20:33:05.2704461Z torch.manual_seed(2025) 2025-05-07T20:33:05.2704539Z 2025-05-07T20:33:05.2704717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2704804Z 2025-05-07T20:33:05.2704901Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2705036Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2705144Z x = x_sign * x_clamp 2025-05-07T20:33:05.2705232Z x0 = x[:, :D] 2025-05-07T20:33:05.2705315Z x1 = x[:, D:] 2025-05-07T20:33:05.2705399Z 2025-05-07T20:33:05.2705490Z if contiguous: 2025-05-07T20:33:05.2705588Z x0 = x0.contiguous() 2025-05-07T20:33:05.2705692Z x1 = x1.contiguous() 2025-05-07T20:33:05.2705768Z 2025-05-07T20:33:05.2705870Z if scale_ub is not None: 2025-05-07T20:33:05.2705981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2706123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2706210Z ) 2025-05-07T20:33:05.2706289Z else: 2025-05-07T20:33:05.2706388Z scale_ub_tensor = None 2025-05-07T20:33:05.2706471Z 2025-05-07T20:33:05.2706695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2706789Z op = silu_mul_quant 2025-05-07T20:33:05.2706887Z if compiled: 2025-05-07T20:33:05.2707036Z op = torch.compile(op) 2025-05-07T20:33:05.2707152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2707237Z 2025-05-07T20:33:05.2707332Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2707337Z 2025-05-07T20:33:05.2707453Z moe/activation_test.py:117: 2025-05-07T20:33:05.2707587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2707693Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2707808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2708197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2708296Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2708817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2708967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2709390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2709625Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2709977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2710080Z kernel = self.compile( 2025-05-07T20:33:05.2710477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2710667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2710799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2710803Z 2025-05-07T20:33:05.2711014Z self = 2025-05-07T20:33:05.2711833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2712360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78051ac00>} 2025-05-07T20:33:05.2713137Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2713689Z context = 2025-05-07T20:33:05.2713698Z 2025-05-07T20:33:05.2713951Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2714266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2714386Z module_map=module_map) 2025-05-07T20:33:05.2714567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2714670Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2714755Z E ^ 2025-05-07T20:33:05.2715130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2715134Z 2025-05-07T20:33:05.2715566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2715570Z 2025-05-07T20:33:05.2715688Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2715921Z self=, 2025-05-07T20:33:05.2716000Z T=1, 2025-05-07T20:33:05.2716087Z D=7168, 2025-05-07T20:33:05.2716354Z scale_ub=None, 2025-05-07T20:33:05.2716445Z contiguous=False, 2025-05-07T20:33:05.2716542Z compiled=True, 2025-05-07T20:33:05.2716622Z ) 2025-05-07T20:33:05.2716922Z self = 2025-05-07T20:33:05.2717106Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2717111Z 2025-05-07T20:33:05.2717191Z @given( 2025-05-07T20:33:05.2717321Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2717425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2717544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2717672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2717789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2717867Z ) 2025-05-07T20:33:05.2718129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2718228Z def test_silu_mul_quant( 2025-05-07T20:33:05.2718311Z self, 2025-05-07T20:33:05.2718398Z T: int, 2025-05-07T20:33:05.2718477Z D: int, 2025-05-07T20:33:05.2718685Z scale_ub: Optional[float], 2025-05-07T20:33:05.2718788Z contiguous: bool, 2025-05-07T20:33:05.2718948Z compiled: bool, 2025-05-07T20:33:05.2719038Z ) -> None: 2025-05-07T20:33:05.2719138Z torch.manual_seed(2025) 2025-05-07T20:33:05.2719218Z 2025-05-07T20:33:05.2719403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2719480Z 2025-05-07T20:33:05.2719576Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2719712Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2719805Z x = x_sign * x_clamp 2025-05-07T20:33:05.2719888Z x0 = x[:, :D] 2025-05-07T20:33:05.2719978Z x1 = x[:, D:] 2025-05-07T20:33:05.2720053Z 2025-05-07T20:33:05.2720214Z if contiguous: 2025-05-07T20:33:05.2720318Z x0 = x0.contiguous() 2025-05-07T20:33:05.2720413Z x1 = x1.contiguous() 2025-05-07T20:33:05.2720489Z 2025-05-07T20:33:05.2720595Z if scale_ub is not None: 2025-05-07T20:33:05.2720708Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2720859Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2720940Z ) 2025-05-07T20:33:05.2721020Z else: 2025-05-07T20:33:05.2721127Z scale_ub_tensor = None 2025-05-07T20:33:05.2721202Z 2025-05-07T20:33:05.2721337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2721438Z op = silu_mul_quant 2025-05-07T20:33:05.2721528Z if compiled: 2025-05-07T20:33:05.2721631Z op = torch.compile(op) 2025-05-07T20:33:05.2721753Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2721829Z 2025-05-07T20:33:05.2721922Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.2722056Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.2722134Z 2025-05-07T20:33:05.2722286Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2722396Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.2722505Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.2722639Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.2722786Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2722861Z 2025-05-07T20:33:05.2722973Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.2722977Z 2025-05-07T20:33:05.2723079Z moe/activation_test.py:126: 2025-05-07T20:33:05.2723219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2723328Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.2723468Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.2724054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.2724213Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.2724629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2724875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2725255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.2725529Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.2725921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.2726094Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.2726458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.2726541Z fn() 2025-05-07T20:33:05.2726958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.2727096Z self.fn.run( 2025-05-07T20:33:05.2727486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2727594Z kernel = self.compile( 2025-05-07T20:33:05.2727990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2728170Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2728309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2728314Z 2025-05-07T20:33:05.2728527Z self = 2025-05-07T20:33:05.2729336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2729870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780a3c180>} 2025-05-07T20:33:05.2730643Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2730849Z context = 2025-05-07T20:33:05.2730854Z 2025-05-07T20:33:05.2731026Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2731309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2731422Z module_map=module_map) 2025-05-07T20:33:05.2731591Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2731707Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.2731789Z E ^ 2025-05-07T20:33:05.2732167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2732171Z 2025-05-07T20:33:05.2732604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2732608Z 2025-05-07T20:33:05.2732722Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2732961Z self=, 2025-05-07T20:33:05.2733040Z T=1, 2025-05-07T20:33:05.2733119Z D=5120, 2025-05-07T20:33:05.2733215Z scale_ub=1200.0, 2025-05-07T20:33:05.2733304Z contiguous=False, 2025-05-07T20:33:05.2733398Z compiled=True, 2025-05-07T20:33:05.2733474Z ) 2025-05-07T20:33:05.2733749Z self = 2025-05-07T20:33:05.2733934Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.2733979Z 2025-05-07T20:33:05.2734060Z @given( 2025-05-07T20:33:05.2734188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2734298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2734417Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2734537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2734660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2734738Z ) 2025-05-07T20:33:05.2734999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2735096Z def test_silu_mul_quant( 2025-05-07T20:33:05.2735175Z self, 2025-05-07T20:33:05.2735263Z T: int, 2025-05-07T20:33:05.2735341Z D: int, 2025-05-07T20:33:05.2735447Z scale_ub: Optional[float], 2025-05-07T20:33:05.2735550Z contiguous: bool, 2025-05-07T20:33:05.2735683Z compiled: bool, 2025-05-07T20:33:05.2735764Z ) -> None: 2025-05-07T20:33:05.2735876Z torch.manual_seed(2025) 2025-05-07T20:33:05.2735991Z 2025-05-07T20:33:05.2736169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2736256Z 2025-05-07T20:33:05.2736353Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2736489Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2736580Z x = x_sign * x_clamp 2025-05-07T20:33:05.2736665Z x0 = x[:, :D] 2025-05-07T20:33:05.2736755Z x1 = x[:, D:] 2025-05-07T20:33:05.2736832Z 2025-05-07T20:33:05.2736923Z if contiguous: 2025-05-07T20:33:05.2737031Z x0 = x0.contiguous() 2025-05-07T20:33:05.2737127Z x1 = x1.contiguous() 2025-05-07T20:33:05.2737202Z 2025-05-07T20:33:05.2737305Z if scale_ub is not None: 2025-05-07T20:33:05.2737418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2737559Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2738132Z ) 2025-05-07T20:33:05.2738215Z else: 2025-05-07T20:33:05.2738325Z scale_ub_tensor = None 2025-05-07T20:33:05.2738402Z 2025-05-07T20:33:05.2738537Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2738642Z op = silu_mul_quant 2025-05-07T20:33:05.2738732Z if compiled: 2025-05-07T20:33:05.2738835Z op = torch.compile(op) 2025-05-07T20:33:05.2738950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2739027Z 2025-05-07T20:33:05.2739122Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2739126Z 2025-05-07T20:33:05.2739238Z moe/activation_test.py:117: 2025-05-07T20:33:05.2739370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2739484Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2739589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2739975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2740081Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2740592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2740696Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2741076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2741306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2741666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2741765Z kernel = self.compile( 2025-05-07T20:33:05.2742161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2742409Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2742584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2742589Z 2025-05-07T20:33:05.2742801Z self = 2025-05-07T20:33:05.2743610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2744133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780a3d300>} 2025-05-07T20:33:05.2744913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2745161Z context = 2025-05-07T20:33:05.2745166Z 2025-05-07T20:33:05.2745386Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2745664Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2745777Z module_map=module_map) 2025-05-07T20:33:05.2745949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2746052Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2746135Z E ^ 2025-05-07T20:33:05.2746507Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2746512Z 2025-05-07T20:33:05.2746940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2746947Z 2025-05-07T20:33:05.2747066Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2747303Z self=, 2025-05-07T20:33:05.2747386Z T=1, 2025-05-07T20:33:05.2747472Z D=5120, 2025-05-07T20:33:05.2747559Z scale_ub=1200.0, 2025-05-07T20:33:05.2747648Z contiguous=False, 2025-05-07T20:33:05.2747741Z compiled=False, 2025-05-07T20:33:05.2747818Z ) 2025-05-07T20:33:05.2748049Z self = 2025-05-07T20:33:05.2748225Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.2748229Z 2025-05-07T20:33:05.2748308Z @given( 2025-05-07T20:33:05.2748437Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2748538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2748655Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2748789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2748910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2748985Z ) 2025-05-07T20:33:05.2749249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2749343Z def test_silu_mul_quant( 2025-05-07T20:33:05.2749430Z self, 2025-05-07T20:33:05.2749507Z T: int, 2025-05-07T20:33:05.2749585Z D: int, 2025-05-07T20:33:05.2749691Z scale_ub: Optional[float], 2025-05-07T20:33:05.2749782Z contiguous: bool, 2025-05-07T20:33:05.2749872Z compiled: bool, 2025-05-07T20:33:05.2749958Z ) -> None: 2025-05-07T20:33:05.2750054Z torch.manual_seed(2025) 2025-05-07T20:33:05.2750127Z 2025-05-07T20:33:05.2750312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2750387Z 2025-05-07T20:33:05.2750482Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2750664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2750755Z x = x_sign * x_clamp 2025-05-07T20:33:05.2750849Z x0 = x[:, :D] 2025-05-07T20:33:05.2751004Z x1 = x[:, D:] 2025-05-07T20:33:05.2751082Z 2025-05-07T20:33:05.2751176Z if contiguous: 2025-05-07T20:33:05.2751271Z x0 = x0.contiguous() 2025-05-07T20:33:05.2751363Z x1 = x1.contiguous() 2025-05-07T20:33:05.2751448Z 2025-05-07T20:33:05.2751542Z if scale_ub is not None: 2025-05-07T20:33:05.2751653Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2751800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2751881Z ) 2025-05-07T20:33:05.2751964Z else: 2025-05-07T20:33:05.2752071Z scale_ub_tensor = None 2025-05-07T20:33:05.2752146Z 2025-05-07T20:33:05.2752288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2752381Z op = silu_mul_quant 2025-05-07T20:33:05.2752471Z if compiled: 2025-05-07T20:33:05.2752582Z op = torch.compile(op) 2025-05-07T20:33:05.2752746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2752822Z 2025-05-07T20:33:05.2752961Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2752967Z 2025-05-07T20:33:05.2753068Z moe/activation_test.py:117: 2025-05-07T20:33:05.2753210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2753313Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2753415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2753935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2754035Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2754405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2754645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2755000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2755105Z kernel = self.compile( 2025-05-07T20:33:05.2755498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2755677Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2755813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2755817Z 2025-05-07T20:33:05.2756024Z self = 2025-05-07T20:33:05.2756834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2757361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780a3e020>} 2025-05-07T20:33:05.2758131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2758334Z context = 2025-05-07T20:33:05.2758339Z 2025-05-07T20:33:05.2758511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2758793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2758903Z module_map=module_map) 2025-05-07T20:33:05.2759068Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2759222Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2759302Z E ^ 2025-05-07T20:33:05.2759707Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2759724Z 2025-05-07T20:33:05.2760247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2760252Z 2025-05-07T20:33:05.2760359Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2760594Z self=, 2025-05-07T20:33:05.2760672Z T=16384, 2025-05-07T20:33:05.2760751Z D=5120, 2025-05-07T20:33:05.2760843Z scale_ub=1200.0, 2025-05-07T20:33:05.2760932Z contiguous=False, 2025-05-07T20:33:05.2761017Z compiled=True, 2025-05-07T20:33:05.2761096Z ) 2025-05-07T20:33:05.2761319Z self = 2025-05-07T20:33:05.2761509Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.2761516Z 2025-05-07T20:33:05.2761642Z @given( 2025-05-07T20:33:05.2761765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2761913Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2762031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2762150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2762271Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2762346Z ) 2025-05-07T20:33:05.2762604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2762704Z def test_silu_mul_quant( 2025-05-07T20:33:05.2762780Z self, 2025-05-07T20:33:05.2762864Z T: int, 2025-05-07T20:33:05.2762940Z D: int, 2025-05-07T20:33:05.2763039Z scale_ub: Optional[float], 2025-05-07T20:33:05.2763137Z contiguous: bool, 2025-05-07T20:33:05.2763228Z compiled: bool, 2025-05-07T20:33:05.2763306Z ) -> None: 2025-05-07T20:33:05.2763408Z torch.manual_seed(2025) 2025-05-07T20:33:05.2763486Z 2025-05-07T20:33:05.2763661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2763744Z 2025-05-07T20:33:05.2763837Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2763963Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2764060Z x = x_sign * x_clamp 2025-05-07T20:33:05.2764140Z x0 = x[:, :D] 2025-05-07T20:33:05.2764226Z x1 = x[:, D:] 2025-05-07T20:33:05.2764298Z 2025-05-07T20:33:05.2764385Z if contiguous: 2025-05-07T20:33:05.2764483Z x0 = x0.contiguous() 2025-05-07T20:33:05.2764572Z x1 = x1.contiguous() 2025-05-07T20:33:05.2764664Z 2025-05-07T20:33:05.2764756Z if scale_ub is not None: 2025-05-07T20:33:05.2764866Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2765011Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2765090Z ) 2025-05-07T20:33:05.2765167Z else: 2025-05-07T20:33:05.2765276Z scale_ub_tensor = None 2025-05-07T20:33:05.2765352Z 2025-05-07T20:33:05.2765496Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2765587Z op = silu_mul_quant 2025-05-07T20:33:05.2765673Z if compiled: 2025-05-07T20:33:05.2765783Z op = torch.compile(op) 2025-05-07T20:33:05.2765890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2765963Z 2025-05-07T20:33:05.2766065Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2766070Z 2025-05-07T20:33:05.2766170Z moe/activation_test.py:117: 2025-05-07T20:33:05.2766301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2766411Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2766511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2766947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2767047Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2767596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2767704Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2768075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2768306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2768667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2768763Z kernel = self.compile( 2025-05-07T20:33:05.2769166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2769347Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2769474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2769521Z 2025-05-07T20:33:05.2769781Z self = 2025-05-07T20:33:05.2770586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2771116Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780a3f600>} 2025-05-07T20:33:05.2771885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2772091Z context = 2025-05-07T20:33:05.2772098Z 2025-05-07T20:33:05.2772271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2772544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2772658Z module_map=module_map) 2025-05-07T20:33:05.2772822Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2772923Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2773008Z E ^ 2025-05-07T20:33:05.2773372Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2773377Z 2025-05-07T20:33:05.2773807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2773812Z 2025-05-07T20:33:05.2773920Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2774147Z self=, 2025-05-07T20:33:05.2774234Z T=2048, 2025-05-07T20:33:05.2774314Z D=7168, 2025-05-07T20:33:05.2774401Z scale_ub=1200.0, 2025-05-07T20:33:05.2774496Z contiguous=False, 2025-05-07T20:33:05.2774582Z compiled=True, 2025-05-07T20:33:05.2774656Z ) 2025-05-07T20:33:05.2774889Z self = 2025-05-07T20:33:05.2775068Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.2775073Z 2025-05-07T20:33:05.2775156Z @given( 2025-05-07T20:33:05.2775276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2775378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2775504Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2775624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2775814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2775897Z ) 2025-05-07T20:33:05.2776153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2776294Z def test_silu_mul_quant( 2025-05-07T20:33:05.2776376Z self, 2025-05-07T20:33:05.2776458Z T: int, 2025-05-07T20:33:05.2776541Z D: int, 2025-05-07T20:33:05.2776642Z scale_ub: Optional[float], 2025-05-07T20:33:05.2776734Z contiguous: bool, 2025-05-07T20:33:05.2776828Z compiled: bool, 2025-05-07T20:33:05.2776906Z ) -> None: 2025-05-07T20:33:05.2777004Z torch.manual_seed(2025) 2025-05-07T20:33:05.2777082Z 2025-05-07T20:33:05.2777255Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2777331Z 2025-05-07T20:33:05.2777428Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2777554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2777655Z x = x_sign * x_clamp 2025-05-07T20:33:05.2777739Z x0 = x[:, :D] 2025-05-07T20:33:05.2777819Z x1 = x[:, D:] 2025-05-07T20:33:05.2777944Z 2025-05-07T20:33:05.2778031Z if contiguous: 2025-05-07T20:33:05.2778126Z x0 = x0.contiguous() 2025-05-07T20:33:05.2778262Z x1 = x1.contiguous() 2025-05-07T20:33:05.2778336Z 2025-05-07T20:33:05.2778429Z if scale_ub is not None: 2025-05-07T20:33:05.2778543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2778680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2778756Z ) 2025-05-07T20:33:05.2778841Z else: 2025-05-07T20:33:05.2778937Z scale_ub_tensor = None 2025-05-07T20:33:05.2779010Z 2025-05-07T20:33:05.2779149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2779240Z op = silu_mul_quant 2025-05-07T20:33:05.2779333Z if compiled: 2025-05-07T20:33:05.2779434Z op = torch.compile(op) 2025-05-07T20:33:05.2779543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2779630Z 2025-05-07T20:33:05.2779726Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2779730Z 2025-05-07T20:33:05.2779838Z moe/activation_test.py:117: 2025-05-07T20:33:05.2779975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2780078Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2780180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2780568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2780663Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2781178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2781279Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2781650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2781890Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2782250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2782354Z kernel = self.compile( 2025-05-07T20:33:05.2782750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2782933Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2783069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2783073Z 2025-05-07T20:33:05.2783284Z self = 2025-05-07T20:33:05.2784089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2784743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780038720>} 2025-05-07T20:33:05.2785520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2785725Z context = 2025-05-07T20:33:05.2785729Z 2025-05-07T20:33:05.2785899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2786178Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2786286Z module_map=module_map) 2025-05-07T20:33:05.2786454Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2786563Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2786696Z E ^ 2025-05-07T20:33:05.2787104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2787116Z 2025-05-07T20:33:05.2787545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2787549Z 2025-05-07T20:33:05.2787655Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2787890Z self=, 2025-05-07T20:33:05.2787972Z T=1, 2025-05-07T20:33:05.2788049Z D=5120, 2025-05-07T20:33:05.2788137Z scale_ub=None, 2025-05-07T20:33:05.2788226Z contiguous=False, 2025-05-07T20:33:05.2788311Z compiled=False, 2025-05-07T20:33:05.2788389Z ) 2025-05-07T20:33:05.2788613Z self = 2025-05-07T20:33:05.2788797Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.2788807Z 2025-05-07T20:33:05.2788884Z @given( 2025-05-07T20:33:05.2789010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2789117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2789233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2789351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2789472Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2789546Z ) 2025-05-07T20:33:05.2789799Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2789898Z def test_silu_mul_quant( 2025-05-07T20:33:05.2789976Z self, 2025-05-07T20:33:05.2790060Z T: int, 2025-05-07T20:33:05.2790137Z D: int, 2025-05-07T20:33:05.2790236Z scale_ub: Optional[float], 2025-05-07T20:33:05.2790334Z contiguous: bool, 2025-05-07T20:33:05.2790420Z compiled: bool, 2025-05-07T20:33:05.2790500Z ) -> None: 2025-05-07T20:33:05.2790606Z torch.manual_seed(2025) 2025-05-07T20:33:05.2790683Z 2025-05-07T20:33:05.2790859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2790941Z 2025-05-07T20:33:05.2791034Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2791159Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2791259Z x = x_sign * x_clamp 2025-05-07T20:33:05.2791341Z x0 = x[:, :D] 2025-05-07T20:33:05.2791430Z x1 = x[:, D:] 2025-05-07T20:33:05.2791503Z 2025-05-07T20:33:05.2791589Z if contiguous: 2025-05-07T20:33:05.2791687Z x0 = x0.contiguous() 2025-05-07T20:33:05.2791776Z x1 = x1.contiguous() 2025-05-07T20:33:05.2791849Z 2025-05-07T20:33:05.2791950Z if scale_ub is not None: 2025-05-07T20:33:05.2792058Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2792244Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2792329Z ) 2025-05-07T20:33:05.2792406Z else: 2025-05-07T20:33:05.2792540Z scale_ub_tensor = None 2025-05-07T20:33:05.2792622Z 2025-05-07T20:33:05.2792756Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2792850Z op = silu_mul_quant 2025-05-07T20:33:05.2792942Z if compiled: 2025-05-07T20:33:05.2793044Z op = torch.compile(op) 2025-05-07T20:33:05.2793156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2793228Z 2025-05-07T20:33:05.2793320Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2793324Z 2025-05-07T20:33:05.2793434Z moe/activation_test.py:117: 2025-05-07T20:33:05.2793563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2793666Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2793777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2794291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2794444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2794851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2795085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2795446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2795543Z kernel = self.compile( 2025-05-07T20:33:05.2795935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2796121Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2796248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2796255Z 2025-05-07T20:33:05.2796468Z self = 2025-05-07T20:33:05.2797277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2797798Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780039120>} 2025-05-07T20:33:05.2798575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2798770Z context = 2025-05-07T20:33:05.2798776Z 2025-05-07T20:33:05.2798951Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2799231Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2799347Z module_map=module_map) 2025-05-07T20:33:05.2799512Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2799612Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2799698Z E ^ 2025-05-07T20:33:05.2800181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2800186Z 2025-05-07T20:33:05.2800615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2800619Z 2025-05-07T20:33:05.2800732Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2800961Z self=, 2025-05-07T20:33:05.2801095Z T=4096, 2025-05-07T20:33:05.2801174Z D=7168, 2025-05-07T20:33:05.2801263Z scale_ub=1200.0, 2025-05-07T20:33:05.2801357Z contiguous=False, 2025-05-07T20:33:05.2801484Z compiled=False, 2025-05-07T20:33:05.2801563Z ) 2025-05-07T20:33:05.2801795Z self = 2025-05-07T20:33:05.2801976Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.2801980Z 2025-05-07T20:33:05.2802059Z @given( 2025-05-07T20:33:05.2802187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2802288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2802412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2802532Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2802648Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2802731Z ) 2025-05-07T20:33:05.2802984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2803081Z def test_silu_mul_quant( 2025-05-07T20:33:05.2803206Z self, 2025-05-07T20:33:05.2803283Z T: int, 2025-05-07T20:33:05.2803362Z D: int, 2025-05-07T20:33:05.2803510Z scale_ub: Optional[float], 2025-05-07T20:33:05.2803603Z contiguous: bool, 2025-05-07T20:33:05.2803690Z compiled: bool, 2025-05-07T20:33:05.2803775Z ) -> None: 2025-05-07T20:33:05.2803870Z torch.manual_seed(2025) 2025-05-07T20:33:05.2803949Z 2025-05-07T20:33:05.2804122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2804197Z 2025-05-07T20:33:05.2804296Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2804422Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2804511Z x = x_sign * x_clamp 2025-05-07T20:33:05.2804597Z x0 = x[:, :D] 2025-05-07T20:33:05.2804681Z x1 = x[:, D:] 2025-05-07T20:33:05.2804757Z 2025-05-07T20:33:05.2804848Z if contiguous: 2025-05-07T20:33:05.2804941Z x0 = x0.contiguous() 2025-05-07T20:33:05.2805037Z x1 = x1.contiguous() 2025-05-07T20:33:05.2805120Z 2025-05-07T20:33:05.2805217Z if scale_ub is not None: 2025-05-07T20:33:05.2805331Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2805467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2805546Z ) 2025-05-07T20:33:05.2805627Z else: 2025-05-07T20:33:05.2805723Z scale_ub_tensor = None 2025-05-07T20:33:05.2805797Z 2025-05-07T20:33:05.2805935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2806026Z op = silu_mul_quant 2025-05-07T20:33:05.2806111Z if compiled: 2025-05-07T20:33:05.2806219Z op = torch.compile(op) 2025-05-07T20:33:05.2806326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2806397Z 2025-05-07T20:33:05.2806497Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2806502Z 2025-05-07T20:33:05.2806600Z moe/activation_test.py:117: 2025-05-07T20:33:05.2806741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2806848Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2806949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2807467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2807567Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2807939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2808175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2808525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2808680Z kernel = self.compile( 2025-05-07T20:33:05.2809074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2809299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2809435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2809439Z 2025-05-07T20:33:05.2809648Z self = 2025-05-07T20:33:05.2810455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2810978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78003a480>} 2025-05-07T20:33:05.2811753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2812062Z context = 2025-05-07T20:33:05.2812067Z 2025-05-07T20:33:05.2812237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2812520Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2812630Z module_map=module_map) 2025-05-07T20:33:05.2812794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2812902Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2812981Z E ^ 2025-05-07T20:33:05.2813726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2813754Z 2025-05-07T20:33:05.2814250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2814259Z 2025-05-07T20:33:05.2814373Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2814612Z self=, 2025-05-07T20:33:05.2814692Z T=16384, 2025-05-07T20:33:05.2814770Z D=7168, 2025-05-07T20:33:05.2814859Z scale_ub=None, 2025-05-07T20:33:05.2814946Z contiguous=True, 2025-05-07T20:33:05.2815031Z compiled=True, 2025-05-07T20:33:05.2815112Z ) 2025-05-07T20:33:05.2815336Z self = 2025-05-07T20:33:05.2815522Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.2815526Z 2025-05-07T20:33:05.2815604Z @given( 2025-05-07T20:33:05.2815726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2815835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2815954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2816076Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2816202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2816279Z ) 2025-05-07T20:33:05.2816539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2816634Z def test_silu_mul_quant( 2025-05-07T20:33:05.2816712Z self, 2025-05-07T20:33:05.2816798Z T: int, 2025-05-07T20:33:05.2816877Z D: int, 2025-05-07T20:33:05.2816977Z scale_ub: Optional[float], 2025-05-07T20:33:05.2817076Z contiguous: bool, 2025-05-07T20:33:05.2817163Z compiled: bool, 2025-05-07T20:33:05.2817243Z ) -> None: 2025-05-07T20:33:05.2817347Z torch.manual_seed(2025) 2025-05-07T20:33:05.2817420Z 2025-05-07T20:33:05.2817594Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2817854Z 2025-05-07T20:33:05.2817947Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2818075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2818238Z x = x_sign * x_clamp 2025-05-07T20:33:05.2818324Z x0 = x[:, :D] 2025-05-07T20:33:05.2818411Z x1 = x[:, D:] 2025-05-07T20:33:05.2818483Z 2025-05-07T20:33:05.2818571Z if contiguous: 2025-05-07T20:33:05.2818669Z x0 = x0.contiguous() 2025-05-07T20:33:05.2818758Z x1 = x1.contiguous() 2025-05-07T20:33:05.2818830Z 2025-05-07T20:33:05.2818927Z if scale_ub is not None: 2025-05-07T20:33:05.2819035Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2819172Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2819255Z ) 2025-05-07T20:33:05.2819331Z else: 2025-05-07T20:33:05.2819426Z scale_ub_tensor = None 2025-05-07T20:33:05.2819504Z 2025-05-07T20:33:05.2819636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2819737Z op = silu_mul_quant 2025-05-07T20:33:05.2819891Z if compiled: 2025-05-07T20:33:05.2819995Z op = torch.compile(op) 2025-05-07T20:33:05.2820170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2820245Z 2025-05-07T20:33:05.2820337Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2820342Z 2025-05-07T20:33:05.2820446Z moe/activation_test.py:117: 2025-05-07T20:33:05.2820575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2820677Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2820785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2821161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2821262Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2821769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2821872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2822253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2822480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2822832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2822935Z kernel = self.compile( 2025-05-07T20:33:05.2823326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2823512Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2823640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2823645Z 2025-05-07T20:33:05.2823851Z self = 2025-05-07T20:33:05.2824717Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2825238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78003b740>} 2025-05-07T20:33:05.2826011Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2826205Z context = 2025-05-07T20:33:05.2826210Z 2025-05-07T20:33:05.2826384Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2826705Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2826818Z module_map=module_map) 2025-05-07T20:33:05.2827024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2827129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2827208Z E ^ 2025-05-07T20:33:05.2827579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2827584Z 2025-05-07T20:33:05.2828010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2828014Z 2025-05-07T20:33:05.2828125Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2828353Z self=, 2025-05-07T20:33:05.2828430Z T=4096, 2025-05-07T20:33:05.2828514Z D=5120, 2025-05-07T20:33:05.2828596Z scale_ub=None, 2025-05-07T20:33:05.2833137Z contiguous=False, 2025-05-07T20:33:05.2833252Z compiled=True, 2025-05-07T20:33:05.2833419Z ) 2025-05-07T20:33:05.2833656Z self = 2025-05-07T20:33:05.2833881Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2833887Z 2025-05-07T20:33:05.2833976Z @given( 2025-05-07T20:33:05.2834102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2834206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2834342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2834487Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2834628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2834715Z ) 2025-05-07T20:33:05.2834971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2835081Z def test_silu_mul_quant( 2025-05-07T20:33:05.2835165Z self, 2025-05-07T20:33:05.2835247Z T: int, 2025-05-07T20:33:05.2835333Z D: int, 2025-05-07T20:33:05.2835440Z scale_ub: Optional[float], 2025-05-07T20:33:05.2835536Z contiguous: bool, 2025-05-07T20:33:05.2835635Z compiled: bool, 2025-05-07T20:33:05.2835717Z ) -> None: 2025-05-07T20:33:05.2835816Z torch.manual_seed(2025) 2025-05-07T20:33:05.2835901Z 2025-05-07T20:33:05.2836077Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2836161Z 2025-05-07T20:33:05.2836257Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2836385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2836486Z x = x_sign * x_clamp 2025-05-07T20:33:05.2836569Z x0 = x[:, :D] 2025-05-07T20:33:05.2836652Z x1 = x[:, D:] 2025-05-07T20:33:05.2836736Z 2025-05-07T20:33:05.2836823Z if contiguous: 2025-05-07T20:33:05.2836919Z x0 = x0.contiguous() 2025-05-07T20:33:05.2837022Z x1 = x1.contiguous() 2025-05-07T20:33:05.2837097Z 2025-05-07T20:33:05.2837194Z if scale_ub is not None: 2025-05-07T20:33:05.2837314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2837457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2837549Z ) 2025-05-07T20:33:05.2837629Z else: 2025-05-07T20:33:05.2837726Z scale_ub_tensor = None 2025-05-07T20:33:05.2837810Z 2025-05-07T20:33:05.2837945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2838039Z op = silu_mul_quant 2025-05-07T20:33:05.2838136Z if compiled: 2025-05-07T20:33:05.2838240Z op = torch.compile(op) 2025-05-07T20:33:05.2838351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2838436Z 2025-05-07T20:33:05.2838530Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2838535Z 2025-05-07T20:33:05.2838637Z moe/activation_test.py:117: 2025-05-07T20:33:05.2838828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2838935Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2839084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2839476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2839574Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2840203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2840307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2840677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2840923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2841275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2841383Z kernel = self.compile( 2025-05-07T20:33:05.2841837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2842058Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2842197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2842202Z 2025-05-07T20:33:05.2842413Z self = 2025-05-07T20:33:05.2843221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2843743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780678c20>} 2025-05-07T20:33:05.2844515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2844720Z context = 2025-05-07T20:33:05.2844725Z 2025-05-07T20:33:05.2844894Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2845173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2845283Z module_map=module_map) 2025-05-07T20:33:05.2845448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2845559Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2845637Z E ^ 2025-05-07T20:33:05.2846009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2846016Z 2025-05-07T20:33:05.2846444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2846451Z 2025-05-07T20:33:05.2846563Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2846800Z self=, 2025-05-07T20:33:05.2846879Z T=4096, 2025-05-07T20:33:05.2846957Z D=5120, 2025-05-07T20:33:05.2847051Z scale_ub=1200.0, 2025-05-07T20:33:05.2847140Z contiguous=False, 2025-05-07T20:33:05.2847233Z compiled=False, 2025-05-07T20:33:05.2847308Z ) 2025-05-07T20:33:05.2847534Z self = 2025-05-07T20:33:05.2847722Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.2847726Z 2025-05-07T20:33:05.2847808Z @given( 2025-05-07T20:33:05.2847930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2848085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2848207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2848362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2848492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2848570Z ) 2025-05-07T20:33:05.2848831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2848927Z def test_silu_mul_quant( 2025-05-07T20:33:05.2849005Z self, 2025-05-07T20:33:05.2849090Z T: int, 2025-05-07T20:33:05.2849167Z D: int, 2025-05-07T20:33:05.2849270Z scale_ub: Optional[float], 2025-05-07T20:33:05.2849371Z contiguous: bool, 2025-05-07T20:33:05.2849459Z compiled: bool, 2025-05-07T20:33:05.2849543Z ) -> None: 2025-05-07T20:33:05.2849648Z torch.manual_seed(2025) 2025-05-07T20:33:05.2849723Z 2025-05-07T20:33:05.2849902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2849985Z 2025-05-07T20:33:05.2850159Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2850299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2850427Z x = x_sign * x_clamp 2025-05-07T20:33:05.2850511Z x0 = x[:, :D] 2025-05-07T20:33:05.2850602Z x1 = x[:, D:] 2025-05-07T20:33:05.2850677Z 2025-05-07T20:33:05.2850766Z if contiguous: 2025-05-07T20:33:05.2850873Z x0 = x0.contiguous() 2025-05-07T20:33:05.2850964Z x1 = x1.contiguous() 2025-05-07T20:33:05.2851038Z 2025-05-07T20:33:05.2851139Z if scale_ub is not None: 2025-05-07T20:33:05.2851247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2851386Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2851472Z ) 2025-05-07T20:33:05.2851550Z else: 2025-05-07T20:33:05.2851655Z scale_ub_tensor = None 2025-05-07T20:33:05.2851733Z 2025-05-07T20:33:05.2851867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2851971Z op = silu_mul_quant 2025-05-07T20:33:05.2852063Z if compiled: 2025-05-07T20:33:05.2852168Z op = torch.compile(op) 2025-05-07T20:33:05.2852284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2852359Z 2025-05-07T20:33:05.2852452Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2852457Z 2025-05-07T20:33:05.2852566Z moe/activation_test.py:117: 2025-05-07T20:33:05.2852697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2852809Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2852912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2853427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2853535Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2853910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2854162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2854558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2854654Z kernel = self.compile( 2025-05-07T20:33:05.2855057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2855237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2855367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2855371Z 2025-05-07T20:33:05.2855589Z self = 2025-05-07T20:33:05.2856390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2857009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7806796c0>} 2025-05-07T20:33:05.2857784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2857984Z context = 2025-05-07T20:33:05.2857988Z 2025-05-07T20:33:05.2858165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2858441Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2858561Z module_map=module_map) 2025-05-07T20:33:05.2858725Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2858869Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2858957Z E ^ 2025-05-07T20:33:05.2859360Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2859365Z 2025-05-07T20:33:05.2859802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2859807Z 2025-05-07T20:33:05.2859915Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2860145Z self=, 2025-05-07T20:33:05.2860232Z T=4096, 2025-05-07T20:33:05.2860310Z D=5120, 2025-05-07T20:33:05.2860397Z scale_ub=1200.0, 2025-05-07T20:33:05.2860493Z contiguous=False, 2025-05-07T20:33:05.2860581Z compiled=True, 2025-05-07T20:33:05.2860658Z ) 2025-05-07T20:33:05.2860889Z self = 2025-05-07T20:33:05.2861074Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.2861078Z 2025-05-07T20:33:05.2861165Z @given( 2025-05-07T20:33:05.2861287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2861388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2861512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2861632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2861747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2861830Z ) 2025-05-07T20:33:05.2862083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2862181Z def test_silu_mul_quant( 2025-05-07T20:33:05.2862267Z self, 2025-05-07T20:33:05.2862344Z T: int, 2025-05-07T20:33:05.2862427Z D: int, 2025-05-07T20:33:05.2862531Z scale_ub: Optional[float], 2025-05-07T20:33:05.2862623Z contiguous: bool, 2025-05-07T20:33:05.2862721Z compiled: bool, 2025-05-07T20:33:05.2862801Z ) -> None: 2025-05-07T20:33:05.2862903Z torch.manual_seed(2025) 2025-05-07T20:33:05.2862985Z 2025-05-07T20:33:05.2863162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2863238Z 2025-05-07T20:33:05.2863341Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2863470Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2863564Z x = x_sign * x_clamp 2025-05-07T20:33:05.2863653Z x0 = x[:, :D] 2025-05-07T20:33:05.2863735Z x1 = x[:, D:] 2025-05-07T20:33:05.2863811Z 2025-05-07T20:33:05.2863905Z if contiguous: 2025-05-07T20:33:05.2863999Z x0 = x0.contiguous() 2025-05-07T20:33:05.2864099Z x1 = x1.contiguous() 2025-05-07T20:33:05.2864178Z 2025-05-07T20:33:05.2864324Z if scale_ub is not None: 2025-05-07T20:33:05.2864464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2864622Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2864746Z ) 2025-05-07T20:33:05.2864838Z else: 2025-05-07T20:33:05.2864936Z scale_ub_tensor = None 2025-05-07T20:33:05.2865010Z 2025-05-07T20:33:05.2865153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2865245Z op = silu_mul_quant 2025-05-07T20:33:05.2865333Z if compiled: 2025-05-07T20:33:05.2865443Z op = torch.compile(op) 2025-05-07T20:33:05.2865550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2865632Z 2025-05-07T20:33:05.2865726Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2865730Z 2025-05-07T20:33:05.2865834Z moe/activation_test.py:117: 2025-05-07T20:33:05.2865974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2866079Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2866181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2866614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2866748Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2867271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2867372Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2867746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2867988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2868343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2868441Z kernel = self.compile( 2025-05-07T20:33:05.2868848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2869038Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2869181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2869186Z 2025-05-07T20:33:05.2869394Z self = 2025-05-07T20:33:05.2870194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2870727Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78067afc0>} 2025-05-07T20:33:05.2871496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2871712Z context = 2025-05-07T20:33:05.2871719Z 2025-05-07T20:33:05.2871890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2872170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2872281Z module_map=module_map) 2025-05-07T20:33:05.2872448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2872557Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2872637Z E ^ 2025-05-07T20:33:05.2873006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2873011Z 2025-05-07T20:33:05.2873449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2873501Z 2025-05-07T20:33:05.2873610Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2873888Z self=, 2025-05-07T20:33:05.2873968Z T=2048, 2025-05-07T20:33:05.2874046Z D=7168, 2025-05-07T20:33:05.2874140Z scale_ub=1200.0, 2025-05-07T20:33:05.2874230Z contiguous=False, 2025-05-07T20:33:05.2874319Z compiled=False, 2025-05-07T20:33:05.2874401Z ) 2025-05-07T20:33:05.2874628Z self = 2025-05-07T20:33:05.2874811Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.2874826Z 2025-05-07T20:33:05.2874905Z @given( 2025-05-07T20:33:05.2875025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2875134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2875257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2875378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2875543Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2875623Z ) 2025-05-07T20:33:05.2875914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2876016Z def test_silu_mul_quant( 2025-05-07T20:33:05.2876094Z self, 2025-05-07T20:33:05.2876174Z T: int, 2025-05-07T20:33:05.2876257Z D: int, 2025-05-07T20:33:05.2876360Z scale_ub: Optional[float], 2025-05-07T20:33:05.2876459Z contiguous: bool, 2025-05-07T20:33:05.2876547Z compiled: bool, 2025-05-07T20:33:05.2876626Z ) -> None: 2025-05-07T20:33:05.2876734Z torch.manual_seed(2025) 2025-05-07T20:33:05.2876808Z 2025-05-07T20:33:05.2876984Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2877070Z 2025-05-07T20:33:05.2877169Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2877296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2877397Z x = x_sign * x_clamp 2025-05-07T20:33:05.2877482Z x0 = x[:, :D] 2025-05-07T20:33:05.2877565Z x1 = x[:, D:] 2025-05-07T20:33:05.2877649Z 2025-05-07T20:33:05.2877734Z if contiguous: 2025-05-07T20:33:05.2877826Z x0 = x0.contiguous() 2025-05-07T20:33:05.2877923Z x1 = x1.contiguous() 2025-05-07T20:33:05.2877999Z 2025-05-07T20:33:05.2878105Z if scale_ub is not None: 2025-05-07T20:33:05.2878212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2878350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2878433Z ) 2025-05-07T20:33:05.2878511Z else: 2025-05-07T20:33:05.2878610Z scale_ub_tensor = None 2025-05-07T20:33:05.2878692Z 2025-05-07T20:33:05.2878826Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2878920Z op = silu_mul_quant 2025-05-07T20:33:05.2879016Z if compiled: 2025-05-07T20:33:05.2879121Z op = torch.compile(op) 2025-05-07T20:33:05.2879232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2879318Z 2025-05-07T20:33:05.2879411Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2879416Z 2025-05-07T20:33:05.2879523Z moe/activation_test.py:117: 2025-05-07T20:33:05.2879658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2879761Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2879871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2880517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2880619Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2880997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2881285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2881721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2881820Z kernel = self.compile( 2025-05-07T20:33:05.2882217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2882406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2882535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2882539Z 2025-05-07T20:33:05.2882757Z self = 2025-05-07T20:33:05.2883558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2884087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff78067bec0>} 2025-05-07T20:33:05.2884943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2885143Z context = 2025-05-07T20:33:05.2885148Z 2025-05-07T20:33:05.2885323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2885596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2885712Z module_map=module_map) 2025-05-07T20:33:05.2885877Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2885979Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2886067Z E ^ 2025-05-07T20:33:05.2886440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2886447Z 2025-05-07T20:33:05.2886871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2886883Z 2025-05-07T20:33:05.2886988Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2887217Z self=, 2025-05-07T20:33:05.2887304Z T=1, 2025-05-07T20:33:05.2887384Z D=7168, 2025-05-07T20:33:05.2887468Z scale_ub=None, 2025-05-07T20:33:05.2887564Z contiguous=True, 2025-05-07T20:33:05.2887651Z compiled=False, 2025-05-07T20:33:05.2887726Z ) 2025-05-07T20:33:05.2887963Z self = 2025-05-07T20:33:05.2888136Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.2888141Z 2025-05-07T20:33:05.2888230Z @given( 2025-05-07T20:33:05.2888354Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2888459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2888584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2888705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2888822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2888904Z ) 2025-05-07T20:33:05.2889157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2889251Z def test_silu_mul_quant( 2025-05-07T20:33:05.2889339Z self, 2025-05-07T20:33:05.2889416Z T: int, 2025-05-07T20:33:05.2889492Z D: int, 2025-05-07T20:33:05.2889599Z scale_ub: Optional[float], 2025-05-07T20:33:05.2889689Z contiguous: bool, 2025-05-07T20:33:05.2889828Z compiled: bool, 2025-05-07T20:33:05.2889906Z ) -> None: 2025-05-07T20:33:05.2890003Z torch.manual_seed(2025) 2025-05-07T20:33:05.2890086Z 2025-05-07T20:33:05.2890300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2890377Z 2025-05-07T20:33:05.2890476Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2890601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2890691Z x = x_sign * x_clamp 2025-05-07T20:33:05.2890777Z x0 = x[:, :D] 2025-05-07T20:33:05.2890857Z x1 = x[:, D:] 2025-05-07T20:33:05.2890929Z 2025-05-07T20:33:05.2891020Z if contiguous: 2025-05-07T20:33:05.2891112Z x0 = x0.contiguous() 2025-05-07T20:33:05.2891207Z x1 = x1.contiguous() 2025-05-07T20:33:05.2891281Z 2025-05-07T20:33:05.2891373Z if scale_ub is not None: 2025-05-07T20:33:05.2891485Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2891627Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2891703Z ) 2025-05-07T20:33:05.2891829Z else: 2025-05-07T20:33:05.2891924Z scale_ub_tensor = None 2025-05-07T20:33:05.2891999Z 2025-05-07T20:33:05.2892177Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2892270Z op = silu_mul_quant 2025-05-07T20:33:05.2892358Z if compiled: 2025-05-07T20:33:05.2892465Z op = torch.compile(op) 2025-05-07T20:33:05.2892573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2892652Z 2025-05-07T20:33:05.2892742Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2892746Z 2025-05-07T20:33:05.2892865Z moe/activation_test.py:117: 2025-05-07T20:33:05.2892998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2893101Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2893209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2893731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2893839Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2894220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2894450Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2894813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2894912Z kernel = self.compile( 2025-05-07T20:33:05.2895310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2895499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2895632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2895639Z 2025-05-07T20:33:05.2895856Z self = 2025-05-07T20:33:05.2896668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2897193Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780e4ccc0>} 2025-05-07T20:33:05.2897971Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2898170Z context = 2025-05-07T20:33:05.2898175Z 2025-05-07T20:33:05.2898353Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2899274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2899452Z module_map=module_map) 2025-05-07T20:33:05.2899627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2899728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2899813Z E ^ 2025-05-07T20:33:05.2900181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2900186Z 2025-05-07T20:33:05.2900618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2900622Z 2025-05-07T20:33:05.2900734Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2900964Z self=, 2025-05-07T20:33:05.2901051Z T=16384, 2025-05-07T20:33:05.2901132Z D=7168, 2025-05-07T20:33:05.2901217Z scale_ub=1200.0, 2025-05-07T20:33:05.2901355Z contiguous=False, 2025-05-07T20:33:05.2901442Z compiled=True, 2025-05-07T20:33:05.2901521Z ) 2025-05-07T20:33:05.2901793Z self = 2025-05-07T20:33:05.2901981Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.2901986Z 2025-05-07T20:33:05.2902063Z @given( 2025-05-07T20:33:05.2902191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2902292Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2902416Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2902535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2902650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2902731Z ) 2025-05-07T20:33:05.2902984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2903082Z def test_silu_mul_quant( 2025-05-07T20:33:05.2903167Z self, 2025-05-07T20:33:05.2903248Z T: int, 2025-05-07T20:33:05.2903325Z D: int, 2025-05-07T20:33:05.2903435Z scale_ub: Optional[float], 2025-05-07T20:33:05.2903525Z contiguous: bool, 2025-05-07T20:33:05.2903611Z compiled: bool, 2025-05-07T20:33:05.2903696Z ) -> None: 2025-05-07T20:33:05.2903792Z torch.manual_seed(2025) 2025-05-07T20:33:05.2903865Z 2025-05-07T20:33:05.2904048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2904124Z 2025-05-07T20:33:05.2904225Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2904353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2904443Z x = x_sign * x_clamp 2025-05-07T20:33:05.2904529Z x0 = x[:, :D] 2025-05-07T20:33:05.2904611Z x1 = x[:, D:] 2025-05-07T20:33:05.2904684Z 2025-05-07T20:33:05.2904777Z if contiguous: 2025-05-07T20:33:05.2904871Z x0 = x0.contiguous() 2025-05-07T20:33:05.2904963Z x1 = x1.contiguous() 2025-05-07T20:33:05.2905043Z 2025-05-07T20:33:05.2905139Z if scale_ub is not None: 2025-05-07T20:33:05.2905250Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2905395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2905473Z ) 2025-05-07T20:33:05.2905555Z else: 2025-05-07T20:33:05.2905654Z scale_ub_tensor = None 2025-05-07T20:33:05.2905727Z 2025-05-07T20:33:05.2905865Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2905956Z op = silu_mul_quant 2025-05-07T20:33:05.2906041Z if compiled: 2025-05-07T20:33:05.2906149Z op = torch.compile(op) 2025-05-07T20:33:05.2906256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2906330Z 2025-05-07T20:33:05.2906428Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2906480Z 2025-05-07T20:33:05.2906580Z moe/activation_test.py:117: 2025-05-07T20:33:05.2906720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2906860Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2906965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2907357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2907451Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2907963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2908071Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2908442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2908680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2909037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2909172Z kernel = self.compile( 2025-05-07T20:33:05.2909617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2909803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2909934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2909944Z 2025-05-07T20:33:05.2910160Z self = 2025-05-07T20:33:05.2910962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2911497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780e4e0c0>} 2025-05-07T20:33:05.2912281Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2912487Z context = 2025-05-07T20:33:05.2912492Z 2025-05-07T20:33:05.2912663Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2912940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2913060Z module_map=module_map) 2025-05-07T20:33:05.2913226Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2913658Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2913796Z E ^ 2025-05-07T20:33:05.2914231Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2914240Z 2025-05-07T20:33:05.2914685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2914689Z 2025-05-07T20:33:05.2914797Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2915030Z self=, 2025-05-07T20:33:05.2915116Z T=1, 2025-05-07T20:33:05.2915196Z D=7168, 2025-05-07T20:33:05.2915289Z scale_ub=None, 2025-05-07T20:33:05.2915378Z contiguous=False, 2025-05-07T20:33:05.2915464Z compiled=False, 2025-05-07T20:33:05.2915545Z ) 2025-05-07T20:33:05.2915771Z self = 2025-05-07T20:33:05.2915946Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.2915950Z 2025-05-07T20:33:05.2916270Z @given( 2025-05-07T20:33:05.2916394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2916498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2916693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2916822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2916944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2917019Z ) 2025-05-07T20:33:05.2917273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2917375Z def test_silu_mul_quant( 2025-05-07T20:33:05.2917452Z self, 2025-05-07T20:33:05.2917529Z T: int, 2025-05-07T20:33:05.2917613Z D: int, 2025-05-07T20:33:05.2917714Z scale_ub: Optional[float], 2025-05-07T20:33:05.2917805Z contiguous: bool, 2025-05-07T20:33:05.2917897Z compiled: bool, 2025-05-07T20:33:05.2917977Z ) -> None: 2025-05-07T20:33:05.2918078Z torch.manual_seed(2025) 2025-05-07T20:33:05.2918161Z 2025-05-07T20:33:05.2918336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2918481Z 2025-05-07T20:33:05.2918583Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2918776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2918873Z x = x_sign * x_clamp 2025-05-07T20:33:05.2918955Z x0 = x[:, :D] 2025-05-07T20:33:05.2919035Z x1 = x[:, D:] 2025-05-07T20:33:05.2919113Z 2025-05-07T20:33:05.2919198Z if contiguous: 2025-05-07T20:33:05.2919290Z x0 = x0.contiguous() 2025-05-07T20:33:05.2919388Z x1 = x1.contiguous() 2025-05-07T20:33:05.2919460Z 2025-05-07T20:33:05.2919553Z if scale_ub is not None: 2025-05-07T20:33:05.2919665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2919803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2919878Z ) 2025-05-07T20:33:05.2919965Z else: 2025-05-07T20:33:05.2920059Z scale_ub_tensor = None 2025-05-07T20:33:05.2920212Z 2025-05-07T20:33:05.2920349Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2920443Z op = silu_mul_quant 2025-05-07T20:33:05.2920537Z if compiled: 2025-05-07T20:33:05.2920638Z op = torch.compile(op) 2025-05-07T20:33:05.2920745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2920825Z 2025-05-07T20:33:05.2920917Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2920921Z 2025-05-07T20:33:05.2921020Z moe/activation_test.py:117: 2025-05-07T20:33:05.2921155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2921255Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2921364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2921880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2921981Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2922364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2922597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2922952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2923054Z kernel = self.compile( 2025-05-07T20:33:05.2923449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2923636Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2923767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2923772Z 2025-05-07T20:33:05.2923983Z self = 2025-05-07T20:33:05.2924932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2925461Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff780e4ec00>} 2025-05-07T20:33:05.2926240Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2926439Z context = 2025-05-07T20:33:05.2926443Z 2025-05-07T20:33:05.2926620Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2926895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2927008Z module_map=module_map) 2025-05-07T20:33:05.2927221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2927360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2927441Z E ^ 2025-05-07T20:33:05.2927818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2927822Z 2025-05-07T20:33:05.2928252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2928257Z 2025-05-07T20:33:05.2928370Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2928598Z self=, 2025-05-07T20:33:05.2928676Z T=2048, 2025-05-07T20:33:05.2928761Z D=7168, 2025-05-07T20:33:05.2928845Z scale_ub=None, 2025-05-07T20:33:05.2928932Z contiguous=False, 2025-05-07T20:33:05.2929025Z compiled=True, 2025-05-07T20:33:05.2929101Z ) 2025-05-07T20:33:05.2929328Z self = 2025-05-07T20:33:05.2929521Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2929525Z 2025-05-07T20:33:05.2929604Z @given( 2025-05-07T20:33:05.2929733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2929834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2929951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2930078Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2930194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2930269Z ) 2025-05-07T20:33:05.2930531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2930627Z def test_silu_mul_quant( 2025-05-07T20:33:05.2930708Z self, 2025-05-07T20:33:05.2930796Z T: int, 2025-05-07T20:33:05.2930873Z D: int, 2025-05-07T20:33:05.2930982Z scale_ub: Optional[float], 2025-05-07T20:33:05.2931076Z contiguous: bool, 2025-05-07T20:33:05.2931167Z compiled: bool, 2025-05-07T20:33:05.2931255Z ) -> None: 2025-05-07T20:33:05.2931352Z torch.manual_seed(2025) 2025-05-07T20:33:05.2931426Z 2025-05-07T20:33:05.2931608Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2931685Z 2025-05-07T20:33:05.2931779Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2931911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2932001Z x = x_sign * x_clamp 2025-05-07T20:33:05.2932082Z x0 = x[:, :D] 2025-05-07T20:33:05.2932167Z x1 = x[:, D:] 2025-05-07T20:33:05.2932242Z 2025-05-07T20:33:05.2932333Z if contiguous: 2025-05-07T20:33:05.2932426Z x0 = x0.contiguous() 2025-05-07T20:33:05.2932517Z x1 = x1.contiguous() 2025-05-07T20:33:05.2932648Z 2025-05-07T20:33:05.2932740Z if scale_ub is not None: 2025-05-07T20:33:05.2932851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2933037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2933116Z ) 2025-05-07T20:33:05.2933196Z else: 2025-05-07T20:33:05.2933297Z scale_ub_tensor = None 2025-05-07T20:33:05.2933370Z 2025-05-07T20:33:05.2933502Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2933600Z op = silu_mul_quant 2025-05-07T20:33:05.2933685Z if compiled: 2025-05-07T20:33:05.2933787Z op = torch.compile(op) 2025-05-07T20:33:05.2933904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2933977Z 2025-05-07T20:33:05.2934077Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2934081Z 2025-05-07T20:33:05.2934180Z moe/activation_test.py:117: 2025-05-07T20:33:05.2934310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2934422Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2934566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2934986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2935092Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2935603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2935710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2936080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2936315Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2936675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2936773Z kernel = self.compile( 2025-05-07T20:33:05.2937169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2937366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2937497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2937502Z 2025-05-07T20:33:05.2937718Z self = 2025-05-07T20:33:05.2938520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2939049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7811442c0>} 2025-05-07T20:33:05.2939828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2940032Z context = 2025-05-07T20:33:05.2940037Z 2025-05-07T20:33:05.2940213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2940486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2940603Z module_map=module_map) 2025-05-07T20:33:05.2940770Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2940872Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2940958Z E ^ 2025-05-07T20:33:05.2941327Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2941375Z 2025-05-07T20:33:05.2941806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2941819Z 2025-05-07T20:33:05.2941962Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2942197Z self=, 2025-05-07T20:33:05.2942291Z T=4096, 2025-05-07T20:33:05.2942369Z D=7168, 2025-05-07T20:33:05.2942453Z scale_ub=None, 2025-05-07T20:33:05.2942551Z contiguous=False, 2025-05-07T20:33:05.2942634Z compiled=True, 2025-05-07T20:33:05.2942708Z ) 2025-05-07T20:33:05.2942946Z self = 2025-05-07T20:33:05.2943127Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2943131Z 2025-05-07T20:33:05.2943216Z @given( 2025-05-07T20:33:05.2943343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2943448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2943570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2943734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2943917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2944003Z ) 2025-05-07T20:33:05.2944260Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2944359Z def test_silu_mul_quant( 2025-05-07T20:33:05.2944443Z self, 2025-05-07T20:33:05.2944521Z T: int, 2025-05-07T20:33:05.2944599Z D: int, 2025-05-07T20:33:05.2944705Z scale_ub: Optional[float], 2025-05-07T20:33:05.2944798Z contiguous: bool, 2025-05-07T20:33:05.2944892Z compiled: bool, 2025-05-07T20:33:05.2944972Z ) -> None: 2025-05-07T20:33:05.2945072Z torch.manual_seed(2025) 2025-05-07T20:33:05.2945152Z 2025-05-07T20:33:05.2945328Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2945409Z 2025-05-07T20:33:05.2945511Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2945641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2945734Z x = x_sign * x_clamp 2025-05-07T20:33:05.2945825Z x0 = x[:, :D] 2025-05-07T20:33:05.2945907Z x1 = x[:, D:] 2025-05-07T20:33:05.2945979Z 2025-05-07T20:33:05.2946071Z if contiguous: 2025-05-07T20:33:05.2946163Z x0 = x0.contiguous() 2025-05-07T20:33:05.2946263Z x1 = x1.contiguous() 2025-05-07T20:33:05.2946336Z 2025-05-07T20:33:05.2946426Z if scale_ub is not None: 2025-05-07T20:33:05.2946541Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2946679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2946755Z ) 2025-05-07T20:33:05.2946837Z else: 2025-05-07T20:33:05.2946934Z scale_ub_tensor = None 2025-05-07T20:33:05.2947005Z 2025-05-07T20:33:05.2947150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2947244Z op = silu_mul_quant 2025-05-07T20:33:05.2947336Z if compiled: 2025-05-07T20:33:05.2947446Z op = torch.compile(op) 2025-05-07T20:33:05.2947556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2947630Z 2025-05-07T20:33:05.2947728Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2947733Z 2025-05-07T20:33:05.2947831Z moe/activation_test.py:117: 2025-05-07T20:33:05.2947966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2948067Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2948168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2948554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2948650Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2949162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2949324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2949735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2949976Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2950329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2950425Z kernel = self.compile( 2025-05-07T20:33:05.2950828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2951009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2951144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2951148Z 2025-05-07T20:33:05.2951361Z self = 2025-05-07T20:33:05.2952232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2952799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781144d60>} 2025-05-07T20:33:05.2953573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2953776Z context = 2025-05-07T20:33:05.2953780Z 2025-05-07T20:33:05.2953951Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2954227Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2954348Z module_map=module_map) 2025-05-07T20:33:05.2954519Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2954626Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2954706Z E ^ 2025-05-07T20:33:05.2955074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2955079Z 2025-05-07T20:33:05.2955515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2955519Z 2025-05-07T20:33:05.2955626Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2955863Z self=, 2025-05-07T20:33:05.2955943Z T=16384, 2025-05-07T20:33:05.2956021Z D=5120, 2025-05-07T20:33:05.2956114Z scale_ub=1200.0, 2025-05-07T20:33:05.2956201Z contiguous=False, 2025-05-07T20:33:05.2956287Z compiled=False, 2025-05-07T20:33:05.2956369Z ) 2025-05-07T20:33:05.2956597Z self = 2025-05-07T20:33:05.2956792Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.2956796Z 2025-05-07T20:33:05.2956883Z @given( 2025-05-07T20:33:05.2957006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2957112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2957230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2957350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2957472Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2957546Z ) 2025-05-07T20:33:05.2957797Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2957898Z def test_silu_mul_quant( 2025-05-07T20:33:05.2958022Z self, 2025-05-07T20:33:05.2962595Z T: int, 2025-05-07T20:33:05.2962708Z D: int, 2025-05-07T20:33:05.2962824Z scale_ub: Optional[float], 2025-05-07T20:33:05.2963004Z contiguous: bool, 2025-05-07T20:33:05.2963101Z compiled: bool, 2025-05-07T20:33:05.2963197Z ) -> None: 2025-05-07T20:33:05.2963297Z torch.manual_seed(2025) 2025-05-07T20:33:05.2963374Z 2025-05-07T20:33:05.2963564Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2963641Z 2025-05-07T20:33:05.2963738Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2963878Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2963969Z x = x_sign * x_clamp 2025-05-07T20:33:05.2964061Z x0 = x[:, :D] 2025-05-07T20:33:05.2964144Z x1 = x[:, D:] 2025-05-07T20:33:05.2964219Z 2025-05-07T20:33:05.2964314Z if contiguous: 2025-05-07T20:33:05.2964414Z x0 = x0.contiguous() 2025-05-07T20:33:05.2964506Z x1 = x1.contiguous() 2025-05-07T20:33:05.2964641Z 2025-05-07T20:33:05.2964737Z if scale_ub is not None: 2025-05-07T20:33:05.2964850Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2965041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2965122Z ) 2025-05-07T20:33:05.2965202Z else: 2025-05-07T20:33:05.2965310Z scale_ub_tensor = None 2025-05-07T20:33:05.2965386Z 2025-05-07T20:33:05.2965524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2965625Z op = silu_mul_quant 2025-05-07T20:33:05.2965717Z if compiled: 2025-05-07T20:33:05.2965830Z op = torch.compile(op) 2025-05-07T20:33:05.2965941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2966017Z 2025-05-07T20:33:05.2966118Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2966123Z 2025-05-07T20:33:05.2966227Z moe/activation_test.py:117: 2025-05-07T20:33:05.2966362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2966479Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2966589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2967112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2967223Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2967595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2967836Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2968193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2968293Z kernel = self.compile( 2025-05-07T20:33:05.2968702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2968890Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2969039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2969044Z 2025-05-07T20:33:05.2969256Z self = 2025-05-07T20:33:05.2970065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2970598Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781145c60>} 2025-05-07T20:33:05.2971371Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2971632Z context = 2025-05-07T20:33:05.2971673Z 2025-05-07T20:33:05.2971851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2972128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2972247Z module_map=module_map) 2025-05-07T20:33:05.2972417Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2972527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2972607Z E ^ 2025-05-07T20:33:05.2972976Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2972981Z 2025-05-07T20:33:05.2973424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2973431Z 2025-05-07T20:33:05.2973540Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2973825Z self=, 2025-05-07T20:33:05.2973948Z T=16384, 2025-05-07T20:33:05.2974031Z D=5120, 2025-05-07T20:33:05.2974125Z scale_ub=1200.0, 2025-05-07T20:33:05.2974214Z contiguous=True, 2025-05-07T20:33:05.2974300Z compiled=True, 2025-05-07T20:33:05.2974384Z ) 2025-05-07T20:33:05.2974612Z self = 2025-05-07T20:33:05.2974797Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.2974801Z 2025-05-07T20:33:05.2974888Z @given( 2025-05-07T20:33:05.2975015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2975128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2975249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2975375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2975499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2975579Z ) 2025-05-07T20:33:05.2975843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2975949Z def test_silu_mul_quant( 2025-05-07T20:33:05.2976028Z self, 2025-05-07T20:33:05.2976108Z T: int, 2025-05-07T20:33:05.2976195Z D: int, 2025-05-07T20:33:05.2976298Z scale_ub: Optional[float], 2025-05-07T20:33:05.2976391Z contiguous: bool, 2025-05-07T20:33:05.2976487Z compiled: bool, 2025-05-07T20:33:05.2976567Z ) -> None: 2025-05-07T20:33:05.2976672Z torch.manual_seed(2025) 2025-05-07T20:33:05.2976747Z 2025-05-07T20:33:05.2976923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2977005Z 2025-05-07T20:33:05.2977100Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2977231Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2977332Z x = x_sign * x_clamp 2025-05-07T20:33:05.2977418Z x0 = x[:, :D] 2025-05-07T20:33:05.2977503Z x1 = x[:, D:] 2025-05-07T20:33:05.2977586Z 2025-05-07T20:33:05.2977678Z if contiguous: 2025-05-07T20:33:05.2977771Z x0 = x0.contiguous() 2025-05-07T20:33:05.2977872Z x1 = x1.contiguous() 2025-05-07T20:33:05.2977946Z 2025-05-07T20:33:05.2978050Z if scale_ub is not None: 2025-05-07T20:33:05.2978160Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2978307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2978395Z ) 2025-05-07T20:33:05.2978476Z else: 2025-05-07T20:33:05.2978575Z scale_ub_tensor = None 2025-05-07T20:33:05.2978660Z 2025-05-07T20:33:05.2978795Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2978889Z op = silu_mul_quant 2025-05-07T20:33:05.2979034Z if compiled: 2025-05-07T20:33:05.2979138Z op = torch.compile(op) 2025-05-07T20:33:05.2979250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2979381Z 2025-05-07T20:33:05.2979479Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2979484Z 2025-05-07T20:33:05.2979595Z moe/activation_test.py:117: 2025-05-07T20:33:05.2979727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2979834Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2979944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2980323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2980421Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2980942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2981048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2981428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2981778Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2982137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2982245Z kernel = self.compile( 2025-05-07T20:33:05.2982648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2982832Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2982974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2982979Z 2025-05-07T20:33:05.2983196Z self = 2025-05-07T20:33:05.2984008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2984599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff781147380>} 2025-05-07T20:33:05.2985380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2985581Z context = 2025-05-07T20:33:05.2985585Z 2025-05-07T20:33:05.2985758Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2986040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.2986156Z module_map=module_map) 2025-05-07T20:33:05.2986331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.2986440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.2986520Z E ^ 2025-05-07T20:33:05.2986901Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.2986906Z 2025-05-07T20:33:05.2987341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.2987346Z 2025-05-07T20:33:05.2987455Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.2987697Z self=, 2025-05-07T20:33:05.2987778Z T=16384, 2025-05-07T20:33:05.2987864Z D=5120, 2025-05-07T20:33:05.2987950Z scale_ub=None, 2025-05-07T20:33:05.2988043Z contiguous=False, 2025-05-07T20:33:05.2988139Z compiled=True, 2025-05-07T20:33:05.2988264Z ) 2025-05-07T20:33:05.2988490Z self = 2025-05-07T20:33:05.2988727Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.2988734Z 2025-05-07T20:33:05.2988816Z @given( 2025-05-07T20:33:05.2988940Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.2989051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.2989169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.2989301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.2989419Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.2989496Z ) 2025-05-07T20:33:05.2989761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.2989858Z def test_silu_mul_quant( 2025-05-07T20:33:05.2989937Z self, 2025-05-07T20:33:05.2990024Z T: int, 2025-05-07T20:33:05.2990107Z D: int, 2025-05-07T20:33:05.2990209Z scale_ub: Optional[float], 2025-05-07T20:33:05.2990310Z contiguous: bool, 2025-05-07T20:33:05.2990443Z compiled: bool, 2025-05-07T20:33:05.2990528Z ) -> None: 2025-05-07T20:33:05.2990670Z torch.manual_seed(2025) 2025-05-07T20:33:05.2990746Z 2025-05-07T20:33:05.2990931Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.2991008Z 2025-05-07T20:33:05.2991105Z x_sign = torch.sign(x) 2025-05-07T20:33:05.2991238Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.2991328Z x = x_sign * x_clamp 2025-05-07T20:33:05.2991409Z x0 = x[:, :D] 2025-05-07T20:33:05.2991498Z x1 = x[:, D:] 2025-05-07T20:33:05.2991572Z 2025-05-07T20:33:05.2991658Z if contiguous: 2025-05-07T20:33:05.2991760Z x0 = x0.contiguous() 2025-05-07T20:33:05.2991851Z x1 = x1.contiguous() 2025-05-07T20:33:05.2991926Z 2025-05-07T20:33:05.2992029Z if scale_ub is not None: 2025-05-07T20:33:05.2992137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.2992292Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.2992374Z ) 2025-05-07T20:33:05.2992458Z else: 2025-05-07T20:33:05.2992562Z scale_ub_tensor = None 2025-05-07T20:33:05.2992640Z 2025-05-07T20:33:05.2992775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.2992876Z op = silu_mul_quant 2025-05-07T20:33:05.2992964Z if compiled: 2025-05-07T20:33:05.2993069Z op = torch.compile(op) 2025-05-07T20:33:05.2993186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2993263Z 2025-05-07T20:33:05.2993357Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.2993362Z 2025-05-07T20:33:05.2993470Z moe/activation_test.py:117: 2025-05-07T20:33:05.2993603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2993717Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.2993820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.2994214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.2994323Z return fn(*args, **kwargs) 2025-05-07T20:33:05.2994878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.2994980Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.2995362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.2995599Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.2995966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.2996067Z kernel = self.compile( 2025-05-07T20:33:05.2996515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.2996747Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.2996884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.2996889Z 2025-05-07T20:33:05.2997112Z self = 2025-05-07T20:33:05.2997916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.2998440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7802d85e0>} 2025-05-07T20:33:05.2999216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.2999496Z context = 2025-05-07T20:33:05.2999501Z 2025-05-07T20:33:05.2999682Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.2999957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3000187Z module_map=module_map) 2025-05-07T20:33:05.3000365Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3000468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3000557Z E ^ 2025-05-07T20:33:05.3000926Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3000931Z 2025-05-07T20:33:05.3001363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3001371Z 2025-05-07T20:33:05.3001486Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3001722Z self=, 2025-05-07T20:33:05.3001802Z T=2048, 2025-05-07T20:33:05.3001891Z D=5120, 2025-05-07T20:33:05.3001976Z scale_ub=None, 2025-05-07T20:33:05.3002076Z contiguous=False, 2025-05-07T20:33:05.3002161Z compiled=True, 2025-05-07T20:33:05.3002237Z ) 2025-05-07T20:33:05.3002470Z self = 2025-05-07T20:33:05.3002650Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.3002655Z 2025-05-07T20:33:05.3002735Z @given( 2025-05-07T20:33:05.3002865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3002969Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3003092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3003219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3003343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3003426Z ) 2025-05-07T20:33:05.3003684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3003783Z def test_silu_mul_quant( 2025-05-07T20:33:05.3003868Z self, 2025-05-07T20:33:05.3003949Z T: int, 2025-05-07T20:33:05.3004029Z D: int, 2025-05-07T20:33:05.3004139Z scale_ub: Optional[float], 2025-05-07T20:33:05.3004234Z contiguous: bool, 2025-05-07T20:33:05.3004322Z compiled: bool, 2025-05-07T20:33:05.3004410Z ) -> None: 2025-05-07T20:33:05.3004508Z torch.manual_seed(2025) 2025-05-07T20:33:05.3004586Z 2025-05-07T20:33:05.3004768Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3004845Z 2025-05-07T20:33:05.3005000Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3005128Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3005223Z x = x_sign * x_clamp 2025-05-07T20:33:05.3005349Z x0 = x[:, :D] 2025-05-07T20:33:05.3005434Z x1 = x[:, D:] 2025-05-07T20:33:05.3005513Z 2025-05-07T20:33:05.3005605Z if contiguous: 2025-05-07T20:33:05.3005698Z x0 = x0.contiguous() 2025-05-07T20:33:05.3005791Z x1 = x1.contiguous() 2025-05-07T20:33:05.3005872Z 2025-05-07T20:33:05.3005965Z if scale_ub is not None: 2025-05-07T20:33:05.3006073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3006221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3006298Z ) 2025-05-07T20:33:05.3006385Z else: 2025-05-07T20:33:05.3006482Z scale_ub_tensor = None 2025-05-07T20:33:05.3006558Z 2025-05-07T20:33:05.3006700Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3006797Z op = silu_mul_quant 2025-05-07T20:33:05.3006884Z if compiled: 2025-05-07T20:33:05.3007040Z op = torch.compile(op) 2025-05-07T20:33:05.3007153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3007264Z 2025-05-07T20:33:05.3007366Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3007370Z 2025-05-07T20:33:05.3007472Z moe/activation_test.py:117: 2025-05-07T20:33:05.3007604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3007717Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3007821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3008212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3008310Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3008825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3008938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3009322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3009559Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3009925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3010025Z kernel = self.compile( 2025-05-07T20:33:05.3010433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3010617Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3010752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3010756Z 2025-05-07T20:33:05.3010977Z self = 2025-05-07T20:33:05.3011786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3012322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7802d9440>} 2025-05-07T20:33:05.3013094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3013301Z context = 2025-05-07T20:33:05.3013306Z 2025-05-07T20:33:05.3013881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3014190Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3014482Z module_map=module_map) 2025-05-07T20:33:05.3014755Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3014862Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3014949Z E ^ 2025-05-07T20:33:05.3015319Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3015324Z 2025-05-07T20:33:05.3015762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3015767Z 2025-05-07T20:33:05.3015877Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3016109Z self=, 2025-05-07T20:33:05.3016196Z T=2048, 2025-05-07T20:33:05.3016279Z D=5120, 2025-05-07T20:33:05.3016374Z scale_ub=1200.0, 2025-05-07T20:33:05.3016467Z contiguous=False, 2025-05-07T20:33:05.3016551Z compiled=True, 2025-05-07T20:33:05.3016631Z ) 2025-05-07T20:33:05.3016926Z self = 2025-05-07T20:33:05.3017170Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.3017175Z 2025-05-07T20:33:05.3017264Z @given( 2025-05-07T20:33:05.3017388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3017497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3017614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3017735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3017862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3017939Z ) 2025-05-07T20:33:05.3018193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3018296Z def test_silu_mul_quant( 2025-05-07T20:33:05.3018376Z self, 2025-05-07T20:33:05.3018457Z T: int, 2025-05-07T20:33:05.3018546Z D: int, 2025-05-07T20:33:05.3018652Z scale_ub: Optional[float], 2025-05-07T20:33:05.3018744Z contiguous: bool, 2025-05-07T20:33:05.3018843Z compiled: bool, 2025-05-07T20:33:05.3018926Z ) -> None: 2025-05-07T20:33:05.3019032Z torch.manual_seed(2025) 2025-05-07T20:33:05.3019108Z 2025-05-07T20:33:05.3019284Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3019366Z 2025-05-07T20:33:05.3019461Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3019590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3019686Z x = x_sign * x_clamp 2025-05-07T20:33:05.3019770Z x0 = x[:, :D] 2025-05-07T20:33:05.3019851Z x1 = x[:, D:] 2025-05-07T20:33:05.3019930Z 2025-05-07T20:33:05.3020015Z if contiguous: 2025-05-07T20:33:05.3020108Z x0 = x0.contiguous() 2025-05-07T20:33:05.3020206Z x1 = x1.contiguous() 2025-05-07T20:33:05.3020279Z 2025-05-07T20:33:05.3020370Z if scale_ub is not None: 2025-05-07T20:33:05.3020487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3020628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3020714Z ) 2025-05-07T20:33:05.3020790Z else: 2025-05-07T20:33:05.3020885Z scale_ub_tensor = None 2025-05-07T20:33:05.3020964Z 2025-05-07T20:33:05.3021097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3021190Z op = silu_mul_quant 2025-05-07T20:33:05.3021281Z if compiled: 2025-05-07T20:33:05.3021381Z op = torch.compile(op) 2025-05-07T20:33:05.3021487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3021568Z 2025-05-07T20:33:05.3021658Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3021663Z 2025-05-07T20:33:05.3021767Z moe/activation_test.py:117: 2025-05-07T20:33:05.3021946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3022049Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3022157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3022575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3022671Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3023186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3023286Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3023661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3023893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3024246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3024353Z kernel = self.compile( 2025-05-07T20:33:05.3024744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3025024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3025157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3025162Z 2025-05-07T20:33:05.3025381Z self = 2025-05-07T20:33:05.3026182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3026705Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7802da660>} 2025-05-07T20:33:05.3027485Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3027685Z context = 2025-05-07T20:33:05.3027690Z 2025-05-07T20:33:05.3027865Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3028136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3028251Z module_map=module_map) 2025-05-07T20:33:05.3028416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3028520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3028605Z E ^ 2025-05-07T20:33:05.3028969Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3028977Z 2025-05-07T20:33:05.3029402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3029409Z 2025-05-07T20:33:05.3029528Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3029758Z self=, 2025-05-07T20:33:05.3029842Z T=4096, 2025-05-07T20:33:05.3029921Z D=5120, 2025-05-07T20:33:05.3030010Z scale_ub=1200.0, 2025-05-07T20:33:05.3030104Z contiguous=True, 2025-05-07T20:33:05.3030189Z compiled=True, 2025-05-07T20:33:05.3030262Z ) 2025-05-07T20:33:05.3030494Z self = 2025-05-07T20:33:05.3030671Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.3030676Z 2025-05-07T20:33:05.3030755Z @given( 2025-05-07T20:33:05.3030883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3031029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3031150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3031309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3031429Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3031509Z ) 2025-05-07T20:33:05.3031763Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3031858Z def test_silu_mul_quant( 2025-05-07T20:33:05.3031941Z self, 2025-05-07T20:33:05.3032019Z T: int, 2025-05-07T20:33:05.3032095Z D: int, 2025-05-07T20:33:05.3032200Z scale_ub: Optional[float], 2025-05-07T20:33:05.3032291Z contiguous: bool, 2025-05-07T20:33:05.3032379Z compiled: bool, 2025-05-07T20:33:05.3032463Z ) -> None: 2025-05-07T20:33:05.3032559Z torch.manual_seed(2025) 2025-05-07T20:33:05.3032638Z 2025-05-07T20:33:05.3032814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3032893Z 2025-05-07T20:33:05.3032993Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3033164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3033258Z x = x_sign * x_clamp 2025-05-07T20:33:05.3033385Z x0 = x[:, :D] 2025-05-07T20:33:05.3033468Z x1 = x[:, D:] 2025-05-07T20:33:05.3033541Z 2025-05-07T20:33:05.3033632Z if contiguous: 2025-05-07T20:33:05.3033723Z x0 = x0.contiguous() 2025-05-07T20:33:05.3033813Z x1 = x1.contiguous() 2025-05-07T20:33:05.3033891Z 2025-05-07T20:33:05.3033983Z if scale_ub is not None: 2025-05-07T20:33:05.3034095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3034234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3034310Z ) 2025-05-07T20:33:05.3034394Z else: 2025-05-07T20:33:05.3034492Z scale_ub_tensor = None 2025-05-07T20:33:05.3034566Z 2025-05-07T20:33:05.3034711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3034806Z op = silu_mul_quant 2025-05-07T20:33:05.3034895Z if compiled: 2025-05-07T20:33:05.3035012Z op = torch.compile(op) 2025-05-07T20:33:05.3035125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3035198Z 2025-05-07T20:33:05.3035295Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3035300Z 2025-05-07T20:33:05.3035400Z moe/activation_test.py:117: 2025-05-07T20:33:05.3035535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3035639Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3035740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3036123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3036219Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3036730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3036841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3037215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3037453Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3037804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3037902Z kernel = self.compile( 2025-05-07T20:33:05.3038305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3038487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3038617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3038630Z 2025-05-07T20:33:05.3038889Z self = 2025-05-07T20:33:05.3039729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3040315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff7802db9c0>} 2025-05-07T20:33:05.3041087Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3041292Z context = 2025-05-07T20:33:05.3041296Z 2025-05-07T20:33:05.3041464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3041738Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3041893Z module_map=module_map) 2025-05-07T20:33:05.3042099Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3042213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3042292Z E ^ 2025-05-07T20:33:05.3042660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3042665Z 2025-05-07T20:33:05.3043102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3043106Z 2025-05-07T20:33:05.3043214Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3043445Z self=, 2025-05-07T20:33:05.3043536Z T=128, 2025-05-07T20:33:05.3043614Z D=5120, 2025-05-07T20:33:05.3043709Z scale_ub=1200.0, 2025-05-07T20:33:05.3043798Z contiguous=False, 2025-05-07T20:33:05.3043886Z compiled=True, 2025-05-07T20:33:05.3043968Z ) 2025-05-07T20:33:05.3044222Z self = 2025-05-07T20:33:05.3044423Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.3044427Z 2025-05-07T20:33:05.3044514Z @given( 2025-05-07T20:33:05.3044639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3044740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3044864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3044983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3045106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3045181Z ) 2025-05-07T20:33:05.3045434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3045537Z def test_silu_mul_quant( 2025-05-07T20:33:05.3045615Z self, 2025-05-07T20:33:05.3045692Z T: int, 2025-05-07T20:33:05.3045785Z D: int, 2025-05-07T20:33:05.3045887Z scale_ub: Optional[float], 2025-05-07T20:33:05.3045981Z contiguous: bool, 2025-05-07T20:33:05.3046075Z compiled: bool, 2025-05-07T20:33:05.3046154Z ) -> None: 2025-05-07T20:33:05.3046250Z torch.manual_seed(2025) 2025-05-07T20:33:05.3046337Z 2025-05-07T20:33:05.3046511Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3046594Z 2025-05-07T20:33:05.3046689Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3046817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3046913Z x = x_sign * x_clamp 2025-05-07T20:33:05.3046994Z x0 = x[:, :D] 2025-05-07T20:33:05.3047075Z x1 = x[:, D:] 2025-05-07T20:33:05.3047156Z 2025-05-07T20:33:05.3047242Z if contiguous: 2025-05-07T20:33:05.3047405Z x0 = x0.contiguous() 2025-05-07T20:33:05.3047505Z x1 = x1.contiguous() 2025-05-07T20:33:05.3047580Z 2025-05-07T20:33:05.3047672Z if scale_ub is not None: 2025-05-07T20:33:05.3047821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3047966Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3048049Z ) 2025-05-07T20:33:05.3048127Z else: 2025-05-07T20:33:05.3048222Z scale_ub_tensor = None 2025-05-07T20:33:05.3048300Z 2025-05-07T20:33:05.3048432Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3048525Z op = silu_mul_quant 2025-05-07T20:33:05.3048616Z if compiled: 2025-05-07T20:33:05.3048716Z op = torch.compile(op) 2025-05-07T20:33:05.3048823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3048901Z 2025-05-07T20:33:05.3048992Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3048997Z 2025-05-07T20:33:05.3049099Z moe/activation_test.py:117: 2025-05-07T20:33:05.3049234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3049379Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3049523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3049901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3049996Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3050511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3050610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3050979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3051217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3051570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3051677Z kernel = self.compile( 2025-05-07T20:33:05.3052079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3052259Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3052394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3052399Z 2025-05-07T20:33:05.3052609Z self = 2025-05-07T20:33:05.3053417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3053947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfcd0fe0>} 2025-05-07T20:33:05.3054775Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3054983Z context = 2025-05-07T20:33:05.3054987Z 2025-05-07T20:33:05.3055157Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3055438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3055548Z module_map=module_map) 2025-05-07T20:33:05.3055715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3055821Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3055900Z E ^ 2025-05-07T20:33:05.3056271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3056341Z 2025-05-07T20:33:05.3056811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3056818Z 2025-05-07T20:33:05.3056926Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3057164Z self=, 2025-05-07T20:33:05.3057242Z T=16384, 2025-05-07T20:33:05.3057321Z D=7168, 2025-05-07T20:33:05.3057413Z scale_ub=1200.0, 2025-05-07T20:33:05.3057499Z contiguous=True, 2025-05-07T20:33:05.3057591Z compiled=True, 2025-05-07T20:33:05.3057667Z ) 2025-05-07T20:33:05.3057891Z self = 2025-05-07T20:33:05.3058079Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.3058084Z 2025-05-07T20:33:05.3058161Z @given( 2025-05-07T20:33:05.3058284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3058392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3058551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3058709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3058835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3058911Z ) 2025-05-07T20:33:05.3059173Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3059267Z def test_silu_mul_quant( 2025-05-07T20:33:05.3059345Z self, 2025-05-07T20:33:05.3059430Z T: int, 2025-05-07T20:33:05.3059507Z D: int, 2025-05-07T20:33:05.3059607Z scale_ub: Optional[float], 2025-05-07T20:33:05.3059706Z contiguous: bool, 2025-05-07T20:33:05.3059794Z compiled: bool, 2025-05-07T20:33:05.3059871Z ) -> None: 2025-05-07T20:33:05.3059975Z torch.manual_seed(2025) 2025-05-07T20:33:05.3060051Z 2025-05-07T20:33:05.3060223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3060307Z 2025-05-07T20:33:05.3060402Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3060539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3060631Z x = x_sign * x_clamp 2025-05-07T20:33:05.3060712Z x0 = x[:, :D] 2025-05-07T20:33:05.3060805Z x1 = x[:, D:] 2025-05-07T20:33:05.3060880Z 2025-05-07T20:33:05.3060964Z if contiguous: 2025-05-07T20:33:05.3061061Z x0 = x0.contiguous() 2025-05-07T20:33:05.3061152Z x1 = x1.contiguous() 2025-05-07T20:33:05.3061226Z 2025-05-07T20:33:05.3061325Z if scale_ub is not None: 2025-05-07T20:33:05.3061434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3061571Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3061655Z ) 2025-05-07T20:33:05.3061734Z else: 2025-05-07T20:33:05.3061838Z scale_ub_tensor = None 2025-05-07T20:33:05.3061911Z 2025-05-07T20:33:05.3062042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3062142Z op = silu_mul_quant 2025-05-07T20:33:05.3062230Z if compiled: 2025-05-07T20:33:05.3062334Z op = torch.compile(op) 2025-05-07T20:33:05.3062446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3062518Z 2025-05-07T20:33:05.3062610Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3062615Z 2025-05-07T20:33:05.3062719Z moe/activation_test.py:117: 2025-05-07T20:33:05.3062848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3062951Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3063058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3063436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3063537Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3064096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3064236Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3064617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3064848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3065207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3065304Z kernel = self.compile( 2025-05-07T20:33:05.3065700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3065888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3066018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3066025Z 2025-05-07T20:33:05.3066236Z self = 2025-05-07T20:33:05.3067130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3067655Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfcd1e40>} 2025-05-07T20:33:05.3068432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3068631Z context = 2025-05-07T20:33:05.3068636Z 2025-05-07T20:33:05.3068812Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3069085Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3069204Z module_map=module_map) 2025-05-07T20:33:05.3069374Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3069475Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3069555Z E ^ 2025-05-07T20:33:05.3069925Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3069930Z 2025-05-07T20:33:05.3070356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3070361Z 2025-05-07T20:33:05.3070472Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3070701Z self=, 2025-05-07T20:33:05.3070783Z T=16384, 2025-05-07T20:33:05.3070869Z D=5120, 2025-05-07T20:33:05.3070953Z scale_ub=1200.0, 2025-05-07T20:33:05.3071043Z contiguous=True, 2025-05-07T20:33:05.3071138Z compiled=False, 2025-05-07T20:33:05.3071215Z ) 2025-05-07T20:33:05.3071447Z self = 2025-05-07T20:33:05.3071630Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.3071634Z 2025-05-07T20:33:05.3071712Z @given( 2025-05-07T20:33:05.3071841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3071942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3072057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3072182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3072296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3072370Z ) 2025-05-07T20:33:05.3072632Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3072774Z def test_silu_mul_quant( 2025-05-07T20:33:05.3072862Z self, 2025-05-07T20:33:05.3072941Z T: int, 2025-05-07T20:33:05.3073056Z D: int, 2025-05-07T20:33:05.3073166Z scale_ub: Optional[float], 2025-05-07T20:33:05.3073259Z contiguous: bool, 2025-05-07T20:33:05.3073347Z compiled: bool, 2025-05-07T20:33:05.3073431Z ) -> None: 2025-05-07T20:33:05.3073528Z torch.manual_seed(2025) 2025-05-07T20:33:05.3073602Z 2025-05-07T20:33:05.3073783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3073858Z 2025-05-07T20:33:05.3073953Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3074084Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3074176Z x = x_sign * x_clamp 2025-05-07T20:33:05.3074267Z x0 = x[:, :D] 2025-05-07T20:33:05.3074348Z x1 = x[:, D:] 2025-05-07T20:33:05.3074424Z 2025-05-07T20:33:05.3074515Z if contiguous: 2025-05-07T20:33:05.3074609Z x0 = x0.contiguous() 2025-05-07T20:33:05.3074766Z x1 = x1.contiguous() 2025-05-07T20:33:05.3074846Z 2025-05-07T20:33:05.3074943Z if scale_ub is not None: 2025-05-07T20:33:05.3075110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3075259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3075335Z ) 2025-05-07T20:33:05.3075412Z else: 2025-05-07T20:33:05.3075514Z scale_ub_tensor = None 2025-05-07T20:33:05.3075589Z 2025-05-07T20:33:05.3075722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3075821Z op = silu_mul_quant 2025-05-07T20:33:05.3075905Z if compiled: 2025-05-07T20:33:05.3076013Z op = torch.compile(op) 2025-05-07T20:33:05.3076119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3076193Z 2025-05-07T20:33:05.3076292Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3076297Z 2025-05-07T20:33:05.3076394Z moe/activation_test.py:117: 2025-05-07T20:33:05.3076530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3076642Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3076743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3077263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3077362Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3077730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3077965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3078316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3078411Z kernel = self.compile( 2025-05-07T20:33:05.3078816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3079003Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3079141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3079146Z 2025-05-07T20:33:05.3079357Z self = 2025-05-07T20:33:05.3080207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3080736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfcd2ca0>} 2025-05-07T20:33:05.3081506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3081798Z context = 2025-05-07T20:33:05.3081803Z 2025-05-07T20:33:05.3081973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3082251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3082359Z module_map=module_map) 2025-05-07T20:33:05.3082523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3082628Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3082704Z E ^ 2025-05-07T20:33:05.3083071Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3083076Z 2025-05-07T20:33:05.3083514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3083559Z 2025-05-07T20:33:05.3083668Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3083941Z self=, 2025-05-07T20:33:05.3084020Z T=1, 2025-05-07T20:33:05.3084097Z D=7168, 2025-05-07T20:33:05.3084187Z scale_ub=1200.0, 2025-05-07T20:33:05.3084276Z contiguous=False, 2025-05-07T20:33:05.3084361Z compiled=False, 2025-05-07T20:33:05.3084439Z ) 2025-05-07T20:33:05.3084662Z self = 2025-05-07T20:33:05.3084834Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.3084839Z 2025-05-07T20:33:05.3084925Z @given( 2025-05-07T20:33:05.3085045Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3085151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3085271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3085390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3085518Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3085595Z ) 2025-05-07T20:33:05.3085847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3085951Z def test_silu_mul_quant( 2025-05-07T20:33:05.3086027Z self, 2025-05-07T20:33:05.3086104Z T: int, 2025-05-07T20:33:05.3086186Z D: int, 2025-05-07T20:33:05.3086287Z scale_ub: Optional[float], 2025-05-07T20:33:05.3086385Z contiguous: bool, 2025-05-07T20:33:05.3086472Z compiled: bool, 2025-05-07T20:33:05.3086552Z ) -> None: 2025-05-07T20:33:05.3086654Z torch.manual_seed(2025) 2025-05-07T20:33:05.3086726Z 2025-05-07T20:33:05.3086902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3086985Z 2025-05-07T20:33:05.3087078Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3091925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3092055Z x = x_sign * x_clamp 2025-05-07T20:33:05.3092151Z x0 = x[:, :D] 2025-05-07T20:33:05.3092238Z x1 = x[:, D:] 2025-05-07T20:33:05.3092314Z 2025-05-07T20:33:05.3092410Z if contiguous: 2025-05-07T20:33:05.3092505Z x0 = x0.contiguous() 2025-05-07T20:33:05.3092600Z x1 = x1.contiguous() 2025-05-07T20:33:05.3092683Z 2025-05-07T20:33:05.3092778Z if scale_ub is not None: 2025-05-07T20:33:05.3092892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3093044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3093123Z ) 2025-05-07T20:33:05.3093204Z else: 2025-05-07T20:33:05.3093311Z scale_ub_tensor = None 2025-05-07T20:33:05.3093387Z 2025-05-07T20:33:05.3093532Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3093706Z op = silu_mul_quant 2025-05-07T20:33:05.3093795Z if compiled: 2025-05-07T20:33:05.3093910Z op = torch.compile(op) 2025-05-07T20:33:05.3094063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3094139Z 2025-05-07T20:33:05.3094243Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3094248Z 2025-05-07T20:33:05.3094350Z moe/activation_test.py:117: 2025-05-07T20:33:05.3094486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3094598Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3094703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3095237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3095339Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3095713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3095957Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3096401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3096502Z kernel = self.compile( 2025-05-07T20:33:05.3096908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3097091Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3097229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3097234Z 2025-05-07T20:33:05.3097446Z self = 2025-05-07T20:33:05.3098254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3098802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfd0c0e0>} 2025-05-07T20:33:05.3099575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3099783Z context = 2025-05-07T20:33:05.3099788Z 2025-05-07T20:33:05.3099959Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3100241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3100353Z module_map=module_map) 2025-05-07T20:33:05.3100521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3100634Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3100720Z E ^ 2025-05-07T20:33:05.3101093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3101098Z 2025-05-07T20:33:05.3101533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3101538Z 2025-05-07T20:33:05.3101644Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3101883Z self=, 2025-05-07T20:33:05.3101965Z T=4096, 2025-05-07T20:33:05.3102043Z D=7168, 2025-05-07T20:33:05.3102140Z scale_ub=1200.0, 2025-05-07T20:33:05.3102231Z contiguous=False, 2025-05-07T20:33:05.3102317Z compiled=True, 2025-05-07T20:33:05.3102400Z ) 2025-05-07T20:33:05.3102625Z self = 2025-05-07T20:33:05.3102853Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.3102868Z 2025-05-07T20:33:05.3102950Z @given( 2025-05-07T20:33:05.3103116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3103226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3103345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3103468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3103595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3103674Z ) 2025-05-07T20:33:05.3103929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3104034Z def test_silu_mul_quant( 2025-05-07T20:33:05.3104112Z self, 2025-05-07T20:33:05.3104191Z T: int, 2025-05-07T20:33:05.3104277Z D: int, 2025-05-07T20:33:05.3104379Z scale_ub: Optional[float], 2025-05-07T20:33:05.3104482Z contiguous: bool, 2025-05-07T20:33:05.3104573Z compiled: bool, 2025-05-07T20:33:05.3104652Z ) -> None: 2025-05-07T20:33:05.3104800Z torch.manual_seed(2025) 2025-05-07T20:33:05.3104876Z 2025-05-07T20:33:05.3105092Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3105177Z 2025-05-07T20:33:05.3105272Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3105401Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3105500Z x = x_sign * x_clamp 2025-05-07T20:33:05.3105582Z x0 = x[:, :D] 2025-05-07T20:33:05.3105664Z x1 = x[:, D:] 2025-05-07T20:33:05.3105747Z 2025-05-07T20:33:05.3105833Z if contiguous: 2025-05-07T20:33:05.3105928Z x0 = x0.contiguous() 2025-05-07T20:33:05.3106027Z x1 = x1.contiguous() 2025-05-07T20:33:05.3106101Z 2025-05-07T20:33:05.3106201Z if scale_ub is not None: 2025-05-07T20:33:05.3106311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3106452Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3106540Z ) 2025-05-07T20:33:05.3106619Z else: 2025-05-07T20:33:05.3106720Z scale_ub_tensor = None 2025-05-07T20:33:05.3106804Z 2025-05-07T20:33:05.3106939Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3107032Z op = silu_mul_quant 2025-05-07T20:33:05.3107128Z if compiled: 2025-05-07T20:33:05.3107231Z op = torch.compile(op) 2025-05-07T20:33:05.3107340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3107424Z 2025-05-07T20:33:05.3107521Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3107526Z 2025-05-07T20:33:05.3107634Z moe/activation_test.py:117: 2025-05-07T20:33:05.3107770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3107876Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3107986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3108369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3108472Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3108998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3109100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3109480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3109713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3110067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3110173Z kernel = self.compile( 2025-05-07T20:33:05.3110569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3110802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3110984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3110989Z 2025-05-07T20:33:05.3111203Z self = 2025-05-07T20:33:05.3112023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3112550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfd0d300>} 2025-05-07T20:33:05.3113962Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3114285Z context = 2025-05-07T20:33:05.3114451Z 2025-05-07T20:33:05.3114703Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3114991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3115110Z module_map=module_map) 2025-05-07T20:33:05.3115283Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3115385Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3115465Z E ^ 2025-05-07T20:33:05.3115840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3115845Z 2025-05-07T20:33:05.3116276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3116283Z 2025-05-07T20:33:05.3116399Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3116640Z self=, 2025-05-07T20:33:05.3116724Z T=128, 2025-05-07T20:33:05.3116815Z D=7168, 2025-05-07T20:33:05.3116903Z scale_ub=1200.0, 2025-05-07T20:33:05.3116991Z contiguous=False, 2025-05-07T20:33:05.3117090Z compiled=True, 2025-05-07T20:33:05.3117167Z ) 2025-05-07T20:33:05.3117392Z self = 2025-05-07T20:33:05.3117577Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:05.3117582Z 2025-05-07T20:33:05.3117661Z @given( 2025-05-07T20:33:05.3117782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3117897Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3118017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3118155Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3118274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3118353Z ) 2025-05-07T20:33:05.3118618Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3118716Z def test_silu_mul_quant( 2025-05-07T20:33:05.3118800Z self, 2025-05-07T20:33:05.3118885Z T: int, 2025-05-07T20:33:05.3118964Z D: int, 2025-05-07T20:33:05.3119064Z scale_ub: Optional[float], 2025-05-07T20:33:05.3119163Z contiguous: bool, 2025-05-07T20:33:05.3119251Z compiled: bool, 2025-05-07T20:33:05.3119342Z ) -> None: 2025-05-07T20:33:05.3119440Z torch.manual_seed(2025) 2025-05-07T20:33:05.3119516Z 2025-05-07T20:33:05.3119703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3119779Z 2025-05-07T20:33:05.3119871Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3120006Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3120280Z x = x_sign * x_clamp 2025-05-07T20:33:05.3120364Z x0 = x[:, :D] 2025-05-07T20:33:05.3120454Z x1 = x[:, D:] 2025-05-07T20:33:05.3120531Z 2025-05-07T20:33:05.3120694Z if contiguous: 2025-05-07T20:33:05.3120799Z x0 = x0.contiguous() 2025-05-07T20:33:05.3120889Z x1 = x1.contiguous() 2025-05-07T20:33:05.3120963Z 2025-05-07T20:33:05.3121067Z if scale_ub is not None: 2025-05-07T20:33:05.3121175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3121322Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3121399Z ) 2025-05-07T20:33:05.3121479Z else: 2025-05-07T20:33:05.3121582Z scale_ub_tensor = None 2025-05-07T20:33:05.3121656Z 2025-05-07T20:33:05.3121788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3121886Z op = silu_mul_quant 2025-05-07T20:33:05.3121973Z if compiled: 2025-05-07T20:33:05.3122079Z op = torch.compile(op) 2025-05-07T20:33:05.3122194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3122312Z 2025-05-07T20:33:05.3122407Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3122456Z 2025-05-07T20:33:05.3122558Z moe/activation_test.py:117: 2025-05-07T20:33:05.3122688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3122798Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3122900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3123283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3123386Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3123900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3124001Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3124380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3124620Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3124982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3125079Z kernel = self.compile( 2025-05-07T20:33:05.3125473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3125662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3125793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3125797Z 2025-05-07T20:33:05.3126015Z self = 2025-05-07T20:33:05.3126815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3127352Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfd0e160>} 2025-05-07T20:33:05.3128126Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3128324Z context = 2025-05-07T20:33:05.3128328Z 2025-05-07T20:33:05.3128505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3128779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3128888Z module_map=module_map) 2025-05-07T20:33:05.3129108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3129212Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3129298Z E ^ 2025-05-07T20:33:05.3129704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3129709Z 2025-05-07T20:33:05.3130140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3130145Z 2025-05-07T20:33:05.3130257Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3130486Z self=, 2025-05-07T20:33:05.3130572Z T=2048, 2025-05-07T20:33:05.3130653Z D=7168, 2025-05-07T20:33:05.3130739Z scale_ub=None, 2025-05-07T20:33:05.3130834Z contiguous=True, 2025-05-07T20:33:05.3130921Z compiled=True, 2025-05-07T20:33:05.3130997Z ) 2025-05-07T20:33:05.3131229Z self = 2025-05-07T20:33:05.3131449Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.3131457Z 2025-05-07T20:33:05.3131535Z @given( 2025-05-07T20:33:05.3131699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3131803Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3131929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3132052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3132167Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3132251Z ) 2025-05-07T20:33:05.3132504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3132600Z def test_silu_mul_quant( 2025-05-07T20:33:05.3132684Z self, 2025-05-07T20:33:05.3132763Z T: int, 2025-05-07T20:33:05.3132840Z D: int, 2025-05-07T20:33:05.3132951Z scale_ub: Optional[float], 2025-05-07T20:33:05.3133042Z contiguous: bool, 2025-05-07T20:33:05.3133132Z compiled: bool, 2025-05-07T20:33:05.3133218Z ) -> None: 2025-05-07T20:33:05.3133318Z torch.manual_seed(2025) 2025-05-07T20:33:05.3133395Z 2025-05-07T20:33:05.3133576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3133653Z 2025-05-07T20:33:05.3133753Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3133880Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3133970Z x = x_sign * x_clamp 2025-05-07T20:33:05.3134056Z x0 = x[:, :D] 2025-05-07T20:33:05.3134137Z x1 = x[:, D:] 2025-05-07T20:33:05.3134212Z 2025-05-07T20:33:05.3134304Z if contiguous: 2025-05-07T20:33:05.3134396Z x0 = x0.contiguous() 2025-05-07T20:33:05.3134488Z x1 = x1.contiguous() 2025-05-07T20:33:05.3134567Z 2025-05-07T20:33:05.3134660Z if scale_ub is not None: 2025-05-07T20:33:05.3134770Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3134915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3134996Z ) 2025-05-07T20:33:05.3135081Z else: 2025-05-07T20:33:05.3135178Z scale_ub_tensor = None 2025-05-07T20:33:05.3135252Z 2025-05-07T20:33:05.3135394Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3135485Z op = silu_mul_quant 2025-05-07T20:33:05.3135570Z if compiled: 2025-05-07T20:33:05.3135678Z op = torch.compile(op) 2025-05-07T20:33:05.3135786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3135859Z 2025-05-07T20:33:05.3135957Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3135961Z 2025-05-07T20:33:05.3136058Z moe/activation_test.py:117: 2025-05-07T20:33:05.3136193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3136343Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3136444Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3136870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3136970Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3137481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3137586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3137954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3138194Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3138545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3138640Z kernel = self.compile( 2025-05-07T20:33:05.3139039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3139265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3139436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3139446Z 2025-05-07T20:33:05.3139655Z self = 2025-05-07T20:33:05.3140456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3140984Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfd0f420>} 2025-05-07T20:33:05.3141751Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3141959Z context = 2025-05-07T20:33:05.3141964Z 2025-05-07T20:33:05.3142133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3142407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3142521Z module_map=module_map) 2025-05-07T20:33:05.3142688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3142788Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3142870Z E ^ 2025-05-07T20:33:05.3143236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3143240Z 2025-05-07T20:33:05.3143673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3143680Z 2025-05-07T20:33:05.3143789Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3144020Z self=, 2025-05-07T20:33:05.3144102Z T=16384, 2025-05-07T20:33:05.3144179Z D=5120, 2025-05-07T20:33:05.3144267Z scale_ub=None, 2025-05-07T20:33:05.3144354Z contiguous=False, 2025-05-07T20:33:05.3144441Z compiled=False, 2025-05-07T20:33:05.3144519Z ) 2025-05-07T20:33:05.3144742Z self = 2025-05-07T20:33:05.3144923Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.3144928Z 2025-05-07T20:33:05.3145009Z @given( 2025-05-07T20:33:05.3145129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3145228Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3145346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3145537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3145657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3145731Z ) 2025-05-07T20:33:05.3146026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3146126Z def test_silu_mul_quant( 2025-05-07T20:33:05.3146203Z self, 2025-05-07T20:33:05.3146279Z T: int, 2025-05-07T20:33:05.3146362Z D: int, 2025-05-07T20:33:05.3146459Z scale_ub: Optional[float], 2025-05-07T20:33:05.3146554Z contiguous: bool, 2025-05-07T20:33:05.3146639Z compiled: bool, 2025-05-07T20:33:05.3146717Z ) -> None: 2025-05-07T20:33:05.3146816Z torch.manual_seed(2025) 2025-05-07T20:33:05.3146888Z 2025-05-07T20:33:05.3147061Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3147139Z 2025-05-07T20:33:05.3147232Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3147361Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3149319Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3149326Z 2025-05-07T20:33:05.3149448Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:05.3149453Z 2025-05-07T20:33:05.3149560Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3149787Z self=, 2025-05-07T20:33:05.3149868Z T=4096, 2025-05-07T20:33:05.3149945Z D=7168, 2025-05-07T20:33:05.3150027Z scale_ub=1200.0, 2025-05-07T20:33:05.3150118Z contiguous=True, 2025-05-07T20:33:05.3150201Z compiled=True, 2025-05-07T20:33:05.3150279Z ) 2025-05-07T20:33:05.3150506Z self = 2025-05-07T20:33:05.3150681Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.3150686Z 2025-05-07T20:33:05.3150762Z @given( 2025-05-07T20:33:05.3150884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3150982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3151101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3151219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3151332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3151410Z ) 2025-05-07T20:33:05.3151661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3151758Z def test_silu_mul_quant( 2025-05-07T20:33:05.3151841Z self, 2025-05-07T20:33:05.3151917Z T: int, 2025-05-07T20:33:05.3151996Z D: int, 2025-05-07T20:33:05.3152102Z scale_ub: Optional[float], 2025-05-07T20:33:05.3152192Z contiguous: bool, 2025-05-07T20:33:05.3152278Z compiled: bool, 2025-05-07T20:33:05.3152361Z ) -> None: 2025-05-07T20:33:05.3152455Z torch.manual_seed(2025) 2025-05-07T20:33:05.3152526Z 2025-05-07T20:33:05.3152700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3152775Z 2025-05-07T20:33:05.3152875Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3152998Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3154887Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3154938Z 2025-05-07T20:33:05.3155059Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:05.3155063Z 2025-05-07T20:33:05.3155167Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3155399Z self=, 2025-05-07T20:33:05.3155477Z T=16384, 2025-05-07T20:33:05.3155555Z D=7168, 2025-05-07T20:33:05.3155643Z scale_ub=None, 2025-05-07T20:33:05.3155730Z contiguous=False, 2025-05-07T20:33:05.3155815Z compiled=False, 2025-05-07T20:33:05.3155893Z ) 2025-05-07T20:33:05.3156116Z self = 2025-05-07T20:33:05.3156312Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.3156356Z 2025-05-07T20:33:05.3156439Z @given( 2025-05-07T20:33:05.3156601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3156703Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3156817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3156939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3157052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3157127Z ) 2025-05-07T20:33:05.3157381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3157474Z def test_silu_mul_quant( 2025-05-07T20:33:05.3157553Z self, 2025-05-07T20:33:05.3157630Z T: int, 2025-05-07T20:33:05.3157707Z D: int, 2025-05-07T20:33:05.3157807Z scale_ub: Optional[float], 2025-05-07T20:33:05.3157901Z contiguous: bool, 2025-05-07T20:33:05.3157986Z compiled: bool, 2025-05-07T20:33:05.3158071Z ) -> None: 2025-05-07T20:33:05.3158167Z torch.manual_seed(2025) 2025-05-07T20:33:05.3158242Z 2025-05-07T20:33:05.3158421Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3160348Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3160354Z 2025-05-07T20:33:05.3160480Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3160484Z 2025-05-07T20:33:05.3160586Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3160822Z self=, 2025-05-07T20:33:05.3160903Z T=2048, 2025-05-07T20:33:05.3160980Z D=7168, 2025-05-07T20:33:05.3161069Z scale_ub=1200.0, 2025-05-07T20:33:05.3161153Z contiguous=True, 2025-05-07T20:33:05.3161236Z compiled=True, 2025-05-07T20:33:05.3161311Z ) 2025-05-07T20:33:05.3161531Z self = 2025-05-07T20:33:05.3161704Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.3161714Z 2025-05-07T20:33:05.3161790Z @given( 2025-05-07T20:33:05.3161908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3162014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3162128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3162294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3162415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3162491Z ) 2025-05-07T20:33:05.3162781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3162880Z def test_silu_mul_quant( 2025-05-07T20:33:05.3162959Z self, 2025-05-07T20:33:05.3163035Z T: int, 2025-05-07T20:33:05.3163115Z D: int, 2025-05-07T20:33:05.3163212Z scale_ub: Optional[float], 2025-05-07T20:33:05.3163306Z contiguous: bool, 2025-05-07T20:33:05.3163391Z compiled: bool, 2025-05-07T20:33:05.3163468Z ) -> None: 2025-05-07T20:33:05.3163567Z torch.manual_seed(2025) 2025-05-07T20:33:05.3163641Z 2025-05-07T20:33:05.3163811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3163889Z 2025-05-07T20:33:05.3163981Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3164111Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3166033Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3166074Z 2025-05-07T20:33:05.3166195Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:05.3166200Z 2025-05-07T20:33:05.3166306Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3166533Z self=, 2025-05-07T20:33:05.3166615Z T=2048, 2025-05-07T20:33:05.3166696Z D=7168, 2025-05-07T20:33:05.3166780Z scale_ub=None, 2025-05-07T20:33:05.3166871Z contiguous=True, 2025-05-07T20:33:05.3166958Z compiled=False, 2025-05-07T20:33:05.3167030Z ) 2025-05-07T20:33:05.3167261Z self = 2025-05-07T20:33:05.3167436Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.3167440Z 2025-05-07T20:33:05.3167515Z @given( 2025-05-07T20:33:05.3167643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3167743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3167862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3167980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3168093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3168168Z ) 2025-05-07T20:33:05.3168420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3168516Z def test_silu_mul_quant( 2025-05-07T20:33:05.3168596Z self, 2025-05-07T20:33:05.3168674Z T: int, 2025-05-07T20:33:05.3168750Z D: int, 2025-05-07T20:33:05.3168855Z scale_ub: Optional[float], 2025-05-07T20:33:05.3168947Z contiguous: bool, 2025-05-07T20:33:05.3169033Z compiled: bool, 2025-05-07T20:33:05.3169115Z ) -> None: 2025-05-07T20:33:05.3169210Z torch.manual_seed(2025) 2025-05-07T20:33:05.3169287Z 2025-05-07T20:33:05.3169456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3169529Z 2025-05-07T20:33:05.3169628Z > x_sign = torch.sign(x) 2025-05-07T20:33:05.3171508Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3171552Z 2025-05-07T20:33:05.3171676Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:05.3171681Z 2025-05-07T20:33:05.3171784Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3172012Z self=, 2025-05-07T20:33:05.3172094Z T=1, 2025-05-07T20:33:05.3172171Z D=7168, 2025-05-07T20:33:05.3172256Z scale_ub=1200.0, 2025-05-07T20:33:05.3172349Z contiguous=True, 2025-05-07T20:33:05.3172434Z compiled=False, 2025-05-07T20:33:05.3172507Z ) 2025-05-07T20:33:05.3172736Z self = 2025-05-07T20:33:05.3172904Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.3172911Z 2025-05-07T20:33:05.3172991Z @given( 2025-05-07T20:33:05.3173151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3173257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3173438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3173556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3173669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3173747Z ) 2025-05-07T20:33:05.3173998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3174094Z def test_silu_mul_quant( 2025-05-07T20:33:05.3174170Z self, 2025-05-07T20:33:05.3174247Z T: int, 2025-05-07T20:33:05.3174325Z D: int, 2025-05-07T20:33:05.3174422Z scale_ub: Optional[float], 2025-05-07T20:33:05.3174512Z contiguous: bool, 2025-05-07T20:33:05.3174602Z compiled: bool, 2025-05-07T20:33:05.3174681Z ) -> None: 2025-05-07T20:33:05.3174775Z torch.manual_seed(2025) 2025-05-07T20:33:05.3174855Z 2025-05-07T20:33:05.3175028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3175102Z 2025-05-07T20:33:05.3175199Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3175324Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3175413Z x = x_sign * x_clamp 2025-05-07T20:33:05.3175498Z x0 = x[:, :D] 2025-05-07T20:33:05.3175579Z x1 = x[:, D:] 2025-05-07T20:33:05.3175655Z 2025-05-07T20:33:05.3175741Z if contiguous: 2025-05-07T20:33:05.3175834Z x0 = x0.contiguous() 2025-05-07T20:33:05.3175928Z x1 = x1.contiguous() 2025-05-07T20:33:05.3176001Z 2025-05-07T20:33:05.3176090Z if scale_ub is not None: 2025-05-07T20:33:05.3176201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3176337Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3176417Z ) 2025-05-07T20:33:05.3176496Z else: 2025-05-07T20:33:05.3176594Z scale_ub_tensor = None 2025-05-07T20:33:05.3176666Z 2025-05-07T20:33:05.3176806Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3176896Z op = silu_mul_quant 2025-05-07T20:33:05.3176984Z if compiled: 2025-05-07T20:33:05.3177084Z op = torch.compile(op) 2025-05-07T20:33:05.3177189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3177267Z 2025-05-07T20:33:05.3177362Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3177366Z 2025-05-07T20:33:05.3177463Z moe/activation_test.py:117: 2025-05-07T20:33:05.3177598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3177701Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3177801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3178320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3178470Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3178891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3179123Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3179475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3179575Z kernel = self.compile( 2025-05-07T20:33:05.3179972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3180158Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3180289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3180293Z 2025-05-07T20:33:05.3180507Z self = 2025-05-07T20:33:05.3181357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3181931Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfbc22a0>} 2025-05-07T20:33:05.3182709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3182907Z context = 2025-05-07T20:33:05.3182911Z 2025-05-07T20:33:05.3183079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3183359Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3183470Z module_map=module_map) 2025-05-07T20:33:05.3183640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3183737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3183813Z E ^ 2025-05-07T20:33:05.3184180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3184185Z 2025-05-07T20:33:05.3184609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3184614Z 2025-05-07T20:33:05.3184722Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3184949Z self=, 2025-05-07T20:33:05.3185027Z T=128, 2025-05-07T20:33:05.3185109Z D=5120, 2025-05-07T20:33:05.3185192Z scale_ub=None, 2025-05-07T20:33:05.3185277Z contiguous=True, 2025-05-07T20:33:05.3185363Z compiled=False, 2025-05-07T20:33:05.3185439Z ) 2025-05-07T20:33:05.3185666Z self = 2025-05-07T20:33:05.3185840Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.3185845Z 2025-05-07T20:33:05.3185922Z @given( 2025-05-07T20:33:05.3186047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3186147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3186261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3186383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3186497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3186572Z ) 2025-05-07T20:33:05.3186829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3186921Z def test_silu_mul_quant( 2025-05-07T20:33:05.3187044Z self, 2025-05-07T20:33:05.3187122Z T: int, 2025-05-07T20:33:05.3187201Z D: int, 2025-05-07T20:33:05.3187299Z scale_ub: Optional[float], 2025-05-07T20:33:05.3187433Z contiguous: bool, 2025-05-07T20:33:05.3187519Z compiled: bool, 2025-05-07T20:33:05.3187604Z ) -> None: 2025-05-07T20:33:05.3187700Z torch.manual_seed(2025) 2025-05-07T20:33:05.3187772Z 2025-05-07T20:33:05.3187947Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3188021Z 2025-05-07T20:33:05.3188113Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3188240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3188327Z x = x_sign * x_clamp 2025-05-07T20:33:05.3188407Z x0 = x[:, :D] 2025-05-07T20:33:05.3188496Z x1 = x[:, D:] 2025-05-07T20:33:05.3188566Z 2025-05-07T20:33:05.3188650Z if contiguous: 2025-05-07T20:33:05.3188749Z x0 = x0.contiguous() 2025-05-07T20:33:05.3188837Z x1 = x1.contiguous() 2025-05-07T20:33:05.3188950Z 2025-05-07T20:33:05.3189045Z if scale_ub is not None: 2025-05-07T20:33:05.3189155Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3189332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3189410Z ) 2025-05-07T20:33:05.3189488Z else: 2025-05-07T20:33:05.3189586Z scale_ub_tensor = None 2025-05-07T20:33:05.3189657Z 2025-05-07T20:33:05.3189787Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3189881Z op = silu_mul_quant 2025-05-07T20:33:05.3189965Z if compiled: 2025-05-07T20:33:05.3190064Z op = torch.compile(op) 2025-05-07T20:33:05.3190175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3190245Z 2025-05-07T20:33:05.3190336Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3190348Z 2025-05-07T20:33:05.3190447Z moe/activation_test.py:117: 2025-05-07T20:33:05.3190576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3190684Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3190788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3191304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3191406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3191776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3192009Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3192362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3192456Z kernel = self.compile( 2025-05-07T20:33:05.3192853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3193032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3193164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3193169Z 2025-05-07T20:33:05.3193380Z self = 2025-05-07T20:33:05.3194176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3194704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfbc31a0>} 2025-05-07T20:33:05.3195471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3195750Z context = 2025-05-07T20:33:05.3195792Z 2025-05-07T20:33:05.3195965Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3196238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3196350Z module_map=module_map) 2025-05-07T20:33:05.3196514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3196613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3196696Z E ^ 2025-05-07T20:33:05.3197059Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3197064Z 2025-05-07T20:33:05.3197494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3197501Z 2025-05-07T20:33:05.3197605Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3197881Z self=, 2025-05-07T20:33:05.3198002Z T=128, 2025-05-07T20:33:05.3198078Z D=7168, 2025-05-07T20:33:05.3198161Z scale_ub=None, 2025-05-07T20:33:05.3198250Z contiguous=True, 2025-05-07T20:33:05.3198333Z compiled=False, 2025-05-07T20:33:05.3198410Z ) 2025-05-07T20:33:05.3198631Z self = 2025-05-07T20:33:05.3198803Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.3198807Z 2025-05-07T20:33:05.3198886Z @given( 2025-05-07T20:33:05.3199003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3199101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3199220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3199340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3199451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3199532Z ) 2025-05-07T20:33:05.3199791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3199885Z def test_silu_mul_quant( 2025-05-07T20:33:05.3199961Z self, 2025-05-07T20:33:05.3200035Z T: int, 2025-05-07T20:33:05.3200190Z D: int, 2025-05-07T20:33:05.3200293Z scale_ub: Optional[float], 2025-05-07T20:33:05.3200386Z contiguous: bool, 2025-05-07T20:33:05.3200477Z compiled: bool, 2025-05-07T20:33:05.3200553Z ) -> None: 2025-05-07T20:33:05.3200649Z torch.manual_seed(2025) 2025-05-07T20:33:05.3200723Z 2025-05-07T20:33:05.3200903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3200977Z 2025-05-07T20:33:05.3201072Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3201200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3201291Z x = x_sign * x_clamp 2025-05-07T20:33:05.3201374Z x0 = x[:, :D] 2025-05-07T20:33:05.3201456Z x1 = x[:, D:] 2025-05-07T20:33:05.3201530Z 2025-05-07T20:33:05.3201616Z if contiguous: 2025-05-07T20:33:05.3201706Z x0 = x0.contiguous() 2025-05-07T20:33:05.3201799Z x1 = x1.contiguous() 2025-05-07T20:33:05.3201873Z 2025-05-07T20:33:05.3201964Z if scale_ub is not None: 2025-05-07T20:33:05.3202083Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3202221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3202296Z ) 2025-05-07T20:33:05.3202376Z else: 2025-05-07T20:33:05.3202470Z scale_ub_tensor = None 2025-05-07T20:33:05.3202550Z 2025-05-07T20:33:05.3202682Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3202773Z op = silu_mul_quant 2025-05-07T20:33:05.3202909Z if compiled: 2025-05-07T20:33:05.3203016Z op = torch.compile(op) 2025-05-07T20:33:05.3203123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3203236Z 2025-05-07T20:33:05.3203329Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3203334Z 2025-05-07T20:33:05.3203432Z moe/activation_test.py:117: 2025-05-07T20:33:05.3203561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3203660Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3203766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3204333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3204429Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3204804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3205038Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3205390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3205593Z kernel = self.compile( 2025-05-07T20:33:05.3205990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3206173Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3206301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3206305Z 2025-05-07T20:33:05.3206514Z self = 2025-05-07T20:33:05.3207326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3207855Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfaac040>} 2025-05-07T20:33:05.3208640Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3208839Z context = 2025-05-07T20:33:05.3208843Z 2025-05-07T20:33:05.3209012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3209290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3209398Z module_map=module_map) 2025-05-07T20:33:05.3209567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3209665Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3209744Z E ^ 2025-05-07T20:33:05.3210118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3210129Z 2025-05-07T20:33:05.3210561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3210566Z 2025-05-07T20:33:05.3210672Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3210901Z self=, 2025-05-07T20:33:05.3210978Z T=2048, 2025-05-07T20:33:05.3211056Z D=7168, 2025-05-07T20:33:05.3211140Z scale_ub=1200.0, 2025-05-07T20:33:05.3211226Z contiguous=True, 2025-05-07T20:33:05.3211320Z compiled=False, 2025-05-07T20:33:05.3211395Z ) 2025-05-07T20:33:05.3211616Z self = 2025-05-07T20:33:05.3211797Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.3211846Z 2025-05-07T20:33:05.3211923Z @given( 2025-05-07T20:33:05.3212048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3212187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3212305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3212426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3212542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3212615Z ) 2025-05-07T20:33:05.3212873Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3212966Z def test_silu_mul_quant( 2025-05-07T20:33:05.3213040Z self, 2025-05-07T20:33:05.3213121Z T: int, 2025-05-07T20:33:05.3213195Z D: int, 2025-05-07T20:33:05.3213297Z scale_ub: Optional[float], 2025-05-07T20:33:05.3213700Z contiguous: bool, 2025-05-07T20:33:05.3213793Z compiled: bool, 2025-05-07T20:33:05.3213882Z ) -> None: 2025-05-07T20:33:05.3213978Z torch.manual_seed(2025) 2025-05-07T20:33:05.3214140Z 2025-05-07T20:33:05.3214319Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3216224Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3216231Z 2025-05-07T20:33:05.3216354Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3216359Z 2025-05-07T20:33:05.3216461Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3216693Z self=, 2025-05-07T20:33:05.3216775Z T=1, 2025-05-07T20:33:05.3216849Z D=5120, 2025-05-07T20:33:05.3216940Z scale_ub=1200.0, 2025-05-07T20:33:05.3217028Z contiguous=True, 2025-05-07T20:33:05.3217111Z compiled=False, 2025-05-07T20:33:05.3217187Z ) 2025-05-07T20:33:05.3217410Z self = 2025-05-07T20:33:05.3217579Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.3217583Z 2025-05-07T20:33:05.3217663Z @given( 2025-05-07T20:33:05.3217781Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3217880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3221320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3221466Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3221587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3221666Z ) 2025-05-07T20:33:05.3221926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3222028Z def test_silu_mul_quant( 2025-05-07T20:33:05.3222107Z self, 2025-05-07T20:33:05.3222191Z T: int, 2025-05-07T20:33:05.3222268Z D: int, 2025-05-07T20:33:05.3222366Z scale_ub: Optional[float], 2025-05-07T20:33:05.3222460Z contiguous: bool, 2025-05-07T20:33:05.3222547Z compiled: bool, 2025-05-07T20:33:05.3222631Z ) -> None: 2025-05-07T20:33:05.3222727Z torch.manual_seed(2025) 2025-05-07T20:33:05.3222798Z 2025-05-07T20:33:05.3222975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3223047Z 2025-05-07T20:33:05.3223139Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3223267Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3223357Z x = x_sign * x_clamp 2025-05-07T20:33:05.3223529Z x0 = x[:, :D] 2025-05-07T20:33:05.3223612Z x1 = x[:, D:] 2025-05-07T20:33:05.3223685Z 2025-05-07T20:33:05.3223768Z if contiguous: 2025-05-07T20:33:05.3223922Z x0 = x0.contiguous() 2025-05-07T20:33:05.3224019Z x1 = x1.contiguous() 2025-05-07T20:33:05.3224095Z 2025-05-07T20:33:05.3224211Z if scale_ub is not None: 2025-05-07T20:33:05.3224333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3224488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3224562Z ) 2025-05-07T20:33:05.3224639Z else: 2025-05-07T20:33:05.3224744Z scale_ub_tensor = None 2025-05-07T20:33:05.3224814Z 2025-05-07T20:33:05.3224947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3225042Z op = silu_mul_quant 2025-05-07T20:33:05.3225128Z if compiled: 2025-05-07T20:33:05.3225227Z op = torch.compile(op) 2025-05-07T20:33:05.3225341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3225413Z 2025-05-07T20:33:05.3225555Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3225561Z 2025-05-07T20:33:05.3225661Z moe/activation_test.py:117: 2025-05-07T20:33:05.3225828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3225935Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3226035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3226555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3226655Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3227025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3227264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3227615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3227715Z kernel = self.compile( 2025-05-07T20:33:05.3228129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3228312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3228441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3228448Z 2025-05-07T20:33:05.3228656Z self = 2025-05-07T20:33:05.3229456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3229983Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cfaad580>} 2025-05-07T20:33:05.3230755Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3230960Z context = 2025-05-07T20:33:05.3230965Z 2025-05-07T20:33:05.3231131Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3231402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3231516Z module_map=module_map) 2025-05-07T20:33:05.3231681Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3231779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3231861Z E ^ 2025-05-07T20:33:05.3232223Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3232272Z 2025-05-07T20:33:05.3232744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3232752Z 2025-05-07T20:33:05.3232859Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3233089Z self=, 2025-05-07T20:33:05.3233172Z T=2048, 2025-05-07T20:33:05.3233246Z D=5120, 2025-05-07T20:33:05.3233333Z scale_ub=None, 2025-05-07T20:33:05.3233419Z contiguous=True, 2025-05-07T20:33:05.3233502Z compiled=False, 2025-05-07T20:33:05.3233581Z ) 2025-05-07T20:33:05.3233804Z self = 2025-05-07T20:33:05.3233978Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.3233983Z 2025-05-07T20:33:05.3234068Z @given( 2025-05-07T20:33:05.3234192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3234294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3234484Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3234662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3234782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3234857Z ) 2025-05-07T20:33:05.3235111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3235208Z def test_silu_mul_quant( 2025-05-07T20:33:05.3235285Z self, 2025-05-07T20:33:05.3235359Z T: int, 2025-05-07T20:33:05.3235440Z D: int, 2025-05-07T20:33:05.3235538Z scale_ub: Optional[float], 2025-05-07T20:33:05.3235628Z contiguous: bool, 2025-05-07T20:33:05.3235717Z compiled: bool, 2025-05-07T20:33:05.3235795Z ) -> None: 2025-05-07T20:33:05.3235892Z torch.manual_seed(2025) 2025-05-07T20:33:05.3235969Z 2025-05-07T20:33:05.3236144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3236226Z 2025-05-07T20:33:05.3236319Z > x_sign = torch.sign(x) 2025-05-07T20:33:05.3238169Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3238180Z 2025-05-07T20:33:05.3238301Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:05.3238305Z 2025-05-07T20:33:05.3238412Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3238646Z self=, 2025-05-07T20:33:05.3238723Z T=16384, 2025-05-07T20:33:05.3238805Z D=5120, 2025-05-07T20:33:05.3238890Z scale_ub=None, 2025-05-07T20:33:05.3238981Z contiguous=True, 2025-05-07T20:33:05.3239068Z compiled=False, 2025-05-07T20:33:05.3239144Z ) 2025-05-07T20:33:05.3239366Z self = 2025-05-07T20:33:05.3239550Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.3239555Z 2025-05-07T20:33:05.3239630Z @given( 2025-05-07T20:33:05.3239750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3239853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3239969Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3240179Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3240302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3240429Z ) 2025-05-07T20:33:05.3240681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3240782Z def test_silu_mul_quant( 2025-05-07T20:33:05.3240923Z self, 2025-05-07T20:33:05.3241006Z T: int, 2025-05-07T20:33:05.3241081Z D: int, 2025-05-07T20:33:05.3241180Z scale_ub: Optional[float], 2025-05-07T20:33:05.3241276Z contiguous: bool, 2025-05-07T20:33:05.3241360Z compiled: bool, 2025-05-07T20:33:05.3241437Z ) -> None: 2025-05-07T20:33:05.3241534Z torch.manual_seed(2025) 2025-05-07T20:33:05.3241607Z 2025-05-07T20:33:05.3241777Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3243661Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3243705Z 2025-05-07T20:33:05.3243826Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3243831Z 2025-05-07T20:33:05.3243937Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3244163Z self=, 2025-05-07T20:33:05.3244242Z T=4096, 2025-05-07T20:33:05.3244317Z D=5120, 2025-05-07T20:33:05.3244399Z scale_ub=None, 2025-05-07T20:33:05.3244487Z contiguous=True, 2025-05-07T20:33:05.3244570Z compiled=False, 2025-05-07T20:33:05.3244644Z ) 2025-05-07T20:33:05.3244868Z self = 2025-05-07T20:33:05.3245045Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.3245049Z 2025-05-07T20:33:05.3245127Z @given( 2025-05-07T20:33:05.3245253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3245354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3245466Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3245586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3245701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3245776Z ) 2025-05-07T20:33:05.3246027Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3246122Z def test_silu_mul_quant( 2025-05-07T20:33:05.3246199Z self, 2025-05-07T20:33:05.3246273Z T: int, 2025-05-07T20:33:05.3246351Z D: int, 2025-05-07T20:33:05.3246450Z scale_ub: Optional[float], 2025-05-07T20:33:05.3246537Z contiguous: bool, 2025-05-07T20:33:05.3246623Z compiled: bool, 2025-05-07T20:33:05.3246705Z ) -> None: 2025-05-07T20:33:05.3246799Z torch.manual_seed(2025) 2025-05-07T20:33:05.3246873Z 2025-05-07T20:33:05.3247048Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3248876Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3248885Z 2025-05-07T20:33:05.3249001Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3249005Z 2025-05-07T20:33:05.3249155Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3249384Z self=, 2025-05-07T20:33:05.3249464Z T=2048, 2025-05-07T20:33:05.3249577Z D=5120, 2025-05-07T20:33:05.3249666Z scale_ub=None, 2025-05-07T20:33:05.3249752Z contiguous=False, 2025-05-07T20:33:05.3249834Z compiled=False, 2025-05-07T20:33:05.3249909Z ) 2025-05-07T20:33:05.3250128Z self = 2025-05-07T20:33:05.3250307Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.3250311Z 2025-05-07T20:33:05.3250388Z @given( 2025-05-07T20:33:05.3250503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3250606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3250719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3250835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3250953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3251026Z ) 2025-05-07T20:33:05.3251318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3251454Z def test_silu_mul_quant( 2025-05-07T20:33:05.3251532Z self, 2025-05-07T20:33:05.3251609Z T: int, 2025-05-07T20:33:05.3251683Z D: int, 2025-05-07T20:33:05.3251779Z scale_ub: Optional[float], 2025-05-07T20:33:05.3251869Z contiguous: bool, 2025-05-07T20:33:05.3251954Z compiled: bool, 2025-05-07T20:33:05.3252030Z ) -> None: 2025-05-07T20:33:05.3252127Z torch.manual_seed(2025) 2025-05-07T20:33:05.3252198Z 2025-05-07T20:33:05.3252367Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3254199Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3254211Z 2025-05-07T20:33:05.3254329Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3254333Z 2025-05-07T20:33:05.3254440Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3254665Z self=, 2025-05-07T20:33:05.3254744Z T=4096, 2025-05-07T20:33:05.3254820Z D=7168, 2025-05-07T20:33:05.3254901Z scale_ub=None, 2025-05-07T20:33:05.3254990Z contiguous=True, 2025-05-07T20:33:05.3255073Z compiled=True, 2025-05-07T20:33:05.3255146Z ) 2025-05-07T20:33:05.3255369Z self = 2025-05-07T20:33:05.3255542Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.3255550Z 2025-05-07T20:33:05.3255629Z @given( 2025-05-07T20:33:05.3255758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3255856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3255968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3256087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3256200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3256276Z ) 2025-05-07T20:33:05.3256524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3256616Z def test_silu_mul_quant( 2025-05-07T20:33:05.3256692Z self, 2025-05-07T20:33:05.3256766Z T: int, 2025-05-07T20:33:05.3256841Z D: int, 2025-05-07T20:33:05.3256941Z scale_ub: Optional[float], 2025-05-07T20:33:05.3257079Z contiguous: bool, 2025-05-07T20:33:05.3257163Z compiled: bool, 2025-05-07T20:33:05.3257247Z ) -> None: 2025-05-07T20:33:05.3257341Z torch.manual_seed(2025) 2025-05-07T20:33:05.3257453Z 2025-05-07T20:33:05.3257628Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3259452Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3259460Z 2025-05-07T20:33:05.3259579Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3259586Z 2025-05-07T20:33:05.3259688Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3259962Z self=, 2025-05-07T20:33:05.3260078Z T=2048, 2025-05-07T20:33:05.3260155Z D=5120, 2025-05-07T20:33:05.3260242Z scale_ub=1200.0, 2025-05-07T20:33:05.3260327Z contiguous=False, 2025-05-07T20:33:05.3260409Z compiled=False, 2025-05-07T20:33:05.3260489Z ) 2025-05-07T20:33:05.3260710Z self = 2025-05-07T20:33:05.3260891Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.3260895Z 2025-05-07T20:33:05.3260972Z @given( 2025-05-07T20:33:05.3261089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3261188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3261300Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3261417Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3261530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3261606Z ) 2025-05-07T20:33:05.3261860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3261956Z def test_silu_mul_quant( 2025-05-07T20:33:05.3262033Z self, 2025-05-07T20:33:05.3262115Z T: int, 2025-05-07T20:33:05.3262190Z D: int, 2025-05-07T20:33:05.3262288Z scale_ub: Optional[float], 2025-05-07T20:33:05.3262380Z contiguous: bool, 2025-05-07T20:33:05.3262464Z compiled: bool, 2025-05-07T20:33:05.3262540Z ) -> None: 2025-05-07T20:33:05.3262636Z torch.manual_seed(2025) 2025-05-07T20:33:05.3262709Z 2025-05-07T20:33:05.3262883Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3264767Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3264778Z 2025-05-07T20:33:05.3264898Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3264902Z 2025-05-07T20:33:05.3265006Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3265232Z self=, 2025-05-07T20:33:05.3265311Z T=4096, 2025-05-07T20:33:05.3265386Z D=7168, 2025-05-07T20:33:05.3265468Z scale_ub=1200.0, 2025-05-07T20:33:05.3265553Z contiguous=True, 2025-05-07T20:33:05.3265635Z compiled=False, 2025-05-07T20:33:05.3265752Z ) 2025-05-07T20:33:05.3265975Z self = 2025-05-07T20:33:05.3266189Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.3266195Z 2025-05-07T20:33:05.3266273Z @given( 2025-05-07T20:33:05.3266394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3266491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3266604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3266723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3266835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3266908Z ) 2025-05-07T20:33:05.3267156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3267249Z def test_silu_mul_quant( 2025-05-07T20:33:05.3267328Z self, 2025-05-07T20:33:05.3267402Z T: int, 2025-05-07T20:33:05.3267478Z D: int, 2025-05-07T20:33:05.3267578Z scale_ub: Optional[float], 2025-05-07T20:33:05.3267666Z contiguous: bool, 2025-05-07T20:33:05.3267793Z compiled: bool, 2025-05-07T20:33:05.3267877Z ) -> None: 2025-05-07T20:33:05.3268009Z torch.manual_seed(2025) 2025-05-07T20:33:05.3268080Z 2025-05-07T20:33:05.3268252Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3270076Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3270090Z 2025-05-07T20:33:05.3270205Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3270214Z 2025-05-07T20:33:05.3270315Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3270548Z self=, 2025-05-07T20:33:05.3270624Z T=16384, 2025-05-07T20:33:05.3270702Z D=7168, 2025-05-07T20:33:05.3270786Z scale_ub=None, 2025-05-07T20:33:05.3270872Z contiguous=False, 2025-05-07T20:33:05.3270952Z compiled=True, 2025-05-07T20:33:05.3271026Z ) 2025-05-07T20:33:05.3271245Z self = 2025-05-07T20:33:05.3271424Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:05.3271429Z 2025-05-07T20:33:05.3271503Z @given( 2025-05-07T20:33:05.3271620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3271722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3271837Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3271952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3272072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3272146Z ) 2025-05-07T20:33:05.3272399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3272493Z def test_silu_mul_quant( 2025-05-07T20:33:05.3272569Z self, 2025-05-07T20:33:05.3272646Z T: int, 2025-05-07T20:33:05.3272720Z D: int, 2025-05-07T20:33:05.3272817Z scale_ub: Optional[float], 2025-05-07T20:33:05.3272909Z contiguous: bool, 2025-05-07T20:33:05.3272997Z compiled: bool, 2025-05-07T20:33:05.3273074Z ) -> None: 2025-05-07T20:33:05.3273171Z torch.manual_seed(2025) 2025-05-07T20:33:05.3273241Z 2025-05-07T20:33:05.3273410Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3275317Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3275360Z 2025-05-07T20:33:05.3275481Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3275486Z 2025-05-07T20:33:05.3275587Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3275812Z self=, 2025-05-07T20:33:05.3275891Z T=4096, 2025-05-07T20:33:05.3275966Z D=7168, 2025-05-07T20:33:05.3276047Z scale_ub=None, 2025-05-07T20:33:05.3276137Z contiguous=True, 2025-05-07T20:33:05.3276220Z compiled=False, 2025-05-07T20:33:05.3276292Z ) 2025-05-07T20:33:05.3276558Z self = 2025-05-07T20:33:05.3276769Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.3276774Z 2025-05-07T20:33:05.3276853Z @given( 2025-05-07T20:33:05.3276971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3277067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3277181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3277298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3277410Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3277484Z ) 2025-05-07T20:33:05.3277734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3277828Z def test_silu_mul_quant( 2025-05-07T20:33:05.3277905Z self, 2025-05-07T20:33:05.3277982Z T: int, 2025-05-07T20:33:05.3278058Z D: int, 2025-05-07T20:33:05.3278157Z scale_ub: Optional[float], 2025-05-07T20:33:05.3278244Z contiguous: bool, 2025-05-07T20:33:05.3278334Z compiled: bool, 2025-05-07T20:33:05.3278413Z ) -> None: 2025-05-07T20:33:05.3278506Z torch.manual_seed(2025) 2025-05-07T20:33:05.3278579Z 2025-05-07T20:33:05.3278748Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3280666Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3280684Z 2025-05-07T20:33:05.3280802Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3280809Z 2025-05-07T20:33:05.3280912Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3281142Z self=, 2025-05-07T20:33:05.3281218Z T=16384, 2025-05-07T20:33:05.3281299Z D=7168, 2025-05-07T20:33:05.3281381Z scale_ub=None, 2025-05-07T20:33:05.3281465Z contiguous=True, 2025-05-07T20:33:05.3281549Z compiled=False, 2025-05-07T20:33:05.3281621Z ) 2025-05-07T20:33:05.3281840Z self = 2025-05-07T20:33:05.3282022Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:05.3282026Z 2025-05-07T20:33:05.3282101Z @given( 2025-05-07T20:33:05.3282216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3282367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3282478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3282636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3282751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3282824Z ) 2025-05-07T20:33:05.3283081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3283173Z def test_silu_mul_quant( 2025-05-07T20:33:05.3283248Z self, 2025-05-07T20:33:05.3283327Z T: int, 2025-05-07T20:33:05.3283401Z D: int, 2025-05-07T20:33:05.3283498Z scale_ub: Optional[float], 2025-05-07T20:33:05.3283592Z contiguous: bool, 2025-05-07T20:33:05.3283678Z compiled: bool, 2025-05-07T20:33:05.3283757Z ) -> None: 2025-05-07T20:33:05.3283849Z torch.manual_seed(2025) 2025-05-07T20:33:05.3283924Z 2025-05-07T20:33:05.3284091Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3285960Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3286006Z 2025-05-07T20:33:05.3286125Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3286129Z 2025-05-07T20:33:05.3286231Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3286458Z self=, 2025-05-07T20:33:05.3286534Z T=16384, 2025-05-07T20:33:05.3286616Z D=7168, 2025-05-07T20:33:05.3286702Z scale_ub=1200.0, 2025-05-07T20:33:05.3286784Z contiguous=True, 2025-05-07T20:33:05.3286870Z compiled=False, 2025-05-07T20:33:05.3286943Z ) 2025-05-07T20:33:05.3287166Z self = 2025-05-07T20:33:05.3287346Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.3287350Z 2025-05-07T20:33:05.3287426Z @given( 2025-05-07T20:33:05.3287542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3287642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3287754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3287868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3287983Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3288055Z ) 2025-05-07T20:33:05.3288303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3288401Z def test_silu_mul_quant( 2025-05-07T20:33:05.3288475Z self, 2025-05-07T20:33:05.3288556Z T: int, 2025-05-07T20:33:05.3288629Z D: int, 2025-05-07T20:33:05.3288727Z scale_ub: Optional[float], 2025-05-07T20:33:05.3288818Z contiguous: bool, 2025-05-07T20:33:05.3288901Z compiled: bool, 2025-05-07T20:33:05.3288976Z ) -> None: 2025-05-07T20:33:05.3289073Z torch.manual_seed(2025) 2025-05-07T20:33:05.3289144Z 2025-05-07T20:33:05.3289314Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3291145Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3291200Z 2025-05-07T20:33:05.3291355Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3291360Z 2025-05-07T20:33:05.3291465Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3291690Z self=, 2025-05-07T20:33:05.3291769Z T=128, 2025-05-07T20:33:05.3291845Z D=5120, 2025-05-07T20:33:05.3291926Z scale_ub=1200.0, 2025-05-07T20:33:05.3292012Z contiguous=False, 2025-05-07T20:33:05.3292095Z compiled=False, 2025-05-07T20:33:05.3292166Z ) 2025-05-07T20:33:05.3292387Z self = 2025-05-07T20:33:05.3292562Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.3292566Z 2025-05-07T20:33:05.3292643Z @given( 2025-05-07T20:33:05.3292765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3292862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3293023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3293174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3293287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3293362Z ) 2025-05-07T20:33:05.3293611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3293702Z def test_silu_mul_quant( 2025-05-07T20:33:05.3293781Z self, 2025-05-07T20:33:05.3293856Z T: int, 2025-05-07T20:33:05.3293931Z D: int, 2025-05-07T20:33:05.3294031Z scale_ub: Optional[float], 2025-05-07T20:33:05.3294144Z contiguous: bool, 2025-05-07T20:33:05.3294233Z compiled: bool, 2025-05-07T20:33:05.3294330Z ) -> None: 2025-05-07T20:33:05.3294426Z torch.manual_seed(2025) 2025-05-07T20:33:05.3294504Z 2025-05-07T20:33:05.3294673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3294748Z 2025-05-07T20:33:05.3294844Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3294975Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3295063Z x = x_sign * x_clamp 2025-05-07T20:33:05.3295144Z x0 = x[:, :D] 2025-05-07T20:33:05.3295223Z x1 = x[:, D:] 2025-05-07T20:33:05.3295294Z 2025-05-07T20:33:05.3295380Z if contiguous: 2025-05-07T20:33:05.3295469Z x0 = x0.contiguous() 2025-05-07T20:33:05.3295558Z x1 = x1.contiguous() 2025-05-07T20:33:05.3295631Z 2025-05-07T20:33:05.3295719Z if scale_ub is not None: 2025-05-07T20:33:05.3295824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3295962Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3296036Z ) 2025-05-07T20:33:05.3296115Z else: 2025-05-07T20:33:05.3296211Z scale_ub_tensor = None 2025-05-07T20:33:05.3296280Z 2025-05-07T20:33:05.3296416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3296507Z op = silu_mul_quant 2025-05-07T20:33:05.3296592Z if compiled: 2025-05-07T20:33:05.3296694Z op = torch.compile(op) 2025-05-07T20:33:05.3296798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3296868Z 2025-05-07T20:33:05.3296961Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3296965Z 2025-05-07T20:33:05.3297063Z moe/activation_test.py:117: 2025-05-07T20:33:05.3297195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3297294Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3297392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3297907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3298050Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3298417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3298693Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3299046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3299144Z kernel = self.compile( 2025-05-07T20:33:05.3299534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3299711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3299840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3299845Z 2025-05-07T20:33:05.3300054Z self = 2025-05-07T20:33:05.3300857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3301477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cf7e11c0>} 2025-05-07T20:33:05.3302249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3302447Z context = 2025-05-07T20:33:05.3302451Z 2025-05-07T20:33:05.3302617Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3302891Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3303000Z module_map=module_map) 2025-05-07T20:33:05.3303166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3303270Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3303347Z E ^ 2025-05-07T20:33:05.3303708Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3303717Z 2025-05-07T20:33:05.3304141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3304146Z 2025-05-07T20:33:05.3304247Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3304477Z self=, 2025-05-07T20:33:05.3304554Z T=2048, 2025-05-07T20:33:05.3304628Z D=7168, 2025-05-07T20:33:05.3304714Z scale_ub=None, 2025-05-07T20:33:05.3304799Z contiguous=False, 2025-05-07T20:33:05.3304884Z compiled=False, 2025-05-07T20:33:05.3304961Z ) 2025-05-07T20:33:05.3305181Z self = 2025-05-07T20:33:05.3305369Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:05.3305373Z 2025-05-07T20:33:05.3305450Z @given( 2025-05-07T20:33:05.3305569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3305668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3305781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3305896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3306012Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3306084Z ) 2025-05-07T20:33:05.3306337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3306429Z def test_silu_mul_quant( 2025-05-07T20:33:05.3306503Z self, 2025-05-07T20:33:05.3306627Z T: int, 2025-05-07T20:33:05.3306702Z D: int, 2025-05-07T20:33:05.3306798Z scale_ub: Optional[float], 2025-05-07T20:33:05.3306891Z contiguous: bool, 2025-05-07T20:33:05.3307012Z compiled: bool, 2025-05-07T20:33:05.3307090Z ) -> None: 2025-05-07T20:33:05.3307191Z torch.manual_seed(2025) 2025-05-07T20:33:05.3307261Z 2025-05-07T20:33:05.3307431Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3309266Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3309274Z 2025-05-07T20:33:05.3309392Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3309437Z 2025-05-07T20:33:05.3309544Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3309808Z self=, 2025-05-07T20:33:05.3309888Z T=128, 2025-05-07T20:33:05.3309964Z D=7168, 2025-05-07T20:33:05.3310048Z scale_ub=1200.0, 2025-05-07T20:33:05.3310135Z contiguous=True, 2025-05-07T20:33:05.3310218Z compiled=True, 2025-05-07T20:33:05.3310288Z ) 2025-05-07T20:33:05.3310512Z self = 2025-05-07T20:33:05.3310681Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.3310686Z 2025-05-07T20:33:05.3310760Z @given( 2025-05-07T20:33:05.3310881Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3310979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3311098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3311215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3311329Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3311408Z ) 2025-05-07T20:33:05.3311659Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3311751Z def test_silu_mul_quant( 2025-05-07T20:33:05.3311828Z self, 2025-05-07T20:33:05.3311903Z T: int, 2025-05-07T20:33:05.3311978Z D: int, 2025-05-07T20:33:05.3312079Z scale_ub: Optional[float], 2025-05-07T20:33:05.3312167Z contiguous: bool, 2025-05-07T20:33:05.3312251Z compiled: bool, 2025-05-07T20:33:05.3312329Z ) -> None: 2025-05-07T20:33:05.3312421Z torch.manual_seed(2025) 2025-05-07T20:33:05.3312497Z 2025-05-07T20:33:05.3312664Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3312739Z 2025-05-07T20:33:05.3312833Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3312957Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3313051Z x = x_sign * x_clamp 2025-05-07T20:33:05.3313137Z x0 = x[:, :D] 2025-05-07T20:33:05.3313218Z x1 = x[:, D:] 2025-05-07T20:33:05.3313289Z 2025-05-07T20:33:05.3313900Z if contiguous: 2025-05-07T20:33:05.3314012Z x0 = x0.contiguous() 2025-05-07T20:33:05.3314100Z x1 = x1.contiguous() 2025-05-07T20:33:05.3314173Z 2025-05-07T20:33:05.3314264Z if scale_ub is not None: 2025-05-07T20:33:05.3314375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.3314516Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.3314590Z ) 2025-05-07T20:33:05.3314668Z else: 2025-05-07T20:33:05.3314763Z scale_ub_tensor = None 2025-05-07T20:33:05.3314833Z 2025-05-07T20:33:05.3314969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.3315194Z op = silu_mul_quant 2025-05-07T20:33:05.3315283Z if compiled: 2025-05-07T20:33:05.3315445Z op = torch.compile(op) 2025-05-07T20:33:05.3315554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3315626Z 2025-05-07T20:33:05.3315724Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.3315730Z 2025-05-07T20:33:05.3315827Z moe/activation_test.py:117: 2025-05-07T20:33:05.3315960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3316060Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.3316160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.3316544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:05.3316637Z return fn(*args, **kwargs) 2025-05-07T20:33:05.3317142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.3317247Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.3317733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.3317966Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.3318314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.3318407Z kernel = self.compile( 2025-05-07T20:33:05.3318802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.3318979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.3319105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.3319113Z 2025-05-07T20:33:05.3319320Z self = 2025-05-07T20:33:05.3320201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.3320732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7ff5cf9f7b00>} 2025-05-07T20:33:05.3321500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.3321698Z context = 2025-05-07T20:33:05.3321702Z 2025-05-07T20:33:05.3321870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.3322143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.3322258Z module_map=module_map) 2025-05-07T20:33:05.3322423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.3322526Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.3322601Z E ^ 2025-05-07T20:33:05.3322963Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.3322968Z 2025-05-07T20:33:05.3323394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.3323399Z 2025-05-07T20:33:05.3323501Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3323727Z self=, 2025-05-07T20:33:05.3323805Z T=128, 2025-05-07T20:33:05.3323879Z D=7168, 2025-05-07T20:33:05.3323964Z scale_ub=1200.0, 2025-05-07T20:33:05.3324122Z contiguous=True, 2025-05-07T20:33:05.3324204Z compiled=False, 2025-05-07T20:33:05.3324281Z ) 2025-05-07T20:33:05.3324540Z self = 2025-05-07T20:33:05.3324718Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:05.3324723Z 2025-05-07T20:33:05.3324801Z @given( 2025-05-07T20:33:05.3324919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3325020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3325138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3325252Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3325367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3325439Z ) 2025-05-07T20:33:05.3325690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3325787Z def test_silu_mul_quant( 2025-05-07T20:33:05.3325867Z self, 2025-05-07T20:33:05.3325940Z T: int, 2025-05-07T20:33:05.3326018Z D: int, 2025-05-07T20:33:05.3326166Z scale_ub: Optional[float], 2025-05-07T20:33:05.3326256Z contiguous: bool, 2025-05-07T20:33:05.3326385Z compiled: bool, 2025-05-07T20:33:05.3326463Z ) -> None: 2025-05-07T20:33:05.3326556Z torch.manual_seed(2025) 2025-05-07T20:33:05.3326630Z 2025-05-07T20:33:05.3326799Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3326875Z 2025-05-07T20:33:05.3326964Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3327087Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3328932Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3328944Z 2025-05-07T20:33:05.3329062Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:05.3329066Z 2025-05-07T20:33:05.3329171Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3329397Z self=, 2025-05-07T20:33:05.3329473Z T=128, 2025-05-07T20:33:05.3329550Z D=5120, 2025-05-07T20:33:05.3329632Z scale_ub=1200.0, 2025-05-07T20:33:05.3329715Z contiguous=True, 2025-05-07T20:33:05.3329800Z compiled=True, 2025-05-07T20:33:05.3329872Z ) 2025-05-07T20:33:05.3330096Z self = 2025-05-07T20:33:05.3330265Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:05.3330272Z 2025-05-07T20:33:05.3330346Z @given( 2025-05-07T20:33:05.3330470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3330572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3330686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3330806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3330917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3330988Z ) 2025-05-07T20:33:05.3331238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3331330Z def test_silu_mul_quant( 2025-05-07T20:33:05.3331406Z self, 2025-05-07T20:33:05.3331482Z T: int, 2025-05-07T20:33:05.3331556Z D: int, 2025-05-07T20:33:05.3331659Z scale_ub: Optional[float], 2025-05-07T20:33:05.3331747Z contiguous: bool, 2025-05-07T20:33:05.3331830Z compiled: bool, 2025-05-07T20:33:05.3331957Z ) -> None: 2025-05-07T20:33:05.3332051Z torch.manual_seed(2025) 2025-05-07T20:33:05.3332126Z 2025-05-07T20:33:05.3332338Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3332415Z 2025-05-07T20:33:05.3332507Z x_sign = torch.sign(x) 2025-05-07T20:33:05.3332633Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.3334451Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3334459Z 2025-05-07T20:33:05.3334581Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:05.3334624Z 2025-05-07T20:33:05.3334727Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.3334993Z self=, 2025-05-07T20:33:05.3335074Z T=128, 2025-05-07T20:33:05.3335149Z D=7168, 2025-05-07T20:33:05.3335232Z scale_ub=None, 2025-05-07T20:33:05.3335318Z contiguous=True, 2025-05-07T20:33:05.3335400Z compiled=True, 2025-05-07T20:33:05.3335476Z ) 2025-05-07T20:33:05.3335696Z self = 2025-05-07T20:33:05.3335864Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.3335871Z 2025-05-07T20:33:05.3335946Z @given( 2025-05-07T20:33:05.3336064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.3336167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.3336282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.3336398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.3336515Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.3336591Z ) 2025-05-07T20:33:05.3336844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.3336940Z def test_silu_mul_quant( 2025-05-07T20:33:05.3337014Z self, 2025-05-07T20:33:05.3337089Z T: int, 2025-05-07T20:33:05.3337166Z D: int, 2025-05-07T20:33:05.3337263Z scale_ub: Optional[float], 2025-05-07T20:33:05.3337355Z contiguous: bool, 2025-05-07T20:33:05.3337439Z compiled: bool, 2025-05-07T20:33:05.3337515Z ) -> None: 2025-05-07T20:33:05.3337610Z torch.manual_seed(2025) 2025-05-07T20:33:05.3337680Z 2025-05-07T20:33:05.3337849Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.3339680Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:05.3339691Z 2025-05-07T20:33:05.3339808Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:05.3339946Z =============================== warnings summary =============================== 2025-05-07T20:33:05.3340260Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:05.3340570Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:05.3340925Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:05.3341870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:05.3342108Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:05.3342112Z 2025-05-07T20:33:05.3342326Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:05.3342497Z ================= 1 failed, 1 deselected, 3 warnings in 12.08s ================= 2025-05-07T20:33:06.9302879Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:06.9930745Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:06.9930998Z 2025-05-07T20:33:06.9931479Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:06.9932171Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:06.9932609Z 2025-05-07T20:33:06.9932613Z 2025-05-07T20:33:06.9932617Z 2025-05-07T20:33:06.9948006Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:07.0029463Z Post job cleanup. 2025-05-07T20:33:07.1011702Z [command]/usr/bin/git version 2025-05-07T20:33:07.1055358Z git version 2.47.1 2025-05-07T20:33:07.1093527Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/558b33c6-0bf6-4269-b53e-73cbc3faf9f6/.gitconfig' 2025-05-07T20:33:07.1103994Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/558b33c6-0bf6-4269-b53e-73cbc3faf9f6' before making global git config changes 2025-05-07T20:33:07.1104911Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:07.1122136Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:07.1168299Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:07.1203249Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:07.1537911Z Entering 'external/asmjit' 2025-05-07T20:33:07.1605774Z Entering 'external/composable_kernel' 2025-05-07T20:33:07.1678712Z Entering 'external/cpuinfo' 2025-05-07T20:33:07.1746011Z Entering 'external/cutlass' 2025-05-07T20:33:07.1823081Z Entering 'external/googletest' 2025-05-07T20:33:07.1889815Z Entering 'external/hipify_torch' 2025-05-07T20:33:07.1957067Z Entering 'external/json' 2025-05-07T20:33:07.2042541Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:07.2067767Z http.https://github.com/.extraheader 2025-05-07T20:33:07.2082374Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:07.2115151Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:07.2446447Z Entering 'external/asmjit' 2025-05-07T20:33:07.2489429Z http.https://github.com/.extraheader 2025-05-07T20:33:07.2532463Z Entering 'external/composable_kernel' 2025-05-07T20:33:07.2574780Z http.https://github.com/.extraheader 2025-05-07T20:33:07.2624909Z Entering 'external/cpuinfo' 2025-05-07T20:33:07.2667303Z http.https://github.com/.extraheader 2025-05-07T20:33:07.2710138Z Entering 'external/cutlass' 2025-05-07T20:33:07.2754777Z http.https://github.com/.extraheader 2025-05-07T20:33:07.2805465Z Entering 'external/googletest' 2025-05-07T20:33:07.2848725Z http.https://github.com/.extraheader 2025-05-07T20:33:07.2890798Z Entering 'external/hipify_torch' 2025-05-07T20:33:07.2934574Z http.https://github.com/.extraheader 2025-05-07T20:33:07.2978558Z Entering 'external/json' 2025-05-07T20:33:07.3021625Z http.https://github.com/.extraheader 2025-05-07T20:33:07.3177212Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:07.3211313Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:07.3222811Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:07.3223191Z ##[endgroup] 2025-05-07T20:33:07.3322756Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:18.1192327Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:35.4262467Z Cleaning up orphan processes