2025-05-07T20:23:26.0508890Z Current runner version: '2.323.0' 2025-05-07T20:23:26.0515181Z Runner name: 'i-0bb11f79b54aad6c7' 2025-05-07T20:23:26.0516196Z Machine name: 'ip-10-0-16-208' 2025-05-07T20:23:26.0518887Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:23:26.0521158Z Contents: read 2025-05-07T20:23:26.0521668Z Metadata: read 2025-05-07T20:23:26.0522151Z Packages: read 2025-05-07T20:23:26.0522640Z ##[endgroup] 2025-05-07T20:23:26.0524496Z Secret source: None 2025-05-07T20:23:26.0525123Z Prepare workflow directory 2025-05-07T20:23:26.1453587Z Prepare all required actions 2025-05-07T20:23:26.1494009Z Getting action download info 2025-05-07T20:23:26.3905868Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:23:26.6496898Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:23:27.0214077Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:23:28.6102416Z Getting action download info 2025-05-07T20:23:28.7232991Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:23:28.9660430Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.12, 12.8.0, 12.6.3, gcc) 2025-05-07T20:23:29.0166054Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:23:29.0273012Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:23:29.0284522Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:29.0285178Z ##[endgroup] 2025-05-07T20:23:30.2907887Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:23:30.2908331Z Instance Type: g5.4xlarge 2025-05-07T20:23:30.2908580Z AMI Name: unknown 2025-05-07T20:23:30.2950070Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:23:35.6403906Z ##[group]Run actions/checkout@v4 2025-05-07T20:23:35.6404218Z with: 2025-05-07T20:23:35.6404444Z submodules: true 2025-05-07T20:23:35.6404684Z repository: pytorch/FBGEMM 2025-05-07T20:23:35.6405074Z token: *** 2025-05-07T20:23:35.6405280Z ssh-strict: true 2025-05-07T20:23:35.6405498Z ssh-user: git 2025-05-07T20:23:35.6405721Z persist-credentials: true 2025-05-07T20:23:35.6405978Z clean: true 2025-05-07T20:23:35.6406211Z sparse-checkout-cone-mode: true 2025-05-07T20:23:35.6406481Z fetch-depth: 1 2025-05-07T20:23:35.6406698Z fetch-tags: false 2025-05-07T20:23:35.6406916Z show-progress: true 2025-05-07T20:23:35.6407141Z lfs: false 2025-05-07T20:23:35.6407354Z set-safe-directory: true 2025-05-07T20:23:35.6407607Z env: 2025-05-07T20:23:35.6407822Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:35.6408132Z BUILD_ENV: build_binary 2025-05-07T20:23:35.6408394Z BUILD_TARGET: genai 2025-05-07T20:23:35.6408663Z BUILD_VARIANT: cuda 2025-05-07T20:23:35.6408938Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:35.6409191Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:35.6409439Z ##[endgroup] 2025-05-07T20:23:35.7567880Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:23:35.7569082Z ##[group]Getting Git version info 2025-05-07T20:23:35.7569567Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:23:35.7570248Z [command]/usr/bin/git version 2025-05-07T20:23:35.7570535Z git version 2.47.1 2025-05-07T20:23:35.7574515Z ##[endgroup] 2025-05-07T20:23:35.7588338Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/c149a0cb-4ab3-48f4-9c0f-4470b857a01b' before making global git config changes 2025-05-07T20:23:35.7589391Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:23:35.7602446Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:35.7643500Z [command]/usr/bin/git config --local --get remote.origin.url 2025-05-07T20:23:35.7668153Z https://github.com/pytorch/FBGEMM 2025-05-07T20:23:35.7686195Z ##[group]Removing previously created refs, to avoid conflicts 2025-05-07T20:23:35.7690956Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-05-07T20:23:35.7715824Z refs/heads/main 2025-05-07T20:23:35.7725246Z [command]/usr/bin/git checkout --detach 2025-05-07T20:23:36.6350669Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.6402955Z [command]/usr/bin/git branch --delete --force main 2025-05-07T20:23:36.6430738Z Deleted branch main (was b6b2ce3). 2025-05-07T20:23:36.6436254Z ##[endgroup] 2025-05-07T20:23:36.6440186Z [command]/usr/bin/git submodule status 2025-05-07T20:23:36.6862698Z e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b) 2025-05-07T20:23:36.6949609Z 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd) 2025-05-07T20:23:36.7037535Z 6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec) 2025-05-07T20:23:36.7123645Z 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e) 2025-05-07T20:23:36.7208909Z f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77) 2025-05-07T20:23:36.7294894Z 420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844) 2025-05-07T20:23:36.7377423Z 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280) 2025-05-07T20:23:36.7391487Z ##[group]Cleaning the repository 2025-05-07T20:23:36.7395818Z [command]/usr/bin/git clean -ffdx 2025-05-07T20:23:36.7453762Z [command]/usr/bin/git reset --hard HEAD 2025-05-07T20:23:36.7563293Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.7570840Z ##[endgroup] 2025-05-07T20:23:36.7572893Z ##[group]Disabling automatic garbage collection 2025-05-07T20:23:36.7577426Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:23:36.7607512Z ##[endgroup] 2025-05-07T20:23:36.7607891Z ##[group]Setting up auth 2025-05-07T20:23:36.7624280Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:23:36.7654370Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:23:36.7986794Z Entering 'external/asmjit' 2025-05-07T20:23:36.8052849Z Entering 'external/composable_kernel' 2025-05-07T20:23:36.8126729Z Entering 'external/cpuinfo' 2025-05-07T20:23:36.8192091Z Entering 'external/cutlass' 2025-05-07T20:23:36.8266320Z Entering 'external/googletest' 2025-05-07T20:23:36.8331303Z Entering 'external/hipify_torch' 2025-05-07T20:23:36.8396211Z Entering 'external/json' 2025-05-07T20:23:36.8481727Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:23:36.8512598Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:23:36.8847187Z Entering 'external/asmjit' 2025-05-07T20:23:36.8913486Z Entering 'external/composable_kernel' 2025-05-07T20:23:36.8986074Z Entering 'external/cpuinfo' 2025-05-07T20:23:36.9051508Z Entering 'external/cutlass' 2025-05-07T20:23:36.9127020Z Entering 'external/googletest' 2025-05-07T20:23:36.9192433Z Entering 'external/hipify_torch' 2025-05-07T20:23:36.9257482Z Entering 'external/json' 2025-05-07T20:23:36.9344780Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:36.9396211Z ##[endgroup] 2025-05-07T20:23:36.9396605Z ##[group]Fetching the repository 2025-05-07T20:23:36.9403529Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:23:37.1124177Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:23:37.1124818Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:23:37.1150981Z ##[endgroup] 2025-05-07T20:23:37.1151545Z ##[group]Determining the checkout info 2025-05-07T20:23:37.1152737Z ##[endgroup] 2025-05-07T20:23:37.1157406Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:23:37.1208607Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:23:37.1238245Z ##[group]Checking out the ref 2025-05-07T20:23:37.1243039Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:23:37.1368003Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:37.1372219Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:23:37.1382083Z ##[endgroup] 2025-05-07T20:23:37.1382676Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:23:37.1388852Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:37.1439505Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:23:37.1471627Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:23:37.1503712Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:23:37.1532549Z ##[endgroup] 2025-05-07T20:23:37.1533183Z ##[group]Fetching submodules 2025-05-07T20:23:37.1536700Z [command]/usr/bin/git submodule sync 2025-05-07T20:23:37.1910776Z Synchronizing submodule url for 'external/asmjit' 2025-05-07T20:23:37.1911242Z Synchronizing submodule url for 'external/composable_kernel' 2025-05-07T20:23:37.1911665Z Synchronizing submodule url for 'external/cpuinfo' 2025-05-07T20:23:37.1912046Z Synchronizing submodule url for 'external/cutlass' 2025-05-07T20:23:37.1913878Z Synchronizing submodule url for 'external/googletest' 2025-05-07T20:23:37.1914301Z Synchronizing submodule url for 'external/hipify_torch' 2025-05-07T20:23:37.1914692Z Synchronizing submodule url for 'external/json' 2025-05-07T20:23:37.1928456Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:23:37.2356145Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:23:37.2503947Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:23:37.2605698Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:23:37.2776233Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:23:37.2866514Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:23:37.2952329Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:23:37.3059433Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:23:37.3076898Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:23:37.3406582Z Entering 'external/asmjit' 2025-05-07T20:23:37.3438904Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.3472945Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.3505600Z Entering 'external/cutlass' 2025-05-07T20:23:37.3536947Z Entering 'external/googletest' 2025-05-07T20:23:37.3569114Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.3601914Z Entering 'external/json' 2025-05-07T20:23:37.3647364Z ##[endgroup] 2025-05-07T20:23:37.3647790Z ##[group]Persisting credentials for submodules 2025-05-07T20:23:37.3653391Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:23:37.3984081Z Entering 'external/asmjit' 2025-05-07T20:23:37.4025924Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4026695Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4069917Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.4112583Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4113020Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4161831Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.4204440Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4204877Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4247317Z Entering 'external/cutlass' 2025-05-07T20:23:37.4290507Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4291223Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4342517Z Entering 'external/googletest' 2025-05-07T20:23:37.4386316Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4386758Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4429472Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.4472593Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4473016Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4514921Z Entering 'external/json' 2025-05-07T20:23:37.4557856Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4558305Z url.https://github.com/.insteadof 2025-05-07T20:23:37.4618662Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:23:37.4946867Z Entering 'external/asmjit' 2025-05-07T20:23:37.5008267Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:23:37.5011034Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.5072540Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:23:37.5075257Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.5136615Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:23:37.5139350Z Entering 'external/cutlass' 2025-05-07T20:23:37.5202189Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:23:37.5204868Z Entering 'external/googletest' 2025-05-07T20:23:37.5266367Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:23:37.5269713Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.5331130Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:23:37.5334018Z Entering 'external/json' 2025-05-07T20:23:37.5395227Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:23:37.5528364Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:23:37.5857003Z Entering 'external/asmjit' 2025-05-07T20:23:37.5889940Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.5922598Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.5954908Z Entering 'external/cutlass' 2025-05-07T20:23:37.5989235Z Entering 'external/googletest' 2025-05-07T20:23:37.6021011Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.6054080Z Entering 'external/json' 2025-05-07T20:23:37.6101382Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:23:37.6429716Z Entering 'external/asmjit' 2025-05-07T20:23:37.6462346Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.6498153Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.6529966Z Entering 'external/cutlass' 2025-05-07T20:23:37.6561999Z Entering 'external/googletest' 2025-05-07T20:23:37.6593820Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.6625471Z Entering 'external/json' 2025-05-07T20:23:37.6671510Z ##[endgroup] 2025-05-07T20:23:37.6713072Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:23:37.6740127Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:37.6918596Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:23:37.6918915Z with: 2025-05-07T20:23:37.6919159Z name: fbgemm_genai_x86_gcc_py3.12_cu12.8.0.whl 2025-05-07T20:23:37.6919474Z merge-multiple: false 2025-05-07T20:23:37.6919736Z repository: pytorch/FBGEMM 2025-05-07T20:23:37.6919995Z run-id: 14891846252 2025-05-07T20:23:37.6920229Z env: 2025-05-07T20:23:37.6920490Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:37.6920786Z BUILD_ENV: build_binary 2025-05-07T20:23:37.6921032Z BUILD_TARGET: genai 2025-05-07T20:23:37.6921255Z BUILD_VARIANT: cuda 2025-05-07T20:23:37.6921491Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:37.6921741Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:37.6921984Z ##[endgroup] 2025-05-07T20:23:37.9255836Z Downloading single artifact 2025-05-07T20:23:38.0234489Z Preparing to download the following artifacts: 2025-05-07T20:23:38.0235316Z - fbgemm_genai_x86_gcc_py3.12_cu12.8.0.whl (ID: 3081407199, Size: 18498190, Expected Digest: sha256:44a8371d786eb18d4cfaf0c12983918cf9c0bfea6fa4b0e46e2bab9751f50039) 2025-05-07T20:23:38.1091147Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-ece139bb-06c0-5836-80f5-9819333cc7e6/artifacts/32c0b958496f27864187ef499761b3d1022dfdf4e072683d135f40e372c7bc42.zip 2025-05-07T20:23:38.1093394Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:38.1736986Z (node:65593) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:23:38.1737961Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:23:38.4408684Z SHA256 digest of downloaded artifact is 44a8371d786eb18d4cfaf0c12983918cf9c0bfea6fa4b0e46e2bab9751f50039 2025-05-07T20:23:38.4409269Z Artifact download completed successfully. 2025-05-07T20:23:38.4409638Z Total of 1 artifact(s) downloaded 2025-05-07T20:23:38.4414873Z Download artifact has finished successfully 2025-05-07T20:23:38.4669830Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:23:38.4670225Z with: 2025-05-07T20:23:38.4670453Z driver-version: 570.133.07 2025-05-07T20:23:38.4670709Z env: 2025-05-07T20:23:38.4670938Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.4671240Z BUILD_ENV: build_binary 2025-05-07T20:23:38.4671494Z BUILD_TARGET: genai 2025-05-07T20:23:38.4671729Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.4671965Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:38.4672227Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.4672465Z ##[endgroup] 2025-05-07T20:23:38.4766456Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:23:38.4766846Z with: 2025-05-07T20:23:38.4767054Z timeout_minutes: 10 2025-05-07T20:23:38.4767405Z max_attempts: 3 2025-05-07T20:23:38.4790492Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:23:38.4814164Z retry_wait_seconds: 10 2025-05-07T20:23:38.4814433Z polling_interval_seconds: 1 2025-05-07T20:23:38.4814702Z warning_on_retry: true 2025-05-07T20:23:38.4814960Z continue_on_error: false 2025-05-07T20:23:38.4815207Z env: 2025-05-07T20:23:38.4815430Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.4815869Z BUILD_ENV: build_binary 2025-05-07T20:23:38.4816263Z BUILD_TARGET: genai 2025-05-07T20:23:38.4816594Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.4833385Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:38.4833676Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.4833920Z DRIVER_VERSION: 570.133.07 2025-05-07T20:23:38.4834159Z ##[endgroup] 2025-05-07T20:23:38.5642850Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:23:38.5644389Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:23:38.5644788Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:23:38.9018056Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:23:38.9018963Z No packages marked for removal. 2025-05-07T20:23:38.9081705Z Dependencies resolved. 2025-05-07T20:23:38.9091414Z Nothing to do. 2025-05-07T20:23:38.9092018Z Complete! 2025-05-07T20:23:38.9929550Z + install_nvidia_driver_common 2025-05-07T20:23:38.9935783Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:23:38.9936425Z + lspci 2025-05-07T20:23:38.9937975Z Before installing NVIDIA driver 2025-05-07T20:23:39.0121503Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:39.0122374Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:39.0122927Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:39.0123447Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:39.0123913Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:39.0124531Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:39.0125078Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:39.0125549Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:39.0125947Z + lsmod 2025-05-07T20:23:39.0170874Z Module Size Used by 2025-05-07T20:23:39.0171254Z xt_nat 16384 0 2025-05-07T20:23:39.0171618Z nvidia_modeset 1716224 0 2025-05-07T20:23:39.0171922Z video 65536 1 nvidia_modeset 2025-05-07T20:23:39.0172236Z wmi 36864 1 video 2025-05-07T20:23:39.0172512Z nvidia_uvm 1884160 0 2025-05-07T20:23:39.0172807Z nvidia 11583488 2 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:39.0173163Z drm 602112 1 nvidia 2025-05-07T20:23:39.0173467Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:39.0173822Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:39.0174170Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:39.0174456Z veth 36864 0 2025-05-07T20:23:39.0174713Z xt_conntrack 16384 1 2025-05-07T20:23:39.0174967Z nft_chain_nat 16384 3 2025-05-07T20:23:39.0175225Z xt_MASQUERADE 20480 1 2025-05-07T20:23:39.0175534Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:39.0175871Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:39.0176509Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:39.0176972Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:39.0177284Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:39.0177580Z xfrm_user 57344 1 2025-05-07T20:23:39.0177848Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:39.0178148Z xt_addrtype 16384 2 2025-05-07T20:23:39.0178402Z nft_compat 20480 4 2025-05-07T20:23:39.0178713Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:39.0179130Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:39.0179502Z br_netfilter 36864 0 2025-05-07T20:23:39.0179786Z bridge 323584 1 br_netfilter 2025-05-07T20:23:39.0180093Z stp 16384 1 bridge 2025-05-07T20:23:39.0180382Z llc 16384 2 bridge,stp 2025-05-07T20:23:39.0180675Z overlay 167936 0 2025-05-07T20:23:39.0180931Z tls 135168 0 2025-05-07T20:23:39.0181185Z nls_ascii 16384 1 2025-05-07T20:23:39.0181440Z nls_cp437 20480 1 2025-05-07T20:23:39.0181694Z vfat 24576 1 2025-05-07T20:23:39.0181946Z fat 86016 1 vfat 2025-05-07T20:23:39.0182213Z ena 180224 0 2025-05-07T20:23:39.0182465Z sunrpc 696320 1 2025-05-07T20:23:39.0182729Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:39.0182993Z i8042 45056 0 2025-05-07T20:23:39.0183251Z serio 28672 3 i8042 2025-05-07T20:23:39.0183529Z button 24576 0 2025-05-07T20:23:39.0183782Z sch_fq_codel 20480 17 2025-05-07T20:23:39.0184042Z dm_mod 188416 0 2025-05-07T20:23:39.0184296Z dax 45056 1 dm_mod 2025-05-07T20:23:39.0184562Z loop 36864 0 2025-05-07T20:23:39.0184811Z fuse 163840 1 2025-05-07T20:23:39.0185156Z configfs 57344 1 2025-05-07T20:23:39.0185438Z dmi_sysfs 20480 0 2025-05-07T20:23:39.0185826Z crc32_pclmul 16384 0 2025-05-07T20:23:39.0186087Z crc32c_intel 24576 0 2025-05-07T20:23:39.0186343Z efivarfs 24576 1 2025-05-07T20:23:39.0186590Z + modinfo nvidia 2025-05-07T20:23:39.0190190Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:39.0190664Z import_ns: DMA_BUF 2025-05-07T20:23:39.0190918Z alias: char-major-195-* 2025-05-07T20:23:39.0191182Z version: 570.133.07 2025-05-07T20:23:39.0191429Z supported: external 2025-05-07T20:23:39.0191679Z license: Dual MIT/GPL 2025-05-07T20:23:39.0191960Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:39.0192325Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:39.0192767Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:39.0193111Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:39.0193459Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:39.0193805Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:39.0194123Z depends: i2c-core,drm 2025-05-07T20:23:39.0194373Z retpoline: Y 2025-05-07T20:23:39.0194594Z name: nvidia 2025-05-07T20:23:39.0194948Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:39.0195408Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:39.0195919Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:39.0196337Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:39.0196648Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:39.0196944Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:39.0197262Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:39.0197567Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:39.0197994Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:39.0198360Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:39.0198749Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:39.0199074Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:39.0199381Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:39.0199693Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:39.0200049Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:39.0200445Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:39.0200826Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:39.0201236Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.0201637Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:39.0202055Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.0202469Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:39.0202803Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:39.0203175Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:39.0203542Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:39.0203872Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:39.0204193Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:39.0204528Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:39.0204849Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:39.0205152Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:39.0205504Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:39.0205867Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:39.0206192Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:39.0206525Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:39.0206868Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:39.0207197Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:39.0207539Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:39.0207966Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:39.0208262Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:39.0208580Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:39.0208903Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:39.0209215Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:39.0209535Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:39.0209892Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:39.0210243Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:39.0210562Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:39.0210907Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:39.0211246Z parm: rm_firmware_active:charp 2025-05-07T20:23:39.0211544Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:23:39.0211782Z ++ command -v nvidia-smi 2025-05-07T20:23:39.0212046Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:23:39.0212306Z + set +e 2025-05-07T20:23:39.0212614Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:40.7191741Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:40.7192101Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:40.7192345Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:40.7192567Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:40.7192831Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:40.7193270Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:40.7193740Z + set -e 2025-05-07T20:23:40.7193930Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:40.7194314Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:40.7194774Z + post_install_nvidia_driver_common 2025-05-07T20:23:40.7197787Z + sudo modprobe nvidia 2025-05-07T20:23:40.8464003Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:40.8464559Z + lspci 2025-05-07T20:23:40.8464789Z After installing NVIDIA driver 2025-05-07T20:23:40.8585892Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:40.8586545Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:40.8587187Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:40.8587899Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:40.8588497Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:40.8589010Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:40.8589489Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:40.8589962Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:40.8590360Z + lsmod 2025-05-07T20:23:40.8618454Z Module Size Used by 2025-05-07T20:23:40.8618867Z xt_nat 16384 0 2025-05-07T20:23:40.8619272Z nvidia_modeset 1716224 0 2025-05-07T20:23:40.8619655Z video 65536 1 nvidia_modeset 2025-05-07T20:23:40.8620044Z wmi 36864 1 video 2025-05-07T20:23:40.8620315Z nvidia_uvm 1884160 0 2025-05-07T20:23:40.8620692Z nvidia 11583488 2 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:40.8621158Z drm 602112 1 nvidia 2025-05-07T20:23:40.8621567Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:40.8621974Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:40.8622316Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:40.8622601Z veth 36864 0 2025-05-07T20:23:40.8622857Z xt_conntrack 16384 1 2025-05-07T20:23:40.8623106Z nft_chain_nat 16384 3 2025-05-07T20:23:40.8623368Z xt_MASQUERADE 20480 1 2025-05-07T20:23:40.8623679Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:40.8624019Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:40.8624689Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:40.8625151Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:40.8625461Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:40.8625747Z xfrm_user 57344 1 2025-05-07T20:23:40.8626017Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:40.8626302Z xt_addrtype 16384 2 2025-05-07T20:23:40.8626557Z nft_compat 20480 4 2025-05-07T20:23:40.8626858Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:40.8627264Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:40.8627632Z br_netfilter 36864 0 2025-05-07T20:23:40.8627910Z bridge 323584 1 br_netfilter 2025-05-07T20:23:40.8628203Z stp 16384 1 bridge 2025-05-07T20:23:40.8628480Z llc 16384 2 bridge,stp 2025-05-07T20:23:40.8628770Z overlay 167936 0 2025-05-07T20:23:40.8629026Z tls 135168 0 2025-05-07T20:23:40.8629275Z nls_ascii 16384 1 2025-05-07T20:23:40.8629519Z nls_cp437 20480 1 2025-05-07T20:23:40.8629764Z vfat 24576 1 2025-05-07T20:23:40.8630013Z fat 86016 1 vfat 2025-05-07T20:23:40.8630268Z ena 180224 0 2025-05-07T20:23:40.8630514Z sunrpc 696320 1 2025-05-07T20:23:40.8630768Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:40.8631025Z i8042 45056 0 2025-05-07T20:23:40.8631302Z serio 28672 3 i8042 2025-05-07T20:23:40.8631598Z button 24576 0 2025-05-07T20:23:40.8631842Z sch_fq_codel 20480 17 2025-05-07T20:23:40.8632102Z dm_mod 188416 0 2025-05-07T20:23:40.8632358Z dax 45056 1 dm_mod 2025-05-07T20:23:40.8632622Z loop 36864 0 2025-05-07T20:23:40.8633009Z fuse 163840 1 2025-05-07T20:23:40.8633257Z configfs 57344 1 2025-05-07T20:23:40.8633516Z dmi_sysfs 20480 0 2025-05-07T20:23:40.8633761Z crc32_pclmul 16384 0 2025-05-07T20:23:40.8634014Z crc32c_intel 24576 0 2025-05-07T20:23:40.8634267Z efivarfs 24576 1 2025-05-07T20:23:40.8634514Z + modinfo nvidia 2025-05-07T20:23:40.8634987Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:40.8635638Z import_ns: DMA_BUF 2025-05-07T20:23:40.8636087Z alias: char-major-195-* 2025-05-07T20:23:40.8636443Z version: 570.133.07 2025-05-07T20:23:40.8636777Z supported: external 2025-05-07T20:23:40.8637096Z license: Dual MIT/GPL 2025-05-07T20:23:40.8637377Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:40.8637717Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:40.8638034Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:40.8638351Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:40.8638693Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:40.8639027Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:40.8639329Z depends: i2c-core,drm 2025-05-07T20:23:40.8639583Z retpoline: Y 2025-05-07T20:23:40.8639801Z name: nvidia 2025-05-07T20:23:40.8640157Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:40.8640766Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:40.8641384Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:40.8641862Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:40.8642163Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:40.8642462Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:40.8642778Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:40.8643073Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:40.8643382Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:40.8643862Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:40.8644249Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:40.8644597Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:40.8644899Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:40.8645204Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:40.8645555Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:40.8645950Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:40.8646325Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:40.8646732Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:40.8647128Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:40.8647551Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:40.8647959Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:40.8648296Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:40.8648661Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:40.8649033Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:40.8649367Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:40.8649691Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:40.8650022Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:40.8650343Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:40.8650645Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:40.8650992Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:40.8651397Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:40.8651721Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:40.8652052Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:40.8652395Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:40.8652820Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:40.8653164Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:40.8653491Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:40.8653770Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:40.8654094Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:40.8654414Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:40.8654726Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:40.8655048Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:40.8655403Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:40.8655755Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:40.8656071Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:40.8656418Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:40.8656756Z parm: rm_firmware_active:charp 2025-05-07T20:23:40.8657027Z + set +e 2025-05-07T20:23:40.8657225Z + nvidia-smi 2025-05-07T20:23:42.2878690Z Wed May 7 20:23:42 2025 2025-05-07T20:23:42.2879103Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.2879603Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:42.2880098Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:42.2880589Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:42.2881124Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:42.2881558Z | | | MIG M. | 2025-05-07T20:23:42.2881897Z |=========================================+========================+======================| 2025-05-07T20:23:42.2943169Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:42.2944151Z | 0% 30C P0 64W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:42.2944552Z | | | N/A | 2025-05-07T20:23:42.2944947Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:42.2945338Z 2025-05-07T20:23:42.2945727Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.2946154Z | Processes: | 2025-05-07T20:23:42.2946588Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:42.2946995Z | ID ID Usage | 2025-05-07T20:23:42.2947344Z |=========================================================================================| 2025-05-07T20:23:42.2947788Z | No running processes found | 2025-05-07T20:23:42.2948258Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.7120168Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:44.1289468Z NVIDIA A10G 2025-05-07T20:23:44.4017755Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:44.4018135Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:44.4018469Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:44.4018867Z + set -e 2025-05-07T20:23:44.4019085Z INFO: Ignoring allowed status 0 2025-05-07T20:23:44.4026993Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:44.4041033Z + sudo yum install -y yum-utils 2025-05-07T20:23:44.8366420Z Last metadata expiration check: 0:17:42 ago on Wed May 7 20:06:02 2025. 2025-05-07T20:23:44.8613286Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:44.9007123Z Dependencies resolved. 2025-05-07T20:23:44.9187369Z Nothing to do. 2025-05-07T20:23:44.9187796Z Complete! 2025-05-07T20:23:44.9575545Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:44.9576130Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:44.9576977Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.3555187Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.4123057Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:45.9934133Z nvidia-container-toolkit 13 kB/s | 833 B 00:00 2025-05-07T20:23:46.0191256Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:46.0592176Z Dependencies resolved. 2025-05-07T20:23:46.0769656Z ================================================================================ 2025-05-07T20:23:46.0770283Z Package Arch Version Repository Size 2025-05-07T20:23:46.0770809Z ================================================================================ 2025-05-07T20:23:46.0771120Z Downgrading: 2025-05-07T20:23:46.0771527Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:46.0772397Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:46.0772784Z 2025-05-07T20:23:46.0772923Z Transaction Summary 2025-05-07T20:23:46.0773312Z ================================================================================ 2025-05-07T20:23:46.0773694Z Downgrade 2 Packages 2025-05-07T20:23:46.0773843Z 2025-05-07T20:23:46.0773951Z Total download size: 6.8 M 2025-05-07T20:23:46.0774209Z Downloading Packages: 2025-05-07T20:23:46.1447463Z (1/2): nvidia-container-toolkit-base-1.16.2-1.x 85 MB/s | 5.6 MB 00:00 2025-05-07T20:23:46.1562366Z (2/2): nvidia-container-toolkit-1.16.2-1.x86_64 16 MB/s | 1.2 MB 00:00 2025-05-07T20:23:46.1570701Z -------------------------------------------------------------------------------- 2025-05-07T20:23:46.1573526Z Total 86 MB/s | 6.8 MB 00:00 2025-05-07T20:23:46.1576014Z Running transaction check 2025-05-07T20:23:46.1678201Z Transaction check succeeded. 2025-05-07T20:23:46.1678742Z Running transaction test 2025-05-07T20:23:46.1973146Z Transaction test succeeded. 2025-05-07T20:23:46.1975188Z Running transaction 2025-05-07T20:23:46.7462541Z Preparing : 1/1 2025-05-07T20:23:46.8518509Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:46.8551157Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:46.8797971Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:46.8798783Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:46.8905095Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:46.8930912Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:47.0832886Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:47.0833611Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:47.0834145Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:47.0834729Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:47.2167817Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:47.2168724Z WARNING: 2025-05-07T20:23:47.2169015Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:47.2169275Z 2025-05-07T20:23:47.2169377Z Available Versions: 2025-05-07T20:23:47.2169524Z 2025-05-07T20:23:47.2169626Z Version 2023.7.20250331: 2025-05-07T20:23:47.2169934Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:47.2170191Z 2025-05-07T20:23:47.2170315Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:47.2170524Z 2025-05-07T20:23:47.2170621Z Release notes: 2025-05-07T20:23:47.2171030Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:47.2171404Z 2025-05-07T20:23:47.2171495Z Version 2023.7.20250414: 2025-05-07T20:23:47.2171808Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:47.2172056Z 2025-05-07T20:23:47.2172174Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:47.2172382Z 2025-05-07T20:23:47.2172468Z Release notes: 2025-05-07T20:23:47.2172877Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:47.2173243Z 2025-05-07T20:23:47.2173342Z Version 2023.7.20250428: 2025-05-07T20:23:47.2173642Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:47.2173900Z 2025-05-07T20:23:47.2174017Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:47.2174231Z 2025-05-07T20:23:47.2174321Z Release notes: 2025-05-07T20:23:47.2174713Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:47.2175072Z 2025-05-07T20:23:47.2175186Z ================================================================================ 2025-05-07T20:23:47.2532033Z 2025-05-07T20:23:47.2532170Z 2025-05-07T20:23:47.2532608Z Downgraded: 2025-05-07T20:23:47.2532975Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:47.2533555Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:47.2533918Z 2025-05-07T20:23:47.2534011Z Complete! 2025-05-07T20:23:47.2984417Z + sudo systemctl restart docker 2025-05-07T20:23:52.4205044Z Wed May 7 20:23:52 2025 2025-05-07T20:23:52.4205453Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:52.4205959Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:52.4206454Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:52.4206954Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:52.4207496Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:52.4207935Z | | | MIG M. | 2025-05-07T20:23:52.4208280Z |=========================================+========================+======================| 2025-05-07T20:23:52.4289891Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:52.4290445Z | 0% 30C P0 64W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:52.4290829Z | | | N/A | 2025-05-07T20:23:52.4291227Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:52.4291685Z 2025-05-07T20:23:52.4292410Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:52.4292874Z | Processes: | 2025-05-07T20:23:52.4293320Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:52.4294008Z | ID ID Usage | 2025-05-07T20:23:52.4294357Z |=========================================================================================| 2025-05-07T20:23:52.4295145Z | No running processes found | 2025-05-07T20:23:52.5749196Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:53.5402496Z Command completed after 1 attempt(s). 2025-05-07T20:23:53.5486726Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:53.5487182Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:53.5501512Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:53.5501866Z env: 2025-05-07T20:23:53.5502095Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:53.5502395Z BUILD_ENV: build_binary 2025-05-07T20:23:53.5502641Z BUILD_TARGET: genai 2025-05-07T20:23:53.5502887Z BUILD_VARIANT: cuda 2025-05-07T20:23:53.5503121Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:53.5503383Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:53.5503684Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:53.5504012Z ##[endgroup] 2025-05-07T20:23:53.8853406Z ################################################################################ 2025-05-07T20:23:53.8853776Z # Print System Info 2025-05-07T20:23:53.8853996Z # 2025-05-07T20:23:53.8869825Z # [2025-05-07T20:23:53.886Z] + print_system_info 2025-05-07T20:23:53.8870185Z ################################################################################ 2025-05-07T20:23:53.8870399Z 2025-05-07T20:23:53.8870510Z ################################################################################ 2025-05-07T20:23:53.8870840Z [INFO] Printing environment variables ... 2025-05-07T20:23:53.8871142Z + printenv 2025-05-07T20:23:53.8871256Z 2025-05-07T20:23:53.8895466Z SHELL=/bin/bash 2025-05-07T20:23:53.8895979Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:53.8896371Z BUILD_VARIANT=cuda 2025-05-07T20:23:53.8896929Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_2d420f38-63a6-49a3-894a-78ca8f969b19 2025-05-07T20:23:53.8897495Z GITHUB_ACTION=__run 2025-05-07T20:23:53.8897774Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:53.8898108Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:53.8898356Z RUNNER_NAME=i-0bb11f79b54aad6c7 2025-05-07T20:23:53.8898882Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:53.8899473Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:53.8899992Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:53.8900707Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:53.8901546Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:53.8902088Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:53.8902656Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:53.8903461Z *** 2025-05-07T20:23:53.8903850Z LOGNAME=ec2-user 2025-05-07T20:23:53.8904305Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:53.8904809Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:53.8905262Z GITHUB_ACTIONS=true 2025-05-07T20:23:53.8905700Z SYSTEMD_EXEC_PID=55408 2025-05-07T20:23:53.8906237Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:53.8907304Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:53.8908197Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:53.8908514Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:53.8908771Z RUNNER_OS=Linux 2025-05-07T20:23:53.8908995Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:53.8909239Z HOME=/home/ec2-user 2025-05-07T20:23:53.8909499Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:53.8909790Z LANG=C.UTF-8 2025-05-07T20:23:53.8910093Z RUNNER_TRACKING_ID=github_0bce55bd-12c2-4dec-a701-d9bdbd3e25ae 2025-05-07T20:23:53.8910453Z RUNNER_ARCH=X64 2025-05-07T20:23:53.8910736Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:53.8911345Z BUILD_TARGET=genai 2025-05-07T20:23:53.8911859Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_2d420f38-63a6-49a3-894a-78ca8f969b19 2025-05-07T20:23:53.8912716Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_2d420f38-63a6-49a3-894a-78ca8f969b19 2025-05-07T20:23:53.8913441Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:53.8914098Z INVOCATION_ID=fbf0150337c146ec88e18a11b2fcdd98 2025-05-07T20:23:53.8914417Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:53.8914684Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:53.8915254Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_2d420f38-63a6-49a3-894a-78ca8f969b19 2025-05-07T20:23:53.8915929Z BUILD_ENV=build_binary 2025-05-07T20:23:53.8916166Z GITHUB_ACTOR=q10 2025-05-07T20:23:53.8916383Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:53.8916616Z KERN_NAME_LC=linux 2025-05-07T20:23:53.8916841Z BUILD_CUDA_VERSION=12.8.0 2025-05-07T20:23:53.8917140Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:53.8917469Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:53.8917716Z USER=ec2-user 2025-05-07T20:23:53.8917948Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:53.8918219Z SHLVL=1 2025-05-07T20:23:53.8918415Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:53.8918723Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:53.8919167Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:53.8919528Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:53.8919766Z KERN_NAME=Linux 2025-05-07T20:23:53.8919995Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:53.8920391Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:53.8920818Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:53.8921093Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:53.8921329Z JOURNAL_STREAM=8:93345 2025-05-07T20:23:53.8921641Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:53.8922005Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:53.8922306Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:53.8922634Z GITHUB_BASE_REF=main 2025-05-07T20:23:53.8922855Z CI=true 2025-05-07T20:23:53.8923062Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:53.8923348Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:53.8923628Z GITHUB_ACTION_REF= 2025-05-07T20:23:53.8923883Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:53.8924477Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_2d420f38-63a6-49a3-894a-78ca8f969b19 2025-05-07T20:23:53.8925056Z MACHINE_NAME=x86_64 2025-05-07T20:23:53.8925281Z _=/usr/bin/printenv 2025-05-07T20:23:53.8925426Z 2025-05-07T20:23:53.8925540Z ################################################################################ 2025-05-07T20:23:53.8925862Z [INFO] Print ldd version ... 2025-05-07T20:23:53.8926129Z + ldd --version 2025-05-07T20:23:53.8926258Z 2025-05-07T20:23:53.8926349Z ldd (GNU libc) 2.34 2025-05-07T20:23:53.8926609Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:53.8927047Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:53.8927577Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:53.8928013Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:53.8928238Z 2025-05-07T20:23:53.8928351Z ################################################################################ 2025-05-07T20:23:53.8928686Z [INFO] Print CPU info ... 2025-05-07T20:23:53.8928953Z + nproc 2025-05-07T20:23:53.8929060Z 2025-05-07T20:23:53.8945142Z 16 2025-05-07T20:23:53.8946647Z 2025-05-07T20:23:53.8946887Z + lscpu 2025-05-07T20:23:53.8946999Z 2025-05-07T20:23:53.9065052Z Architecture: x86_64 2025-05-07T20:23:53.9066184Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:53.9067408Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9068168Z Byte Order: Little Endian 2025-05-07T20:23:53.9068483Z CPU(s): 16 2025-05-07T20:23:53.9068778Z On-line CPU(s) list: 0-15 2025-05-07T20:23:53.9069088Z Vendor ID: AuthenticAMD 2025-05-07T20:23:53.9069424Z Model name: AMD EPYC 7R32 2025-05-07T20:23:53.9069741Z CPU family: 23 2025-05-07T20:23:53.9070164Z Model: 49 2025-05-07T20:23:53.9070453Z Thread(s) per core: 2 2025-05-07T20:23:53.9070741Z Core(s) per socket: 8 2025-05-07T20:23:53.9071017Z Socket(s): 1 2025-05-07T20:23:53.9071295Z Stepping: 0 2025-05-07T20:23:53.9071594Z BogoMIPS: 5599.99 2025-05-07T20:23:53.9073686Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9075864Z Hypervisor vendor: KVM 2025-05-07T20:23:53.9076175Z Virtualization type: full 2025-05-07T20:23:53.9076504Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:53.9076874Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:53.9077236Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:53.9077588Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:53.9077911Z NUMA node(s): 1 2025-05-07T20:23:53.9078203Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:53.9078532Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:53.9078947Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:53.9079301Z Vulnerability L1tf: Not affected 2025-05-07T20:23:53.9079654Z Vulnerability Mds: Not affected 2025-05-07T20:23:53.9080014Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:53.9080369Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:53.9080737Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:53.9081273Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:53.9081874Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:53.9082423Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:53.9083109Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:53.9083950Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:53.9084623Z Vulnerability Srbds: Not affected 2025-05-07T20:23:53.9084989Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:53.9085218Z 2025-05-07T20:23:53.9085410Z + cat /proc/cpuinfo 2025-05-07T20:23:53.9085547Z 2025-05-07T20:23:53.9085632Z processor : 0 2025-05-07T20:23:53.9085850Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9086089Z cpu family : 23 2025-05-07T20:23:53.9086293Z model : 49 2025-05-07T20:23:53.9086501Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9086749Z stepping : 0 2025-05-07T20:23:53.9086953Z microcode : 0x830107f 2025-05-07T20:23:53.9087283Z cpu MHz : 2530.842 2025-05-07T20:23:53.9087496Z cache size : 512 KB 2025-05-07T20:23:53.9087709Z physical id : 0 2025-05-07T20:23:53.9087920Z siblings : 16 2025-05-07T20:23:53.9088119Z core id : 0 2025-05-07T20:23:53.9088310Z cpu cores : 8 2025-05-07T20:23:53.9088512Z apicid : 0 2025-05-07T20:23:53.9088711Z initial apicid : 0 2025-05-07T20:23:53.9088917Z fpu : yes 2025-05-07T20:23:53.9089117Z fpu_exception : yes 2025-05-07T20:23:53.9089332Z cpuid level : 13 2025-05-07T20:23:53.9089532Z wp : yes 2025-05-07T20:23:53.9091608Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9093870Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9094359Z bogomips : 5599.99 2025-05-07T20:23:53.9094578Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9094810Z clflush size : 64 2025-05-07T20:23:53.9095028Z cache_alignment : 64 2025-05-07T20:23:53.9095296Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9095616Z power management: 2025-05-07T20:23:53.9095757Z 2025-05-07T20:23:53.9095840Z processor : 1 2025-05-07T20:23:53.9096052Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9096281Z cpu family : 23 2025-05-07T20:23:53.9096487Z model : 49 2025-05-07T20:23:53.9096692Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9096931Z stepping : 0 2025-05-07T20:23:53.9097141Z microcode : 0x830107f 2025-05-07T20:23:53.9097365Z cpu MHz : 3310.621 2025-05-07T20:23:53.9097574Z cache size : 512 KB 2025-05-07T20:23:53.9097796Z physical id : 0 2025-05-07T20:23:53.9098005Z siblings : 16 2025-05-07T20:23:53.9098201Z core id : 1 2025-05-07T20:23:53.9098404Z cpu cores : 8 2025-05-07T20:23:53.9098609Z apicid : 2 2025-05-07T20:23:53.9098807Z initial apicid : 2 2025-05-07T20:23:53.9099019Z fpu : yes 2025-05-07T20:23:53.9099220Z fpu_exception : yes 2025-05-07T20:23:53.9099457Z cpuid level : 13 2025-05-07T20:23:53.9099663Z wp : yes 2025-05-07T20:23:53.9101651Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9103905Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9150997Z bogomips : 5599.99 2025-05-07T20:23:53.9151300Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9151605Z clflush size : 64 2025-05-07T20:23:53.9151868Z cache_alignment : 64 2025-05-07T20:23:53.9152132Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9152453Z power management: 2025-05-07T20:23:53.9152587Z 2025-05-07T20:23:53.9152705Z processor : 2 2025-05-07T20:23:53.9152917Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9153159Z cpu family : 23 2025-05-07T20:23:53.9153367Z model : 49 2025-05-07T20:23:53.9153648Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9153896Z stepping : 0 2025-05-07T20:23:53.9154109Z microcode : 0x830107f 2025-05-07T20:23:53.9154330Z cpu MHz : 3298.770 2025-05-07T20:23:53.9154545Z cache size : 512 KB 2025-05-07T20:23:53.9154764Z physical id : 0 2025-05-07T20:23:53.9155139Z siblings : 16 2025-05-07T20:23:53.9155347Z core id : 2 2025-05-07T20:23:53.9155549Z cpu cores : 8 2025-05-07T20:23:53.9155843Z apicid : 4 2025-05-07T20:23:53.9156036Z initial apicid : 4 2025-05-07T20:23:53.9156253Z fpu : yes 2025-05-07T20:23:53.9156454Z fpu_exception : yes 2025-05-07T20:23:53.9156666Z cpuid level : 13 2025-05-07T20:23:53.9156874Z wp : yes 2025-05-07T20:23:53.9159021Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9161292Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9161779Z bogomips : 5599.99 2025-05-07T20:23:53.9162006Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9162244Z clflush size : 64 2025-05-07T20:23:53.9162454Z cache_alignment : 64 2025-05-07T20:23:53.9162720Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9163035Z power management: 2025-05-07T20:23:53.9163167Z 2025-05-07T20:23:53.9163253Z processor : 3 2025-05-07T20:23:53.9163471Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9163710Z cpu family : 23 2025-05-07T20:23:53.9163916Z model : 49 2025-05-07T20:23:53.9164118Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9164355Z stepping : 0 2025-05-07T20:23:53.9164565Z microcode : 0x830107f 2025-05-07T20:23:53.9164788Z cpu MHz : 3296.831 2025-05-07T20:23:53.9165008Z cache size : 512 KB 2025-05-07T20:23:53.9165224Z physical id : 0 2025-05-07T20:23:53.9165788Z siblings : 16 2025-05-07T20:23:53.9166032Z core id : 3 2025-05-07T20:23:53.9166238Z cpu cores : 8 2025-05-07T20:23:53.9166435Z apicid : 6 2025-05-07T20:23:53.9166636Z initial apicid : 6 2025-05-07T20:23:53.9166847Z fpu : yes 2025-05-07T20:23:53.9167044Z fpu_exception : yes 2025-05-07T20:23:53.9167259Z cpuid level : 13 2025-05-07T20:23:53.9167466Z wp : yes 2025-05-07T20:23:53.9169451Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9171695Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9172187Z bogomips : 5599.99 2025-05-07T20:23:53.9172410Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9172646Z clflush size : 64 2025-05-07T20:23:53.9172857Z cache_alignment : 64 2025-05-07T20:23:53.9173124Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9173438Z power management: 2025-05-07T20:23:53.9173570Z 2025-05-07T20:23:53.9173653Z processor : 4 2025-05-07T20:23:53.9173867Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9174109Z cpu family : 23 2025-05-07T20:23:53.9174314Z model : 49 2025-05-07T20:23:53.9174530Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9174779Z stepping : 0 2025-05-07T20:23:53.9174979Z microcode : 0x830107f 2025-05-07T20:23:53.9175200Z cpu MHz : 2176.939 2025-05-07T20:23:53.9175415Z cache size : 512 KB 2025-05-07T20:23:53.9175627Z physical id : 0 2025-05-07T20:23:53.9175838Z siblings : 16 2025-05-07T20:23:53.9176040Z core id : 4 2025-05-07T20:23:53.9176236Z cpu cores : 8 2025-05-07T20:23:53.9176436Z apicid : 8 2025-05-07T20:23:53.9176910Z initial apicid : 8 2025-05-07T20:23:53.9177115Z fpu : yes 2025-05-07T20:23:53.9177385Z fpu_exception : yes 2025-05-07T20:23:53.9177610Z cpuid level : 13 2025-05-07T20:23:53.9177812Z wp : yes 2025-05-07T20:23:53.9181619Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9183883Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9184370Z bogomips : 5599.99 2025-05-07T20:23:53.9184602Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9184835Z clflush size : 64 2025-05-07T20:23:53.9185055Z cache_alignment : 64 2025-05-07T20:23:53.9185322Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9185633Z power management: 2025-05-07T20:23:53.9185771Z 2025-05-07T20:23:53.9185855Z processor : 5 2025-05-07T20:23:53.9186071Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9186305Z cpu family : 23 2025-05-07T20:23:53.9186512Z model : 49 2025-05-07T20:23:53.9186719Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9186964Z stepping : 0 2025-05-07T20:23:53.9187176Z microcode : 0x830107f 2025-05-07T20:23:53.9187404Z cpu MHz : 3297.663 2025-05-07T20:23:53.9187619Z cache size : 512 KB 2025-05-07T20:23:53.9187830Z physical id : 0 2025-05-07T20:23:53.9188043Z siblings : 16 2025-05-07T20:23:53.9188248Z core id : 5 2025-05-07T20:23:53.9188470Z cpu cores : 8 2025-05-07T20:23:53.9188695Z apicid : 10 2025-05-07T20:23:53.9188901Z initial apicid : 10 2025-05-07T20:23:53.9189109Z fpu : yes 2025-05-07T20:23:53.9189317Z fpu_exception : yes 2025-05-07T20:23:53.9189537Z cpuid level : 13 2025-05-07T20:23:53.9189740Z wp : yes 2025-05-07T20:23:53.9191713Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9193954Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9194437Z bogomips : 5599.99 2025-05-07T20:23:53.9194656Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9194897Z clflush size : 64 2025-05-07T20:23:53.9195120Z cache_alignment : 64 2025-05-07T20:23:53.9195383Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9195756Z power management: 2025-05-07T20:23:53.9195897Z 2025-05-07T20:23:53.9195982Z processor : 6 2025-05-07T20:23:53.9196203Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9196433Z cpu family : 23 2025-05-07T20:23:53.9196643Z model : 49 2025-05-07T20:23:53.9196855Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9197092Z stepping : 0 2025-05-07T20:23:53.9197300Z microcode : 0x830107f 2025-05-07T20:23:53.9197531Z cpu MHz : 3303.680 2025-05-07T20:23:53.9197739Z cache size : 512 KB 2025-05-07T20:23:53.9197954Z physical id : 0 2025-05-07T20:23:53.9198160Z siblings : 16 2025-05-07T20:23:53.9198354Z core id : 6 2025-05-07T20:23:53.9198552Z cpu cores : 8 2025-05-07T20:23:53.9198781Z apicid : 12 2025-05-07T20:23:53.9199005Z initial apicid : 12 2025-05-07T20:23:53.9199216Z fpu : yes 2025-05-07T20:23:53.9199416Z fpu_exception : yes 2025-05-07T20:23:53.9199713Z cpuid level : 13 2025-05-07T20:23:53.9199926Z wp : yes 2025-05-07T20:23:53.9201977Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9204217Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9204702Z bogomips : 5599.99 2025-05-07T20:23:53.9204917Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9205164Z clflush size : 64 2025-05-07T20:23:53.9205388Z cache_alignment : 64 2025-05-07T20:23:53.9205665Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9205984Z power management: 2025-05-07T20:23:53.9206118Z 2025-05-07T20:23:53.9206215Z processor : 7 2025-05-07T20:23:53.9206429Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9206673Z cpu family : 23 2025-05-07T20:23:53.9206885Z model : 49 2025-05-07T20:23:53.9207087Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9207330Z stepping : 0 2025-05-07T20:23:53.9207543Z microcode : 0x830107f 2025-05-07T20:23:53.9207758Z cpu MHz : 1796.866 2025-05-07T20:23:53.9207983Z cache size : 512 KB 2025-05-07T20:23:53.9208199Z physical id : 0 2025-05-07T20:23:53.9208409Z siblings : 16 2025-05-07T20:23:53.9208608Z core id : 7 2025-05-07T20:23:53.9208832Z cpu cores : 8 2025-05-07T20:23:53.9209055Z apicid : 14 2025-05-07T20:23:53.9209258Z initial apicid : 14 2025-05-07T20:23:53.9209499Z fpu : yes 2025-05-07T20:23:53.9209692Z fpu_exception : yes 2025-05-07T20:23:53.9209904Z cpuid level : 13 2025-05-07T20:23:53.9210115Z wp : yes 2025-05-07T20:23:53.9212095Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9214337Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9214815Z bogomips : 5599.99 2025-05-07T20:23:53.9215039Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9215279Z clflush size : 64 2025-05-07T20:23:53.9215494Z cache_alignment : 64 2025-05-07T20:23:53.9215763Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9216090Z power management: 2025-05-07T20:23:53.9216222Z 2025-05-07T20:23:53.9216307Z processor : 8 2025-05-07T20:23:53.9216522Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9216762Z cpu family : 23 2025-05-07T20:23:53.9216963Z model : 49 2025-05-07T20:23:53.9217169Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9217410Z stepping : 0 2025-05-07T20:23:53.9217613Z microcode : 0x830107f 2025-05-07T20:23:53.9217834Z cpu MHz : 2216.105 2025-05-07T20:23:53.9218047Z cache size : 512 KB 2025-05-07T20:23:53.9218263Z physical id : 0 2025-05-07T20:23:53.9218470Z siblings : 16 2025-05-07T20:23:53.9218671Z core id : 0 2025-05-07T20:23:53.9218867Z cpu cores : 8 2025-05-07T20:23:53.9219062Z apicid : 1 2025-05-07T20:23:53.9219266Z initial apicid : 1 2025-05-07T20:23:53.9219473Z fpu : yes 2025-05-07T20:23:53.9219669Z fpu_exception : yes 2025-05-07T20:23:53.9219883Z cpuid level : 13 2025-05-07T20:23:53.9220090Z wp : yes 2025-05-07T20:23:53.9222050Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9224488Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9224972Z bogomips : 5599.99 2025-05-07T20:23:53.9225192Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9225426Z clflush size : 64 2025-05-07T20:23:53.9225642Z cache_alignment : 64 2025-05-07T20:23:53.9225910Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9226225Z power management: 2025-05-07T20:23:53.9226355Z 2025-05-07T20:23:53.9226443Z processor : 9 2025-05-07T20:23:53.9226652Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9226887Z cpu family : 23 2025-05-07T20:23:53.9227083Z model : 49 2025-05-07T20:23:53.9227287Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9227527Z stepping : 0 2025-05-07T20:23:53.9227730Z microcode : 0x830107f 2025-05-07T20:23:53.9227955Z cpu MHz : 3305.186 2025-05-07T20:23:53.9228174Z cache size : 512 KB 2025-05-07T20:23:53.9228384Z physical id : 0 2025-05-07T20:23:53.9228594Z siblings : 16 2025-05-07T20:23:53.9228792Z core id : 1 2025-05-07T20:23:53.9228990Z cpu cores : 8 2025-05-07T20:23:53.9229188Z apicid : 3 2025-05-07T20:23:53.9229383Z initial apicid : 3 2025-05-07T20:23:53.9229587Z fpu : yes 2025-05-07T20:23:53.9229786Z fpu_exception : yes 2025-05-07T20:23:53.9230000Z cpuid level : 13 2025-05-07T20:23:53.9230204Z wp : yes 2025-05-07T20:23:53.9232173Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9234422Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9234905Z bogomips : 5599.99 2025-05-07T20:23:53.9235127Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9235357Z clflush size : 64 2025-05-07T20:23:53.9235573Z cache_alignment : 64 2025-05-07T20:23:53.9235912Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9236223Z power management: 2025-05-07T20:23:53.9236357Z 2025-05-07T20:23:53.9236441Z processor : 10 2025-05-07T20:23:53.9236656Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9236889Z cpu family : 23 2025-05-07T20:23:53.9237094Z model : 49 2025-05-07T20:23:53.9237298Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9237531Z stepping : 0 2025-05-07T20:23:53.9237737Z microcode : 0x830107f 2025-05-07T20:23:53.9237959Z cpu MHz : 3301.776 2025-05-07T20:23:53.9238162Z cache size : 512 KB 2025-05-07T20:23:53.9238372Z physical id : 0 2025-05-07T20:23:53.9238577Z siblings : 16 2025-05-07T20:23:53.9238776Z core id : 2 2025-05-07T20:23:53.9238965Z cpu cores : 8 2025-05-07T20:23:53.9239165Z apicid : 5 2025-05-07T20:23:53.9239366Z initial apicid : 5 2025-05-07T20:23:53.9239570Z fpu : yes 2025-05-07T20:23:53.9239773Z fpu_exception : yes 2025-05-07T20:23:53.9239990Z cpuid level : 13 2025-05-07T20:23:53.9240192Z wp : yes 2025-05-07T20:23:53.9242164Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9244500Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9244985Z bogomips : 5599.99 2025-05-07T20:23:53.9245282Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9245517Z clflush size : 64 2025-05-07T20:23:53.9245736Z cache_alignment : 64 2025-05-07T20:23:53.9245999Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9246315Z power management: 2025-05-07T20:23:53.9246449Z 2025-05-07T20:23:53.9246534Z processor : 11 2025-05-07T20:23:53.9246749Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9246978Z cpu family : 23 2025-05-07T20:23:53.9247187Z model : 49 2025-05-07T20:23:53.9247391Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9247625Z stepping : 0 2025-05-07T20:23:53.9247835Z microcode : 0x830107f 2025-05-07T20:23:53.9248061Z cpu MHz : 3290.391 2025-05-07T20:23:53.9248265Z cache size : 512 KB 2025-05-07T20:23:53.9248478Z physical id : 0 2025-05-07T20:23:53.9248687Z siblings : 16 2025-05-07T20:23:53.9248883Z core id : 3 2025-05-07T20:23:53.9249081Z cpu cores : 8 2025-05-07T20:23:53.9249284Z apicid : 7 2025-05-07T20:23:53.9249476Z initial apicid : 7 2025-05-07T20:23:53.9249696Z fpu : yes 2025-05-07T20:23:53.9249892Z fpu_exception : yes 2025-05-07T20:23:53.9250102Z cpuid level : 13 2025-05-07T20:23:53.9250311Z wp : yes 2025-05-07T20:23:53.9252278Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9254521Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9255004Z bogomips : 5599.99 2025-05-07T20:23:53.9255217Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9255456Z clflush size : 64 2025-05-07T20:23:53.9255673Z cache_alignment : 64 2025-05-07T20:23:53.9255936Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9256247Z power management: 2025-05-07T20:23:53.9256378Z 2025-05-07T20:23:53.9256470Z processor : 12 2025-05-07T20:23:53.9256680Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9256913Z cpu family : 23 2025-05-07T20:23:53.9257118Z model : 49 2025-05-07T20:23:53.9257318Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9257563Z stepping : 0 2025-05-07T20:23:53.9257769Z microcode : 0x830107f 2025-05-07T20:23:53.9257990Z cpu MHz : 1848.723 2025-05-07T20:23:53.9258202Z cache size : 512 KB 2025-05-07T20:23:53.9258417Z physical id : 0 2025-05-07T20:23:53.9258645Z siblings : 16 2025-05-07T20:23:53.9258874Z core id : 4 2025-05-07T20:23:53.9259075Z cpu cores : 8 2025-05-07T20:23:53.9259268Z apicid : 9 2025-05-07T20:23:53.9259469Z initial apicid : 9 2025-05-07T20:23:53.9259677Z fpu : yes 2025-05-07T20:23:53.9259872Z fpu_exception : yes 2025-05-07T20:23:53.9260089Z cpuid level : 13 2025-05-07T20:23:53.9260296Z wp : yes 2025-05-07T20:23:53.9262266Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9264610Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9265092Z bogomips : 5599.99 2025-05-07T20:23:53.9265312Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9265839Z clflush size : 64 2025-05-07T20:23:53.9266064Z cache_alignment : 64 2025-05-07T20:23:53.9266478Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9266794Z power management: 2025-05-07T20:23:53.9266924Z 2025-05-07T20:23:53.9267007Z processor : 13 2025-05-07T20:23:53.9267221Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9267455Z cpu family : 23 2025-05-07T20:23:53.9267654Z model : 49 2025-05-07T20:23:53.9267857Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9268098Z stepping : 0 2025-05-07T20:23:53.9268299Z microcode : 0x830107f 2025-05-07T20:23:53.9268527Z cpu MHz : 3318.688 2025-05-07T20:23:53.9268742Z cache size : 512 KB 2025-05-07T20:23:53.9268974Z physical id : 0 2025-05-07T20:23:53.9269205Z siblings : 16 2025-05-07T20:23:53.9269406Z core id : 5 2025-05-07T20:23:53.9269606Z cpu cores : 8 2025-05-07T20:23:53.9269801Z apicid : 11 2025-05-07T20:23:53.9270004Z initial apicid : 11 2025-05-07T20:23:53.9270216Z fpu : yes 2025-05-07T20:23:53.9270411Z fpu_exception : yes 2025-05-07T20:23:53.9270628Z cpuid level : 13 2025-05-07T20:23:53.9270832Z wp : yes 2025-05-07T20:23:53.9272803Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9275051Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9275538Z bogomips : 5599.99 2025-05-07T20:23:53.9275841Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9276070Z clflush size : 64 2025-05-07T20:23:53.9276295Z cache_alignment : 64 2025-05-07T20:23:53.9276559Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9276869Z power management: 2025-05-07T20:23:53.9277009Z 2025-05-07T20:23:53.9277096Z processor : 14 2025-05-07T20:23:53.9277308Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9277542Z cpu family : 23 2025-05-07T20:23:53.9277741Z model : 49 2025-05-07T20:23:53.9277947Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9278190Z stepping : 0 2025-05-07T20:23:53.9278392Z microcode : 0x830107f 2025-05-07T20:23:53.9278619Z cpu MHz : 3306.727 2025-05-07T20:23:53.9278838Z cache size : 512 KB 2025-05-07T20:23:53.9279045Z physical id : 0 2025-05-07T20:23:53.9279252Z siblings : 16 2025-05-07T20:23:53.9279452Z core id : 6 2025-05-07T20:23:53.9279647Z cpu cores : 8 2025-05-07T20:23:53.9279849Z apicid : 13 2025-05-07T20:23:53.9280051Z initial apicid : 13 2025-05-07T20:23:53.9280256Z fpu : yes 2025-05-07T20:23:53.9280453Z fpu_exception : yes 2025-05-07T20:23:53.9280668Z cpuid level : 13 2025-05-07T20:23:53.9280869Z wp : yes 2025-05-07T20:23:53.9282839Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9285206Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9285688Z bogomips : 5599.99 2025-05-07T20:23:53.9285913Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9286141Z clflush size : 64 2025-05-07T20:23:53.9286359Z cache_alignment : 64 2025-05-07T20:23:53.9286625Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9286931Z power management: 2025-05-07T20:23:53.9287067Z 2025-05-07T20:23:53.9287237Z processor : 15 2025-05-07T20:23:53.9287456Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.9287684Z cpu family : 23 2025-05-07T20:23:53.9287887Z model : 49 2025-05-07T20:23:53.9288090Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.9288324Z stepping : 0 2025-05-07T20:23:53.9288534Z microcode : 0x830107f 2025-05-07T20:23:53.9288779Z cpu MHz : 2022.452 2025-05-07T20:23:53.9289008Z cache size : 512 KB 2025-05-07T20:23:53.9289220Z physical id : 0 2025-05-07T20:23:53.9289433Z siblings : 16 2025-05-07T20:23:53.9289627Z core id : 7 2025-05-07T20:23:53.9289823Z cpu cores : 8 2025-05-07T20:23:53.9290025Z apicid : 15 2025-05-07T20:23:53.9290221Z initial apicid : 15 2025-05-07T20:23:53.9290432Z fpu : yes 2025-05-07T20:23:53.9290632Z fpu_exception : yes 2025-05-07T20:23:53.9290840Z cpuid level : 13 2025-05-07T20:23:53.9291048Z wp : yes 2025-05-07T20:23:53.9293021Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.9301313Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.9301842Z bogomips : 5599.99 2025-05-07T20:23:53.9302063Z TLB size : 3072 4K pages 2025-05-07T20:23:53.9302298Z clflush size : 64 2025-05-07T20:23:53.9302518Z cache_alignment : 64 2025-05-07T20:23:53.9302774Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.9303090Z power management: 2025-05-07T20:23:53.9303221Z 2025-05-07T20:23:53.9303225Z 2025-05-07T20:23:53.9303353Z ################################################################################ 2025-05-07T20:23:53.9303664Z [INFO] Print PCI info ... 2025-05-07T20:23:53.9303909Z + lspci -v 2025-05-07T20:23:53.9304023Z 2025-05-07T20:23:53.9304240Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:53.9304625Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:53.9304945Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:53.9305156Z 2025-05-07T20:23:53.9305357Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:53.9305736Z Physical Slot: 1 2025-05-07T20:23:53.9305976Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.9306175Z 2025-05-07T20:23:53.9306416Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:53.9306848Z Physical Slot: 1 2025-05-07T20:23:53.9307101Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:53.9307321Z 2025-05-07T20:23:53.9307587Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:53.9308021Z Physical Slot: 3 2025-05-07T20:23:53.9308256Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.9308605Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:53.9308953Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:53.9309178Z 2025-05-07T20:23:53.9309475Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:53.9310095Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:53.9310383Z Physical Slot: 4 2025-05-07T20:23:53.9310630Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:53.9311010Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.9311372Z Capabilities: 2025-05-07T20:23:53.9311632Z Kernel driver in use: nvme 2025-05-07T20:23:53.9311797Z 2025-05-07T20:23:53.9312097Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:53.9312572Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:53.9312907Z Physical Slot: 5 2025-05-07T20:23:53.9313146Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.9313497Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.9313876Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:53.9314193Z Capabilities: 2025-05-07T20:23:53.9314460Z Kernel driver in use: ena 2025-05-07T20:23:53.9314698Z Kernel modules: ena 2025-05-07T20:23:53.9314837Z 2025-05-07T20:23:53.9315001Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:53.9315380Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:53.9315722Z Physical Slot: 30 2025-05-07T20:23:53.9315978Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:53.9316352Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:53.9316746Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:53.9317111Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:53.9317438Z Capabilities: 2025-05-07T20:23:53.9317704Z Kernel driver in use: nvidia 2025-05-07T20:23:53.9317962Z Kernel modules: nvidia 2025-05-07T20:23:53.9318108Z 2025-05-07T20:23:53.9318405Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:53.9318915Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:53.9319203Z Physical Slot: 31 2025-05-07T20:23:53.9319440Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.9319792Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.9320173Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:53.9320493Z Capabilities: 2025-05-07T20:23:53.9320760Z Kernel driver in use: nvme 2025-05-07T20:23:53.9320921Z 2025-05-07T20:23:53.9320925Z 2025-05-07T20:23:53.9321043Z ################################################################################ 2025-05-07T20:23:53.9321377Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:53.9321652Z + uname -a 2025-05-07T20:23:53.9321769Z 2025-05-07T20:23:53.9322172Z Linux ip-10-0-16-208.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:53.9322675Z 2025-05-07T20:23:53.9322767Z + uname -m 2025-05-07T20:23:53.9322882Z 2025-05-07T20:23:53.9322961Z x86_64 2025-05-07T20:23:53.9323069Z 2025-05-07T20:23:53.9323154Z + cat /proc/version 2025-05-07T20:23:53.9323289Z 2025-05-07T20:23:53.9323826Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:53.9324448Z 2025-05-07T20:23:53.9324535Z + cat /etc/os-release 2025-05-07T20:23:53.9324675Z 2025-05-07T20:23:53.9324768Z NAME="Amazon Linux" 2025-05-07T20:23:53.9324981Z VERSION="2023" 2025-05-07T20:23:53.9325182Z ID="amzn" 2025-05-07T20:23:53.9325368Z ID_LIKE="fedora" 2025-05-07T20:23:53.9325569Z VERSION_ID="2023" 2025-05-07T20:23:53.9325803Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:53.9326079Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:53.9326362Z ANSI_COLOR="0;33" 2025-05-07T20:23:53.9326606Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:53.9327082Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:53.9327502Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:53.9327909Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:53.9328346Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:53.9328712Z VENDOR_NAME="AWS" 2025-05-07T20:23:53.9328941Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:53.9329225Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:53.9329377Z 2025-05-07T20:23:53.9329578Z ################################################################################ 2025-05-07T20:23:53.9329878Z # Print EC2 Instance Info 2025-05-07T20:23:53.9330111Z # 2025-05-07T20:23:53.9330319Z # [2025-05-07T20:23:53.928Z] + print_ec2_info 2025-05-07T20:23:53.9330626Z ################################################################################ 2025-05-07T20:23:53.9330839Z 2025-05-07T20:23:53.9408404Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:53.9525244Z instance-id: i-0bb11f79b54aad6c7 2025-05-07T20:23:53.9635432Z instance-type: g5.4xlarge 2025-05-07T20:23:53.9679735Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:53.9680117Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:53.9689803Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:53.9690165Z env: 2025-05-07T20:23:53.9690395Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:53.9690711Z BUILD_ENV: build_binary 2025-05-07T20:23:53.9690969Z BUILD_TARGET: genai 2025-05-07T20:23:53.9691212Z BUILD_VARIANT: cuda 2025-05-07T20:23:53.9691453Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:53.9691725Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:53.9692045Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:53.9692384Z ##[endgroup] 2025-05-07T20:23:54.3023105Z ################################################################################ 2025-05-07T20:23:54.3023506Z [INFO] Printing general display info ... 2025-05-07T20:23:54.3055235Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:54.4147227Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:54.4156103Z /usr/bin/sudo 2025-05-07T20:23:54.4167150Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:54.4178876Z /usr/bin/yum 2025-05-07T20:23:54.4180789Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:54.4202168Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:54.8444307Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:45 2025. 2025-05-07T20:23:54.9197583Z ================================================================================ 2025-05-07T20:23:54.9198040Z WARNING: 2025-05-07T20:23:54.9198418Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:54.9198739Z 2025-05-07T20:23:54.9198866Z Available Versions: 2025-05-07T20:23:54.9199061Z 2025-05-07T20:23:54.9199194Z Version 2023.7.20250331: 2025-05-07T20:23:54.9199529Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:54.9199810Z 2025-05-07T20:23:54.9199943Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:54.9200157Z 2025-05-07T20:23:54.9200246Z Release notes: 2025-05-07T20:23:54.9200647Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:54.9201008Z 2025-05-07T20:23:54.9201097Z Version 2023.7.20250414: 2025-05-07T20:23:54.9201401Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:54.9201645Z 2025-05-07T20:23:54.9201762Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:54.9201968Z 2025-05-07T20:23:54.9202058Z Release notes: 2025-05-07T20:23:54.9202439Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:54.9202804Z 2025-05-07T20:23:54.9202894Z Version 2023.7.20250428: 2025-05-07T20:23:54.9203193Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:54.9203670Z 2025-05-07T20:23:54.9203794Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:54.9204010Z 2025-05-07T20:23:54.9204099Z Release notes: 2025-05-07T20:23:54.9204484Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:54.9204839Z 2025-05-07T20:23:54.9204956Z ================================================================================ 2025-05-07T20:23:55.0340231Z Dependencies resolved. 2025-05-07T20:23:55.0624777Z ================================================================================ 2025-05-07T20:23:55.0625342Z Package Arch Version Repository Size 2025-05-07T20:23:55.0625845Z ================================================================================ 2025-05-07T20:23:55.0626145Z Upgrading: 2025-05-07T20:23:55.0626502Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:55.0627088Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:55.0627512Z 2025-05-07T20:23:55.0627846Z Transaction Summary 2025-05-07T20:23:55.0628204Z ================================================================================ 2025-05-07T20:23:55.0628640Z Upgrade 2 Packages 2025-05-07T20:23:55.0628827Z 2025-05-07T20:23:55.0628976Z Total download size: 6.9 M 2025-05-07T20:23:55.0629317Z Downloading Packages: 2025-05-07T20:23:55.1001347Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 34 MB/s | 1.2 MB 00:00 2025-05-07T20:23:55.1516452Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 65 MB/s | 5.7 MB 00:00 2025-05-07T20:23:55.1524727Z -------------------------------------------------------------------------------- 2025-05-07T20:23:55.1527721Z Total 77 MB/s | 6.9 MB 00:00 2025-05-07T20:23:55.1530240Z Running transaction check 2025-05-07T20:23:55.1625736Z Transaction check succeeded. 2025-05-07T20:23:55.1626148Z Running transaction test 2025-05-07T20:23:55.1920532Z Transaction test succeeded. 2025-05-07T20:23:55.1923531Z Running transaction 2025-05-07T20:23:55.7435049Z Preparing : 1/1 2025-05-07T20:23:55.8494431Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:55.8520317Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:55.8741023Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:55.8741782Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:55.8853651Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:55.8883777Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:56.0477190Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:56.0477972Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:56.0478646Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:56.0479185Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:56.1944747Z ================================================================================ 2025-05-07T20:23:56.1945222Z WARNING: 2025-05-07T20:23:56.1945470Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:56.1945756Z 2025-05-07T20:23:56.1945891Z Available Versions: 2025-05-07T20:23:56.1946039Z 2025-05-07T20:23:56.1946132Z Version 2023.7.20250331: 2025-05-07T20:23:56.1946439Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:56.1946689Z 2025-05-07T20:23:56.1946816Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:56.1947024Z 2025-05-07T20:23:56.1947109Z Release notes: 2025-05-07T20:23:56.1947516Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:56.1948169Z 2025-05-07T20:23:56.1948271Z Version 2023.7.20250414: 2025-05-07T20:23:56.1948575Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:56.1948820Z 2025-05-07T20:23:56.1948934Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:56.1949146Z 2025-05-07T20:23:56.1949230Z Release notes: 2025-05-07T20:23:56.1949660Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:56.1950028Z 2025-05-07T20:23:56.1950127Z Version 2023.7.20250428: 2025-05-07T20:23:56.1950424Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:56.1950675Z 2025-05-07T20:23:56.1950788Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:56.1950993Z 2025-05-07T20:23:56.1951087Z Release notes: 2025-05-07T20:23:56.1951470Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:56.1951840Z 2025-05-07T20:23:56.1952153Z ================================================================================ 2025-05-07T20:23:56.2516503Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:56.2516854Z 2025-05-07T20:23:56.2516947Z Upgraded: 2025-05-07T20:23:56.2517289Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:56.2517855Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:56.2518198Z 2025-05-07T20:23:56.2518281Z Complete! 2025-05-07T20:23:56.2958796Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:56.2983788Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:56.7477116Z Last metadata expiration check: 0:00:11 ago on Wed May 7 20:23:45 2025. 2025-05-07T20:23:56.7719035Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:56.8123720Z Dependencies resolved. 2025-05-07T20:23:56.8302426Z ================================================================================ 2025-05-07T20:23:56.8302931Z Package Architecture Version Repository Size 2025-05-07T20:23:56.8303426Z ================================================================================ 2025-05-07T20:23:56.8303756Z Installing: 2025-05-07T20:23:56.8304050Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:56.8304322Z 2025-05-07T20:23:56.8304419Z Transaction Summary 2025-05-07T20:23:56.8304664Z ================================================================================ 2025-05-07T20:23:56.8304968Z Install 1 Package 2025-05-07T20:23:56.8305104Z 2025-05-07T20:23:56.8305234Z Total download size: 319 k 2025-05-07T20:23:56.8306112Z Installed size: 837 k 2025-05-07T20:23:56.8307762Z Downloading Packages: 2025-05-07T20:23:56.9120120Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 6.0 MB/s | 319 kB 00:00 2025-05-07T20:23:56.9126365Z -------------------------------------------------------------------------------- 2025-05-07T20:23:56.9129567Z Total 3.8 MB/s | 319 kB 00:00 2025-05-07T20:23:56.9283874Z Running transaction check 2025-05-07T20:23:56.9338580Z Transaction check succeeded. 2025-05-07T20:23:56.9339023Z Running transaction test 2025-05-07T20:23:56.9791919Z Transaction test succeeded. 2025-05-07T20:23:56.9795998Z Running transaction 2025-05-07T20:23:57.0792731Z Preparing : 1/1 2025-05-07T20:23:57.1277137Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:57.3055995Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:57.4227102Z ================================================================================ 2025-05-07T20:23:57.4227674Z WARNING: 2025-05-07T20:23:57.4228049Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:57.4228771Z 2025-05-07T20:23:57.4228916Z Available Versions: 2025-05-07T20:23:57.4229160Z 2025-05-07T20:23:57.4229290Z Version 2023.7.20250331: 2025-05-07T20:23:57.4229735Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:57.4230089Z 2025-05-07T20:23:57.4230268Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:57.4230569Z 2025-05-07T20:23:57.4230696Z Release notes: 2025-05-07T20:23:57.4231283Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:57.4231813Z 2025-05-07T20:23:57.4231960Z Version 2023.7.20250414: 2025-05-07T20:23:57.4232410Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:57.4232778Z 2025-05-07T20:23:57.4232939Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:57.4233245Z 2025-05-07T20:23:57.4233384Z Release notes: 2025-05-07T20:23:57.4233918Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:57.4234436Z 2025-05-07T20:23:57.4234792Z Version 2023.7.20250428: 2025-05-07T20:23:57.4235233Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:57.4235675Z 2025-05-07T20:23:57.4235858Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:57.4236174Z 2025-05-07T20:23:57.4236301Z Release notes: 2025-05-07T20:23:57.4236890Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:57.4237447Z 2025-05-07T20:23:57.4237618Z ================================================================================ 2025-05-07T20:23:57.4574396Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:57.4574860Z 2025-05-07T20:23:57.4574992Z Installed: 2025-05-07T20:23:57.4575422Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:57.4575830Z 2025-05-07T20:23:57.4575964Z Complete! 2025-05-07T20:23:57.5032982Z + hostname 2025-05-07T20:23:57.5033148Z 2025-05-07T20:23:57.5047487Z ip-10-0-16-208.ec2.internal 2025-05-07T20:23:57.5048843Z 2025-05-07T20:23:57.5049381Z + sudo lshw -C display 2025-05-07T20:23:57.5049599Z 2025-05-07T20:23:58.0775715Z *-display:0 UNCLAIMED 2025-05-07T20:23:58.0776118Z description: VGA compatible controller 2025-05-07T20:23:58.0776445Z product: Amazon.com, Inc. 2025-05-07T20:23:58.0776727Z vendor: Amazon.com, Inc. 2025-05-07T20:23:58.0776987Z physical id: 3 2025-05-07T20:23:58.0777226Z bus info: pci@0000:00:03.0 2025-05-07T20:23:58.0777479Z version: 00 2025-05-07T20:23:58.0777695Z width: 32 bits 2025-05-07T20:23:58.0777916Z clock: 33MHz 2025-05-07T20:23:58.0778159Z capabilities: vga_controller bus_master 2025-05-07T20:23:58.0778496Z configuration: latency=0 2025-05-07T20:23:58.0785978Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:58.0786331Z *-display:1 2025-05-07T20:23:58.0786583Z description: 3D controller 2025-05-07T20:23:58.0786877Z product: GA102GL [A10G] 2025-05-07T20:23:58.0787153Z vendor: NVIDIA Corporation 2025-05-07T20:23:58.0787421Z physical id: 1e 2025-05-07T20:23:58.0787667Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:58.0787932Z version: a1 2025-05-07T20:23:58.0788149Z width: 64 bits 2025-05-07T20:23:58.0788377Z clock: 33MHz 2025-05-07T20:23:58.0788676Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:58.0789051Z configuration: driver=nvidia latency=0 2025-05-07T20:23:58.0789672Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:58.0816111Z 2025-05-07T20:23:58.0816308Z ################################################################################ 2025-05-07T20:23:58.0816728Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:58.0950348Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:58.1120166Z Wed May 7 20:23:58 2025 2025-05-07T20:23:58.1120705Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:58.1121384Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:58.1121895Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:58.1122395Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:58.1122924Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:58.1123356Z | | | MIG M. | 2025-05-07T20:23:58.1123692Z |=========================================+========================+======================| 2025-05-07T20:23:58.1201478Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:58.1202341Z | 0% 30C P0 60W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:58.1202881Z | | | N/A | 2025-05-07T20:23:58.1203388Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:58.1203791Z 2025-05-07T20:23:58.1204180Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:58.1204603Z | Processes: | 2025-05-07T20:23:58.1205040Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:58.1205454Z | ID ID Usage | 2025-05-07T20:23:58.1205830Z |=========================================================================================| 2025-05-07T20:23:58.1206415Z | No running processes found | 2025-05-07T20:23:58.1207054Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:58.2594308Z ################################################################################ 2025-05-07T20:23:58.2594792Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:58.2734229Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:58.2735191Z [CHECK] rocminfo not found 2025-05-07T20:23:58.2744077Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:58.2745210Z [CHECK] rocm-smi not found 2025-05-07T20:23:58.2780092Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:58.2780526Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:58.2793004Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:58.2793363Z env: 2025-05-07T20:23:58.2793593Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:58.2793907Z BUILD_ENV: build_binary 2025-05-07T20:23:58.2794164Z BUILD_TARGET: genai 2025-05-07T20:23:58.2794394Z BUILD_VARIANT: cuda 2025-05-07T20:23:58.2794645Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:58.2794910Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:58.2795218Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:58.2795558Z ##[endgroup] 2025-05-07T20:23:58.6133580Z ################################################################################ 2025-05-07T20:23:58.6133938Z # Setup Miniconda 2025-05-07T20:23:58.6134314Z # 2025-05-07T20:23:58.6148358Z # [2025-05-07T20:23:58.614Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:58.6148864Z ################################################################################ 2025-05-07T20:23:58.6149093Z 2025-05-07T20:23:58.6163208Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:58.7045094Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:58.7045462Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:58.7045660Z 2025-05-07T20:23:58.7062741Z 2025-05-07T20:23:58.7063081Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:58.7086238Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:59.6646494Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:59.6646876Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:59.6647131Z 2025-05-07T20:23:59.6789655Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:24:00.1189215Z Unpacking payload ... 2025-05-07T20:24:00.6380150Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:01.4470140Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:03.5509097Z 2025-05-07T20:24:03.5509447Z Installing base environment... 2025-05-07T20:24:03.5509678Z 2025-05-07T20:24:04.6384500Z Preparing transaction: ...working... done 2025-05-07T20:24:07.5491072Z Executing transaction: ...working... done 2025-05-07T20:24:08.2216355Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:08.3124739Z installation finished. 2025-05-07T20:24:08.3133799Z 2025-05-07T20:24:08.3133996Z + rm -f miniconda.sh 2025-05-07T20:24:08.3134186Z 2025-05-07T20:24:08.4006546Z 2025-05-07T20:24:08.4006946Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:24:08.4007314Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:24:08.4007557Z 2025-05-07T20:24:08.7666875Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:24:08.7667255Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:24:08.7667648Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:24:08.7668157Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:24:08.7668522Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:24:08.7668913Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:24:08.7669342Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:24:08.7669780Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:24:08.7670235Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:24:08.7671009Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:24:08.7671531Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:24:08.7671911Z modified /home/ec2-user/.bashrc 2025-05-07T20:24:08.7672102Z 2025-05-07T20:24:08.7672299Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:24:08.7672605Z 2025-05-07T20:24:08.8318193Z 2025-05-07T20:24:08.8318696Z + . /home/ec2-user/.bashrc 2025-05-07T20:24:08.8318954Z 2025-05-07T20:24:09.6748086Z 2025-05-07T20:24:09.6748815Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:24:09.6770599Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:24:23.0756111Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:24.6302584Z Solving environment: | / - \ | / - \ | / - \ done 2025-05-07T20:24:24.7269170Z 2025-05-07T20:24:24.7269807Z ## Package Plan ## 2025-05-07T20:24:24.7270259Z 2025-05-07T20:24:24.7270559Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:24.7271031Z 2025-05-07T20:24:24.7271223Z added / updated specs: 2025-05-07T20:24:24.7271741Z - conda-libmamba-solver 2025-05-07T20:24:24.7272240Z - libarchive 2025-05-07T20:24:24.7272640Z - libmamba 2025-05-07T20:24:24.7273038Z - libmambapy 2025-05-07T20:24:24.7273283Z 2025-05-07T20:24:24.7273291Z 2025-05-07T20:24:24.7273577Z The following packages will be downloaded: 2025-05-07T20:24:24.7274002Z 2025-05-07T20:24:24.7274236Z package | build 2025-05-07T20:24:24.7274858Z ---------------------------|----------------- 2025-05-07T20:24:24.7275498Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:24:24.7276092Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:24:24.7276514Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:24:24.7276982Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:24:24.7277426Z ------------------------------------------------------------ 2025-05-07T20:24:24.7277764Z Total: 1.4 MB 2025-05-07T20:24:24.7277970Z 2025-05-07T20:24:24.7278082Z The following packages will be UPDATED: 2025-05-07T20:24:24.7278290Z 2025-05-07T20:24:24.7283189Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:24.7283965Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:24:24.7284342Z 2025-05-07T20:24:24.7284568Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:24.7284880Z 2025-05-07T20:24:24.7285193Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:24:24.7285983Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:24:24.7286463Z 2025-05-07T20:24:24.7286468Z 2025-05-07T20:24:24.7286472Z 2025-05-07T20:24:24.7286614Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:24.7286989Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:24:24.7287208Z 2025-05-07T20:24:24.7287952Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:24:24.7288194Z 2025-05-07T20:24:24.7293443Z 2025-05-07T20:24:24.7305772Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:24:24.7306093Z 2025-05-07T20:24:24.7306098Z 2025-05-07T20:24:24.7306103Z 2025-05-07T20:24:24.7980025Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:24:24.7981914Z 2025-05-07T20:24:24.8043678Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:24.8142749Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:24.8142981Z 2025-05-07T20:24:24.8143318Z 2025-05-07T20:24:24.8143985Z 2025-05-07T20:24:24.8229736Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:24.8231210Z 2025-05-07T20:24:24.8285656Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:24.8285913Z 2025-05-07T20:24:24.8285917Z 2025-05-07T20:24:24.8510619Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:24.8510917Z 2025-05-07T20:24:24.8510922Z 2025-05-07T20:24:24.8512803Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:24.8513056Z 2025-05-07T20:24:24.8513067Z 2025-05-07T20:24:24.8518077Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:24.8518336Z 2025-05-07T20:24:24.8518340Z 2025-05-07T20:24:24.8518344Z 2025-05-07T20:24:24.8520124Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:24.8520398Z 2025-05-07T20:24:24.8520401Z 2025-05-07T20:24:24.8520474Z 2025-05-07T20:24:24.9503915Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:24.9504501Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:24.9510503Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:24.9511004Z 2025-05-07T20:24:24.9511296Z 2025-05-07T20:24:24.9511594Z  2025-05-07T20:24:24.9511914Z 2025-05-07T20:24:24.9511921Z 2025-05-07T20:24:24.9512186Z  2025-05-07T20:24:24.9512481Z 2025-05-07T20:24:24.9512487Z 2025-05-07T20:24:24.9512506Z 2025-05-07T20:24:24.9512714Z  done 2025-05-07T20:24:25.0518309Z Preparing transaction: / done 2025-05-07T20:24:25.1524202Z Verifying transaction: \ done 2025-05-07T20:24:26.4542453Z Executing transaction: / - \ | / - \ | / - \ | / done 2025-05-07T20:24:28.1789617Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:24:28.1815095Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:24:29.1425303Z Channels: 2025-05-07T20:24:29.1425738Z - defaults 2025-05-07T20:24:29.1426152Z Platform: linux-64 2025-05-07T20:24:30.3118224Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:24:30.4283935Z Solving environment: - \ Channels: 2025-05-07T20:24:30.4284503Z - defaults 2025-05-07T20:24:30.4284915Z Platform: linux-64 2025-05-07T20:24:30.7223840Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:24:30.9328940Z Solving environment: - \ | / done 2025-05-07T20:24:31.0147664Z done 2025-05-07T20:24:31.0800611Z 2025-05-07T20:24:31.0801022Z ## Package Plan ## 2025-05-07T20:24:31.0801186Z 2025-05-07T20:24:31.0801337Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:31.0801588Z 2025-05-07T20:24:31.0801686Z added / updated specs: 2025-05-07T20:24:31.0801932Z - conda 2025-05-07T20:24:31.0802047Z 2025-05-07T20:24:31.0802052Z 2025-05-07T20:24:31.0802176Z The following packages will be downloaded: 2025-05-07T20:24:31.0802388Z 2025-05-07T20:24:31.0802508Z package | build 2025-05-07T20:24:31.0802823Z ---------------------------|----------------- 2025-05-07T20:24:31.0803169Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:24:31.0803800Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:24:31.0804177Z ------------------------------------------------------------ 2025-05-07T20:24:31.0804525Z Total: 1.4 MB 2025-05-07T20:24:31.0804733Z 2025-05-07T20:24:31.0804853Z The following packages will be UPDATED: 2025-05-07T20:24:31.0805056Z 2025-05-07T20:24:31.0805360Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:31.0805861Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:24:31.0806114Z 2025-05-07T20:24:31.0806118Z 2025-05-07T20:24:31.0806122Z 2025-05-07T20:24:31.0806262Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:31.0806631Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:24:31.0806847Z 2025-05-07T20:24:31.1152871Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:24:31.1154035Z 2025-05-07T20:24:31.1521348Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:31.3778749Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:31.3779405Z 2025-05-07T20:24:31.3780695Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:31.3781060Z 2025-05-07T20:24:31.3926134Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:31.3926657Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:31.3931115Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:31.3931436Z 2025-05-07T20:24:31.3931631Z 2025-05-07T20:24:31.3931865Z  done 2025-05-07T20:24:31.4934907Z Preparing transaction: \ done 2025-05-07T20:24:31.5940359Z Verifying transaction: / done 2025-05-07T20:24:33.5966999Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:34.2026511Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:24:34.2030665Z + conda clean --packages --tarball -y 2025-05-07T20:24:34.2030903Z 2025-05-07T20:24:35.2046703Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:24:35.2047181Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:24:35.2664706Z 2025-05-07T20:24:35.2673441Z + conda clean --all -y 2025-05-07T20:24:35.2673623Z 2025-05-07T20:24:35.8174798Z There are no unused tarball(s) to remove. 2025-05-07T20:24:35.8175160Z Will remove 1 index cache(s). 2025-05-07T20:24:35.8175444Z There are no unused package(s) to remove. 2025-05-07T20:24:35.8175765Z There are no tempfile(s) to remove. 2025-05-07T20:24:35.8176074Z There are no logfile(s) to remove. 2025-05-07T20:24:35.8818624Z 2025-05-07T20:24:35.8822322Z + conda info 2025-05-07T20:24:35.8822480Z 2025-05-07T20:24:36.6560339Z 2025-05-07T20:24:36.6560936Z active environment : base 2025-05-07T20:24:36.6561409Z active env location : /home/ec2-user/miniconda 2025-05-07T20:24:36.6561735Z shell level : 1 2025-05-07T20:24:36.6562013Z user config file : /home/ec2-user/.condarc 2025-05-07T20:24:36.6562400Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:24:36.6562767Z conda version : 25.3.1 2025-05-07T20:24:36.6563043Z conda-build version : not installed 2025-05-07T20:24:36.6563339Z python version : 3.13.2.final.0 2025-05-07T20:24:36.6563638Z solver : libmamba (default) 2025-05-07T20:24:36.6563946Z virtual packages : __archspec=1=zen2 2025-05-07T20:24:36.6564236Z __conda=25.3.1=0 2025-05-07T20:24:36.6564517Z __cuda=12.8=0 2025-05-07T20:24:36.6564799Z __glibc=2.34=0 2025-05-07T20:24:36.6565077Z __linux=6.1.130=0 2025-05-07T20:24:36.6565630Z __unix=0=0 2025-05-07T20:24:36.6566360Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:24:36.6566772Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:24:36.6567114Z conda av metadata url : None 2025-05-07T20:24:36.6567488Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:24:36.6567916Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:24:36.6568293Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:24:36.6568669Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:24:36.6569038Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:24:36.6569379Z /home/ec2-user/.conda/pkgs 2025-05-07T20:24:36.6569714Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:24:36.6570054Z /home/ec2-user/.conda/envs 2025-05-07T20:24:36.6570358Z platform : linux-64 2025-05-07T20:24:36.6571178Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:24:36.6572150Z UID:GID : 1000:1000 2025-05-07T20:24:36.6572427Z netrc file : None 2025-05-07T20:24:36.6572687Z offline mode : False 2025-05-07T20:24:36.6572852Z 2025-05-07T20:24:36.7212038Z 2025-05-07T20:24:36.7212291Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:24:36.7213141Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_ce4d2be4-91d5-4eea-8431-0f6f6f174062 ... 2025-05-07T20:24:36.7214234Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:24:36.7295508Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.12 2025-05-07T20:24:36.7295996Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.12 2025-05-07T20:24:36.7312611Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:36.7312962Z env: 2025-05-07T20:24:36.7313195Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:36.7313493Z BUILD_ENV: build_binary 2025-05-07T20:24:36.7313751Z BUILD_TARGET: genai 2025-05-07T20:24:36.7313979Z BUILD_VARIANT: cuda 2025-05-07T20:24:36.7314208Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:24:36.7314465Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:36.7314763Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:36.7315088Z ##[endgroup] 2025-05-07T20:24:37.0664985Z ################################################################################ 2025-05-07T20:24:37.0665642Z # Create Conda Environment 2025-05-07T20:24:37.0665910Z # 2025-05-07T20:24:37.0682319Z # [2025-05-07T20:24:37.067Z] + create_conda_environment build_binary 3.12 2025-05-07T20:24:37.0682736Z ################################################################################ 2025-05-07T20:24:37.0682952Z 2025-05-07T20:24:37.0697584Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:37.1646005Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:37.1646378Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:24:37.1646721Z + conda info --envs 2025-05-07T20:24:37.1646859Z 2025-05-07T20:24:37.9313476Z 2025-05-07T20:24:37.9313839Z # conda environments: 2025-05-07T20:24:37.9314111Z # 2025-05-07T20:24:37.9314336Z base /home/ec2-user/miniconda 2025-05-07T20:24:37.9314569Z 2025-05-07T20:24:37.9987906Z 2025-05-07T20:24:37.9988512Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:24:39.6381281Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:39.6381572Z 2025-05-07T20:24:39.6397267Z 2025-05-07T20:24:39.6406508Z [SETUP] Creating new Conda environment (Python 3.12) ... 2025-05-07T20:24:39.6428507Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.12 2025-05-07T20:24:40.4276251Z Channels: 2025-05-07T20:24:40.4276516Z - defaults 2025-05-07T20:24:40.4276727Z Platform: linux-64 2025-05-07T20:24:41.9311510Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:42.0554495Z Solving environment: / done 2025-05-07T20:24:42.0843413Z 2025-05-07T20:24:42.0843952Z ## Package Plan ## 2025-05-07T20:24:42.0844255Z 2025-05-07T20:24:42.0844668Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:42.0845440Z 2025-05-07T20:24:42.0845726Z added / updated specs: 2025-05-07T20:24:42.0846293Z - python=3.12 2025-05-07T20:24:42.0846574Z 2025-05-07T20:24:42.0846583Z 2025-05-07T20:24:42.0846826Z The following packages will be downloaded: 2025-05-07T20:24:42.0847266Z 2025-05-07T20:24:42.0847517Z package | build 2025-05-07T20:24:42.0848168Z ---------------------------|----------------- 2025-05-07T20:24:42.0848883Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:42.0849678Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:42.0850512Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:42.0851313Z python-3.12.9 | h5148396_0 34.7 MB 2025-05-07T20:24:42.0852432Z setuptools-78.1.1 | py312h06a4308_0 2.2 MB 2025-05-07T20:24:42.0853216Z wheel-0.45.1 | py312h06a4308_0 147 KB 2025-05-07T20:24:42.0853888Z ------------------------------------------------------------ 2025-05-07T20:24:42.0854274Z Total: 37.2 MB 2025-05-07T20:24:42.0854493Z 2025-05-07T20:24:42.0854622Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:42.0854845Z 2025-05-07T20:24:42.0855237Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:42.0855690Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:42.0856110Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:42.0856595Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:42.0857085Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T20:24:42.0857540Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:42.0858007Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:42.0858439Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:42.0858878Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:42.0859334Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:42.0859800Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:42.0860225Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:42.0860651Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:42.0861058Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:42.0861464Z python pkgs/main/linux-64::python-3.12.9-h5148396_0 2025-05-07T20:24:42.0861897Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:42.0862372Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py312h06a4308_0 2025-05-07T20:24:42.0862845Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:42.0863240Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:42.0863630Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:42.0864048Z wheel pkgs/main/linux-64::wheel-0.45.1-py312h06a4308_0 2025-05-07T20:24:42.0864457Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:42.0864840Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:42.0865081Z 2025-05-07T20:24:42.0865085Z 2025-05-07T20:24:42.0865089Z 2025-05-07T20:24:42.0865244Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:42.0865819Z python-3.12.9 | 34.7 MB | | 0% 2025-05-07T20:24:42.0866055Z 2025-05-07T20:24:42.0866317Z setuptools-78.1.1 | 2.2 MB | | 0%  2025-05-07T20:24:42.0866712Z 2025-05-07T20:24:42.0866717Z 2025-05-07T20:24:42.0867160Z wheel-0.45.1 | 147 KB | | 0%  2025-05-07T20:24:42.0867398Z 2025-05-07T20:24:42.0867407Z 2025-05-07T20:24:42.0868666Z 2025-05-07T20:24:42.0893946Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:42.0894321Z 2025-05-07T20:24:42.0894328Z 2025-05-07T20:24:42.0894333Z 2025-05-07T20:24:42.0894349Z 2025-05-07T20:24:42.0911747Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:42.0912088Z 2025-05-07T20:24:42.0912092Z 2025-05-07T20:24:42.0912096Z 2025-05-07T20:24:42.0912099Z 2025-05-07T20:24:42.0912110Z 2025-05-07T20:24:42.1214232Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:42.1214504Z 2025-05-07T20:24:42.1216603Z 2025-05-07T20:24:42.1250604Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:42.1251082Z 2025-05-07T20:24:42.1251087Z 2025-05-07T20:24:42.1251095Z 2025-05-07T20:24:42.1254112Z 2025-05-07T20:24:42.1452549Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:42.1452868Z 2025-05-07T20:24:42.1452874Z 2025-05-07T20:24:42.1452879Z 2025-05-07T20:24:42.1452884Z 2025-05-07T20:24:42.1455471Z 2025-05-07T20:24:42.1792938Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:42.1793227Z 2025-05-07T20:24:42.1793231Z 2025-05-07T20:24:42.1793235Z 2025-05-07T20:24:42.1793238Z 2025-05-07T20:24:42.1793484Z 2025-05-07T20:24:42.1845966Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:42.1853322Z python-3.12.9 | 34.7 MB | 5 | 6% 2025-05-07T20:24:42.1853657Z 2025-05-07T20:24:42.1853662Z 2025-05-07T20:24:42.1855673Z 2025-05-07T20:24:42.2079397Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:42.2079684Z 2025-05-07T20:24:42.2263961Z setuptools-78.1.1 | 2.2 MB | #####7 | 58%  2025-05-07T20:24:42.2264332Z 2025-05-07T20:24:42.2264336Z 2025-05-07T20:24:42.2264340Z 2025-05-07T20:24:42.2273664Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:42.2273943Z 2025-05-07T20:24:42.2273947Z 2025-05-07T20:24:42.2276146Z 2025-05-07T20:24:42.2518124Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:42.2518404Z 2025-05-07T20:24:42.2518416Z 2025-05-07T20:24:42.2518420Z 2025-05-07T20:24:42.2518731Z 2025-05-07T20:24:42.2525853Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:42.2526128Z 2025-05-07T20:24:42.2526138Z 2025-05-07T20:24:42.2526142Z 2025-05-07T20:24:42.2528321Z 2025-05-07T20:24:42.2638582Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:42.2641772Z 2025-05-07T20:24:42.2831366Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:24:42.2831624Z 2025-05-07T20:24:42.2832140Z 2025-05-07T20:24:42.2840372Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:42.2840677Z 2025-05-07T20:24:42.2841324Z 2025-05-07T20:24:42.2846293Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:42.3847893Z python-3.12.9 | 34.7 MB | ## | 21% 2025-05-07T20:24:42.5704936Z python-3.12.9 | 34.7 MB | #####6 | 56% 2025-05-07T20:24:42.5705426Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:42.6467921Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:42.6468211Z 2025-05-07T20:24:43.2298597Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:24:43.2305861Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:43.2306264Z 2025-05-07T20:24:43.2306461Z 2025-05-07T20:24:43.2306656Z  2025-05-07T20:24:43.2306861Z 2025-05-07T20:24:43.2306881Z 2025-05-07T20:24:43.2307047Z  2025-05-07T20:24:43.2307256Z 2025-05-07T20:24:43.2307259Z 2025-05-07T20:24:43.2307263Z 2025-05-07T20:24:43.2307434Z  2025-05-07T20:24:43.2307642Z 2025-05-07T20:24:43.2307646Z 2025-05-07T20:24:43.2307650Z 2025-05-07T20:24:43.2307653Z 2025-05-07T20:24:43.2307825Z  2025-05-07T20:24:43.2308028Z 2025-05-07T20:24:43.2308032Z 2025-05-07T20:24:43.2308045Z 2025-05-07T20:24:43.2308056Z 2025-05-07T20:24:43.2308060Z 2025-05-07T20:24:43.2308263Z  done 2025-05-07T20:24:43.4415474Z Preparing transaction: \ | done 2025-05-07T20:24:44.8785185Z Verifying transaction: - \ | / - \ | / - \ | / - done 2025-05-07T20:24:47.2994133Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:47.3498448Z # 2025-05-07T20:24:47.3498782Z # To activate this environment, use 2025-05-07T20:24:47.3499182Z # 2025-05-07T20:24:47.3499458Z # $ conda activate build_binary 2025-05-07T20:24:47.3499741Z # 2025-05-07T20:24:47.3499959Z # To deactivate an active environment, use 2025-05-07T20:24:47.3500245Z # 2025-05-07T20:24:47.3500431Z # $ conda deactivate 2025-05-07T20:24:47.3500594Z 2025-05-07T20:24:47.4579173Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:47.4603925Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:50.4370511Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (25.1) 2025-05-07T20:24:50.4371128Z Collecting pip 2025-05-07T20:24:50.4371440Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:50.4371867Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:50.4375748Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 64.5 MB/s eta 0:00:00 2025-05-07T20:24:50.4376155Z Installing collected packages: pip 2025-05-07T20:24:50.4376450Z Attempting uninstall: pip 2025-05-07T20:24:50.4376733Z Found existing installation: pip 25.1 2025-05-07T20:24:50.4377050Z Uninstalling pip-25.1: 2025-05-07T20:24:50.4377319Z Successfully uninstalled pip-25.1 2025-05-07T20:24:50.4377630Z Successfully installed pip-25.1.1 2025-05-07T20:24:50.4377817Z 2025-05-07T20:24:50.4991832Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:50.5014798Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:51.3781873Z Channels: 2025-05-07T20:24:51.3782185Z - conda-forge 2025-05-07T20:24:51.3782416Z Platform: linux-64 2025-05-07T20:25:01.7921259Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:03.4829960Z Solving environment: - \ | / - \ done 2025-05-07T20:25:03.5460347Z 2025-05-07T20:25:03.5460539Z ## Package Plan ## 2025-05-07T20:25:03.5460759Z 2025-05-07T20:25:03.5461035Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:03.5461495Z 2025-05-07T20:25:03.5461609Z added / updated specs: 2025-05-07T20:25:03.5461884Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:25:03.5462099Z 2025-05-07T20:25:03.5462105Z 2025-05-07T20:25:03.5462282Z The following packages will be downloaded: 2025-05-07T20:25:03.5462606Z 2025-05-07T20:25:03.5462795Z package | build 2025-05-07T20:25:03.5463269Z ---------------------------|----------------- 2025-05-07T20:25:03.5463811Z cffi-1.17.1 | py312h06ac9bb_0 288 KB conda-forge 2025-05-07T20:25:03.5464341Z cryptography-44.0.3 | py312hda17c39_0 1.5 MB conda-forge 2025-05-07T20:25:03.5464967Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:03.5465779Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:03.5466211Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:25:03.5466622Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:25:03.5467043Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:25:03.5467452Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:03.5467882Z libsqlite-3.46.0 | hde9e2c9_0 845 KB conda-forge 2025-05-07T20:25:03.5468301Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:03.5468723Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:25:03.5469146Z libzlib-1.2.13 | h4ab18f5_6 60 KB conda-forge 2025-05-07T20:25:03.5469554Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:25:03.5470247Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:25:03.5470687Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:25:03.5471132Z python-3.12.2 |hab00c5b_0_cpython 30.8 MB conda-forge 2025-05-07T20:25:03.5471555Z python_abi-3.12 | 7_cp312 7 KB conda-forge 2025-05-07T20:25:03.5472017Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:25:03.5472827Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:25:03.5473274Z zlib-1.2.13 | h4ab18f5_6 91 KB conda-forge 2025-05-07T20:25:03.5473656Z ------------------------------------------------------------ 2025-05-07T20:25:03.5474003Z Total: 38.6 MB 2025-05-07T20:25:03.5474216Z 2025-05-07T20:25:03.5474360Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:03.5474580Z 2025-05-07T20:25:03.5474775Z cffi conda-forge/linux-64::cffi-1.17.1-py312h06ac9bb_0 2025-05-07T20:25:03.5475275Z cryptography conda-forge/linux-64::cryptography-44.0.3-py312hda17c39_0 2025-05-07T20:25:03.5475867Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:03.5476311Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:25:03.5476732Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:03.5478874Z libsqlite conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 2025-05-07T20:25:03.5479351Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:25:03.5479802Z libzlib conda-forge/linux-64::libzlib-1.2.13-h4ab18f5_6 2025-05-07T20:25:03.5480265Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:25:03.5480819Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:25:03.5481412Z python_abi conda-forge/noarch::python_abi-3.12-7_cp312 2025-05-07T20:25:03.5481925Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:25:03.5482508Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:25:03.5482855Z 2025-05-07T20:25:03.5482970Z The following packages will be UPDATED: 2025-05-07T20:25:03.5483176Z 2025-05-07T20:25:03.5483575Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:25:03.5484334Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:25:03.5484975Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:25:03.5485599Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:03.5486281Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:25:03.5486941Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.2.13-h4ab18f5_6 2025-05-07T20:25:03.5487269Z 2025-05-07T20:25:03.5487485Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:03.5487805Z 2025-05-07T20:25:03.5488043Z expat pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 2025-05-07T20:25:03.5488668Z python pkgs/main::python-3.12.9-h5148396_0 --> conda-forge::python-3.12.2-hab00c5b_0_cpython 2025-05-07T20:25:03.5489049Z 2025-05-07T20:25:03.5489053Z 2025-05-07T20:25:03.5489063Z 2025-05-07T20:25:03.5489205Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:03.5489592Z python-3.12.2 | 30.8 MB | | 0% 2025-05-07T20:25:03.5489818Z 2025-05-07T20:25:03.5490329Z openssl-3.5.0 | 3.0 MB | | 0%  2025-05-07T20:25:03.5490567Z 2025-05-07T20:25:03.5490570Z 2025-05-07T20:25:03.5490795Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:25:03.5491056Z 2025-05-07T20:25:03.5491060Z 2025-05-07T20:25:03.5491064Z 2025-05-07T20:25:03.5509709Z libsqlite-3.46.0 | 845 KB | | 0%  2025-05-07T20:25:03.5510060Z 2025-05-07T20:25:03.5510064Z 2025-05-07T20:25:03.5510076Z 2025-05-07T20:25:03.5510080Z 2025-05-07T20:25:03.5516239Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:25:03.5516607Z 2025-05-07T20:25:03.5516611Z 2025-05-07T20:25:03.5516622Z 2025-05-07T20:25:03.5516625Z 2025-05-07T20:25:03.5516629Z 2025-05-07T20:25:03.5529454Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:25:03.5529809Z 2025-05-07T20:25:03.5529812Z 2025-05-07T20:25:03.5529824Z 2025-05-07T20:25:03.5529828Z 2025-05-07T20:25:03.5529832Z 2025-05-07T20:25:03.5531158Z 2025-05-07T20:25:03.5538277Z cffi-1.17.1 | 288 KB | | 0%  2025-05-07T20:25:03.5538597Z 2025-05-07T20:25:03.5538602Z 2025-05-07T20:25:03.5538606Z 2025-05-07T20:25:03.5538609Z 2025-05-07T20:25:03.5538613Z 2025-05-07T20:25:03.5538617Z 2025-05-07T20:25:03.5542112Z 2025-05-07T20:25:03.5543747Z expat-2.7.0 | 137 KB | | 0%  2025-05-07T20:25:03.5544024Z 2025-05-07T20:25:03.5544028Z 2025-05-07T20:25:03.5544032Z 2025-05-07T20:25:03.5544036Z 2025-05-07T20:25:03.5544039Z 2025-05-07T20:25:03.5544055Z 2025-05-07T20:25:03.5544060Z 2025-05-07T20:25:03.5544063Z 2025-05-07T20:25:03.5545726Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:25:03.5546028Z 2025-05-07T20:25:03.5546038Z 2025-05-07T20:25:03.5546044Z 2025-05-07T20:25:03.5546049Z 2025-05-07T20:25:03.5546055Z 2025-05-07T20:25:03.5546062Z 2025-05-07T20:25:03.5546068Z 2025-05-07T20:25:03.5546075Z 2025-05-07T20:25:03.5546099Z 2025-05-07T20:25:03.5547963Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:25:03.5548268Z 2025-05-07T20:25:03.5548272Z 2025-05-07T20:25:03.5548276Z 2025-05-07T20:25:03.5548280Z 2025-05-07T20:25:03.5548283Z 2025-05-07T20:25:03.5548287Z 2025-05-07T20:25:03.5548290Z 2025-05-07T20:25:03.5548294Z 2025-05-07T20:25:03.5548298Z 2025-05-07T20:25:03.5549956Z 2025-05-07T20:25:03.5553014Z libxcrypt-4.4.36 | 98 KB | | 0%  2025-05-07T20:25:03.5553688Z 2025-05-07T20:25:03.5553711Z 2025-05-07T20:25:03.5553722Z 2025-05-07T20:25:03.5553730Z 2025-05-07T20:25:03.5553738Z 2025-05-07T20:25:03.5553745Z 2025-05-07T20:25:03.5553753Z 2025-05-07T20:25:03.5553759Z 2025-05-07T20:25:03.5553765Z 2025-05-07T20:25:03.5553771Z 2025-05-07T20:25:03.5553779Z 2025-05-07T20:25:03.5555322Z zlib-1.2.13 | 91 KB | | 0%  2025-05-07T20:25:03.5555947Z 2025-05-07T20:25:03.5555968Z 2025-05-07T20:25:03.5555976Z 2025-05-07T20:25:03.5555985Z 2025-05-07T20:25:03.5555994Z 2025-05-07T20:25:03.5556002Z 2025-05-07T20:25:03.5556010Z 2025-05-07T20:25:03.5556018Z 2025-05-07T20:25:03.5556027Z 2025-05-07T20:25:03.5556041Z 2025-05-07T20:25:03.5556063Z 2025-05-07T20:25:03.5556072Z 2025-05-07T20:25:03.5558089Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:25:03.5558599Z 2025-05-07T20:25:03.5558606Z 2025-05-07T20:25:03.5558622Z 2025-05-07T20:25:03.5558627Z 2025-05-07T20:25:03.5558644Z 2025-05-07T20:25:03.5558650Z 2025-05-07T20:25:03.5558655Z 2025-05-07T20:25:03.5558660Z 2025-05-07T20:25:03.5558666Z 2025-05-07T20:25:03.5558671Z 2025-05-07T20:25:03.5558676Z 2025-05-07T20:25:03.5558681Z 2025-05-07T20:25:03.5558686Z 2025-05-07T20:25:03.5559162Z libexpat-2.7.0 | 73 KB | | 0%  2025-05-07T20:25:03.5559628Z 2025-05-07T20:25:03.5559635Z 2025-05-07T20:25:03.5559809Z 2025-05-07T20:25:03.5559815Z 2025-05-07T20:25:03.5559820Z 2025-05-07T20:25:03.5559826Z 2025-05-07T20:25:03.5559831Z 2025-05-07T20:25:03.5559836Z 2025-05-07T20:25:03.5559842Z 2025-05-07T20:25:03.5559847Z 2025-05-07T20:25:03.5559853Z 2025-05-07T20:25:03.5559858Z 2025-05-07T20:25:03.5559863Z 2025-05-07T20:25:03.5559869Z 2025-05-07T20:25:03.5560972Z libzlib-1.2.13 | 60 KB | | 0%  2025-05-07T20:25:03.5561433Z 2025-05-07T20:25:03.5561439Z 2025-05-07T20:25:03.5561444Z 2025-05-07T20:25:03.5561608Z 2025-05-07T20:25:03.5561616Z 2025-05-07T20:25:03.5561622Z 2025-05-07T20:25:03.5561627Z 2025-05-07T20:25:03.5561644Z 2025-05-07T20:25:03.5561650Z 2025-05-07T20:25:03.5561655Z 2025-05-07T20:25:03.5561661Z 2025-05-07T20:25:03.5561666Z 2025-05-07T20:25:03.5561671Z 2025-05-07T20:25:03.5561677Z 2025-05-07T20:25:03.5561690Z 2025-05-07T20:25:03.5562697Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:25:03.5563215Z 2025-05-07T20:25:03.5563221Z 2025-05-07T20:25:03.5563227Z 2025-05-07T20:25:03.5563233Z 2025-05-07T20:25:03.5563238Z 2025-05-07T20:25:03.5563244Z 2025-05-07T20:25:03.5563249Z 2025-05-07T20:25:03.5563255Z 2025-05-07T20:25:03.5563260Z 2025-05-07T20:25:03.5563265Z 2025-05-07T20:25:03.5563280Z 2025-05-07T20:25:03.5563286Z 2025-05-07T20:25:03.5563292Z 2025-05-07T20:25:03.5563297Z 2025-05-07T20:25:03.5563302Z 2025-05-07T20:25:03.5563308Z 2025-05-07T20:25:03.5564415Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:03.5564886Z 2025-05-07T20:25:03.5564893Z 2025-05-07T20:25:03.5564898Z 2025-05-07T20:25:03.5564904Z 2025-05-07T20:25:03.5564909Z 2025-05-07T20:25:03.5564914Z 2025-05-07T20:25:03.5564931Z 2025-05-07T20:25:03.5564937Z 2025-05-07T20:25:03.5564942Z 2025-05-07T20:25:03.5564948Z 2025-05-07T20:25:03.5564954Z 2025-05-07T20:25:03.5564959Z 2025-05-07T20:25:03.5564973Z 2025-05-07T20:25:03.5564978Z 2025-05-07T20:25:03.5564984Z 2025-05-07T20:25:03.5564990Z 2025-05-07T20:25:03.5564995Z 2025-05-07T20:25:03.5566369Z libuuid-2.38.1 | 33 KB | | 0%  2025-05-07T20:25:03.5566846Z 2025-05-07T20:25:03.5566852Z 2025-05-07T20:25:03.5566857Z 2025-05-07T20:25:03.5566862Z 2025-05-07T20:25:03.5566868Z 2025-05-07T20:25:03.5566873Z 2025-05-07T20:25:03.5566878Z 2025-05-07T20:25:03.5566884Z 2025-05-07T20:25:03.5566889Z 2025-05-07T20:25:03.5566894Z 2025-05-07T20:25:03.5566910Z 2025-05-07T20:25:03.5566917Z 2025-05-07T20:25:03.5566923Z 2025-05-07T20:25:03.5566928Z 2025-05-07T20:25:03.5566934Z 2025-05-07T20:25:03.5566940Z 2025-05-07T20:25:03.5566960Z 2025-05-07T20:25:03.5566966Z 2025-05-07T20:25:03.5567905Z libnsl-2.0.1 | 33 KB | | 0%  2025-05-07T20:25:03.5568292Z 2025-05-07T20:25:03.5568296Z 2025-05-07T20:25:03.5568316Z 2025-05-07T20:25:03.5568319Z 2025-05-07T20:25:03.5568323Z 2025-05-07T20:25:03.5568327Z 2025-05-07T20:25:03.5568330Z 2025-05-07T20:25:03.5568334Z 2025-05-07T20:25:03.5568337Z 2025-05-07T20:25:03.5568341Z 2025-05-07T20:25:03.5568344Z 2025-05-07T20:25:03.5568348Z 2025-05-07T20:25:03.5568358Z 2025-05-07T20:25:03.5568361Z 2025-05-07T20:25:03.5568365Z 2025-05-07T20:25:03.5568368Z 2025-05-07T20:25:03.5568372Z 2025-05-07T20:25:03.5568375Z 2025-05-07T20:25:03.5568379Z 2025-05-07T20:25:03.6262003Z ... (more hidden) ... 2025-05-07T20:25:03.6262304Z 2025-05-07T20:25:03.6262308Z 2025-05-07T20:25:03.6263655Z 2025-05-07T20:25:03.6396870Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:25:03.6405797Z 2025-05-07T20:25:03.6492491Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:25:03.6492873Z 2025-05-07T20:25:03.6492879Z 2025-05-07T20:25:03.6492885Z 2025-05-07T20:25:03.6492890Z 2025-05-07T20:25:03.6683228Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:03.6712995Z python-3.12.2 | 30.8 MB | | 0% 2025-05-07T20:25:03.6713247Z 2025-05-07T20:25:03.6713253Z 2025-05-07T20:25:03.6713259Z 2025-05-07T20:25:03.6713264Z 2025-05-07T20:25:03.6713480Z 2025-05-07T20:25:03.6718823Z libgomp-15.1.0 | 442 KB | 3 | 4%  2025-05-07T20:25:03.6719099Z 2025-05-07T20:25:03.6719103Z 2025-05-07T20:25:03.6720662Z 2025-05-07T20:25:03.6723630Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:25:03.6723895Z 2025-05-07T20:25:03.6723899Z 2025-05-07T20:25:03.6725490Z 2025-05-07T20:25:03.6874319Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:25:03.6874616Z 2025-05-07T20:25:03.6874621Z 2025-05-07T20:25:03.6874626Z 2025-05-07T20:25:03.6874632Z 2025-05-07T20:25:03.6874637Z 2025-05-07T20:25:03.6874642Z 2025-05-07T20:25:03.6924078Z cffi-1.17.1 | 288 KB | 5 | 6%  2025-05-07T20:25:03.6924439Z 2025-05-07T20:25:03.6924444Z 2025-05-07T20:25:03.6924447Z 2025-05-07T20:25:03.6924459Z 2025-05-07T20:25:03.6926399Z 2025-05-07T20:25:03.6998554Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:03.6998824Z 2025-05-07T20:25:03.6998828Z 2025-05-07T20:25:03.7066912Z cryptography-44.0.3 | 1.5 MB | 1 | 1%  2025-05-07T20:25:03.7067185Z 2025-05-07T20:25:03.7067189Z 2025-05-07T20:25:03.7067192Z 2025-05-07T20:25:03.7067196Z 2025-05-07T20:25:03.7067210Z 2025-05-07T20:25:03.7068214Z 2025-05-07T20:25:03.7419589Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:25:03.7419853Z 2025-05-07T20:25:03.7419857Z 2025-05-07T20:25:03.7419860Z 2025-05-07T20:25:03.7419864Z 2025-05-07T20:25:03.7419874Z 2025-05-07T20:25:03.7419878Z 2025-05-07T20:25:03.7419881Z 2025-05-07T20:25:03.7422039Z 2025-05-07T20:25:03.7444514Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:25:03.7444946Z 2025-05-07T20:25:03.7444953Z 2025-05-07T20:25:03.7444958Z 2025-05-07T20:25:03.7444962Z 2025-05-07T20:25:03.7444967Z 2025-05-07T20:25:03.7444972Z 2025-05-07T20:25:03.7451374Z 2025-05-07T20:25:03.7461125Z expat-2.7.0 | 137 KB | #1 | 12%  2025-05-07T20:25:03.7461504Z 2025-05-07T20:25:03.7461510Z 2025-05-07T20:25:03.7461515Z 2025-05-07T20:25:03.7461520Z 2025-05-07T20:25:03.7461525Z 2025-05-07T20:25:03.7461530Z 2025-05-07T20:25:03.7461535Z 2025-05-07T20:25:03.7461558Z 2025-05-07T20:25:03.7468215Z 2025-05-07T20:25:03.7548894Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:25:03.7549333Z 2025-05-07T20:25:03.7549337Z 2025-05-07T20:25:03.7549340Z 2025-05-07T20:25:03.7549344Z 2025-05-07T20:25:03.7549348Z 2025-05-07T20:25:03.7549351Z 2025-05-07T20:25:03.7549355Z 2025-05-07T20:25:03.7549358Z 2025-05-07T20:25:03.7602113Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:03.7602544Z 2025-05-07T20:25:03.7602549Z 2025-05-07T20:25:03.7602555Z 2025-05-07T20:25:03.7602560Z 2025-05-07T20:25:03.7602565Z 2025-05-07T20:25:03.7602570Z 2025-05-07T20:25:03.7602580Z 2025-05-07T20:25:03.7629892Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:25:03.7630268Z 2025-05-07T20:25:03.7630376Z 2025-05-07T20:25:03.7630384Z 2025-05-07T20:25:03.7630392Z 2025-05-07T20:25:03.7630397Z 2025-05-07T20:25:03.7630402Z 2025-05-07T20:25:03.7630408Z 2025-05-07T20:25:03.7630427Z 2025-05-07T20:25:03.7630434Z 2025-05-07T20:25:03.7684695Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:03.7966277Z python-3.12.2 | 30.8 MB | 8 | 8% 2025-05-07T20:25:03.7966656Z 2025-05-07T20:25:03.7966662Z 2025-05-07T20:25:03.7966666Z 2025-05-07T20:25:03.7966671Z 2025-05-07T20:25:03.7966676Z 2025-05-07T20:25:03.7966682Z 2025-05-07T20:25:03.7966931Z 2025-05-07T20:25:03.7966936Z 2025-05-07T20:25:03.7966942Z 2025-05-07T20:25:03.7966947Z 2025-05-07T20:25:03.7966962Z 2025-05-07T20:25:03.7969740Z 2025-05-07T20:25:03.8039751Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:25:03.8040253Z 2025-05-07T20:25:03.8040267Z 2025-05-07T20:25:03.8040274Z 2025-05-07T20:25:03.8040279Z 2025-05-07T20:25:03.8040285Z 2025-05-07T20:25:03.8040290Z 2025-05-07T20:25:03.8040295Z 2025-05-07T20:25:03.8040301Z 2025-05-07T20:25:03.8040306Z 2025-05-07T20:25:03.8040551Z 2025-05-07T20:25:03.8040558Z 2025-05-07T20:25:03.8041789Z 2025-05-07T20:25:03.8088886Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:03.8089223Z 2025-05-07T20:25:03.8089227Z 2025-05-07T20:25:03.8089231Z 2025-05-07T20:25:03.8089234Z 2025-05-07T20:25:03.8089238Z 2025-05-07T20:25:03.8089241Z 2025-05-07T20:25:03.8089245Z 2025-05-07T20:25:03.8089248Z 2025-05-07T20:25:03.8089263Z 2025-05-07T20:25:03.8089266Z 2025-05-07T20:25:03.8128048Z libxcrypt-4.4.36 | 98 KB | #6 | 16%  2025-05-07T20:25:03.8128341Z 2025-05-07T20:25:03.8128345Z 2025-05-07T20:25:03.8129037Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:03.8129311Z 2025-05-07T20:25:03.8131458Z 2025-05-07T20:25:03.8172689Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:03.8172960Z 2025-05-07T20:25:03.8173108Z 2025-05-07T20:25:03.8173113Z 2025-05-07T20:25:03.8173117Z 2025-05-07T20:25:03.8173132Z 2025-05-07T20:25:03.8173136Z 2025-05-07T20:25:03.8173140Z 2025-05-07T20:25:03.8173144Z 2025-05-07T20:25:03.8173147Z 2025-05-07T20:25:03.8173154Z 2025-05-07T20:25:03.8218506Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:25:03.8218812Z 2025-05-07T20:25:03.8218816Z 2025-05-07T20:25:03.8218820Z 2025-05-07T20:25:03.8218823Z 2025-05-07T20:25:03.8218827Z 2025-05-07T20:25:03.8218843Z 2025-05-07T20:25:03.8218847Z 2025-05-07T20:25:03.8218851Z 2025-05-07T20:25:03.8218854Z 2025-05-07T20:25:03.8218858Z 2025-05-07T20:25:03.8222318Z 2025-05-07T20:25:03.8288260Z zlib-1.2.13 | 91 KB | #7 | 18%  2025-05-07T20:25:03.8288535Z 2025-05-07T20:25:03.8288540Z 2025-05-07T20:25:03.8288544Z 2025-05-07T20:25:03.8288547Z 2025-05-07T20:25:03.8288551Z 2025-05-07T20:25:03.8288560Z 2025-05-07T20:25:03.8288564Z 2025-05-07T20:25:03.8288568Z 2025-05-07T20:25:03.8288571Z 2025-05-07T20:25:03.8288584Z 2025-05-07T20:25:03.8288588Z 2025-05-07T20:25:03.8367486Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:25:03.8367756Z 2025-05-07T20:25:03.8367765Z 2025-05-07T20:25:03.8367789Z 2025-05-07T20:25:03.8367921Z 2025-05-07T20:25:03.8374644Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:03.8374977Z 2025-05-07T20:25:03.8374982Z 2025-05-07T20:25:03.8374997Z 2025-05-07T20:25:03.8375001Z 2025-05-07T20:25:03.8464512Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:03.8464961Z 2025-05-07T20:25:03.8464968Z 2025-05-07T20:25:03.8464975Z 2025-05-07T20:25:03.8464980Z 2025-05-07T20:25:03.8464986Z 2025-05-07T20:25:03.8464991Z 2025-05-07T20:25:03.8464997Z 2025-05-07T20:25:03.8465003Z 2025-05-07T20:25:03.8465009Z 2025-05-07T20:25:03.8465014Z 2025-05-07T20:25:03.8465020Z 2025-05-07T20:25:03.8465027Z 2025-05-07T20:25:03.8465032Z 2025-05-07T20:25:03.8527316Z libexpat-2.7.0 | 73 KB | ##2 | 22%  2025-05-07T20:25:03.8527803Z 2025-05-07T20:25:03.8527809Z 2025-05-07T20:25:03.8527815Z 2025-05-07T20:25:03.8527821Z 2025-05-07T20:25:03.8529787Z 2025-05-07T20:25:03.8538795Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:03.8539151Z 2025-05-07T20:25:03.8539155Z 2025-05-07T20:25:03.8539159Z 2025-05-07T20:25:03.8539162Z 2025-05-07T20:25:03.8539378Z 2025-05-07T20:25:03.8554004Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:03.8554267Z 2025-05-07T20:25:03.8554275Z 2025-05-07T20:25:03.8554279Z 2025-05-07T20:25:03.8554283Z 2025-05-07T20:25:03.8554288Z 2025-05-07T20:25:03.8554293Z 2025-05-07T20:25:03.8554297Z 2025-05-07T20:25:03.8554302Z 2025-05-07T20:25:03.8554307Z 2025-05-07T20:25:03.8554312Z 2025-05-07T20:25:03.8554317Z 2025-05-07T20:25:03.8554322Z 2025-05-07T20:25:03.8557237Z 2025-05-07T20:25:03.8684959Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:25:03.8714174Z python-3.12.2 | 30.8 MB | ##2 | 23% 2025-05-07T20:25:03.8714406Z 2025-05-07T20:25:03.8714745Z 2025-05-07T20:25:03.8714749Z 2025-05-07T20:25:03.8714764Z 2025-05-07T20:25:03.8714767Z 2025-05-07T20:25:03.8714771Z 2025-05-07T20:25:03.8714796Z 2025-05-07T20:25:03.8714800Z 2025-05-07T20:25:03.8714813Z 2025-05-07T20:25:03.8714817Z 2025-05-07T20:25:03.8714829Z 2025-05-07T20:25:03.8714842Z 2025-05-07T20:25:03.8714979Z 2025-05-07T20:25:03.8714984Z 2025-05-07T20:25:03.8715164Z 2025-05-07T20:25:03.8754524Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:25:03.8754929Z 2025-05-07T20:25:03.8754935Z 2025-05-07T20:25:03.8754941Z 2025-05-07T20:25:03.8754947Z 2025-05-07T20:25:03.8754953Z 2025-05-07T20:25:03.8754958Z 2025-05-07T20:25:03.8754964Z 2025-05-07T20:25:03.8754969Z 2025-05-07T20:25:03.8754982Z 2025-05-07T20:25:03.8754987Z 2025-05-07T20:25:03.8755024Z 2025-05-07T20:25:03.8755029Z 2025-05-07T20:25:03.8755034Z 2025-05-07T20:25:03.8755037Z 2025-05-07T20:25:03.8755041Z 2025-05-07T20:25:03.8841048Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:03.8841390Z 2025-05-07T20:25:03.8841394Z 2025-05-07T20:25:03.8841398Z 2025-05-07T20:25:03.8841402Z 2025-05-07T20:25:03.8841407Z 2025-05-07T20:25:03.8841436Z 2025-05-07T20:25:03.8841441Z 2025-05-07T20:25:03.8841446Z 2025-05-07T20:25:03.8841451Z 2025-05-07T20:25:03.8841457Z 2025-05-07T20:25:03.8841461Z 2025-05-07T20:25:03.8841467Z 2025-05-07T20:25:03.8841472Z 2025-05-07T20:25:03.8841477Z 2025-05-07T20:25:03.8841481Z 2025-05-07T20:25:03.8841486Z 2025-05-07T20:25:03.8882021Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:03.8882339Z 2025-05-07T20:25:03.8882343Z 2025-05-07T20:25:03.8882347Z 2025-05-07T20:25:03.8882350Z 2025-05-07T20:25:03.8882365Z 2025-05-07T20:25:03.8882369Z 2025-05-07T20:25:03.8882373Z 2025-05-07T20:25:03.8882377Z 2025-05-07T20:25:03.8882380Z 2025-05-07T20:25:03.8882392Z 2025-05-07T20:25:03.8882396Z 2025-05-07T20:25:03.8882399Z 2025-05-07T20:25:03.8882403Z 2025-05-07T20:25:03.8882407Z 2025-05-07T20:25:03.8882410Z 2025-05-07T20:25:03.8882414Z 2025-05-07T20:25:03.8982199Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:03.8982532Z 2025-05-07T20:25:03.8982537Z 2025-05-07T20:25:03.8982591Z 2025-05-07T20:25:03.8982601Z 2025-05-07T20:25:03.8982607Z 2025-05-07T20:25:03.8982613Z 2025-05-07T20:25:03.8982714Z 2025-05-07T20:25:03.8982722Z 2025-05-07T20:25:03.8982728Z 2025-05-07T20:25:03.8982734Z 2025-05-07T20:25:03.8982740Z 2025-05-07T20:25:03.8982748Z 2025-05-07T20:25:03.8982754Z 2025-05-07T20:25:03.8983943Z 2025-05-07T20:25:03.9006257Z libzlib-1.2.13 | 60 KB | ##6 | 27%  2025-05-07T20:25:03.9006577Z 2025-05-07T20:25:03.9006583Z 2025-05-07T20:25:03.9006589Z 2025-05-07T20:25:03.9006597Z 2025-05-07T20:25:03.9006603Z 2025-05-07T20:25:03.9006609Z 2025-05-07T20:25:03.9006614Z 2025-05-07T20:25:03.9006618Z 2025-05-07T20:25:03.9006623Z 2025-05-07T20:25:03.9006627Z 2025-05-07T20:25:03.9006639Z 2025-05-07T20:25:03.9006644Z 2025-05-07T20:25:03.9006647Z 2025-05-07T20:25:03.9006651Z 2025-05-07T20:25:03.9006654Z 2025-05-07T20:25:03.9006849Z 2025-05-07T20:25:03.9006856Z 2025-05-07T20:25:03.9037514Z libuuid-2.38.1 | 33 KB | ####8 | 49%  2025-05-07T20:25:03.9037819Z 2025-05-07T20:25:03.9037823Z 2025-05-07T20:25:03.9037826Z 2025-05-07T20:25:03.9037830Z 2025-05-07T20:25:03.9037833Z 2025-05-07T20:25:03.9037837Z 2025-05-07T20:25:03.9037840Z 2025-05-07T20:25:03.9037844Z 2025-05-07T20:25:03.9037847Z 2025-05-07T20:25:03.9037851Z 2025-05-07T20:25:03.9037854Z 2025-05-07T20:25:03.9037858Z 2025-05-07T20:25:03.9037861Z 2025-05-07T20:25:03.9040452Z 2025-05-07T20:25:03.9074397Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:25:03.9074697Z 2025-05-07T20:25:03.9074700Z 2025-05-07T20:25:03.9074704Z 2025-05-07T20:25:03.9074707Z 2025-05-07T20:25:03.9074711Z 2025-05-07T20:25:03.9074714Z 2025-05-07T20:25:03.9074718Z 2025-05-07T20:25:03.9074721Z 2025-05-07T20:25:03.9074732Z 2025-05-07T20:25:03.9074749Z 2025-05-07T20:25:03.9074752Z 2025-05-07T20:25:03.9074756Z 2025-05-07T20:25:03.9074759Z 2025-05-07T20:25:03.9074763Z 2025-05-07T20:25:03.9074766Z 2025-05-07T20:25:03.9074770Z 2025-05-07T20:25:03.9074773Z 2025-05-07T20:25:03.9147383Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:25:03.9147684Z 2025-05-07T20:25:03.9147950Z 2025-05-07T20:25:03.9147958Z 2025-05-07T20:25:03.9147964Z 2025-05-07T20:25:03.9147970Z 2025-05-07T20:25:03.9147975Z 2025-05-07T20:25:03.9147981Z 2025-05-07T20:25:03.9148006Z 2025-05-07T20:25:03.9148012Z 2025-05-07T20:25:03.9148017Z 2025-05-07T20:25:03.9148048Z 2025-05-07T20:25:03.9148054Z 2025-05-07T20:25:03.9148060Z 2025-05-07T20:25:03.9148065Z 2025-05-07T20:25:03.9148069Z 2025-05-07T20:25:03.9148073Z 2025-05-07T20:25:03.9148078Z 2025-05-07T20:25:03.9148516Z 2025-05-07T20:25:03.9185478Z libnsl-2.0.1 | 33 KB | ####9 | 49%  2025-05-07T20:25:03.9185945Z 2025-05-07T20:25:03.9185949Z 2025-05-07T20:25:03.9185954Z 2025-05-07T20:25:03.9185958Z 2025-05-07T20:25:03.9185963Z 2025-05-07T20:25:03.9185969Z 2025-05-07T20:25:03.9185973Z 2025-05-07T20:25:03.9185977Z 2025-05-07T20:25:03.9185980Z 2025-05-07T20:25:03.9185984Z 2025-05-07T20:25:03.9185988Z 2025-05-07T20:25:03.9185991Z 2025-05-07T20:25:03.9185995Z 2025-05-07T20:25:03.9185999Z 2025-05-07T20:25:03.9186002Z 2025-05-07T20:25:03.9186006Z 2025-05-07T20:25:03.9186009Z 2025-05-07T20:25:03.9189535Z 2025-05-07T20:25:03.9351297Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:25:03.9351712Z 2025-05-07T20:25:03.9351718Z 2025-05-07T20:25:03.9351725Z 2025-05-07T20:25:03.9351740Z 2025-05-07T20:25:03.9351746Z 2025-05-07T20:25:03.9351752Z 2025-05-07T20:25:03.9351757Z 2025-05-07T20:25:03.9351763Z 2025-05-07T20:25:03.9358713Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:03.9359036Z 2025-05-07T20:25:03.9359040Z 2025-05-07T20:25:03.9359043Z 2025-05-07T20:25:03.9359047Z 2025-05-07T20:25:03.9359050Z 2025-05-07T20:25:03.9359054Z 2025-05-07T20:25:03.9359058Z 2025-05-07T20:25:03.9359120Z 2025-05-07T20:25:03.9466518Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:03.9466927Z 2025-05-07T20:25:03.9466931Z 2025-05-07T20:25:03.9466935Z 2025-05-07T20:25:03.9466939Z 2025-05-07T20:25:03.9466942Z 2025-05-07T20:25:03.9466946Z 2025-05-07T20:25:03.9466949Z 2025-05-07T20:25:03.9466965Z 2025-05-07T20:25:03.9466969Z 2025-05-07T20:25:03.9466973Z 2025-05-07T20:25:03.9466977Z 2025-05-07T20:25:03.9466980Z 2025-05-07T20:25:03.9466984Z 2025-05-07T20:25:03.9466987Z 2025-05-07T20:25:03.9466991Z 2025-05-07T20:25:03.9466994Z 2025-05-07T20:25:03.9466998Z 2025-05-07T20:25:03.9467001Z 2025-05-07T20:25:03.9468354Z 2025-05-07T20:25:03.9520523Z ... (more hidden) ... 2025-05-07T20:25:03.9521249Z 2025-05-07T20:25:03.9521266Z 2025-05-07T20:25:03.9521272Z 2025-05-07T20:25:03.9521278Z 2025-05-07T20:25:03.9521284Z 2025-05-07T20:25:03.9521290Z 2025-05-07T20:25:03.9521296Z 2025-05-07T20:25:03.9521301Z 2025-05-07T20:25:03.9521307Z 2025-05-07T20:25:03.9521313Z 2025-05-07T20:25:03.9521319Z 2025-05-07T20:25:03.9521325Z 2025-05-07T20:25:03.9521331Z 2025-05-07T20:25:03.9521337Z 2025-05-07T20:25:03.9521343Z 2025-05-07T20:25:03.9521349Z 2025-05-07T20:25:03.9521354Z 2025-05-07T20:25:03.9521360Z 2025-05-07T20:25:03.9523239Z 2025-05-07T20:25:03.9686350Z ... (more hidden) ... 2025-05-07T20:25:03.9814531Z python-3.12.2 | 30.8 MB | ###4 | 34% 2025-05-07T20:25:03.9814937Z 2025-05-07T20:25:03.9814943Z 2025-05-07T20:25:03.9814948Z 2025-05-07T20:25:03.9814953Z 2025-05-07T20:25:03.9814958Z 2025-05-07T20:25:03.9815174Z 2025-05-07T20:25:03.9821900Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:25:03.9822268Z 2025-05-07T20:25:03.9822274Z 2025-05-07T20:25:03.9822280Z 2025-05-07T20:25:03.9822285Z 2025-05-07T20:25:03.9822290Z 2025-05-07T20:25:03.9822295Z 2025-05-07T20:25:03.9977519Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:25:03.9977878Z 2025-05-07T20:25:03.9977883Z 2025-05-07T20:25:03.9977887Z 2025-05-07T20:25:03.9977890Z 2025-05-07T20:25:03.9977894Z 2025-05-07T20:25:03.9977898Z 2025-05-07T20:25:03.9978662Z 2025-05-07T20:25:03.9988816Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:25:03.9989122Z 2025-05-07T20:25:03.9989128Z 2025-05-07T20:25:03.9989134Z 2025-05-07T20:25:03.9989139Z 2025-05-07T20:25:03.9989144Z 2025-05-07T20:25:03.9989149Z 2025-05-07T20:25:03.9989200Z 2025-05-07T20:25:04.0279054Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:25:04.0279344Z 2025-05-07T20:25:04.0279348Z 2025-05-07T20:25:04.0279380Z 2025-05-07T20:25:04.0279386Z 2025-05-07T20:25:04.0279391Z 2025-05-07T20:25:04.0279406Z 2025-05-07T20:25:04.0279412Z 2025-05-07T20:25:04.0279417Z 2025-05-07T20:25:04.0279422Z 2025-05-07T20:25:04.0279427Z 2025-05-07T20:25:04.0279433Z 2025-05-07T20:25:04.0279438Z 2025-05-07T20:25:04.0283372Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:04.0283849Z 2025-05-07T20:25:04.0283854Z 2025-05-07T20:25:04.0283857Z 2025-05-07T20:25:04.0283861Z 2025-05-07T20:25:04.0283865Z 2025-05-07T20:25:04.0283869Z 2025-05-07T20:25:04.0283884Z 2025-05-07T20:25:04.0283888Z 2025-05-07T20:25:04.0283892Z 2025-05-07T20:25:04.0283895Z 2025-05-07T20:25:04.0283899Z 2025-05-07T20:25:04.0283902Z 2025-05-07T20:25:04.0686777Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:04.1178847Z python-3.12.2 | 30.8 MB | #####1 | 51% 2025-05-07T20:25:04.1182131Z 2025-05-07T20:25:04.1187588Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:25:04.1187940Z 2025-05-07T20:25:04.1686693Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:25:04.1725621Z python-3.12.2 | 30.8 MB | ######6 | 67% 2025-05-07T20:25:04.1725903Z 2025-05-07T20:25:04.1725908Z 2025-05-07T20:25:04.1725918Z 2025-05-07T20:25:04.1725922Z 2025-05-07T20:25:04.1725926Z 2025-05-07T20:25:04.1725929Z 2025-05-07T20:25:04.1725933Z 2025-05-07T20:25:04.1725939Z 2025-05-07T20:25:04.1725944Z 2025-05-07T20:25:04.1728223Z 2025-05-07T20:25:04.1735914Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:25:04.1736236Z 2025-05-07T20:25:04.1736242Z 2025-05-07T20:25:04.1736247Z 2025-05-07T20:25:04.1736252Z 2025-05-07T20:25:04.1736257Z 2025-05-07T20:25:04.1736262Z 2025-05-07T20:25:04.1736267Z 2025-05-07T20:25:04.1736272Z 2025-05-07T20:25:04.1736278Z 2025-05-07T20:25:04.1737979Z 2025-05-07T20:25:04.2199594Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:25:04.2200214Z 2025-05-07T20:25:04.2200218Z 2025-05-07T20:25:04.2200222Z 2025-05-07T20:25:04.2200225Z 2025-05-07T20:25:04.2200229Z 2025-05-07T20:25:04.2200232Z 2025-05-07T20:25:04.2200236Z 2025-05-07T20:25:04.2200239Z 2025-05-07T20:25:04.2200243Z 2025-05-07T20:25:04.2200247Z 2025-05-07T20:25:04.2200250Z 2025-05-07T20:25:04.2211387Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:25:04.2211785Z 2025-05-07T20:25:04.2211789Z 2025-05-07T20:25:04.2211997Z 2025-05-07T20:25:04.2212002Z 2025-05-07T20:25:04.2212006Z 2025-05-07T20:25:04.2212009Z 2025-05-07T20:25:04.2212021Z 2025-05-07T20:25:04.2212025Z 2025-05-07T20:25:04.2212029Z 2025-05-07T20:25:04.2212032Z 2025-05-07T20:25:04.2212036Z 2025-05-07T20:25:04.2455427Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:25:04.2455754Z 2025-05-07T20:25:04.2455761Z 2025-05-07T20:25:04.2455787Z 2025-05-07T20:25:04.2455792Z 2025-05-07T20:25:04.2455799Z 2025-05-07T20:25:04.2455804Z 2025-05-07T20:25:04.2455810Z 2025-05-07T20:25:04.2455815Z 2025-05-07T20:25:04.2455821Z 2025-05-07T20:25:04.2463279Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:04.2463581Z 2025-05-07T20:25:04.2463585Z 2025-05-07T20:25:04.2463588Z 2025-05-07T20:25:04.2463592Z 2025-05-07T20:25:04.2463595Z 2025-05-07T20:25:04.2463599Z 2025-05-07T20:25:04.2463602Z 2025-05-07T20:25:04.2463605Z 2025-05-07T20:25:04.2464242Z 2025-05-07T20:25:04.2566343Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:04.2566775Z 2025-05-07T20:25:04.2566780Z 2025-05-07T20:25:04.2566783Z 2025-05-07T20:25:04.2566787Z 2025-05-07T20:25:04.2566790Z 2025-05-07T20:25:04.2566794Z 2025-05-07T20:25:04.2566798Z 2025-05-07T20:25:04.2566801Z 2025-05-07T20:25:04.2566805Z 2025-05-07T20:25:04.2566808Z 2025-05-07T20:25:04.2566812Z 2025-05-07T20:25:04.2566822Z 2025-05-07T20:25:04.2569104Z 2025-05-07T20:25:04.2585445Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:25:04.2585858Z 2025-05-07T20:25:04.2585862Z 2025-05-07T20:25:04.2585866Z 2025-05-07T20:25:04.2585870Z 2025-05-07T20:25:04.2585883Z 2025-05-07T20:25:04.2585886Z 2025-05-07T20:25:04.2585890Z 2025-05-07T20:25:04.2585893Z 2025-05-07T20:25:04.2585897Z 2025-05-07T20:25:04.2585900Z 2025-05-07T20:25:04.2585904Z 2025-05-07T20:25:04.2585907Z 2025-05-07T20:25:04.2585911Z 2025-05-07T20:25:04.2688366Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:25:04.2808695Z python-3.12.2 | 30.8 MB | ########1 | 81% 2025-05-07T20:25:04.2809100Z 2025-05-07T20:25:04.2809118Z 2025-05-07T20:25:04.2809124Z 2025-05-07T20:25:04.2809129Z 2025-05-07T20:25:04.2809134Z 2025-05-07T20:25:04.2809140Z 2025-05-07T20:25:04.2809145Z 2025-05-07T20:25:04.2809151Z 2025-05-07T20:25:04.2809171Z 2025-05-07T20:25:04.2809178Z 2025-05-07T20:25:04.2809183Z 2025-05-07T20:25:04.2809190Z 2025-05-07T20:25:04.2809196Z 2025-05-07T20:25:04.2809202Z 2025-05-07T20:25:04.2810468Z 2025-05-07T20:25:04.2821196Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:04.2821541Z 2025-05-07T20:25:04.2821547Z 2025-05-07T20:25:04.2821553Z 2025-05-07T20:25:04.2821559Z 2025-05-07T20:25:04.2821565Z 2025-05-07T20:25:04.2821580Z 2025-05-07T20:25:04.2821586Z 2025-05-07T20:25:04.2821591Z 2025-05-07T20:25:04.2821595Z 2025-05-07T20:25:04.2821631Z 2025-05-07T20:25:04.2821636Z 2025-05-07T20:25:04.2821642Z 2025-05-07T20:25:04.2821647Z 2025-05-07T20:25:04.2821652Z 2025-05-07T20:25:04.2821801Z 2025-05-07T20:25:04.3172455Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:04.3172877Z 2025-05-07T20:25:04.3172885Z 2025-05-07T20:25:04.3172892Z 2025-05-07T20:25:04.3172901Z 2025-05-07T20:25:04.3173167Z 2025-05-07T20:25:04.3173177Z 2025-05-07T20:25:04.3173187Z 2025-05-07T20:25:04.3173197Z 2025-05-07T20:25:04.3173206Z 2025-05-07T20:25:04.3173212Z 2025-05-07T20:25:04.3173219Z 2025-05-07T20:25:04.3173235Z 2025-05-07T20:25:04.3173240Z 2025-05-07T20:25:04.3173245Z 2025-05-07T20:25:04.3179532Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:25:04.3179972Z 2025-05-07T20:25:04.3179995Z 2025-05-07T20:25:04.3180001Z 2025-05-07T20:25:04.3180006Z 2025-05-07T20:25:04.3180012Z 2025-05-07T20:25:04.3180261Z 2025-05-07T20:25:04.3180267Z 2025-05-07T20:25:04.3180271Z 2025-05-07T20:25:04.3180275Z 2025-05-07T20:25:04.3180278Z 2025-05-07T20:25:04.3180282Z 2025-05-07T20:25:04.3180285Z 2025-05-07T20:25:04.3180289Z 2025-05-07T20:25:04.3180292Z 2025-05-07T20:25:04.3472665Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:25:04.3473001Z 2025-05-07T20:25:04.3473006Z 2025-05-07T20:25:04.3473027Z 2025-05-07T20:25:04.3473032Z 2025-05-07T20:25:04.3473037Z 2025-05-07T20:25:04.3473042Z 2025-05-07T20:25:04.3473048Z 2025-05-07T20:25:04.3473053Z 2025-05-07T20:25:04.3473058Z 2025-05-07T20:25:04.3473063Z 2025-05-07T20:25:04.3473068Z 2025-05-07T20:25:04.3473074Z 2025-05-07T20:25:04.3473079Z 2025-05-07T20:25:04.3473084Z 2025-05-07T20:25:04.3473098Z 2025-05-07T20:25:04.3475288Z 2025-05-07T20:25:04.3483609Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:04.3484011Z 2025-05-07T20:25:04.3484040Z 2025-05-07T20:25:04.3484047Z 2025-05-07T20:25:04.3484052Z 2025-05-07T20:25:04.3484058Z 2025-05-07T20:25:04.3484063Z 2025-05-07T20:25:04.3484068Z 2025-05-07T20:25:04.3484073Z 2025-05-07T20:25:04.3484078Z 2025-05-07T20:25:04.3484083Z 2025-05-07T20:25:04.3484088Z 2025-05-07T20:25:04.3484093Z 2025-05-07T20:25:04.3484098Z 2025-05-07T20:25:04.3484103Z 2025-05-07T20:25:04.3484109Z 2025-05-07T20:25:04.3487234Z 2025-05-07T20:25:04.3570539Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:04.3570959Z 2025-05-07T20:25:04.3570963Z 2025-05-07T20:25:04.3570966Z 2025-05-07T20:25:04.3570970Z 2025-05-07T20:25:04.3570974Z 2025-05-07T20:25:04.3570977Z 2025-05-07T20:25:04.3570981Z 2025-05-07T20:25:04.3570985Z 2025-05-07T20:25:04.3570988Z 2025-05-07T20:25:04.3570992Z 2025-05-07T20:25:04.3570996Z 2025-05-07T20:25:04.3571005Z 2025-05-07T20:25:04.3571009Z 2025-05-07T20:25:04.3571012Z 2025-05-07T20:25:04.3571016Z 2025-05-07T20:25:04.3571028Z 2025-05-07T20:25:04.3571032Z 2025-05-07T20:25:04.3579115Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.3579497Z 2025-05-07T20:25:04.3579501Z 2025-05-07T20:25:04.3579505Z 2025-05-07T20:25:04.3579508Z 2025-05-07T20:25:04.3579512Z 2025-05-07T20:25:04.3579515Z 2025-05-07T20:25:04.3579519Z 2025-05-07T20:25:04.3579522Z 2025-05-07T20:25:04.3579535Z 2025-05-07T20:25:04.3579538Z 2025-05-07T20:25:04.3579542Z 2025-05-07T20:25:04.3579545Z 2025-05-07T20:25:04.3579549Z 2025-05-07T20:25:04.3579552Z 2025-05-07T20:25:04.3579556Z 2025-05-07T20:25:04.3579559Z 2025-05-07T20:25:04.3580022Z 2025-05-07T20:25:04.3692742Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.3863599Z python-3.12.2 | 30.8 MB | #########6 | 96% 2025-05-07T20:25:04.3863950Z 2025-05-07T20:25:04.3863954Z 2025-05-07T20:25:04.3863958Z 2025-05-07T20:25:04.3863971Z 2025-05-07T20:25:04.3863975Z 2025-05-07T20:25:04.3863979Z 2025-05-07T20:25:04.3863982Z 2025-05-07T20:25:04.3863986Z 2025-05-07T20:25:04.3863990Z 2025-05-07T20:25:04.3863993Z 2025-05-07T20:25:04.3863997Z 2025-05-07T20:25:04.3864000Z 2025-05-07T20:25:04.3864004Z 2025-05-07T20:25:04.3864015Z 2025-05-07T20:25:04.3864019Z 2025-05-07T20:25:04.3864022Z 2025-05-07T20:25:04.3864026Z 2025-05-07T20:25:04.3864029Z 2025-05-07T20:25:04.3864245Z 2025-05-07T20:25:04.3913098Z ... (more hidden) ... 2025-05-07T20:25:04.3913527Z 2025-05-07T20:25:04.3913533Z 2025-05-07T20:25:04.3913538Z 2025-05-07T20:25:04.3913543Z 2025-05-07T20:25:04.3913548Z 2025-05-07T20:25:04.3913553Z 2025-05-07T20:25:04.3913558Z 2025-05-07T20:25:04.3913563Z 2025-05-07T20:25:04.3913568Z 2025-05-07T20:25:04.3913574Z 2025-05-07T20:25:04.3913579Z 2025-05-07T20:25:04.3913584Z 2025-05-07T20:25:04.3913589Z 2025-05-07T20:25:04.3913594Z 2025-05-07T20:25:04.3913599Z 2025-05-07T20:25:04.3913875Z 2025-05-07T20:25:04.3913883Z 2025-05-07T20:25:04.3913889Z 2025-05-07T20:25:04.3918401Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.3918866Z 2025-05-07T20:25:04.3918873Z 2025-05-07T20:25:04.3918879Z 2025-05-07T20:25:04.3918885Z 2025-05-07T20:25:04.3918891Z 2025-05-07T20:25:04.3918896Z 2025-05-07T20:25:04.3918911Z 2025-05-07T20:25:04.3918931Z 2025-05-07T20:25:04.3918936Z 2025-05-07T20:25:04.3918942Z 2025-05-07T20:25:04.3918947Z 2025-05-07T20:25:04.3918953Z 2025-05-07T20:25:04.3918957Z 2025-05-07T20:25:04.3918963Z 2025-05-07T20:25:04.3918968Z 2025-05-07T20:25:04.3918973Z 2025-05-07T20:25:04.3918979Z 2025-05-07T20:25:04.3918984Z 2025-05-07T20:25:04.4474347Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.4474667Z 2025-05-07T20:25:04.4474671Z 2025-05-07T20:25:04.4487424Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:05.1391500Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:25:05.1398065Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:25:05.1398301Z 2025-05-07T20:25:05.1398399Z 2025-05-07T20:25:05.1398404Z 2025-05-07T20:25:05.1398426Z 2025-05-07T20:25:05.1398431Z 2025-05-07T20:25:05.1398434Z 2025-05-07T20:25:05.1398439Z 2025-05-07T20:25:05.1398442Z 2025-05-07T20:25:05.1398464Z 2025-05-07T20:25:05.1398470Z 2025-05-07T20:25:05.1398474Z 2025-05-07T20:25:05.1398478Z 2025-05-07T20:25:05.1398484Z 2025-05-07T20:25:05.1398489Z 2025-05-07T20:25:05.1398495Z 2025-05-07T20:25:05.1398500Z 2025-05-07T20:25:05.1398506Z 2025-05-07T20:25:05.1398510Z 2025-05-07T20:25:05.1398519Z 2025-05-07T20:25:05.1398643Z 2025-05-07T20:25:05.1399054Z  2025-05-07T20:25:05.1399381Z 2025-05-07T20:25:05.1399579Z 2025-05-07T20:25:05.1399758Z  2025-05-07T20:25:05.1399961Z 2025-05-07T20:25:05.1399965Z 2025-05-07T20:25:05.1400130Z  2025-05-07T20:25:05.1400355Z 2025-05-07T20:25:05.1400361Z 2025-05-07T20:25:05.1400366Z 2025-05-07T20:25:05.1400625Z  2025-05-07T20:25:05.1400966Z 2025-05-07T20:25:05.1400971Z 2025-05-07T20:25:05.1400976Z 2025-05-07T20:25:05.1400981Z 2025-05-07T20:25:05.1401219Z  2025-05-07T20:25:05.1401436Z 2025-05-07T20:25:05.1401440Z 2025-05-07T20:25:05.1401443Z 2025-05-07T20:25:05.1401446Z 2025-05-07T20:25:05.1401450Z 2025-05-07T20:25:05.1401624Z  2025-05-07T20:25:05.1401839Z 2025-05-07T20:25:05.1401843Z 2025-05-07T20:25:05.1401846Z 2025-05-07T20:25:05.1401850Z 2025-05-07T20:25:05.1401853Z 2025-05-07T20:25:05.1401863Z 2025-05-07T20:25:05.1402040Z  2025-05-07T20:25:05.1402263Z 2025-05-07T20:25:05.1402266Z 2025-05-07T20:25:05.1402270Z 2025-05-07T20:25:05.1402273Z 2025-05-07T20:25:05.1402277Z 2025-05-07T20:25:05.1402280Z 2025-05-07T20:25:05.1402283Z 2025-05-07T20:25:05.1402463Z  2025-05-07T20:25:05.1403003Z 2025-05-07T20:25:05.1403008Z 2025-05-07T20:25:05.1403014Z 2025-05-07T20:25:05.1403019Z 2025-05-07T20:25:05.1403025Z 2025-05-07T20:25:05.1403030Z 2025-05-07T20:25:05.1403036Z 2025-05-07T20:25:05.1403041Z 2025-05-07T20:25:05.1403294Z  2025-05-07T20:25:05.1403597Z 2025-05-07T20:25:05.1403601Z 2025-05-07T20:25:05.1403605Z 2025-05-07T20:25:05.1403608Z 2025-05-07T20:25:05.1403612Z 2025-05-07T20:25:05.1403615Z 2025-05-07T20:25:05.1403619Z 2025-05-07T20:25:05.1403786Z 2025-05-07T20:25:05.1403791Z 2025-05-07T20:25:05.1403994Z  2025-05-07T20:25:05.1404211Z 2025-05-07T20:25:05.1404215Z 2025-05-07T20:25:05.1404218Z 2025-05-07T20:25:05.1404222Z 2025-05-07T20:25:05.1404225Z 2025-05-07T20:25:05.1404228Z 2025-05-07T20:25:05.1404232Z 2025-05-07T20:25:05.1404235Z 2025-05-07T20:25:05.1404239Z 2025-05-07T20:25:05.1404250Z 2025-05-07T20:25:05.1404446Z  2025-05-07T20:25:05.1404665Z 2025-05-07T20:25:05.1404669Z 2025-05-07T20:25:05.1404673Z 2025-05-07T20:25:05.1404676Z 2025-05-07T20:25:05.1404680Z 2025-05-07T20:25:05.1404683Z 2025-05-07T20:25:05.1404687Z 2025-05-07T20:25:05.1404690Z 2025-05-07T20:25:05.1404701Z 2025-05-07T20:25:05.1404705Z 2025-05-07T20:25:05.1404708Z 2025-05-07T20:25:05.1404899Z  2025-05-07T20:25:05.1405126Z 2025-05-07T20:25:05.1405129Z 2025-05-07T20:25:05.1405133Z 2025-05-07T20:25:05.1405142Z 2025-05-07T20:25:05.1405146Z 2025-05-07T20:25:05.1405149Z 2025-05-07T20:25:05.1405153Z 2025-05-07T20:25:05.1405156Z 2025-05-07T20:25:05.1405160Z 2025-05-07T20:25:05.1405163Z 2025-05-07T20:25:05.1405167Z 2025-05-07T20:25:05.1405170Z 2025-05-07T20:25:05.1405362Z  2025-05-07T20:25:05.1405595Z 2025-05-07T20:25:05.1405599Z 2025-05-07T20:25:05.1405602Z 2025-05-07T20:25:05.1405606Z 2025-05-07T20:25:05.1405609Z 2025-05-07T20:25:05.1405613Z 2025-05-07T20:25:05.1405616Z 2025-05-07T20:25:05.1405620Z 2025-05-07T20:25:05.1405623Z 2025-05-07T20:25:05.1405627Z 2025-05-07T20:25:05.1405630Z 2025-05-07T20:25:05.1405634Z 2025-05-07T20:25:05.1405637Z 2025-05-07T20:25:05.1405833Z  2025-05-07T20:25:05.1406071Z 2025-05-07T20:25:05.1406081Z 2025-05-07T20:25:05.1406086Z 2025-05-07T20:25:05.1406090Z 2025-05-07T20:25:05.1406094Z 2025-05-07T20:25:05.1406099Z 2025-05-07T20:25:05.1406103Z 2025-05-07T20:25:05.1406108Z 2025-05-07T20:25:05.1406112Z 2025-05-07T20:25:05.1406117Z 2025-05-07T20:25:05.1406121Z 2025-05-07T20:25:05.1406125Z 2025-05-07T20:25:05.1406130Z 2025-05-07T20:25:05.1406134Z 2025-05-07T20:25:05.1406373Z  2025-05-07T20:25:05.1406605Z 2025-05-07T20:25:05.1406609Z 2025-05-07T20:25:05.1406612Z 2025-05-07T20:25:05.1406616Z 2025-05-07T20:25:05.1406619Z 2025-05-07T20:25:05.1406623Z 2025-05-07T20:25:05.1406627Z 2025-05-07T20:25:05.1406635Z 2025-05-07T20:25:05.1406639Z 2025-05-07T20:25:05.1406642Z 2025-05-07T20:25:05.1406646Z 2025-05-07T20:25:05.1406649Z 2025-05-07T20:25:05.1406653Z 2025-05-07T20:25:05.1406656Z 2025-05-07T20:25:05.1406660Z 2025-05-07T20:25:05.1406874Z  2025-05-07T20:25:05.1407108Z 2025-05-07T20:25:05.1407111Z 2025-05-07T20:25:05.1407115Z 2025-05-07T20:25:05.1407118Z 2025-05-07T20:25:05.1407122Z 2025-05-07T20:25:05.1407125Z 2025-05-07T20:25:05.1407129Z 2025-05-07T20:25:05.1407132Z 2025-05-07T20:25:05.1407136Z 2025-05-07T20:25:05.1407139Z 2025-05-07T20:25:05.1407143Z 2025-05-07T20:25:05.1407146Z 2025-05-07T20:25:05.1407150Z 2025-05-07T20:25:05.1407234Z 2025-05-07T20:25:05.1407237Z 2025-05-07T20:25:05.1407241Z 2025-05-07T20:25:05.1407457Z  2025-05-07T20:25:05.1407690Z 2025-05-07T20:25:05.1407694Z 2025-05-07T20:25:05.1407697Z 2025-05-07T20:25:05.1407701Z 2025-05-07T20:25:05.1407705Z 2025-05-07T20:25:05.1407708Z 2025-05-07T20:25:05.1407712Z 2025-05-07T20:25:05.1407715Z 2025-05-07T20:25:05.1407719Z 2025-05-07T20:25:05.1407722Z 2025-05-07T20:25:05.1407726Z 2025-05-07T20:25:05.1407729Z 2025-05-07T20:25:05.1407817Z 2025-05-07T20:25:05.1407828Z 2025-05-07T20:25:05.1407832Z 2025-05-07T20:25:05.1407835Z 2025-05-07T20:25:05.1407839Z 2025-05-07T20:25:05.1408062Z  2025-05-07T20:25:05.1408298Z 2025-05-07T20:25:05.1408302Z 2025-05-07T20:25:05.1408305Z 2025-05-07T20:25:05.1408309Z 2025-05-07T20:25:05.1408312Z 2025-05-07T20:25:05.1408322Z 2025-05-07T20:25:05.1408325Z 2025-05-07T20:25:05.1408329Z 2025-05-07T20:25:05.1408332Z 2025-05-07T20:25:05.1408336Z 2025-05-07T20:25:05.1408340Z 2025-05-07T20:25:05.1408343Z 2025-05-07T20:25:05.1408347Z 2025-05-07T20:25:05.1408350Z 2025-05-07T20:25:05.1408354Z 2025-05-07T20:25:05.1408357Z 2025-05-07T20:25:05.1408361Z 2025-05-07T20:25:05.1408364Z 2025-05-07T20:25:05.1408588Z  2025-05-07T20:25:05.1408825Z 2025-05-07T20:25:05.1408905Z done 2025-05-07T20:25:05.2418642Z Preparing transaction: / done 2025-05-07T20:25:05.9788391Z Verifying transaction: \ | / - \ | / done 2025-05-07T20:25:07.5838338Z Executing transaction: \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:25:07.9363274Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:25:09.6794300Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:25:09.6807420Z [SETUP] Installing libxcrypt ... 2025-05-07T20:25:09.6831416Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:25:10.5463743Z Channels: 2025-05-07T20:25:10.5464053Z - conda-forge 2025-05-07T20:25:10.5464283Z Platform: linux-64 2025-05-07T20:25:13.7959981Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:14.1613126Z Solving environment: \ done 2025-05-07T20:25:14.1983222Z 2025-05-07T20:25:14.1983907Z # All requested packages already installed. 2025-05-07T20:25:14.1984179Z 2025-05-07T20:25:17.5667010Z [SETUP] Copying over ... 2025-05-07T20:25:17.5667724Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.12/crypt.h 2025-05-07T20:25:17.5668272Z 2025-05-07T20:25:17.5700309Z 2025-05-07T20:25:19.2025148Z [SETUP] Installed Python version: Python 3.12.2 2025-05-07T20:25:19.2025583Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:25:19.2062617Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:19.2063098Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:19.2077694Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:19.2078045Z env: 2025-05-07T20:25:19.2078273Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:19.2078570Z BUILD_ENV: build_binary 2025-05-07T20:25:19.2078865Z BUILD_TARGET: genai 2025-05-07T20:25:19.2079100Z BUILD_VARIANT: cuda 2025-05-07T20:25:19.2079331Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:19.2079592Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:19.2079891Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:19.2080224Z ##[endgroup] 2025-05-07T20:25:19.5432000Z ################################################################################ 2025-05-07T20:25:19.5432341Z # Install C/C++ Compilers 2025-05-07T20:25:19.5432579Z # 2025-05-07T20:25:19.5447680Z # [2025-05-07T20:25:19.544Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:25:19.5448468Z ################################################################################ 2025-05-07T20:25:19.5448747Z 2025-05-07T20:25:19.5462906Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:19.6360167Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:19.6371101Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:25:19.6393477Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:25:20.5019242Z Channels: 2025-05-07T20:25:20.5019556Z - conda-forge 2025-05-07T20:25:20.5019897Z Platform: linux-64 2025-05-07T20:25:23.9216738Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:24.2932216Z Solving environment: \ done 2025-05-07T20:25:24.3565617Z 2025-05-07T20:25:24.3566047Z ## Package Plan ## 2025-05-07T20:25:24.3566333Z 2025-05-07T20:25:24.3566591Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:24.3567026Z 2025-05-07T20:25:24.3567133Z added / updated specs: 2025-05-07T20:25:24.3567499Z - sysroot_linux-64=2.17 2025-05-07T20:25:24.3567750Z 2025-05-07T20:25:24.3567755Z 2025-05-07T20:25:24.3567944Z The following packages will be downloaded: 2025-05-07T20:25:24.3568274Z 2025-05-07T20:25:24.3568456Z package | build 2025-05-07T20:25:24.3568956Z ---------------------------|----------------- 2025-05-07T20:25:24.3569618Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:25:24.3570352Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:25:24.3570947Z ------------------------------------------------------------ 2025-05-07T20:25:24.3571377Z Total: 15.4 MB 2025-05-07T20:25:24.3571632Z 2025-05-07T20:25:24.3571776Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:24.3572094Z 2025-05-07T20:25:24.3572384Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:25:24.3572952Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:25:24.3573257Z 2025-05-07T20:25:24.3573262Z 2025-05-07T20:25:24.3573266Z 2025-05-07T20:25:24.3573413Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:24.3573789Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:24.3574015Z 2025-05-07T20:25:24.5842934Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:25:24.6005183Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:24.6005429Z 2025-05-07T20:25:24.6068505Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:25:24.6069063Z 2025-05-07T20:25:24.6843324Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:24.7658812Z sysroot_linux-64-2.1 | 14.5 MB | ######4 | 64% 2025-05-07T20:25:24.8501717Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:24.8502233Z 2025-05-07T20:25:24.8502767Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:24.8503266Z 2025-05-07T20:25:25.3356439Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:25.3360198Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:25.3360845Z 2025-05-07T20:25:25.3361192Z 2025-05-07T20:25:25.3361616Z  done 2025-05-07T20:25:25.4366050Z Preparing transaction: / done 2025-05-07T20:25:25.6380932Z Verifying transaction: \ | done 2025-05-07T20:25:25.8443564Z Executing transaction: - \ done 2025-05-07T20:25:25.9984155Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:25:25.9984478Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:25:27.6702030Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:25:27.6715264Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:25:27.6736863Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:25:28.5610504Z Channels: 2025-05-07T20:25:28.5610828Z - conda-forge 2025-05-07T20:25:28.5611059Z Platform: linux-64 2025-05-07T20:25:31.8521300Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:32.8090004Z Solving environment: \ | / done 2025-05-07T20:25:32.8743563Z 2025-05-07T20:25:32.8744438Z ## Package Plan ## 2025-05-07T20:25:32.8744675Z 2025-05-07T20:25:32.8744981Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:32.8745385Z 2025-05-07T20:25:32.8745526Z added / updated specs: 2025-05-07T20:25:32.8745838Z - gxx_linux-64=11.4.0 2025-05-07T20:25:32.8746006Z 2025-05-07T20:25:32.8746010Z 2025-05-07T20:25:32.8746133Z The following packages will be downloaded: 2025-05-07T20:25:32.8746385Z 2025-05-07T20:25:32.8746511Z package | build 2025-05-07T20:25:32.8746829Z ---------------------------|----------------- 2025-05-07T20:25:32.8747235Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:25:32.8747712Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:25:32.8748167Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:25:32.8748600Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:25:32.8749194Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:25:32.8749856Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:25:32.8750505Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:25:32.8751276Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:25:32.8752016Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:25:32.8752693Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:25:32.8753363Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:25:32.8753931Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:25:32.8754408Z ------------------------------------------------------------ 2025-05-07T20:25:32.8754840Z Total: 91.6 MB 2025-05-07T20:25:32.8755112Z 2025-05-07T20:25:32.8755245Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:32.8755588Z 2025-05-07T20:25:32.8755861Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:25:32.8756415Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:25:32.8757285Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:25:32.8757798Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:25:32.8758383Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:25:32.8758885Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:25:32.8759406Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:32.8759990Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:25:32.8760718Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:25:32.8761594Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:32.8762094Z 2025-05-07T20:25:32.8762258Z The following packages will be UPDATED: 2025-05-07T20:25:32.8762549Z 2025-05-07T20:25:32.8762920Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:25:32.8763849Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:25:32.8764358Z 2025-05-07T20:25:32.8764373Z 2025-05-07T20:25:32.8764377Z 2025-05-07T20:25:32.8764521Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:32.8764894Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:32.8765126Z 2025-05-07T20:25:32.8765863Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:32.8766110Z 2025-05-07T20:25:32.8772483Z 2025-05-07T20:25:32.8779588Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:32.8779940Z 2025-05-07T20:25:32.8779944Z 2025-05-07T20:25:32.8779947Z 2025-05-07T20:25:32.8818799Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:32.8819092Z 2025-05-07T20:25:32.8819097Z 2025-05-07T20:25:32.8819100Z 2025-05-07T20:25:32.8824730Z 2025-05-07T20:25:32.8852260Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:32.8852647Z 2025-05-07T20:25:32.8852653Z 2025-05-07T20:25:32.8852658Z 2025-05-07T20:25:32.8852663Z 2025-05-07T20:25:32.8864854Z 2025-05-07T20:25:32.8869691Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:32.8870148Z 2025-05-07T20:25:32.8870154Z 2025-05-07T20:25:32.8870160Z 2025-05-07T20:25:32.8870165Z 2025-05-07T20:25:32.8870171Z 2025-05-07T20:25:32.8870176Z 2025-05-07T20:25:32.8871131Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:25:32.8871429Z 2025-05-07T20:25:32.8871435Z 2025-05-07T20:25:32.8871440Z 2025-05-07T20:25:32.8871445Z 2025-05-07T20:25:32.8871450Z 2025-05-07T20:25:32.8871455Z 2025-05-07T20:25:32.8875108Z 2025-05-07T20:25:32.8877071Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:25:32.8877553Z 2025-05-07T20:25:32.8877559Z 2025-05-07T20:25:32.8877584Z 2025-05-07T20:25:32.8877589Z 2025-05-07T20:25:32.8877594Z 2025-05-07T20:25:32.8877599Z 2025-05-07T20:25:32.8877605Z 2025-05-07T20:25:32.8877610Z 2025-05-07T20:25:32.8879014Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:32.8879307Z 2025-05-07T20:25:32.8879316Z 2025-05-07T20:25:32.8879320Z 2025-05-07T20:25:32.8879323Z 2025-05-07T20:25:32.8879327Z 2025-05-07T20:25:32.8879330Z 2025-05-07T20:25:32.8879334Z 2025-05-07T20:25:32.8879337Z 2025-05-07T20:25:32.8884766Z 2025-05-07T20:25:32.8889790Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:25:32.8890236Z 2025-05-07T20:25:32.8890242Z 2025-05-07T20:25:32.8890248Z 2025-05-07T20:25:32.8890253Z 2025-05-07T20:25:32.8890259Z 2025-05-07T20:25:32.8890264Z 2025-05-07T20:25:32.8890270Z 2025-05-07T20:25:32.8890275Z 2025-05-07T20:25:32.8890280Z 2025-05-07T20:25:32.8890296Z 2025-05-07T20:25:32.8891810Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:25:32.8892261Z 2025-05-07T20:25:32.8892267Z 2025-05-07T20:25:32.8892273Z 2025-05-07T20:25:32.8892291Z 2025-05-07T20:25:32.8892297Z 2025-05-07T20:25:32.8892303Z 2025-05-07T20:25:32.8892308Z 2025-05-07T20:25:32.8892313Z 2025-05-07T20:25:32.8892318Z 2025-05-07T20:25:32.8892323Z 2025-05-07T20:25:32.8892329Z 2025-05-07T20:25:32.9755115Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:25:32.9755525Z 2025-05-07T20:25:32.9777766Z gxx_impl_linux-64-11 | 11.2 MB | 1 | 1%  2025-05-07T20:25:32.9778016Z 2025-05-07T20:25:32.9778278Z 2025-05-07T20:25:32.9799880Z libstdcxx-devel_linu | 11.1 MB | 1 | 2%  2025-05-07T20:25:32.9800138Z 2025-05-07T20:25:32.9800143Z 2025-05-07T20:25:32.9800743Z 2025-05-07T20:25:32.9831465Z binutils_impl_linux- | 6.0 MB | 4 | 4%  2025-05-07T20:25:32.9831738Z 2025-05-07T20:25:32.9832107Z 2025-05-07T20:25:32.9832305Z 2025-05-07T20:25:32.9832667Z 2025-05-07T20:25:33.0492248Z libstdcxx-15.1.0 | 3.7 MB | # | 10%  2025-05-07T20:25:33.0755702Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:33.0756422Z 2025-05-07T20:25:33.0778234Z gxx_impl_linux-64-11 | 11.2 MB | ###3 | 34%  2025-05-07T20:25:33.0778507Z 2025-05-07T20:25:33.0778511Z 2025-05-07T20:25:33.0806666Z libstdcxx-devel_linu | 11.1 MB | ###9 | 40%  2025-05-07T20:25:33.0807026Z 2025-05-07T20:25:33.0807030Z 2025-05-07T20:25:33.0807034Z 2025-05-07T20:25:33.1498128Z binutils_impl_linux- | 6.0 MB | ######3 | 63%  2025-05-07T20:25:33.1730109Z gcc_impl_linux-64-11 | 53.0 MB | 4 | 5% 2025-05-07T20:25:33.1730375Z 2025-05-07T20:25:33.1730380Z 2025-05-07T20:25:33.1730384Z 2025-05-07T20:25:33.1735360Z 2025-05-07T20:25:33.1741399Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:33.1741729Z 2025-05-07T20:25:33.1741755Z 2025-05-07T20:25:33.1741773Z 2025-05-07T20:25:33.1743338Z 2025-05-07T20:25:33.1756360Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:33.1756756Z 2025-05-07T20:25:33.1779655Z gxx_impl_linux-64-11 | 11.2 MB | ######4 | 64%  2025-05-07T20:25:33.1780036Z 2025-05-07T20:25:33.1783953Z 2025-05-07T20:25:33.2251989Z libstdcxx-devel_linu | 11.1 MB | #######1 | 71%  2025-05-07T20:25:33.2252280Z 2025-05-07T20:25:33.2252284Z 2025-05-07T20:25:33.2252288Z 2025-05-07T20:25:33.2252292Z 2025-05-07T20:25:33.2253500Z 2025-05-07T20:25:33.2502019Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:33.2756952Z gcc_impl_linux-64-11 | 53.0 MB | #1 | 11% 2025-05-07T20:25:33.2757315Z 2025-05-07T20:25:33.2781659Z gxx_impl_linux-64-11 | 11.2 MB | ########9 | 90%  2025-05-07T20:25:33.2781939Z 2025-05-07T20:25:33.2783218Z 2025-05-07T20:25:33.3165913Z libstdcxx-devel_linu | 11.1 MB | #########9 | 100%  2025-05-07T20:25:33.3166292Z 2025-05-07T20:25:33.3166299Z 2025-05-07T20:25:33.3169567Z 2025-05-07T20:25:33.3170196Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:33.3170483Z 2025-05-07T20:25:33.3170488Z 2025-05-07T20:25:33.3170491Z 2025-05-07T20:25:33.3253447Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:33.3253733Z 2025-05-07T20:25:33.3253737Z 2025-05-07T20:25:33.3253741Z 2025-05-07T20:25:33.3253744Z 2025-05-07T20:25:33.3254838Z 2025-05-07T20:25:33.3502128Z libsanitizer-11.4.0 | 3.5 MB | ######### | 90%  2025-05-07T20:25:33.3601808Z gcc_impl_linux-64-11 | 53.0 MB | #7 | 18% 2025-05-07T20:25:33.3602053Z 2025-05-07T20:25:33.3602057Z 2025-05-07T20:25:33.3602060Z 2025-05-07T20:25:33.3602064Z 2025-05-07T20:25:33.3602076Z 2025-05-07T20:25:33.3605254Z 2025-05-07T20:25:33.4375278Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:25:33.4375581Z 2025-05-07T20:25:33.4375605Z 2025-05-07T20:25:33.4375899Z 2025-05-07T20:25:33.4375904Z 2025-05-07T20:25:33.4378132Z 2025-05-07T20:25:33.4800124Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:33.4800413Z 2025-05-07T20:25:33.4800589Z 2025-05-07T20:25:33.4800595Z 2025-05-07T20:25:33.4800598Z 2025-05-07T20:25:33.4800602Z 2025-05-07T20:25:33.4803324Z 2025-05-07T20:25:33.4803931Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:33.4804267Z 2025-05-07T20:25:33.4804274Z 2025-05-07T20:25:33.4804475Z 2025-05-07T20:25:33.4804479Z 2025-05-07T20:25:33.4804483Z 2025-05-07T20:25:33.4804518Z 2025-05-07T20:25:33.4814583Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:33.4908502Z gcc_impl_linux-64-11 | 53.0 MB | ##2 | 23% 2025-05-07T20:25:33.4908757Z 2025-05-07T20:25:33.4908762Z 2025-05-07T20:25:33.4908765Z 2025-05-07T20:25:33.4908769Z 2025-05-07T20:25:33.4908773Z 2025-05-07T20:25:33.4908783Z 2025-05-07T20:25:33.4909004Z 2025-05-07T20:25:33.5216566Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:25:33.5216867Z 2025-05-07T20:25:33.5216871Z 2025-05-07T20:25:33.5216875Z 2025-05-07T20:25:33.5216885Z 2025-05-07T20:25:33.5216889Z 2025-05-07T20:25:33.5216893Z 2025-05-07T20:25:33.5216896Z 2025-05-07T20:25:33.5218096Z 2025-05-07T20:25:33.5277597Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:33.5277904Z 2025-05-07T20:25:33.5277908Z 2025-05-07T20:25:33.5277911Z 2025-05-07T20:25:33.5277915Z 2025-05-07T20:25:33.5277918Z 2025-05-07T20:25:33.5277922Z 2025-05-07T20:25:33.5277925Z 2025-05-07T20:25:33.5279965Z 2025-05-07T20:25:33.5381755Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:33.5382058Z 2025-05-07T20:25:33.5382062Z 2025-05-07T20:25:33.5382066Z 2025-05-07T20:25:33.5382069Z 2025-05-07T20:25:33.5382073Z 2025-05-07T20:25:33.5382077Z 2025-05-07T20:25:33.5382092Z 2025-05-07T20:25:33.5680263Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:33.5680566Z 2025-05-07T20:25:33.5680571Z 2025-05-07T20:25:33.5680575Z 2025-05-07T20:25:33.5680578Z 2025-05-07T20:25:33.5680582Z 2025-05-07T20:25:33.5680585Z 2025-05-07T20:25:33.5680589Z 2025-05-07T20:25:33.5680593Z 2025-05-07T20:25:33.5680596Z 2025-05-07T20:25:33.5705493Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:25:33.5705778Z 2025-05-07T20:25:33.5705782Z 2025-05-07T20:25:33.5705785Z 2025-05-07T20:25:33.5705789Z 2025-05-07T20:25:33.5705800Z 2025-05-07T20:25:33.5705803Z 2025-05-07T20:25:33.5705807Z 2025-05-07T20:25:33.5705811Z 2025-05-07T20:25:33.5709574Z 2025-05-07T20:25:33.5951792Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:33.5952187Z 2025-05-07T20:25:33.5952191Z 2025-05-07T20:25:33.5952194Z 2025-05-07T20:25:33.5952198Z 2025-05-07T20:25:33.5952202Z 2025-05-07T20:25:33.5952219Z 2025-05-07T20:25:33.5952223Z 2025-05-07T20:25:33.5952226Z 2025-05-07T20:25:33.5952230Z 2025-05-07T20:25:33.5953878Z 2025-05-07T20:25:33.5979236Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:25:33.5979581Z 2025-05-07T20:25:33.5979585Z 2025-05-07T20:25:33.5982184Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:33.5982458Z 2025-05-07T20:25:33.5982462Z 2025-05-07T20:25:33.5982466Z 2025-05-07T20:25:33.5982470Z 2025-05-07T20:25:33.5982474Z 2025-05-07T20:25:33.5982478Z 2025-05-07T20:25:33.5982488Z 2025-05-07T20:25:33.5982491Z 2025-05-07T20:25:33.5982495Z 2025-05-07T20:25:33.5982498Z 2025-05-07T20:25:33.6033256Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:33.6033863Z 2025-05-07T20:25:33.6086807Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:33.6087170Z 2025-05-07T20:25:33.6087174Z 2025-05-07T20:25:33.6087190Z 2025-05-07T20:25:33.6087402Z 2025-05-07T20:25:33.6087407Z 2025-05-07T20:25:33.6087411Z 2025-05-07T20:25:33.6087414Z 2025-05-07T20:25:33.6087418Z 2025-05-07T20:25:33.6087422Z 2025-05-07T20:25:33.6087425Z 2025-05-07T20:25:33.6087984Z 2025-05-07T20:25:33.6095937Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:25:33.6096367Z 2025-05-07T20:25:33.6096371Z 2025-05-07T20:25:33.6096375Z 2025-05-07T20:25:33.6096378Z 2025-05-07T20:25:33.6096382Z 2025-05-07T20:25:33.6096385Z 2025-05-07T20:25:33.6096389Z 2025-05-07T20:25:33.6096392Z 2025-05-07T20:25:33.6096396Z 2025-05-07T20:25:33.6096400Z 2025-05-07T20:25:33.6096403Z 2025-05-07T20:25:33.6180392Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:33.6180693Z 2025-05-07T20:25:33.6180696Z 2025-05-07T20:25:33.6180700Z 2025-05-07T20:25:33.6182485Z 2025-05-07T20:25:33.7063746Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:33.7064244Z 2025-05-07T20:25:33.7064248Z 2025-05-07T20:25:33.7064252Z 2025-05-07T20:25:33.7064256Z 2025-05-07T20:25:33.7064259Z 2025-05-07T20:25:33.7064263Z 2025-05-07T20:25:33.7073705Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:33.7073991Z 2025-05-07T20:25:33.7073995Z 2025-05-07T20:25:33.7073999Z 2025-05-07T20:25:33.7074002Z 2025-05-07T20:25:33.7074012Z 2025-05-07T20:25:33.7416184Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:33.7653368Z gcc_impl_linux-64-11 | 53.0 MB | ##7 | 28% 2025-05-07T20:25:33.7653611Z 2025-05-07T20:25:33.7653616Z 2025-05-07T20:25:33.7653619Z 2025-05-07T20:25:33.7653623Z 2025-05-07T20:25:33.7653626Z 2025-05-07T20:25:33.7653630Z 2025-05-07T20:25:33.7653634Z 2025-05-07T20:25:33.7653637Z 2025-05-07T20:25:33.7663675Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:33.7663964Z 2025-05-07T20:25:33.7663967Z 2025-05-07T20:25:33.7663987Z 2025-05-07T20:25:33.7663991Z 2025-05-07T20:25:33.7663995Z 2025-05-07T20:25:33.7663999Z 2025-05-07T20:25:33.7664334Z 2025-05-07T20:25:33.7669069Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:33.7669359Z 2025-05-07T20:25:33.7669363Z 2025-05-07T20:25:33.7669374Z 2025-05-07T20:25:33.7669378Z 2025-05-07T20:25:33.7669381Z 2025-05-07T20:25:33.7669385Z 2025-05-07T20:25:33.7669389Z 2025-05-07T20:25:33.7669393Z 2025-05-07T20:25:33.7674277Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:33.7674686Z 2025-05-07T20:25:33.7674692Z 2025-05-07T20:25:33.7674697Z 2025-05-07T20:25:33.7674703Z 2025-05-07T20:25:33.7674708Z 2025-05-07T20:25:33.7674713Z 2025-05-07T20:25:33.7674718Z 2025-05-07T20:25:33.8187230Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:33.8187614Z 2025-05-07T20:25:33.8187620Z 2025-05-07T20:25:33.8187625Z 2025-05-07T20:25:33.8187643Z 2025-05-07T20:25:33.8187655Z 2025-05-07T20:25:33.8187661Z 2025-05-07T20:25:33.8187665Z 2025-05-07T20:25:33.8187671Z 2025-05-07T20:25:33.8187675Z 2025-05-07T20:25:33.8197043Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:33.8197433Z 2025-05-07T20:25:33.8197437Z 2025-05-07T20:25:33.8197440Z 2025-05-07T20:25:33.8197444Z 2025-05-07T20:25:33.8197447Z 2025-05-07T20:25:33.8197450Z 2025-05-07T20:25:33.8197454Z 2025-05-07T20:25:33.8197457Z 2025-05-07T20:25:33.8198537Z 2025-05-07T20:25:33.8417453Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:33.8885526Z gcc_impl_linux-64-11 | 53.0 MB | ###7 | 37% 2025-05-07T20:25:33.8885792Z 2025-05-07T20:25:33.8885796Z 2025-05-07T20:25:33.8885800Z 2025-05-07T20:25:33.8885803Z 2025-05-07T20:25:33.8885807Z 2025-05-07T20:25:33.8885810Z 2025-05-07T20:25:33.8885814Z 2025-05-07T20:25:33.8885908Z 2025-05-07T20:25:33.8885912Z 2025-05-07T20:25:33.8885963Z 2025-05-07T20:25:33.8894538Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:33.8894915Z 2025-05-07T20:25:33.8894918Z 2025-05-07T20:25:33.8894922Z 2025-05-07T20:25:33.8894926Z 2025-05-07T20:25:33.8894929Z 2025-05-07T20:25:33.8894933Z 2025-05-07T20:25:33.8894936Z 2025-05-07T20:25:33.8894940Z 2025-05-07T20:25:33.8894944Z 2025-05-07T20:25:33.8894954Z 2025-05-07T20:25:33.9418172Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:34.0418441Z gcc_impl_linux-64-11 | 53.0 MB | ####9 | 50% 2025-05-07T20:25:34.0515512Z gcc_impl_linux-64-11 | 53.0 MB | ######4 | 64% 2025-05-07T20:25:34.0515754Z 2025-05-07T20:25:34.0515758Z 2025-05-07T20:25:34.0516351Z 2025-05-07T20:25:34.1044797Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:34.1045259Z 2025-05-07T20:25:34.1045269Z 2025-05-07T20:25:34.1045278Z 2025-05-07T20:25:34.1045287Z 2025-05-07T20:25:34.1045610Z 2025-05-07T20:25:34.1045639Z 2025-05-07T20:25:34.1045648Z 2025-05-07T20:25:34.1045658Z 2025-05-07T20:25:34.1045683Z 2025-05-07T20:25:34.1045692Z 2025-05-07T20:25:34.1045763Z 2025-05-07T20:25:34.1056456Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:34.1056937Z 2025-05-07T20:25:34.1056946Z 2025-05-07T20:25:34.1056951Z 2025-05-07T20:25:34.1056956Z 2025-05-07T20:25:34.1056961Z 2025-05-07T20:25:34.1056966Z 2025-05-07T20:25:34.1056971Z 2025-05-07T20:25:34.1056976Z 2025-05-07T20:25:34.1056982Z 2025-05-07T20:25:34.1056987Z 2025-05-07T20:25:34.1057201Z 2025-05-07T20:25:34.1419732Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:34.2419754Z gcc_impl_linux-64-11 | 53.0 MB | #######8 | 79% 2025-05-07T20:25:34.3228784Z gcc_impl_linux-64-11 | 53.0 MB | #########6 | 97% 2025-05-07T20:25:34.3229034Z 2025-05-07T20:25:34.3703111Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:34.5825307Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:34.5825720Z 2025-05-07T20:25:34.5825727Z 2025-05-07T20:25:35.0426530Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:35.0432657Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:35.0433039Z 2025-05-07T20:25:35.0433237Z 2025-05-07T20:25:35.0433452Z  2025-05-07T20:25:35.0433663Z 2025-05-07T20:25:35.0433667Z 2025-05-07T20:25:35.0433838Z  2025-05-07T20:25:35.0434051Z 2025-05-07T20:25:35.0434055Z 2025-05-07T20:25:35.0434059Z 2025-05-07T20:25:35.0434229Z  2025-05-07T20:25:35.0434449Z 2025-05-07T20:25:35.0434454Z 2025-05-07T20:25:35.0434459Z 2025-05-07T20:25:35.0434464Z 2025-05-07T20:25:35.0434730Z  2025-05-07T20:25:35.0435026Z 2025-05-07T20:25:35.0435031Z 2025-05-07T20:25:35.0435036Z 2025-05-07T20:25:35.0435041Z 2025-05-07T20:25:35.0435070Z 2025-05-07T20:25:35.0435310Z  2025-05-07T20:25:35.0435723Z 2025-05-07T20:25:35.0435726Z 2025-05-07T20:25:35.0435730Z 2025-05-07T20:25:35.0435733Z 2025-05-07T20:25:35.0435737Z 2025-05-07T20:25:35.0435747Z 2025-05-07T20:25:35.0443801Z  2025-05-07T20:25:35.0444125Z 2025-05-07T20:25:35.0444131Z 2025-05-07T20:25:35.0444137Z 2025-05-07T20:25:35.0444142Z 2025-05-07T20:25:35.0444147Z 2025-05-07T20:25:35.0444152Z 2025-05-07T20:25:35.0444157Z 2025-05-07T20:25:35.0444480Z  2025-05-07T20:25:35.0444786Z 2025-05-07T20:25:35.0444792Z 2025-05-07T20:25:35.0444797Z 2025-05-07T20:25:35.0444802Z 2025-05-07T20:25:35.0444823Z 2025-05-07T20:25:35.0445106Z 2025-05-07T20:25:35.0445113Z 2025-05-07T20:25:35.0445118Z 2025-05-07T20:25:35.0445392Z  2025-05-07T20:25:35.0445612Z 2025-05-07T20:25:35.0445616Z 2025-05-07T20:25:35.0445619Z 2025-05-07T20:25:35.0445623Z 2025-05-07T20:25:35.0445627Z 2025-05-07T20:25:35.0445630Z 2025-05-07T20:25:35.0445634Z 2025-05-07T20:25:35.0445637Z 2025-05-07T20:25:35.0445641Z 2025-05-07T20:25:35.0445838Z  2025-05-07T20:25:35.0446054Z 2025-05-07T20:25:35.0446058Z 2025-05-07T20:25:35.0446062Z 2025-05-07T20:25:35.0446065Z 2025-05-07T20:25:35.0446069Z 2025-05-07T20:25:35.0446072Z 2025-05-07T20:25:35.0446076Z 2025-05-07T20:25:35.0446079Z 2025-05-07T20:25:35.0446091Z 2025-05-07T20:25:35.0446094Z 2025-05-07T20:25:35.0446282Z  2025-05-07T20:25:35.0446650Z 2025-05-07T20:25:35.0446659Z 2025-05-07T20:25:35.0446663Z 2025-05-07T20:25:35.0446666Z 2025-05-07T20:25:35.0446677Z 2025-05-07T20:25:35.0446681Z 2025-05-07T20:25:35.0446685Z 2025-05-07T20:25:35.0446688Z 2025-05-07T20:25:35.0446692Z 2025-05-07T20:25:35.0446695Z 2025-05-07T20:25:35.0446699Z 2025-05-07T20:25:35.0446908Z  done 2025-05-07T20:25:35.1440909Z Preparing transaction: \ done 2025-05-07T20:25:35.4452756Z Verifying transaction: / - \ done 2025-05-07T20:25:35.5461572Z Executing transaction: / done 2025-05-07T20:25:35.7114447Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:25:39.6053941Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:39.6054484Z 2025-05-07T20:25:39.6068681Z 2025-05-07T20:25:39.6086352Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:39.6086892Z 2025-05-07T20:25:39.6098103Z 2025-05-07T20:25:39.6115259Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:39.6115865Z 2025-05-07T20:25:39.6128456Z 2025-05-07T20:25:39.6145756Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:39.6146286Z 2025-05-07T20:25:39.6157713Z 2025-05-07T20:25:41.5031424Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:41.5031718Z 2025-05-07T20:25:41.5644856Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:43.4477391Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:43.4477686Z 2025-05-07T20:25:43.5088988Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:45.3903067Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:45.3903609Z 2025-05-07T20:25:45.4516361Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:47.3330310Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:47.3330613Z 2025-05-07T20:25:47.3967322Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:47.3971350Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:47.3971955Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:47.3972236Z 2025-05-07T20:25:49.2890375Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:49.2890888Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:49.2891337Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:49.2891772Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:49.2892321Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:49.2892850Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:49.2893291Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:49.2893746Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:49.2894144Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:49.2895034Z #define __CHAR_BIT__ 8 2025-05-07T20:25:49.2895418Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:49.2895769Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:49.2896133Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:49.2896541Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:49.2896961Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:49.2897418Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.2897894Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:49.2898324Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:49.2898835Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:49.2899350Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:49.2899967Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:49.2900564Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:49.2901037Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:49.2901444Z #define __GCC_IEC_559 2 2025-05-07T20:25:49.2901803Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:49.2902466Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:49.2902865Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:49.2903272Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:49.2903780Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.2904267Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:49.2904666Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:49.2905078Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:49.2905475Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:49.2905879Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:49.2906279Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:49.2906690Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:49.2907075Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:49.2907442Z #define __INT8_C(c) c 2025-05-07T20:25:49.2907799Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:49.2908235Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.2908711Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:49.2909220Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:49.2909758Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:49.2910150Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:49.2910540Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.2910956Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:49.2911373Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:49.2911955Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:49.2912566Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:49.2912996Z #define __linux 1 2025-05-07T20:25:49.2913314Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:49.2913718Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:49.2914128Z #define __unix 1 2025-05-07T20:25:49.2914457Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:49.2914883Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:49.2915312Z #define __WINT_MIN__ 0U 2025-05-07T20:25:49.2915862Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:49.2916284Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:49.2916701Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:49.2917106Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:49.2917486Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:49.2917904Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:49.2918375Z #define __INT64_C(c) c ## L 2025-05-07T20:25:49.2918759Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:49.2919208Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:49.2919605Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:49.2920096Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:49.2920646Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:49.2921020Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:49.2921392Z #define __DBL_DIG__ 15 2025-05-07T20:25:49.2921733Z #define __FLT32_DIG__ 6 2025-05-07T20:25:49.2922182Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:49.2922873Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:49.2923255Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:49.2923741Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:49.2924241Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:49.2924611Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:49.2925000Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:49.2925578Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:49.2926175Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:49.2926592Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:49.2926985Z #define __unix__ 1 2025-05-07T20:25:49.2927299Z #define __INT_WIDTH__ 32 2025-05-07T20:25:49.2927657Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:49.2928020Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:49.2928385Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:49.2928781Z #define __UINT16_C(c) c 2025-05-07T20:25:49.2929144Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:49.2929676Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:49.2930276Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:49.2930832Z #define __gnu_linux__ 1 2025-05-07T20:25:49.2931198Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:49.2931624Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:49.2932059Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.2932464Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:49.2932848Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:49.2933222Z #define __GNUC__ 11 2025-05-07T20:25:49.2933542Z #define __pie__ 2 2025-05-07T20:25:49.2933849Z #define __MMX__ 1 2025-05-07T20:25:49.2934176Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:49.2934574Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:49.2934990Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:49.2935401Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:49.2935908Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:49.2936519Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.2936990Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:49.2937373Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:49.2937775Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:49.2938220Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:49.2938607Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:49.2938985Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:49.2939405Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:49.2939838Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:49.2940222Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:49.2940644Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:49.2941026Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:49.2941405Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:49.2941818Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:49.2942202Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:49.2942599Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:49.2943073Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:49.2943623Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:49.2944015Z #define __SSE2_MATH__ 1 2025-05-07T20:25:49.2944370Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:49.2944816Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.2945255Z #define __amd64 1 2025-05-07T20:25:49.2945574Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:49.2945983Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:49.2946433Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:49.2946881Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:49.2947262Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:49.2947680Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:49.2948050Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:49.2948440Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:49.2948822Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:49.2949210Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:49.2949613Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:49.2950255Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:49.2950635Z #define __x86_64 1 2025-05-07T20:25:49.2950975Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:49.2951500Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:49.2952201Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:49.2952866Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:49.2953550Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:49.2954121Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:49.2954496Z #define __LP64__ 1 2025-05-07T20:25:49.2954835Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.2955346Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:49.2956099Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:49.2956527Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:49.2956934Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:49.2957481Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:49.2957891Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:49.2958280Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:49.2958655Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:49.2959033Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:49.2959411Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:49.2959924Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:49.2960472Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:49.2960867Z #define __FLT_DIG__ 6 2025-05-07T20:25:49.2961200Z #define __NO_INLINE__ 1 2025-05-07T20:25:49.2961549Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:49.2962027Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:49.2962536Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:49.2962914Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:49.2963296Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:49.2963662Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:49.2964050Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:49.2964417Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:49.2964992Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:49.2965740Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:49.2966145Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:49.2966603Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:49.2967091Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:49.2967479Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:49.2967860Z #define __FLT128_DIG__ 33 2025-05-07T20:25:49.2968218Z #define __INT32_C(c) c 2025-05-07T20:25:49.2968578Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:49.2969002Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:49.2969416Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:49.2969834Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:49.2970309Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:49.2970770Z #define unix 1 2025-05-07T20:25:49.2971128Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:49.2971590Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.2972037Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:49.2972482Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:49.2972958Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:49.2973329Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:49.2973710Z #define __ELF__ 1 2025-05-07T20:25:49.2974030Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:49.2974430Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:49.2974830Z #define __FLT_RADIX__ 2 2025-05-07T20:25:49.2975197Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:49.2975743Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:49.2976275Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:49.2976655Z #define __SSE_MATH__ 1 2025-05-07T20:25:49.2976978Z #define __k8 1 2025-05-07T20:25:49.2977717Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:49.2978263Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:49.2978675Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:49.2979099Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:49.2979468Z #define __LDBL_DIG__ 18 2025-05-07T20:25:49.2979853Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:49.2980220Z #define __x86_64__ 1 2025-05-07T20:25:49.2980559Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:49.2980971Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:49.2981450Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.2981879Z #define __FLT64_DIG__ 15 2025-05-07T20:25:49.2982274Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.2982779Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:49.2983235Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.2983607Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:49.2983984Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.2984668Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:49.2985179Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:49.2985731Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:49.2986136Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:49.2986614Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:49.2987068Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:49.2987494Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:49.2987893Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:49.2988322Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:49.2988722Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:49.2989060Z #define __SEG_FS 1 2025-05-07T20:25:49.2989383Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:49.2989798Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:49.2990193Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.2990601Z #define __SEG_GS 1 2025-05-07T20:25:49.2991039Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:49.2991588Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:49.2991969Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:49.2992372Z #define __INT16_TYPE__ short int 2025-05-07T20:25:49.2992766Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:49.2993180Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:49.2993545Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:49.2993887Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:49.2994260Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:49.2994730Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:49.2995290Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.2995821Z #define linux 1 2025-05-07T20:25:49.2996141Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.2996525Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:49.2996912Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:49.2997267Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:49.2997647Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:49.2998014Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:49.2998502Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:49.2999070Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:49.2999531Z #define __code_model_small__ 1 2025-05-07T20:25:49.2999913Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:49.3000313Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:49.3000656Z #define __k8__ 1 2025-05-07T20:25:49.3000972Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:49.3001370Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:49.3001786Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:49.3002134Z #define __pic__ 2 2025-05-07T20:25:49.3002485Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.3002928Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:49.3003343Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.3003822Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:49.3004496Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:49.3005013Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:49.3005398Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:49.3005806Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:49.3006250Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:49.3006605Z #define __linux__ 1 2025-05-07T20:25:49.3006924Z #define __INT64_TYPE__ long int 2025-05-07T20:25:49.3007293Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:49.3007660Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:49.3008036Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:49.3008386Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:49.3008797Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.3009261Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:49.3009671Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:49.3010098Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:49.3010523Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:49.3011126Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:49.3011595Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:49.3012113Z #define __SSE__ 1 2025-05-07T20:25:49.3012429Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:49.3012909Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:49.3013410Z #define __amd64__ 1 2025-05-07T20:25:49.3013713Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:49.3014065Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:49.3014444Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:49.3014830Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:49.3015201Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:49.3015583Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:49.3015952Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:49.3016328Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:49.3016705Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:49.3017206Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:49.3017876Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:49.3018395Z #define _LP64 1 2025-05-07T20:25:49.3018704Z #define __UINT8_C(c) c 2025-05-07T20:25:49.3019031Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:49.3019404Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:49.3019810Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:49.3020198Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:49.3020620Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:49.3021120Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:49.3021770Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:49.3022288Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.3022710Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.3023149Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:49.3023665Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:49.3024195Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:49.3024570Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:49.3025033Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:49.3025556Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:49.3025922Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:49.3026276Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:49.3026621Z #define __FXSR__ 1 2025-05-07T20:25:49.3027045Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:49.3027688Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:49.3028272Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:49.3028705Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:49.3029061Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:49.3029523Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:49.3030193Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:49.3030555Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:49.3030879Z #define __PIC__ 2 2025-05-07T20:25:49.3031230Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:49.3031807Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:49.3032364Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:49.3032838Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:49.3033335Z #define __SSE2__ 1 2025-05-07T20:25:49.3033662Z #define __INT32_TYPE__ int 2025-05-07T20:25:49.3033998Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:49.3034364Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:49.3034839Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:49.3035336Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:49.3035850Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:49.3036234Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:49.3036626Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.3048011Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:49.3048434Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:49.3048781Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:49.3049212Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.3049652Z #define __PIE__ 2 2025-05-07T20:25:49.3050108Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:49.3050691Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:49.3051197Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:49.3051720Z #define __INT16_C(c) c 2025-05-07T20:25:49.3052031Z #define __STDC__ 1 2025-05-07T20:25:49.3052360Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:49.3052743Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:49.3053102Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:49.3053532Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:49.3054032Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:49.3054528Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:49.3054903Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:49.3055305Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:49.3055689Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:49.3056095Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:49.3056513Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.3056918Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:49.3057327Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.3057898Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:49.3058437Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:49.3058860Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:49.3059286Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:49.3059643Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:49.3059865Z 2025-05-07T20:25:49.3563859Z 2025-05-07T20:25:49.3564622Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:49.3565125Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:49.3566257Z 2025-05-07T20:25:51.2440452Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:51.2440810Z #define __cpp_attributes 200809L 2025-05-07T20:25:51.2441161Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:51.2441519Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:51.2441809Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:51.2442071Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:51.2442415Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:51.2442775Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:51.2443056Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:51.2443379Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:51.2443695Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:51.2443961Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:51.2444215Z #define __CHAR_BIT__ 8 2025-05-07T20:25:51.2444457Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:51.2445056Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:51.2445319Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:51.2445592Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:51.2445872Z #define __cpp_static_assert 201411L 2025-05-07T20:25:51.2446160Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:51.2446463Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.2446771Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:51.2447058Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:51.2447389Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:51.2447721Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:51.2448120Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:51.2448538Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:51.2448856Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:51.2449138Z #define __GCC_IEC_559 2 2025-05-07T20:25:51.2449395Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:51.2449846Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:51.2450149Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:51.2450468Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:51.2450769Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:51.2451095Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:51.2451404Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:51.2451738Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.2452073Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:51.2452347Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:51.2452632Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:51.2452920Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:51.2453217Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:51.2453492Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:51.2453829Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:51.2454114Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:51.2454450Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:51.2454789Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:51.2455054Z #define __INT8_C(c) c 2025-05-07T20:25:51.2455291Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:51.2455567Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:51.2455895Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.2456219Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:51.2456503Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:51.2456802Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:51.2457115Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:51.2457473Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:51.2457765Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:51.2458046Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:51.2458317Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.2458602Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:51.2458883Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:51.2459286Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:51.2459701Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:51.2459996Z #define __linux 1 2025-05-07T20:25:51.2460248Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:51.2460555Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:51.2460838Z #define __unix 1 2025-05-07T20:25:51.2461062Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:51.2461350Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:51.2461645Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:51.2461916Z #define __WINT_MIN__ 0U 2025-05-07T20:25:51.2462172Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:51.2462462Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:51.2462749Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:51.2463020Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:51.2463279Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:51.2463570Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:51.2463971Z #define __INT64_C(c) c ## L 2025-05-07T20:25:51.2464247Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:51.2464553Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:51.2464827Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:51.2465137Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:51.2465783Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:51.2466087Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:51.2466446Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:51.2466830Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:51.2467084Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:51.2467371Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:51.2467657Z #define __DBL_DIG__ 15 2025-05-07T20:25:51.2467894Z #define __FLT32_DIG__ 6 2025-05-07T20:25:51.2468194Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:51.2468550Z #define __GXX_WEAK__ 1 2025-05-07T20:25:51.2468795Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:51.2469219Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:51.2469583Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:51.2469978Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:51.2470262Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:51.2470643Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:51.2471011Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:51.2471469Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:51.2471928Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:51.2472231Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:51.2472516Z #define __unix__ 1 2025-05-07T20:25:51.2472749Z #define __INT_WIDTH__ 32 2025-05-07T20:25:51.2473018Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:51.2473287Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:51.2473574Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:51.2473862Z #define __UINT16_C(c) c 2025-05-07T20:25:51.2474134Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:51.2474415Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:51.2474811Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:51.2475227Z #define __gnu_linux__ 1 2025-05-07T20:25:51.2475487Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:51.2475844Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:51.2476135Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:51.2476427Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.2476704Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:51.2476965Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:51.2477225Z #define __GNUC__ 11 2025-05-07T20:25:51.2477448Z #define __GXX_RTTI 1 2025-05-07T20:25:51.2477674Z #define __pie__ 2 2025-05-07T20:25:51.2477894Z #define __MMX__ 1 2025-05-07T20:25:51.2478121Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:51.2478383Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:51.2478668Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:51.2478946Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:51.2479194Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:51.2479495Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:51.2479820Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:51.2480184Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:51.2480591Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:51.2480899Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2481218Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:51.2481479Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:51.2481747Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:51.2482058Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:51.2482368Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:51.2482630Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:51.2482895Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:51.2483189Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:51.2483485Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:51.2483937Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:51.2484230Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:51.2484483Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:51.2484755Z #define __cplusplus 201703L 2025-05-07T20:25:51.2485027Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:51.2485312Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:51.2485573Z #define __DEPRECATED 1 2025-05-07T20:25:51.2485830Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:51.2486131Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:51.2486385Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:51.2486709Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:51.2487071Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:51.2487338Z #define __SSE2_MATH__ 1 2025-05-07T20:25:51.2487591Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:51.2487897Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2488186Z #define __amd64 1 2025-05-07T20:25:51.2488518Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:51.2488789Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:51.2489054Z #define __GNUG__ 11 2025-05-07T20:25:51.2489314Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:51.2489629Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:51.2489882Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:51.2490177Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:51.2490474Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:51.2490733Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:51.2491004Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:51.2491302Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:51.2491571Z #define __cpp_hex_float 201603L 2025-05-07T20:25:51.2491837Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:51.2492106Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:51.2492383Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:51.2492645Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:51.2492917Z #define __x86_64 1 2025-05-07T20:25:51.2493162Z #define __cpp_lambdas 200907L 2025-05-07T20:25:51.2493429Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:51.2493801Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:51.2494194Z #define __cpp_template_auto 201606L 2025-05-07T20:25:51.2494546Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:51.2495002Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:51.2495471Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:51.2495861Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:51.2496112Z #define __LP64__ 1 2025-05-07T20:25:51.2496346Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.2496702Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:51.2497078Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:51.2497360Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:51.2497650Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:51.2497930Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:51.2498207Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:51.2498472Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:51.2498734Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:51.2499066Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:51.2499431Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:51.2499718Z #define __FLT_DIG__ 6 2025-05-07T20:25:51.2499948Z #define __NO_INLINE__ 1 2025-05-07T20:25:51.2500195Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:51.2500524Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:51.2500869Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:51.2501126Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:51.2501391Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:51.2501648Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:51.2501927Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:51.2502381Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:51.2502647Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:51.2502940Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:51.2503226Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:51.2503497Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:51.2503793Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:51.2504133Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:51.2504426Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:51.2504688Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:51.2504947Z #define __FLT128_DIG__ 33 2025-05-07T20:25:51.2505189Z #define __INT32_C(c) c 2025-05-07T20:25:51.2505426Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:51.2505710Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:51.2505989Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:51.2506266Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:51.2506582Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:51.2506890Z #define unix 1 2025-05-07T20:25:51.2507243Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:51.2507506Z #define __cpp_rtti 199711L 2025-05-07T20:25:51.2507774Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:51.2508085Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.2508391Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:51.2508705Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:51.2509036Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:51.2509288Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:51.2509587Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:51.2509869Z #define __ELF__ 1 2025-05-07T20:25:51.2510121Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:51.2510436Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:51.2510719Z #define __FLT_RADIX__ 2 2025-05-07T20:25:51.2510967Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:51.2511327Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:51.2511699Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:51.2511974Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:51.2512254Z #define __k8 1 2025-05-07T20:25:51.2512552Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:51.2512928Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:51.2513220Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:51.2513522Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:51.2513786Z #define __LDBL_DIG__ 18 2025-05-07T20:25:51.2514028Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:51.2514292Z #define __x86_64__ 1 2025-05-07T20:25:51.2514538Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:51.2514835Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:51.2515174Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2515487Z #define __FLT64_DIG__ 15 2025-05-07T20:25:51.2515917Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.2516267Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:51.2516599Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.2516864Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:51.2517143Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2517444Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:51.2517814Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:51.2518205Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:51.2518501Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:51.2518829Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:51.2519141Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:51.2519466Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:51.2519768Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:51.2520044Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:51.2520361Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:51.2520683Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:51.2520920Z #define __SEG_FS 1 2025-05-07T20:25:51.2521259Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:51.2521541Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:51.2521820Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2522103Z #define __SEG_GS 1 2025-05-07T20:25:51.2522417Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:51.2522806Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:51.2523076Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:51.2523364Z #define __INT16_TYPE__ short int 2025-05-07T20:25:51.2523646Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:51.2523954Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:51.2524257Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:51.2524511Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:51.2524769Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:51.2525115Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:51.2525508Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2525926Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:51.2526248Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:51.2526550Z #define linux 1 2025-05-07T20:25:51.2526786Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.2527065Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:51.2527340Z #define __EXCEPTIONS 1 2025-05-07T20:25:51.2527589Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:51.2527849Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:51.2528120Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:51.2528416Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:51.2528758Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:51.2529149Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:51.2529500Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:51.2529831Z #define __code_model_small__ 1 2025-05-07T20:25:51.2530102Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:51.2530420Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:51.2530735Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:51.2531008Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:51.2531301Z #define __k8__ 1 2025-05-07T20:25:51.2531533Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:51.2531814Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:51.2532112Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:51.2532358Z #define __pic__ 2 2025-05-07T20:25:51.2532602Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.2532919Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:51.2533190Z #define __cpp_decltype 200707L 2025-05-07T20:25:51.2533477Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2533808Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:51.2534179Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:51.2534539Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:51.2534831Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:51.2535169Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:51.2535461Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:51.2535711Z #define __linux__ 1 2025-05-07T20:25:51.2535940Z #define __INT64_TYPE__ long int 2025-05-07T20:25:51.2536204Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:51.2536461Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:51.2536736Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:51.2537020Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:51.2537331Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:51.2537627Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2537947Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:51.2538212Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:51.2538509Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:51.2538811Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:51.2539159Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:51.2539512Z #define __SSE__ 1 2025-05-07T20:25:51.2539845Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:51.2540216Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:51.2540581Z #define __amd64__ 1 2025-05-07T20:25:51.2540809Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:51.2541067Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:51.2541336Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:51.2541605Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:51.2541881Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:51.2542139Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:51.2542417Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:51.2551906Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:51.2552293Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:51.2552765Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:51.2553127Z #define _LP64 1 2025-05-07T20:25:51.2553344Z #define __UINT8_C(c) c 2025-05-07T20:25:51.2553606Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:51.2554031Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:51.2554302Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:51.2554571Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:51.2554934Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:51.2555402Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:51.2555942Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.2556242Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.2556556Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:51.2556861Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:51.2557249Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:51.2557620Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:51.2557886Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:51.2558154Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:51.2558504Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:51.2558883Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:51.2559141Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:51.2559398Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:51.2559653Z #define __FXSR__ 1 2025-05-07T20:25:51.2559954Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:51.2560410Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:51.2560822Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:51.2561132Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:51.2561404Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:51.2561707Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:51.2561999Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:51.2562271Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:51.2562632Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:51.2562999Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:51.2563275Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:51.2563527Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:51.2563766Z #define __PIC__ 2 2025-05-07T20:25:51.2564013Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:51.2564418Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:51.2564813Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:51.2565145Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:51.2566490Z #define __cpp_constexpr 201603L 2025-05-07T20:25:51.2566761Z #define __SSE2__ 1 2025-05-07T20:25:51.2566995Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:51.2567290Z #define __INT32_TYPE__ int 2025-05-07T20:25:51.2567546Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:51.2567808Z #define __cpp_exceptions 199711L 2025-05-07T20:25:51.2568090Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:51.2568432Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:51.2569092Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:51.2569365Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:51.2569643Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:51.2569920Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.2570220Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:51.2570498Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:51.2570758Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:51.2571051Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:51.2571343Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.2571648Z #define __PIE__ 2 2025-05-07T20:25:51.2571967Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:51.2572384Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:51.2572698Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:51.2573047Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:51.2573412Z #define __INT16_C(c) c 2025-05-07T20:25:51.2573645Z #define __STDC__ 1 2025-05-07T20:25:51.2574032Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:51.2574285Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:51.2574564Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:51.2574828Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:51.2575124Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:51.2575474Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:51.2575813Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:51.2576076Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:51.2576370Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:51.2576655Z #define __SSE_MATH__ 1 2025-05-07T20:25:51.2576895Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:51.2577182Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:51.2577499Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:51.2577788Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:51.2578076Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.2578355Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:51.2578667Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.2579058Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:51.2579434Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:51.2579747Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:51.2580036Z #define _GNU_SOURCE 1 2025-05-07T20:25:51.2580290Z #define __cpp_init_captures 201304L 2025-05-07T20:25:51.2580575Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:51.2580827Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:51.2580993Z 2025-05-07T20:25:51.3066755Z 2025-05-07T20:25:51.3067100Z + conda run -n build_binary c++ --version 2025-05-07T20:25:51.3067352Z 2025-05-07T20:25:53.1870728Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:53.1871111Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:53.1871607Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:53.1872166Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:53.1872504Z 2025-05-07T20:25:53.1872509Z 2025-05-07T20:25:53.2488636Z 2025-05-07T20:25:53.2489121Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:53.2489660Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:53.2489968Z 2025-05-07T20:25:55.2110607Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:55.2112521Z 2025-05-07T20:25:55.2113126Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:55.2113699Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:55.2114008Z 2025-05-07T20:25:57.1632179Z #define __cplusplus 201703L 2025-05-07T20:25:57.1634286Z 2025-05-07T20:25:57.1635704Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:57.1670270Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:57.1670701Z . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:57.1682349Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:57.1682702Z env: 2025-05-07T20:25:57.1682939Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:57.1683255Z BUILD_ENV: build_binary 2025-05-07T20:25:57.1683506Z BUILD_TARGET: genai 2025-05-07T20:25:57.1683743Z BUILD_VARIANT: cuda 2025-05-07T20:25:57.1683984Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:57.1684243Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:57.1684552Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:57.1684896Z ##[endgroup] 2025-05-07T20:25:57.5009842Z ################################################################################ 2025-05-07T20:25:57.5010215Z # Install CUDA 2025-05-07T20:25:57.5010423Z # 2025-05-07T20:25:57.5024266Z # [2025-05-07T20:25:57.502Z] + install_cuda build_binary 12.8.0 2025-05-07T20:25:57.5024651Z ################################################################################ 2025-05-07T20:25:57.5025190Z 2025-05-07T20:25:57.5039638Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:57.5950996Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:57.5951476Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:57.5956663Z + conda clean --packages --tarball -y 2025-05-07T20:25:57.5956891Z 2025-05-07T20:25:58.4647438Z Will remove 40 (182.7 MB) tarball(s). 2025-05-07T20:25:58.4648085Z Will remove 7 (108.6 MB) package(s). 2025-05-07T20:25:58.5268953Z 2025-05-07T20:25:58.5278262Z + conda clean --all -y 2025-05-07T20:25:58.5278420Z 2025-05-07T20:25:59.1968708Z There are no unused tarball(s) to remove. 2025-05-07T20:25:59.1969042Z Will remove 1 index cache(s). 2025-05-07T20:25:59.1969326Z There are no unused package(s) to remove. 2025-05-07T20:25:59.1969639Z There are no tempfile(s) to remove. 2025-05-07T20:25:59.1969950Z There are no logfile(s) to remove. 2025-05-07T20:25:59.2588766Z 2025-05-07T20:25:59.2604054Z [INSTALL] Installing CUDA 12.8.0 ... 2025-05-07T20:25:59.2627934Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0 2025-05-07T20:26:00.1707804Z Channels: 2025-05-07T20:26:00.1708054Z - conda-forge 2025-05-07T20:26:00.1708286Z Platform: linux-64 2025-05-07T20:26:10.6958248Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:11.8027190Z Solving environment: / - \ | / done 2025-05-07T20:26:11.8787952Z 2025-05-07T20:26:11.8788313Z ## Package Plan ## 2025-05-07T20:26:11.8788539Z 2025-05-07T20:26:11.8788819Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:11.8789228Z 2025-05-07T20:26:11.8789371Z added / updated specs: 2025-05-07T20:26:11.8789635Z - cuda=12.8.0 2025-05-07T20:26:11.8789767Z 2025-05-07T20:26:11.8789797Z 2025-05-07T20:26:11.8789927Z The following packages will be downloaded: 2025-05-07T20:26:11.8790140Z 2025-05-07T20:26:11.8790321Z package | build 2025-05-07T20:26:11.8790763Z ---------------------------|----------------- 2025-05-07T20:26:11.8791289Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:26:11.8791904Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:26:11.8792328Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:26:11.8792772Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:26:11.8793349Z cuda-12.8.0 | ha804496_0 26 KB conda-forge 2025-05-07T20:26:11.8793839Z cuda-cccl_linux-64-12.8.55 | ha770c72_1 1.0 MB conda-forge 2025-05-07T20:26:11.8795701Z cuda-command-line-tools-12.8.0| ha770c72_0 20 KB conda-forge 2025-05-07T20:26:11.8796240Z cuda-compiler-12.8.0 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:26:11.8796719Z cuda-crt-dev_linux-64-12.8.61| ha770c72_1 90 KB conda-forge 2025-05-07T20:26:11.8797256Z cuda-crt-tools-12.8.61 | ha770c72_1 27 KB conda-forge 2025-05-07T20:26:11.8797701Z cuda-cudart-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:11.8798157Z cuda-cudart-dev-12.8.57 | h5888daf_1 23 KB conda-forge 2025-05-07T20:26:11.8798648Z cuda-cudart-dev_linux-64-12.8.57| h3f2d84a_1 377 KB conda-forge 2025-05-07T20:26:11.8799146Z cuda-cudart-static-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:11.8799661Z cuda-cudart-static_linux-64-12.8.57| h3f2d84a_1 950 KB conda-forge 2025-05-07T20:26:11.8800173Z cuda-cudart_linux-64-12.8.57| h3f2d84a_1 188 KB conda-forge 2025-05-07T20:26:11.8800660Z cuda-cuobjdump-12.8.55 | hbd13f7d_0 227 KB conda-forge 2025-05-07T20:26:11.8801107Z cuda-cupti-12.8.57 | hbd13f7d_0 1.8 MB conda-forge 2025-05-07T20:26:11.8801721Z cuda-cupti-dev-12.8.57 | h5888daf_0 4.0 MB conda-forge 2025-05-07T20:26:11.8802181Z cuda-cuxxfilt-12.8.55 | hbd13f7d_0 211 KB conda-forge 2025-05-07T20:26:11.8802635Z cuda-driver-dev-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:26:11.8803123Z cuda-driver-dev_linux-64-12.8.90| h3f2d84a_1 36 KB conda-forge 2025-05-07T20:26:11.8803585Z cuda-gdb-12.8.55 | h50b4baa_0 353 KB conda-forge 2025-05-07T20:26:11.8804024Z cuda-libraries-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:11.8804488Z cuda-libraries-dev-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:11.8804966Z cuda-nsight-12.8.55 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:26:11.8805406Z cuda-nvcc-12.8.61 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:26:11.8805880Z cuda-nvcc-dev_linux-64-12.8.61| he91c749_1 12.7 MB conda-forge 2025-05-07T20:26:11.8806350Z cuda-nvcc-impl-12.8.61 | h85509e4_1 25 KB conda-forge 2025-05-07T20:26:11.8806812Z cuda-nvcc-tools-12.8.61 | he02047a_1 24.5 MB conda-forge 2025-05-07T20:26:11.8807280Z cuda-nvcc_linux-64-12.8.61 | h04802cd_0 25 KB conda-forge 2025-05-07T20:26:11.8807734Z cuda-nvdisasm-12.8.55 | hbd13f7d_0 4.9 MB conda-forge 2025-05-07T20:26:11.8808198Z cuda-nvml-dev-12.8.55 | hbd13f7d_0 134 KB conda-forge 2025-05-07T20:26:11.8808646Z cuda-nvprof-12.8.57 | hbd13f7d_0 2.5 MB conda-forge 2025-05-07T20:26:11.8809099Z cuda-nvprune-12.8.55 | hbd13f7d_0 68 KB conda-forge 2025-05-07T20:26:11.8809545Z cuda-nvrtc-12.8.61 | hbd13f7d_0 63.1 MB conda-forge 2025-05-07T20:26:11.8809994Z cuda-nvrtc-dev-12.8.61 | h5888daf_0 34 KB conda-forge 2025-05-07T20:26:11.8810443Z cuda-nvtx-12.8.55 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:26:11.8810897Z cuda-nvvm-dev_linux-64-12.8.61| ha770c72_1 25 KB conda-forge 2025-05-07T20:26:11.8811391Z cuda-nvvm-impl-12.8.61 | he02047a_1 20.8 MB conda-forge 2025-05-07T20:26:11.8811891Z cuda-nvvm-tools-12.8.61 | he02047a_1 23.5 MB conda-forge 2025-05-07T20:26:11.8812337Z cuda-nvvp-12.8.57 | hbd13f7d_0 112.4 MB conda-forge 2025-05-07T20:26:11.8812766Z cuda-opencl-12.8.55 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:26:11.8813222Z cuda-opencl-dev-12.8.55 | h5888daf_0 95 KB conda-forge 2025-05-07T20:26:11.8813839Z cuda-profiler-api-12.8.55 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:26:11.8814305Z cuda-runtime-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:26:11.8814778Z cuda-sanitizer-api-12.8.55 | hbd13f7d_0 8.8 MB conda-forge 2025-05-07T20:26:11.8815259Z cuda-toolkit-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:26:11.8815702Z cuda-tools-12.8.0 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:11.8816132Z cuda-version-12.8 | h5d125a7_3 21 KB conda-forge 2025-05-07T20:26:11.8816594Z cuda-visual-tools-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:11.8817066Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:26:11.8817488Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:26:11.8817948Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:26:11.8818474Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:26:11.8818994Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:26:11.8819564Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:26:11.8820010Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:26:11.8820473Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:26:11.8820950Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:26:11.8821386Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:26:11.8821836Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:11.8822244Z gds-tools-1.13.0.11 | h5888daf_0 37.9 MB conda-forge 2025-05-07T20:26:11.8822641Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:26:11.8823027Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:11.8823427Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:26:11.8823832Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:26:11.8824219Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:26:11.8824640Z libcublas-12.8.3.14 | h9ab20c4_0 460.2 MB conda-forge 2025-05-07T20:26:11.8825096Z libcublas-dev-12.8.3.14 | h9ab20c4_0 89 KB conda-forge 2025-05-07T20:26:11.8825539Z libcufft-11.3.3.41 | hbd13f7d_0 147.4 MB conda-forge 2025-05-07T20:26:11.8825980Z libcufft-dev-11.3.3.41 | h5888daf_0 33 KB conda-forge 2025-05-07T20:26:11.8826423Z libcufile-1.13.0.11 | h12f29b5_0 939 KB conda-forge 2025-05-07T20:26:11.8826876Z libcufile-dev-1.13.0.11 | h5888daf_0 35 KB conda-forge 2025-05-07T20:26:11.8827319Z libcurand-10.3.9.55 | hbd13f7d_0 43.6 MB conda-forge 2025-05-07T20:26:11.8827777Z libcurand-dev-10.3.9.55 | h5888daf_0 265 KB conda-forge 2025-05-07T20:26:11.8828232Z libcusolver-11.7.2.55 | h9ab20c4_0 156.9 MB conda-forge 2025-05-07T20:26:11.8828697Z libcusolver-dev-11.7.2.55 | h9ab20c4_0 59 KB conda-forge 2025-05-07T20:26:11.8829165Z libcusparse-12.5.7.53 | hbd13f7d_0 164.9 MB conda-forge 2025-05-07T20:26:11.8829633Z libcusparse-dev-12.5.7.53 | h5888daf_0 51 KB conda-forge 2025-05-07T20:26:11.8830103Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:26:11.8830548Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:26:11.8830995Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:26:11.8831554Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:26:11.8831995Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:26:11.8832411Z libglvnd-1.7.0 | ha4b6fd6_2 129 KB conda-forge 2025-05-07T20:26:11.8832845Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:26:11.8833275Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:26:11.8833676Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:26:11.8834088Z libnpp-12.3.3.65 | hbd13f7d_0 130.6 MB conda-forge 2025-05-07T20:26:11.8834521Z libnpp-dev-12.3.3.65 | h5888daf_0 443 KB conda-forge 2025-05-07T20:26:11.8834952Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:26:11.8835383Z libnvfatbin-12.8.55 | hbd13f7d_0 793 KB conda-forge 2025-05-07T20:26:11.8835957Z libnvfatbin-dev-12.8.55 | h5888daf_0 26 KB conda-forge 2025-05-07T20:26:11.8836514Z libnvjitlink-12.8.61 | hbd13f7d_0 28.7 MB conda-forge 2025-05-07T20:26:11.8836980Z libnvjitlink-dev-12.8.61 | h5888daf_0 25 KB conda-forge 2025-05-07T20:26:11.8837442Z libnvjpeg-12.3.5.57 | h97fd463_0 3.0 MB conda-forge 2025-05-07T20:26:11.8837891Z libnvjpeg-dev-12.3.5.57 | ha770c72_0 31 KB conda-forge 2025-05-07T20:26:11.8838338Z libopengl-1.7.0 | ha4b6fd6_2 50 KB conda-forge 2025-05-07T20:26:11.8838750Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:26:11.8839168Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:26:11.8839606Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:26:11.8840049Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:26:11.8840456Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:26:11.8840890Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:26:11.8841333Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:26:11.8841747Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:26:11.8842156Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:26:11.8842573Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:26:11.8843016Z nsight-compute-2025.1.0.14 | hb5ebaad_0 320.6 MB conda-forge 2025-05-07T20:26:11.8843459Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:26:11.8843847Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:26:11.8844249Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:26:11.8844701Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:26:11.8845139Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:26:11.8845573Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:26:11.8846012Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:26:11.8846426Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:26:11.8846823Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:26:11.8847229Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:26:11.8847646Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:26:11.8856854Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:26:11.8857429Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:26:11.8857969Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:26:11.8858509Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:26:11.8858982Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:26:11.8859439Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:26:11.8859900Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:26:11.8860330Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:26:11.8860764Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:26:11.8861204Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:26:11.8861674Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:26:11.8862165Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:26:11.8862757Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:11.8863216Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:26:11.8863665Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:11.8864110Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:26:11.8864558Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:26:11.8865022Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:26:11.8865855Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:26:11.8866362Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:26:11.8866913Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:26:11.8867299Z ------------------------------------------------------------ 2025-05-07T20:26:11.8867650Z Total: 1.88 GB 2025-05-07T20:26:11.8867861Z 2025-05-07T20:26:11.8868001Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:11.8868224Z 2025-05-07T20:26:11.8868435Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:26:11.8868859Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:26:11.8869282Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:26:11.8869744Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:26:11.8870176Z cuda conda-forge/noarch::cuda-12.8.0-ha804496_0 2025-05-07T20:26:11.8870657Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 2025-05-07T20:26:11.8871254Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 2025-05-07T20:26:11.8871998Z cuda-compiler conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 2025-05-07T20:26:11.8872539Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:26:11.8873100Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 2025-05-07T20:26:11.8873621Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 2025-05-07T20:26:11.8874146Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 2025-05-07T20:26:11.8874719Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:11.8875324Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 2025-05-07T20:26:11.8876249Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:11.8876864Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:26:11.8877424Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8877944Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 2025-05-07T20:26:11.8878455Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 2025-05-07T20:26:11.8879039Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8879795Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 2025-05-07T20:26:11.8880497Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 2025-05-07T20:26:11.8881036Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 2025-05-07T20:26:11.8881540Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 2025-05-07T20:26:11.8882115Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 2025-05-07T20:26:11.8882672Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 2025-05-07T20:26:11.8883340Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 2025-05-07T20:26:11.8883983Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 2025-05-07T20:26:11.8884609Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 2025-05-07T20:26:11.8885158Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 2025-05-07T20:26:11.8885717Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 2025-05-07T20:26:11.8886262Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8886780Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8887297Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 2025-05-07T20:26:11.8887805Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8888299Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 2025-05-07T20:26:11.8888823Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 2025-05-07T20:26:11.8889324Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8889846Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:26:11.8890406Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 2025-05-07T20:26:11.8890956Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 2025-05-07T20:26:11.8891475Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 2025-05-07T20:26:11.8892061Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8892654Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 2025-05-07T20:26:11.8893233Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 2025-05-07T20:26:11.8893786Z cuda-runtime conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 2025-05-07T20:26:11.8894344Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8894893Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 2025-05-07T20:26:11.8895377Z cuda-tools conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 2025-05-07T20:26:11.8895858Z cuda-version conda-forge/noarch::cuda-version-12.8-h5d125a7_3 2025-05-07T20:26:11.8896394Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 2025-05-07T20:26:11.8896935Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:26:11.8897392Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:26:11.8898035Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:26:11.8898660Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:26:11.8899263Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:26:11.8899842Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:26:11.8900345Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:26:11.8900839Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:26:11.8901384Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:26:11.8901869Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:26:11.8902300Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:26:11.8902732Z gds-tools conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 2025-05-07T20:26:11.8903214Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:26:11.8903642Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:26:11.8904159Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:26:11.8904586Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:26:11.8904998Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:26:11.8905455Z libcublas conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 2025-05-07T20:26:11.8905978Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 2025-05-07T20:26:11.8906559Z libcufft conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 2025-05-07T20:26:11.8907128Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 2025-05-07T20:26:11.8907714Z libcufile conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 2025-05-07T20:26:11.8908295Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 2025-05-07T20:26:11.8908897Z libcurand conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 2025-05-07T20:26:11.8909417Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 2025-05-07T20:26:11.8909947Z libcusolver conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 2025-05-07T20:26:11.8910487Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 2025-05-07T20:26:11.8911034Z libcusparse conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 2025-05-07T20:26:11.8911624Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 2025-05-07T20:26:11.8912151Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:26:11.8912634Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:26:11.8913145Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:26:11.8913668Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:26:11.8914155Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:26:11.8914602Z libglvnd conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 2025-05-07T20:26:11.8915076Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:26:11.8915653Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:26:11.8916090Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:26:11.8916519Z libnpp conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 2025-05-07T20:26:11.8916994Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 2025-05-07T20:26:11.8917468Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:26:11.8917935Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 2025-05-07T20:26:11.8918589Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 2025-05-07T20:26:11.8919137Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 2025-05-07T20:26:11.8919691Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 2025-05-07T20:26:11.8920221Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 2025-05-07T20:26:11.8920738Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 2025-05-07T20:26:11.8921296Z libopengl conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 2025-05-07T20:26:11.8921747Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:26:11.8922199Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:26:11.8922669Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:26:11.8923107Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:26:11.8923584Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:26:11.8924073Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:26:11.8924638Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:26:11.8925067Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:26:11.8925565Z nsight-compute conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 2025-05-07T20:26:11.8926055Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:26:11.8926446Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:26:11.8926861Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:26:11.8927359Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:26:11.8927860Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:26:11.8928345Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:26:11.8928849Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:26:11.8929300Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:26:11.8929744Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:26:11.8930242Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:26:11.8930786Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:26:11.8931351Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:26:11.8931969Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:26:11.8932509Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:26:11.8933029Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:26:11.8933560Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:26:11.8934038Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:26:11.8934628Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:26:11.8935117Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:26:11.8935659Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:26:11.8936236Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:26:11.8936771Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:26:11.8937278Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:26:11.8937793Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:26:11.8938298Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:26:11.8938940Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:26:11.8939487Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:26:11.8940024Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:26:11.8940478Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:26:11.8940726Z 2025-05-07T20:26:11.8940848Z The following packages will be UPDATED: 2025-05-07T20:26:11.8941055Z 2025-05-07T20:26:11.8941220Z libsqlite 3.46.0-hde9e2c9_0 --> 3.49.2-hee588c1_0 2025-05-07T20:26:11.8941760Z libzlib 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 2025-05-07T20:26:11.8942220Z zlib 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 2025-05-07T20:26:11.8942462Z 2025-05-07T20:26:11.8942688Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:26:11.8943002Z 2025-05-07T20:26:11.8943271Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:26:11.8943960Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:26:11.8944291Z 2025-05-07T20:26:11.8944327Z 2025-05-07T20:26:11.8944331Z 2025-05-07T20:26:11.8944478Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:11.8944863Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:26:11.8945100Z 2025-05-07T20:26:11.8945516Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:26:11.8945766Z 2025-05-07T20:26:11.8945770Z 2025-05-07T20:26:11.8945995Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:26:11.8946257Z 2025-05-07T20:26:11.8946261Z 2025-05-07T20:26:11.8946264Z 2025-05-07T20:26:11.8946494Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:26:11.8946761Z 2025-05-07T20:26:11.8946765Z 2025-05-07T20:26:11.8946774Z 2025-05-07T20:26:11.8946778Z 2025-05-07T20:26:11.8947010Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:26:11.8947280Z 2025-05-07T20:26:11.8947284Z 2025-05-07T20:26:11.8947288Z 2025-05-07T20:26:11.8947291Z 2025-05-07T20:26:11.8947295Z 2025-05-07T20:26:11.8947527Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:26:11.8947785Z 2025-05-07T20:26:11.8947796Z 2025-05-07T20:26:11.8947800Z 2025-05-07T20:26:11.8947803Z 2025-05-07T20:26:11.8947807Z 2025-05-07T20:26:11.8947811Z 2025-05-07T20:26:11.8948063Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:26:11.8948341Z 2025-05-07T20:26:11.8948344Z 2025-05-07T20:26:11.8948357Z 2025-05-07T20:26:11.8948361Z 2025-05-07T20:26:11.8948364Z 2025-05-07T20:26:11.8948368Z 2025-05-07T20:26:11.8948371Z 2025-05-07T20:26:11.8964039Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:26:11.8964332Z 2025-05-07T20:26:11.8964344Z 2025-05-07T20:26:11.8964348Z 2025-05-07T20:26:11.8964351Z 2025-05-07T20:26:11.8964355Z 2025-05-07T20:26:11.8964358Z 2025-05-07T20:26:11.8964367Z 2025-05-07T20:26:11.8966874Z 2025-05-07T20:26:11.8967921Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:26:11.8968210Z 2025-05-07T20:26:11.8968215Z 2025-05-07T20:26:11.8968218Z 2025-05-07T20:26:11.8968222Z 2025-05-07T20:26:11.8968234Z 2025-05-07T20:26:11.8968245Z 2025-05-07T20:26:11.8968249Z 2025-05-07T20:26:11.8968253Z 2025-05-07T20:26:11.8968256Z 2025-05-07T20:26:11.8969233Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:26:11.8969519Z 2025-05-07T20:26:11.8969533Z 2025-05-07T20:26:11.8969538Z 2025-05-07T20:26:11.8969554Z 2025-05-07T20:26:11.8969559Z 2025-05-07T20:26:11.8969564Z 2025-05-07T20:26:11.8969569Z 2025-05-07T20:26:11.8969574Z 2025-05-07T20:26:11.8969579Z 2025-05-07T20:26:11.8969584Z 2025-05-07T20:26:11.8970710Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:26:11.8971013Z 2025-05-07T20:26:11.8971018Z 2025-05-07T20:26:11.8971028Z 2025-05-07T20:26:11.8971031Z 2025-05-07T20:26:11.8971035Z 2025-05-07T20:26:11.8971048Z 2025-05-07T20:26:11.8971052Z 2025-05-07T20:26:11.8971056Z 2025-05-07T20:26:11.8971068Z 2025-05-07T20:26:11.8971071Z 2025-05-07T20:26:11.8971075Z 2025-05-07T20:26:11.8971819Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:26:11.8972183Z 2025-05-07T20:26:11.8972189Z 2025-05-07T20:26:11.8972206Z 2025-05-07T20:26:11.8972211Z 2025-05-07T20:26:11.8972217Z 2025-05-07T20:26:11.8972222Z 2025-05-07T20:26:11.8972227Z 2025-05-07T20:26:11.8972232Z 2025-05-07T20:26:11.8972237Z 2025-05-07T20:26:11.8972243Z 2025-05-07T20:26:11.8972247Z 2025-05-07T20:26:11.8972253Z 2025-05-07T20:26:11.8972737Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:26:11.8973063Z 2025-05-07T20:26:11.8973068Z 2025-05-07T20:26:11.8973080Z 2025-05-07T20:26:11.8973084Z 2025-05-07T20:26:11.8973088Z 2025-05-07T20:26:11.8973265Z 2025-05-07T20:26:11.8973269Z 2025-05-07T20:26:11.8973273Z 2025-05-07T20:26:11.8973276Z 2025-05-07T20:26:11.8973280Z 2025-05-07T20:26:11.8973284Z 2025-05-07T20:26:11.8973287Z 2025-05-07T20:26:11.8973291Z 2025-05-07T20:26:11.8974022Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:26:11.8974321Z 2025-05-07T20:26:11.8974335Z 2025-05-07T20:26:11.8974338Z 2025-05-07T20:26:11.8974342Z 2025-05-07T20:26:11.8974346Z 2025-05-07T20:26:11.8974349Z 2025-05-07T20:26:11.8974352Z 2025-05-07T20:26:11.8974356Z 2025-05-07T20:26:11.8974359Z 2025-05-07T20:26:11.8974363Z 2025-05-07T20:26:11.8974373Z 2025-05-07T20:26:11.8974377Z 2025-05-07T20:26:11.8974380Z 2025-05-07T20:26:11.8974384Z 2025-05-07T20:26:11.8975168Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:26:11.8975466Z 2025-05-07T20:26:11.8975475Z 2025-05-07T20:26:11.8975488Z 2025-05-07T20:26:11.8975491Z 2025-05-07T20:26:11.8975504Z 2025-05-07T20:26:11.8975507Z 2025-05-07T20:26:11.8975511Z 2025-05-07T20:26:11.8975514Z 2025-05-07T20:26:11.8975518Z 2025-05-07T20:26:11.8975521Z 2025-05-07T20:26:11.8975524Z 2025-05-07T20:26:11.8975528Z 2025-05-07T20:26:11.8975531Z 2025-05-07T20:26:11.8975535Z 2025-05-07T20:26:11.8975538Z 2025-05-07T20:26:11.8976413Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:26:11.8976726Z 2025-05-07T20:26:11.8976730Z 2025-05-07T20:26:11.8976733Z 2025-05-07T20:26:11.8976737Z 2025-05-07T20:26:11.8976740Z 2025-05-07T20:26:11.8976744Z 2025-05-07T20:26:11.8976747Z 2025-05-07T20:26:11.8976751Z 2025-05-07T20:26:11.8976754Z 2025-05-07T20:26:11.8976758Z 2025-05-07T20:26:11.8976761Z 2025-05-07T20:26:11.8976765Z 2025-05-07T20:26:11.8976778Z 2025-05-07T20:26:11.8976781Z 2025-05-07T20:26:11.8976789Z 2025-05-07T20:26:11.8976792Z 2025-05-07T20:26:11.8978162Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:26:11.8978499Z 2025-05-07T20:26:11.8978502Z 2025-05-07T20:26:11.8978506Z 2025-05-07T20:26:11.8978509Z 2025-05-07T20:26:11.8978513Z 2025-05-07T20:26:11.8978516Z 2025-05-07T20:26:11.8978520Z 2025-05-07T20:26:11.8978523Z 2025-05-07T20:26:11.8978527Z 2025-05-07T20:26:11.8978530Z 2025-05-07T20:26:11.8978534Z 2025-05-07T20:26:11.8978537Z 2025-05-07T20:26:11.8978541Z 2025-05-07T20:26:11.8978544Z 2025-05-07T20:26:11.8978548Z 2025-05-07T20:26:11.8978551Z 2025-05-07T20:26:11.8978599Z 2025-05-07T20:26:11.8980459Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:26:11.8980865Z 2025-05-07T20:26:11.8980869Z 2025-05-07T20:26:11.8980873Z 2025-05-07T20:26:11.8980876Z 2025-05-07T20:26:11.8980880Z 2025-05-07T20:26:11.8980883Z 2025-05-07T20:26:11.8981037Z 2025-05-07T20:26:11.8981046Z 2025-05-07T20:26:11.8981051Z 2025-05-07T20:26:11.8981056Z 2025-05-07T20:26:11.8981061Z 2025-05-07T20:26:11.8981074Z 2025-05-07T20:26:11.8981079Z 2025-05-07T20:26:11.8981084Z 2025-05-07T20:26:11.8981089Z 2025-05-07T20:26:11.8981105Z 2025-05-07T20:26:11.8981110Z 2025-05-07T20:26:11.8981115Z 2025-05-07T20:26:11.8981533Z cuda-cupti-dev-12.8. | 4.0 MB | | 0%  2025-05-07T20:26:11.8981843Z 2025-05-07T20:26:11.8981847Z 2025-05-07T20:26:11.8981865Z 2025-05-07T20:26:11.8981868Z 2025-05-07T20:26:11.8981872Z 2025-05-07T20:26:11.8981875Z 2025-05-07T20:26:11.8981879Z 2025-05-07T20:26:11.8981883Z 2025-05-07T20:26:11.8981887Z 2025-05-07T20:26:11.8981890Z 2025-05-07T20:26:11.8981894Z 2025-05-07T20:26:11.8981897Z 2025-05-07T20:26:11.8981901Z 2025-05-07T20:26:11.8981904Z 2025-05-07T20:26:11.8981908Z 2025-05-07T20:26:11.8981911Z 2025-05-07T20:26:11.8981915Z 2025-05-07T20:26:11.8981918Z 2025-05-07T20:26:11.8981928Z 2025-05-07T20:26:11.9886802Z ... (more hidden) ... 2025-05-07T20:26:11.9887408Z 2025-05-07T20:26:11.9906141Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:26:11.9906425Z 2025-05-07T20:26:11.9907106Z 2025-05-07T20:26:11.9937414Z libcusparse-12.5.7.5 | 164.9 MB | 3 | 4%  2025-05-07T20:26:11.9937753Z 2025-05-07T20:26:11.9937759Z 2025-05-07T20:26:11.9937765Z 2025-05-07T20:26:11.9938950Z 2025-05-07T20:26:12.0084846Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:26:12.0134355Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:26:12.0134724Z 2025-05-07T20:26:12.0134732Z 2025-05-07T20:26:12.0136661Z 2025-05-07T20:26:12.0889524Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:26:12.0892995Z 2025-05-07T20:26:12.0940224Z nsight-compute-2025. | 320.6 MB | 1 | 2%  2025-05-07T20:26:12.0940499Z 2025-05-07T20:26:12.0940663Z 2025-05-07T20:26:12.0940670Z 2025-05-07T20:26:12.0941312Z 2025-05-07T20:26:12.1097625Z libcufft-11.3.3.41 | 147.4 MB | 3 | 3%  2025-05-07T20:26:12.1135749Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:26:12.1136008Z 2025-05-07T20:26:12.1136012Z 2025-05-07T20:26:12.1138141Z 2025-05-07T20:26:12.1370359Z libcusolver-11.7.2.5 | 156.9 MB | 2 | 2%  2025-05-07T20:26:12.1370713Z 2025-05-07T20:26:12.1372455Z 2025-05-07T20:26:12.1890790Z libcusparse-12.5.7.5 | 164.9 MB | 7 | 7%  2025-05-07T20:26:12.1891103Z 2025-05-07T20:26:12.1941618Z nsight-compute-2025. | 320.6 MB | 3 | 3%  2025-05-07T20:26:12.1941897Z 2025-05-07T20:26:12.1941901Z 2025-05-07T20:26:12.1941905Z 2025-05-07T20:26:12.1944897Z 2025-05-07T20:26:12.2102765Z libcufft-11.3.3.41 | 147.4 MB | 5 | 6%  2025-05-07T20:26:12.2137881Z libcublas-12.8.3.14 | 460.2 MB | | 1% 2025-05-07T20:26:12.2138175Z 2025-05-07T20:26:12.2138179Z 2025-05-07T20:26:12.2139492Z 2025-05-07T20:26:12.2693983Z libcusolver-11.7.2.5 | 156.9 MB | 4 | 4%  2025-05-07T20:26:12.2694314Z 2025-05-07T20:26:12.2694318Z 2025-05-07T20:26:12.2892862Z libcusparse-12.5.7.5 | 164.9 MB | # | 10%  2025-05-07T20:26:12.2896626Z 2025-05-07T20:26:12.2946916Z nsight-compute-2025. | 320.6 MB | 4 | 4%  2025-05-07T20:26:12.2947187Z 2025-05-07T20:26:12.2947191Z 2025-05-07T20:26:12.2947195Z 2025-05-07T20:26:12.2948207Z 2025-05-07T20:26:12.3107401Z libcufft-11.3.3.41 | 147.4 MB | 8 | 9%  2025-05-07T20:26:12.3140126Z libcublas-12.8.3.14 | 460.2 MB | 1 | 2% 2025-05-07T20:26:12.3140607Z 2025-05-07T20:26:12.3140750Z 2025-05-07T20:26:12.3141617Z 2025-05-07T20:26:12.3882771Z libcusolver-11.7.2.5 | 156.9 MB | 6 | 7%  2025-05-07T20:26:12.3883154Z 2025-05-07T20:26:12.3885532Z 2025-05-07T20:26:12.3896588Z libcusparse-12.5.7.5 | 164.9 MB | #2 | 13%  2025-05-07T20:26:12.3896971Z 2025-05-07T20:26:12.3947368Z nsight-compute-2025. | 320.6 MB | 5 | 5%  2025-05-07T20:26:12.3947686Z 2025-05-07T20:26:12.3947692Z 2025-05-07T20:26:12.3947698Z 2025-05-07T20:26:12.3948212Z 2025-05-07T20:26:12.4110583Z libcufft-11.3.3.41 | 147.4 MB | #1 | 11%  2025-05-07T20:26:12.4144613Z libcublas-12.8.3.14 | 460.2 MB | 2 | 2% 2025-05-07T20:26:12.4144890Z 2025-05-07T20:26:12.4144896Z 2025-05-07T20:26:12.4146777Z 2025-05-07T20:26:12.4900529Z libcusolver-11.7.2.5 | 156.9 MB | 8 | 9%  2025-05-07T20:26:12.4900962Z 2025-05-07T20:26:12.5047300Z nsight-compute-2025. | 320.6 MB | 6 | 7%  2025-05-07T20:26:12.5047609Z 2025-05-07T20:26:12.5047614Z 2025-05-07T20:26:12.5047617Z 2025-05-07T20:26:12.5052475Z 2025-05-07T20:26:12.5055422Z libcufft-11.3.3.41 | 147.4 MB | #3 | 14%  2025-05-07T20:26:12.5055697Z 2025-05-07T20:26:12.5055701Z 2025-05-07T20:26:12.5112786Z libcusparse-12.5.7.5 | 164.9 MB | #5 | 15%  2025-05-07T20:26:12.5145381Z libcublas-12.8.3.14 | 460.2 MB | 3 | 3% 2025-05-07T20:26:12.5145929Z 2025-05-07T20:26:12.5145934Z 2025-05-07T20:26:12.5145937Z 2025-05-07T20:26:12.5983974Z libcusolver-11.7.2.5 | 156.9 MB | # | 11%  2025-05-07T20:26:12.5984317Z 2025-05-07T20:26:12.6114008Z nsight-compute-2025. | 320.6 MB | 7 | 8%  2025-05-07T20:26:12.6148721Z libcublas-12.8.3.14 | 460.2 MB | 3 | 4% 2025-05-07T20:26:12.6148983Z 2025-05-07T20:26:12.6148987Z 2025-05-07T20:26:12.6150745Z 2025-05-07T20:26:12.6237984Z libcusolver-11.7.2.5 | 156.9 MB | #3 | 13%  2025-05-07T20:26:12.6238273Z 2025-05-07T20:26:12.6238277Z 2025-05-07T20:26:12.6238281Z 2025-05-07T20:26:12.6239445Z 2025-05-07T20:26:12.6329644Z libcufft-11.3.3.41 | 147.4 MB | #5 | 16%  2025-05-07T20:26:12.6329923Z 2025-05-07T20:26:12.6334368Z 2025-05-07T20:26:12.6987108Z libcusparse-12.5.7.5 | 164.9 MB | #7 | 18%  2025-05-07T20:26:12.6990014Z 2025-05-07T20:26:12.7113589Z nsight-compute-2025. | 320.6 MB | 8 | 9%  2025-05-07T20:26:12.7156193Z libcublas-12.8.3.14 | 460.2 MB | 4 | 4% 2025-05-07T20:26:12.7156450Z 2025-05-07T20:26:12.7156454Z 2025-05-07T20:26:12.7156457Z 2025-05-07T20:26:12.7349821Z libcusolver-11.7.2.5 | 156.9 MB | #5 | 15%  2025-05-07T20:26:12.7350112Z 2025-05-07T20:26:12.7350116Z 2025-05-07T20:26:12.7350120Z 2025-05-07T20:26:12.7351792Z 2025-05-07T20:26:12.7566566Z libcufft-11.3.3.41 | 147.4 MB | #8 | 18%  2025-05-07T20:26:12.7566886Z 2025-05-07T20:26:12.7566890Z 2025-05-07T20:26:12.7991596Z libcusparse-12.5.7.5 | 164.9 MB | #9 | 20%  2025-05-07T20:26:12.7991887Z 2025-05-07T20:26:12.8116565Z nsight-compute-2025. | 320.6 MB | # | 10%  2025-05-07T20:26:12.8168715Z libcublas-12.8.3.14 | 460.2 MB | 5 | 5% 2025-05-07T20:26:12.8169002Z 2025-05-07T20:26:12.8169134Z 2025-05-07T20:26:12.8171896Z 2025-05-07T20:26:12.8395872Z libcusolver-11.7.2.5 | 156.9 MB | #7 | 17%  2025-05-07T20:26:12.8396182Z 2025-05-07T20:26:12.8396186Z 2025-05-07T20:26:12.8396190Z 2025-05-07T20:26:12.8396194Z 2025-05-07T20:26:12.8771706Z libcufft-11.3.3.41 | 147.4 MB | ## | 20%  2025-05-07T20:26:12.8772001Z 2025-05-07T20:26:12.8773289Z 2025-05-07T20:26:12.9003852Z libcusparse-12.5.7.5 | 164.9 MB | ##1 | 22%  2025-05-07T20:26:12.9005670Z 2025-05-07T20:26:12.9123469Z nsight-compute-2025. | 320.6 MB | #1 | 11%  2025-05-07T20:26:12.9169835Z libcublas-12.8.3.14 | 460.2 MB | 5 | 6% 2025-05-07T20:26:12.9170084Z 2025-05-07T20:26:12.9170386Z 2025-05-07T20:26:12.9170427Z 2025-05-07T20:26:12.9540884Z libcusolver-11.7.2.5 | 156.9 MB | #9 | 20%  2025-05-07T20:26:12.9541180Z 2025-05-07T20:26:12.9541184Z 2025-05-07T20:26:12.9541188Z 2025-05-07T20:26:12.9542506Z 2025-05-07T20:26:12.9838281Z libcufft-11.3.3.41 | 147.4 MB | ##2 | 23%  2025-05-07T20:26:12.9838564Z 2025-05-07T20:26:12.9839637Z 2025-05-07T20:26:13.0012470Z libcusparse-12.5.7.5 | 164.9 MB | ##3 | 24%  2025-05-07T20:26:13.0012747Z 2025-05-07T20:26:13.0128832Z nsight-compute-2025. | 320.6 MB | #2 | 12%  2025-05-07T20:26:13.0176409Z libcublas-12.8.3.14 | 460.2 MB | 6 | 7% 2025-05-07T20:26:13.0176664Z 2025-05-07T20:26:13.0176956Z 2025-05-07T20:26:13.0176962Z 2025-05-07T20:26:13.0581293Z libcusolver-11.7.2.5 | 156.9 MB | ##1 | 22%  2025-05-07T20:26:13.0581580Z 2025-05-07T20:26:13.0581585Z 2025-05-07T20:26:13.0581588Z 2025-05-07T20:26:13.0581592Z 2025-05-07T20:26:13.0955847Z libcufft-11.3.3.41 | 147.4 MB | ##4 | 25%  2025-05-07T20:26:13.0956131Z 2025-05-07T20:26:13.0957658Z 2025-05-07T20:26:13.1062732Z libcusparse-12.5.7.5 | 164.9 MB | ##5 | 26%  2025-05-07T20:26:13.1063802Z 2025-05-07T20:26:13.1224457Z nsight-compute-2025. | 320.6 MB | #3 | 13%  2025-05-07T20:26:13.1224724Z 2025-05-07T20:26:13.1224974Z 2025-05-07T20:26:13.1225111Z 2025-05-07T20:26:13.1259783Z libcusolver-11.7.2.5 | 156.9 MB | ##4 | 24%  2025-05-07T20:26:13.1584849Z libcublas-12.8.3.14 | 460.2 MB | 7 | 7% 2025-05-07T20:26:13.1585112Z 2025-05-07T20:26:13.1585393Z 2025-05-07T20:26:13.1585402Z 2025-05-07T20:26:13.1585408Z 2025-05-07T20:26:13.1987880Z libcufft-11.3.3.41 | 147.4 MB | ##6 | 27%  2025-05-07T20:26:13.1988182Z 2025-05-07T20:26:13.1988188Z 2025-05-07T20:26:13.2156299Z libcusparse-12.5.7.5 | 164.9 MB | ##7 | 27%  2025-05-07T20:26:13.2156588Z 2025-05-07T20:26:13.2322960Z nsight-compute-2025. | 320.6 MB | #4 | 15%  2025-05-07T20:26:13.2438216Z libcublas-12.8.3.14 | 460.2 MB | 7 | 8% 2025-05-07T20:26:13.2438593Z 2025-05-07T20:26:13.2438602Z 2025-05-07T20:26:13.2439797Z 2025-05-07T20:26:13.2587029Z libcusolver-11.7.2.5 | 156.9 MB | ##6 | 26%  2025-05-07T20:26:13.2587316Z 2025-05-07T20:26:13.2587341Z 2025-05-07T20:26:13.2587347Z 2025-05-07T20:26:13.2587352Z 2025-05-07T20:26:13.2990846Z libcufft-11.3.3.41 | 147.4 MB | ##8 | 29%  2025-05-07T20:26:13.2991244Z 2025-05-07T20:26:13.2991251Z 2025-05-07T20:26:13.3281280Z libcusparse-12.5.7.5 | 164.9 MB | ##9 | 29%  2025-05-07T20:26:13.3281665Z 2025-05-07T20:26:13.3325995Z nsight-compute-2025. | 320.6 MB | #5 | 16%  2025-05-07T20:26:13.3442252Z libcublas-12.8.3.14 | 460.2 MB | 8 | 9% 2025-05-07T20:26:13.3442602Z 2025-05-07T20:26:13.3442608Z 2025-05-07T20:26:13.3444954Z 2025-05-07T20:26:13.3623352Z libcusolver-11.7.2.5 | 156.9 MB | ##8 | 28%  2025-05-07T20:26:13.3623703Z 2025-05-07T20:26:13.3623708Z 2025-05-07T20:26:13.3623712Z 2025-05-07T20:26:13.3625590Z 2025-05-07T20:26:13.3991464Z libcufft-11.3.3.41 | 147.4 MB | ###1 | 31%  2025-05-07T20:26:13.3991765Z 2025-05-07T20:26:13.3993705Z 2025-05-07T20:26:13.4288680Z libcusparse-12.5.7.5 | 164.9 MB | ###1 | 31%  2025-05-07T20:26:13.4291734Z 2025-05-07T20:26:13.4328218Z nsight-compute-2025. | 320.6 MB | #6 | 17%  2025-05-07T20:26:13.4444564Z libcublas-12.8.3.14 | 460.2 MB | 9 | 9% 2025-05-07T20:26:13.4444822Z 2025-05-07T20:26:13.4445174Z 2025-05-07T20:26:13.4445184Z 2025-05-07T20:26:13.4624550Z libcusolver-11.7.2.5 | 156.9 MB | ### | 31%  2025-05-07T20:26:13.4624838Z 2025-05-07T20:26:13.4624843Z 2025-05-07T20:26:13.4624848Z 2025-05-07T20:26:13.4626764Z 2025-05-07T20:26:13.4992556Z libcufft-11.3.3.41 | 147.4 MB | ###3 | 33%  2025-05-07T20:26:13.4992853Z 2025-05-07T20:26:13.4992861Z 2025-05-07T20:26:13.5327804Z libcusparse-12.5.7.5 | 164.9 MB | ###3 | 34%  2025-05-07T20:26:13.5329088Z 2025-05-07T20:26:13.5338697Z nsight-compute-2025. | 320.6 MB | #7 | 18%  2025-05-07T20:26:13.5446146Z libcublas-12.8.3.14 | 460.2 MB | # | 10% 2025-05-07T20:26:13.5446400Z 2025-05-07T20:26:13.5446422Z 2025-05-07T20:26:13.5446425Z 2025-05-07T20:26:13.5694500Z libcusolver-11.7.2.5 | 156.9 MB | ###2 | 33%  2025-05-07T20:26:13.5694847Z 2025-05-07T20:26:13.5694856Z 2025-05-07T20:26:13.5694864Z 2025-05-07T20:26:13.5700616Z 2025-05-07T20:26:13.5996650Z libcufft-11.3.3.41 | 147.4 MB | ###5 | 35%  2025-05-07T20:26:13.5997053Z 2025-05-07T20:26:13.5997061Z 2025-05-07T20:26:13.6333014Z libcusparse-12.5.7.5 | 164.9 MB | ###5 | 35%  2025-05-07T20:26:13.6334407Z 2025-05-07T20:26:13.6380108Z nsight-compute-2025. | 320.6 MB | #8 | 19%  2025-05-07T20:26:13.6447829Z libcublas-12.8.3.14 | 460.2 MB | # | 11% 2025-05-07T20:26:13.6448089Z 2025-05-07T20:26:13.6448094Z 2025-05-07T20:26:13.6449544Z 2025-05-07T20:26:13.6696285Z libcusolver-11.7.2.5 | 156.9 MB | ###4 | 35%  2025-05-07T20:26:13.6696603Z 2025-05-07T20:26:13.6696609Z 2025-05-07T20:26:13.6696613Z 2025-05-07T20:26:13.6699323Z 2025-05-07T20:26:13.6999505Z libcufft-11.3.3.41 | 147.4 MB | ###7 | 38%  2025-05-07T20:26:13.7000052Z 2025-05-07T20:26:13.7001597Z 2025-05-07T20:26:13.7339152Z libcusparse-12.5.7.5 | 164.9 MB | ###7 | 38%  2025-05-07T20:26:13.7339432Z 2025-05-07T20:26:13.7382171Z nsight-compute-2025. | 320.6 MB | #9 | 20%  2025-05-07T20:26:13.7448413Z libcublas-12.8.3.14 | 460.2 MB | #1 | 12% 2025-05-07T20:26:13.7448665Z 2025-05-07T20:26:13.7449125Z 2025-05-07T20:26:13.7450968Z 2025-05-07T20:26:13.7697998Z libcusolver-11.7.2.5 | 156.9 MB | ###7 | 37%  2025-05-07T20:26:13.7698286Z 2025-05-07T20:26:13.7698290Z 2025-05-07T20:26:13.7698294Z 2025-05-07T20:26:13.7698298Z 2025-05-07T20:26:13.8002580Z libcufft-11.3.3.41 | 147.4 MB | #### | 41%  2025-05-07T20:26:13.8002862Z 2025-05-07T20:26:13.8006275Z 2025-05-07T20:26:13.8340104Z libcusparse-12.5.7.5 | 164.9 MB | ###9 | 40%  2025-05-07T20:26:13.8341058Z 2025-05-07T20:26:13.8383209Z nsight-compute-2025. | 320.6 MB | ##1 | 21%  2025-05-07T20:26:13.8452280Z libcublas-12.8.3.14 | 460.2 MB | #2 | 12% 2025-05-07T20:26:13.8452525Z 2025-05-07T20:26:13.8452734Z 2025-05-07T20:26:13.8456106Z 2025-05-07T20:26:13.8702587Z libcusolver-11.7.2.5 | 156.9 MB | ###9 | 39%  2025-05-07T20:26:13.8702897Z 2025-05-07T20:26:13.8702914Z 2025-05-07T20:26:13.8702924Z 2025-05-07T20:26:13.8706453Z 2025-05-07T20:26:13.9004048Z libcufft-11.3.3.41 | 147.4 MB | ####3 | 43%  2025-05-07T20:26:13.9004333Z 2025-05-07T20:26:13.9004346Z 2025-05-07T20:26:13.9341683Z libcusparse-12.5.7.5 | 164.9 MB | ####2 | 42%  2025-05-07T20:26:13.9344167Z 2025-05-07T20:26:13.9384997Z nsight-compute-2025. | 320.6 MB | ##2 | 22%  2025-05-07T20:26:13.9452974Z libcublas-12.8.3.14 | 460.2 MB | #3 | 13% 2025-05-07T20:26:13.9453264Z 2025-05-07T20:26:13.9453269Z 2025-05-07T20:26:13.9454828Z 2025-05-07T20:26:13.9706600Z libcusolver-11.7.2.5 | 156.9 MB | ####1 | 42%  2025-05-07T20:26:13.9707030Z 2025-05-07T20:26:13.9707037Z 2025-05-07T20:26:13.9707042Z 2025-05-07T20:26:13.9707047Z 2025-05-07T20:26:14.0008835Z libcufft-11.3.3.41 | 147.4 MB | ####5 | 46%  2025-05-07T20:26:14.0009182Z 2025-05-07T20:26:14.0009869Z 2025-05-07T20:26:14.0346090Z libcusparse-12.5.7.5 | 164.9 MB | ####4 | 44%  2025-05-07T20:26:14.0347391Z 2025-05-07T20:26:14.0390936Z nsight-compute-2025. | 320.6 MB | ##3 | 23%  2025-05-07T20:26:14.0475237Z libcublas-12.8.3.14 | 460.2 MB | #3 | 14% 2025-05-07T20:26:14.0475705Z 2025-05-07T20:26:14.0475712Z 2025-05-07T20:26:14.0475717Z 2025-05-07T20:26:14.0706844Z libcusolver-11.7.2.5 | 156.9 MB | ####3 | 44%  2025-05-07T20:26:14.0707255Z 2025-05-07T20:26:14.0707261Z 2025-05-07T20:26:14.0707266Z 2025-05-07T20:26:14.0707539Z 2025-05-07T20:26:14.1010158Z libcufft-11.3.3.41 | 147.4 MB | ####8 | 48%  2025-05-07T20:26:14.1010459Z 2025-05-07T20:26:14.1011066Z 2025-05-07T20:26:14.1380385Z libcusparse-12.5.7.5 | 164.9 MB | ####6 | 47%  2025-05-07T20:26:14.1381255Z 2025-05-07T20:26:14.1392735Z nsight-compute-2025. | 320.6 MB | ##4 | 25%  2025-05-07T20:26:14.1481808Z libcublas-12.8.3.14 | 460.2 MB | #4 | 15% 2025-05-07T20:26:14.1482057Z 2025-05-07T20:26:14.1482061Z 2025-05-07T20:26:14.1482065Z 2025-05-07T20:26:14.1709817Z libcusolver-11.7.2.5 | 156.9 MB | ####6 | 46%  2025-05-07T20:26:14.1710217Z 2025-05-07T20:26:14.1710223Z 2025-05-07T20:26:14.1710228Z 2025-05-07T20:26:14.1710233Z 2025-05-07T20:26:14.2013781Z libcufft-11.3.3.41 | 147.4 MB | ##### | 51%  2025-05-07T20:26:14.2014176Z 2025-05-07T20:26:14.2015888Z 2025-05-07T20:26:14.2395649Z libcusparse-12.5.7.5 | 164.9 MB | ####8 | 49%  2025-05-07T20:26:14.2413305Z libcublas-12.8.3.14 | 460.2 MB | #5 | 16% 2025-05-07T20:26:14.2413598Z 2025-05-07T20:26:14.2502752Z nsight-compute-2025. | 320.6 MB | ##5 | 26%  2025-05-07T20:26:14.2503449Z 2025-05-07T20:26:14.2503455Z 2025-05-07T20:26:14.2503460Z 2025-05-07T20:26:14.2746631Z libcusolver-11.7.2.5 | 156.9 MB | ####8 | 48%  2025-05-07T20:26:14.2746912Z 2025-05-07T20:26:14.2746916Z 2025-05-07T20:26:14.2746920Z 2025-05-07T20:26:14.2749708Z 2025-05-07T20:26:14.3015492Z libcufft-11.3.3.41 | 147.4 MB | #####3 | 53%  2025-05-07T20:26:14.3015767Z 2025-05-07T20:26:14.3016011Z 2025-05-07T20:26:14.3418273Z libcusparse-12.5.7.5 | 164.9 MB | #####1 | 51%  2025-05-07T20:26:14.3418940Z 2025-05-07T20:26:14.3426027Z nsight-compute-2025. | 320.6 MB | ##6 | 27%  2025-05-07T20:26:14.3515924Z libcublas-12.8.3.14 | 460.2 MB | #6 | 16% 2025-05-07T20:26:14.3516227Z 2025-05-07T20:26:14.3516233Z 2025-05-07T20:26:14.3517881Z 2025-05-07T20:26:14.3764924Z libcusolver-11.7.2.5 | 156.9 MB | ##### | 51%  2025-05-07T20:26:14.3765216Z 2025-05-07T20:26:14.3765233Z 2025-05-07T20:26:14.3765238Z 2025-05-07T20:26:14.3767865Z 2025-05-07T20:26:14.4015869Z libcufft-11.3.3.41 | 147.4 MB | #####5 | 56%  2025-05-07T20:26:14.4016158Z 2025-05-07T20:26:14.4016164Z 2025-05-07T20:26:14.4420669Z libcusparse-12.5.7.5 | 164.9 MB | #####3 | 54%  2025-05-07T20:26:14.4421696Z 2025-05-07T20:26:14.4435964Z nsight-compute-2025. | 320.6 MB | ##8 | 28%  2025-05-07T20:26:14.4517395Z libcublas-12.8.3.14 | 460.2 MB | #7 | 17% 2025-05-07T20:26:14.4517658Z 2025-05-07T20:26:14.4517663Z 2025-05-07T20:26:14.4520079Z 2025-05-07T20:26:14.4786261Z libcusolver-11.7.2.5 | 156.9 MB | #####2 | 53%  2025-05-07T20:26:14.4786552Z 2025-05-07T20:26:14.4786556Z 2025-05-07T20:26:14.4786560Z 2025-05-07T20:26:14.4788765Z 2025-05-07T20:26:14.5016531Z libcufft-11.3.3.41 | 147.4 MB | #####8 | 58%  2025-05-07T20:26:14.5016831Z 2025-05-07T20:26:14.5017155Z 2025-05-07T20:26:14.5518243Z libcusparse-12.5.7.5 | 164.9 MB | #####5 | 56%  2025-05-07T20:26:14.5519846Z 2025-05-07T20:26:14.5559072Z nsight-compute-2025. | 320.6 MB | ##9 | 29%  2025-05-07T20:26:14.5579167Z libcublas-12.8.3.14 | 460.2 MB | #7 | 18% 2025-05-07T20:26:14.5579416Z 2025-05-07T20:26:14.5579422Z 2025-05-07T20:26:14.5580940Z 2025-05-07T20:26:14.5788906Z libcusolver-11.7.2.5 | 156.9 MB | #####4 | 55%  2025-05-07T20:26:14.5789335Z 2025-05-07T20:26:14.5789343Z 2025-05-07T20:26:14.5789351Z 2025-05-07T20:26:14.5789359Z 2025-05-07T20:26:14.6016713Z libcufft-11.3.3.41 | 147.4 MB | ###### | 61%  2025-05-07T20:26:14.6016999Z 2025-05-07T20:26:14.6017004Z 2025-05-07T20:26:14.6524942Z libcusparse-12.5.7.5 | 164.9 MB | #####8 | 58%  2025-05-07T20:26:14.6526719Z 2025-05-07T20:26:14.6580772Z nsight-compute-2025. | 320.6 MB | ### | 31%  2025-05-07T20:26:14.6581118Z 2025-05-07T20:26:14.6581123Z 2025-05-07T20:26:14.6581129Z 2025-05-07T20:26:14.6790646Z libcusolver-11.7.2.5 | 156.9 MB | #####7 | 57%  2025-05-07T20:26:14.6791086Z 2025-05-07T20:26:14.6791092Z 2025-05-07T20:26:14.6791108Z 2025-05-07T20:26:14.6791113Z 2025-05-07T20:26:14.6898326Z libcufft-11.3.3.41 | 147.4 MB | ######3 | 64%  2025-05-07T20:26:14.7018146Z libcublas-12.8.3.14 | 460.2 MB | #8 | 19% 2025-05-07T20:26:14.7018450Z 2025-05-07T20:26:14.7019699Z 2025-05-07T20:26:14.7527697Z libcusparse-12.5.7.5 | 164.9 MB | ###### | 61%  2025-05-07T20:26:14.7530520Z 2025-05-07T20:26:14.7582034Z nsight-compute-2025. | 320.6 MB | ###1 | 32%  2025-05-07T20:26:14.7582308Z 2025-05-07T20:26:14.7582314Z 2025-05-07T20:26:14.7585609Z 2025-05-07T20:26:14.7793868Z libcusolver-11.7.2.5 | 156.9 MB | #####9 | 60%  2025-05-07T20:26:14.7794144Z 2025-05-07T20:26:14.7794148Z 2025-05-07T20:26:14.7794175Z 2025-05-07T20:26:14.7798619Z 2025-05-07T20:26:14.7961054Z libcufft-11.3.3.41 | 147.4 MB | ######6 | 67%  2025-05-07T20:26:14.8053880Z libcublas-12.8.3.14 | 460.2 MB | #9 | 19% 2025-05-07T20:26:14.8054181Z 2025-05-07T20:26:14.8054869Z 2025-05-07T20:26:14.8583614Z libcusparse-12.5.7.5 | 164.9 MB | ######3 | 63%  2025-05-07T20:26:14.8587936Z 2025-05-07T20:26:14.8593410Z nsight-compute-2025. | 320.6 MB | ###2 | 33%  2025-05-07T20:26:14.8593683Z 2025-05-07T20:26:14.8593687Z 2025-05-07T20:26:14.8593690Z 2025-05-07T20:26:14.8861154Z libcusolver-11.7.2.5 | 156.9 MB | ######2 | 62%  2025-05-07T20:26:14.8861569Z 2025-05-07T20:26:14.8861575Z 2025-05-07T20:26:14.8861591Z 2025-05-07T20:26:14.8862526Z 2025-05-07T20:26:14.8962559Z libcufft-11.3.3.41 | 147.4 MB | ######9 | 69%  2025-05-07T20:26:14.9054324Z libcublas-12.8.3.14 | 460.2 MB | ## | 20% 2025-05-07T20:26:14.9054572Z 2025-05-07T20:26:14.9055005Z 2025-05-07T20:26:14.9587161Z libcusparse-12.5.7.5 | 164.9 MB | ######5 | 65%  2025-05-07T20:26:14.9587437Z 2025-05-07T20:26:14.9602780Z nsight-compute-2025. | 320.6 MB | ###4 | 34%  2025-05-07T20:26:14.9603055Z 2025-05-07T20:26:14.9603059Z 2025-05-07T20:26:14.9606271Z 2025-05-07T20:26:14.9970926Z libcusolver-11.7.2.5 | 156.9 MB | ######4 | 64%  2025-05-07T20:26:15.0074127Z libcublas-12.8.3.14 | 460.2 MB | ## | 21% 2025-05-07T20:26:15.0074368Z 2025-05-07T20:26:15.0074371Z 2025-05-07T20:26:15.0074375Z 2025-05-07T20:26:15.0074379Z 2025-05-07T20:26:15.0221829Z libcufft-11.3.3.41 | 147.4 MB | #######1 | 72%  2025-05-07T20:26:15.0222095Z 2025-05-07T20:26:15.0224111Z 2025-05-07T20:26:15.0675749Z libcusparse-12.5.7.5 | 164.9 MB | ######7 | 68%  2025-05-07T20:26:15.0676025Z 2025-05-07T20:26:15.0676029Z 2025-05-07T20:26:15.0676032Z 2025-05-07T20:26:15.0718635Z libcusolver-11.7.2.5 | 156.9 MB | ######6 | 67%  2025-05-07T20:26:15.0726401Z 2025-05-07T20:26:15.0972876Z nsight-compute-2025. | 320.6 MB | ###5 | 35%  2025-05-07T20:26:15.1251690Z libcublas-12.8.3.14 | 460.2 MB | ##1 | 22% 2025-05-07T20:26:15.1251954Z 2025-05-07T20:26:15.1251981Z 2025-05-07T20:26:15.1251985Z 2025-05-07T20:26:15.1255346Z 2025-05-07T20:26:15.1343940Z libcufft-11.3.3.41 | 147.4 MB | #######4 | 74%  2025-05-07T20:26:15.1344218Z 2025-05-07T20:26:15.1344222Z 2025-05-07T20:26:15.1701888Z libcusparse-12.5.7.5 | 164.9 MB | ####### | 70%  2025-05-07T20:26:15.1702164Z 2025-05-07T20:26:15.1702169Z 2025-05-07T20:26:15.1702172Z 2025-05-07T20:26:15.1836435Z libcusolver-11.7.2.5 | 156.9 MB | ######8 | 69%  2025-05-07T20:26:15.1840026Z 2025-05-07T20:26:15.1982596Z nsight-compute-2025. | 320.6 MB | ###6 | 36%  2025-05-07T20:26:15.2258336Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 22% 2025-05-07T20:26:15.2258586Z 2025-05-07T20:26:15.2258805Z 2025-05-07T20:26:15.2258812Z 2025-05-07T20:26:15.2259847Z 2025-05-07T20:26:15.2386312Z libcufft-11.3.3.41 | 147.4 MB | #######6 | 77%  2025-05-07T20:26:15.2386591Z 2025-05-07T20:26:15.2387958Z 2025-05-07T20:26:15.2794989Z libcusparse-12.5.7.5 | 164.9 MB | #######2 | 72%  2025-05-07T20:26:15.2795252Z 2025-05-07T20:26:15.2795256Z 2025-05-07T20:26:15.2795260Z 2025-05-07T20:26:15.2841020Z libcusolver-11.7.2.5 | 156.9 MB | #######1 | 71%  2025-05-07T20:26:15.2842709Z 2025-05-07T20:26:15.3041469Z nsight-compute-2025. | 320.6 MB | ###7 | 38%  2025-05-07T20:26:15.3270702Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 23% 2025-05-07T20:26:15.3270947Z 2025-05-07T20:26:15.3271241Z 2025-05-07T20:26:15.3271272Z 2025-05-07T20:26:15.3271361Z 2025-05-07T20:26:15.3391400Z libcufft-11.3.3.41 | 147.4 MB | #######9 | 79%  2025-05-07T20:26:15.3391685Z 2025-05-07T20:26:15.3391690Z 2025-05-07T20:26:15.3799166Z libcusparse-12.5.7.5 | 164.9 MB | #######4 | 75%  2025-05-07T20:26:15.3799472Z 2025-05-07T20:26:15.3799477Z 2025-05-07T20:26:15.3799481Z 2025-05-07T20:26:15.3878100Z libcusolver-11.7.2.5 | 156.9 MB | #######3 | 73%  2025-05-07T20:26:15.3878722Z 2025-05-07T20:26:15.4131577Z nsight-compute-2025. | 320.6 MB | ###8 | 39%  2025-05-07T20:26:15.4272708Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 24% 2025-05-07T20:26:15.4273091Z 2025-05-07T20:26:15.4273098Z 2025-05-07T20:26:15.4273103Z 2025-05-07T20:26:15.4273108Z 2025-05-07T20:26:15.4391729Z libcufft-11.3.3.41 | 147.4 MB | ########1 | 82%  2025-05-07T20:26:15.4392020Z 2025-05-07T20:26:15.4392023Z 2025-05-07T20:26:15.4900239Z libcusparse-12.5.7.5 | 164.9 MB | #######6 | 77%  2025-05-07T20:26:15.4900667Z 2025-05-07T20:26:15.4918933Z nsight-compute-2025. | 320.6 MB | ###9 | 40%  2025-05-07T20:26:15.4919254Z 2025-05-07T20:26:15.4919258Z 2025-05-07T20:26:15.4921340Z 2025-05-07T20:26:15.5251177Z libcusolver-11.7.2.5 | 156.9 MB | #######5 | 76%  2025-05-07T20:26:15.5273796Z libcublas-12.8.3.14 | 460.2 MB | ##4 | 25% 2025-05-07T20:26:15.5274139Z 2025-05-07T20:26:15.5274163Z 2025-05-07T20:26:15.5274169Z 2025-05-07T20:26:15.5278248Z 2025-05-07T20:26:15.5444884Z libcufft-11.3.3.41 | 147.4 MB | ########4 | 84%  2025-05-07T20:26:15.5445298Z 2025-05-07T20:26:15.5448272Z 2025-05-07T20:26:15.5905503Z libcusparse-12.5.7.5 | 164.9 MB | #######8 | 79%  2025-05-07T20:26:15.5907434Z 2025-05-07T20:26:15.5920041Z nsight-compute-2025. | 320.6 MB | #### | 41%  2025-05-07T20:26:15.5920307Z 2025-05-07T20:26:15.5920311Z 2025-05-07T20:26:15.5921580Z 2025-05-07T20:26:15.6253733Z libcusolver-11.7.2.5 | 156.9 MB | #######7 | 78%  2025-05-07T20:26:15.6274268Z libcublas-12.8.3.14 | 460.2 MB | ##5 | 25% 2025-05-07T20:26:15.6274627Z 2025-05-07T20:26:15.6274631Z 2025-05-07T20:26:15.6274635Z 2025-05-07T20:26:15.6274638Z 2025-05-07T20:26:15.6446358Z libcufft-11.3.3.41 | 147.4 MB | ########6 | 87%  2025-05-07T20:26:15.6446710Z 2025-05-07T20:26:15.6449109Z 2025-05-07T20:26:15.6922530Z libcusparse-12.5.7.5 | 164.9 MB | ########1 | 81%  2025-05-07T20:26:15.6922934Z 2025-05-07T20:26:15.6922940Z 2025-05-07T20:26:15.6925439Z 2025-05-07T20:26:15.7072621Z libcusolver-11.7.2.5 | 156.9 MB | #######9 | 80%  2025-05-07T20:26:15.7072956Z 2025-05-07T20:26:15.7255542Z nsight-compute-2025. | 320.6 MB | ####2 | 42%  2025-05-07T20:26:15.7327059Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 26% 2025-05-07T20:26:15.7327323Z 2025-05-07T20:26:15.7327329Z 2025-05-07T20:26:15.7327335Z 2025-05-07T20:26:15.7329980Z 2025-05-07T20:26:15.7447401Z libcufft-11.3.3.41 | 147.4 MB | ########9 | 89%  2025-05-07T20:26:15.7447794Z 2025-05-07T20:26:15.7447805Z 2025-05-07T20:26:15.7924174Z libcusparse-12.5.7.5 | 164.9 MB | ########3 | 83%  2025-05-07T20:26:15.7924477Z 2025-05-07T20:26:15.7924482Z 2025-05-07T20:26:15.7925678Z 2025-05-07T20:26:15.8091421Z libcusolver-11.7.2.5 | 156.9 MB | ########2 | 82%  2025-05-07T20:26:15.8091825Z 2025-05-07T20:26:15.8260245Z nsight-compute-2025. | 320.6 MB | ####3 | 43%  2025-05-07T20:26:15.8422008Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 27% 2025-05-07T20:26:15.8422382Z 2025-05-07T20:26:15.8422391Z 2025-05-07T20:26:15.8422396Z 2025-05-07T20:26:15.8425231Z 2025-05-07T20:26:15.8450248Z libcufft-11.3.3.41 | 147.4 MB | #########1 | 92%  2025-05-07T20:26:15.8450582Z 2025-05-07T20:26:15.8453443Z 2025-05-07T20:26:15.8925204Z libcusparse-12.5.7.5 | 164.9 MB | ########5 | 86%  2025-05-07T20:26:15.8925492Z 2025-05-07T20:26:15.8925496Z 2025-05-07T20:26:15.8928511Z 2025-05-07T20:26:15.9092453Z libcusolver-11.7.2.5 | 156.9 MB | ########4 | 85%  2025-05-07T20:26:15.9092733Z 2025-05-07T20:26:15.9264065Z nsight-compute-2025. | 320.6 MB | ####4 | 44%  2025-05-07T20:26:15.9424718Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 28% 2025-05-07T20:26:15.9425018Z 2025-05-07T20:26:15.9425022Z 2025-05-07T20:26:15.9425026Z 2025-05-07T20:26:15.9425745Z 2025-05-07T20:26:15.9484103Z libcufft-11.3.3.41 | 147.4 MB | #########3 | 94%  2025-05-07T20:26:15.9484490Z 2025-05-07T20:26:15.9484500Z 2025-05-07T20:26:15.9934843Z libcusparse-12.5.7.5 | 164.9 MB | ########7 | 88%  2025-05-07T20:26:15.9935170Z 2025-05-07T20:26:15.9935176Z 2025-05-07T20:26:15.9936227Z 2025-05-07T20:26:16.0099048Z libcusolver-11.7.2.5 | 156.9 MB | ########7 | 87%  2025-05-07T20:26:16.0099383Z 2025-05-07T20:26:16.0287367Z nsight-compute-2025. | 320.6 MB | ####5 | 45%  2025-05-07T20:26:16.0494397Z libcublas-12.8.3.14 | 460.2 MB | ##8 | 28% 2025-05-07T20:26:16.0494641Z 2025-05-07T20:26:16.0494647Z 2025-05-07T20:26:16.0494660Z 2025-05-07T20:26:16.0497823Z 2025-05-07T20:26:16.0519559Z libcufft-11.3.3.41 | 147.4 MB | #########6 | 96%  2025-05-07T20:26:16.0519824Z 2025-05-07T20:26:16.0519828Z 2025-05-07T20:26:16.1080151Z libcusparse-12.5.7.5 | 164.9 MB | ######### | 90%  2025-05-07T20:26:16.1080447Z 2025-05-07T20:26:16.1080469Z 2025-05-07T20:26:16.1080475Z 2025-05-07T20:26:16.1144341Z libcusolver-11.7.2.5 | 156.9 MB | ########9 | 89%  2025-05-07T20:26:16.1145165Z 2025-05-07T20:26:16.1395080Z nsight-compute-2025. | 320.6 MB | ####6 | 46%  2025-05-07T20:26:16.1520669Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 29% 2025-05-07T20:26:16.1520929Z 2025-05-07T20:26:16.1521913Z 2025-05-07T20:26:16.1565043Z libcusparse-12.5.7.5 | 164.9 MB | #########2 | 92%  2025-05-07T20:26:16.1565307Z 2025-05-07T20:26:16.1565571Z 2025-05-07T20:26:16.1565579Z 2025-05-07T20:26:16.1565604Z 2025-05-07T20:26:16.2145337Z libcufft-11.3.3.41 | 147.4 MB | #########8 | 99%  2025-05-07T20:26:16.2147314Z 2025-05-07T20:26:16.2213844Z nsight-compute-2025. | 320.6 MB | ####7 | 47%  2025-05-07T20:26:16.2214113Z 2025-05-07T20:26:16.2214117Z 2025-05-07T20:26:16.2214152Z 2025-05-07T20:26:16.2396910Z libcusolver-11.7.2.5 | 156.9 MB | #########1 | 92%  2025-05-07T20:26:16.2521303Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 30% 2025-05-07T20:26:16.2521635Z 2025-05-07T20:26:16.2522692Z 2025-05-07T20:26:16.3146927Z libcusparse-12.5.7.5 | 164.9 MB | #########4 | 95%  2025-05-07T20:26:16.3152350Z 2025-05-07T20:26:16.3217834Z nsight-compute-2025. | 320.6 MB | ####8 | 49%  2025-05-07T20:26:16.3218109Z 2025-05-07T20:26:16.3218113Z 2025-05-07T20:26:16.3219252Z 2025-05-07T20:26:16.3401297Z libcusolver-11.7.2.5 | 156.9 MB | #########4 | 94%  2025-05-07T20:26:16.3522949Z libcublas-12.8.3.14 | 460.2 MB | ### | 31% 2025-05-07T20:26:16.3523277Z 2025-05-07T20:26:16.3523294Z 2025-05-07T20:26:16.4150612Z libcusparse-12.5.7.5 | 164.9 MB | #########6 | 97%  2025-05-07T20:26:16.4153413Z 2025-05-07T20:26:16.4221684Z nsight-compute-2025. | 320.6 MB | ####9 | 50%  2025-05-07T20:26:16.4221954Z 2025-05-07T20:26:16.4222235Z 2025-05-07T20:26:16.4227170Z 2025-05-07T20:26:16.4402534Z libcusolver-11.7.2.5 | 156.9 MB | #########6 | 96%  2025-05-07T20:26:16.4523062Z libcublas-12.8.3.14 | 460.2 MB | ###1 | 32% 2025-05-07T20:26:16.4523346Z 2025-05-07T20:26:16.4525871Z 2025-05-07T20:26:16.5151515Z libcusparse-12.5.7.5 | 164.9 MB | #########9 | 99%  2025-05-07T20:26:16.5154151Z 2025-05-07T20:26:16.5223645Z nsight-compute-2025. | 320.6 MB | ##### | 51%  2025-05-07T20:26:16.5223916Z 2025-05-07T20:26:16.5223921Z 2025-05-07T20:26:16.5224533Z 2025-05-07T20:26:16.5539018Z libcusolver-11.7.2.5 | 156.9 MB | #########9 | 99%  2025-05-07T20:26:16.6151330Z libcublas-12.8.3.14 | 460.2 MB | ###2 | 32% 2025-05-07T20:26:16.6155576Z 2025-05-07T20:26:16.6539457Z nsight-compute-2025. | 320.6 MB | #####2 | 52%  2025-05-07T20:26:16.7151733Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 33% 2025-05-07T20:26:16.7153171Z 2025-05-07T20:26:16.7545670Z nsight-compute-2025. | 320.6 MB | #####3 | 54%  2025-05-07T20:26:16.8153600Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 35% 2025-05-07T20:26:16.8155292Z 2025-05-07T20:26:16.8546781Z nsight-compute-2025. | 320.6 MB | #####5 | 55%  2025-05-07T20:26:16.9163640Z libcublas-12.8.3.14 | 460.2 MB | ###5 | 36% 2025-05-07T20:26:16.9165905Z 2025-05-07T20:26:16.9547235Z nsight-compute-2025. | 320.6 MB | #####6 | 57%  2025-05-07T20:26:17.0439806Z libcublas-12.8.3.14 | 460.2 MB | ###7 | 37% 2025-05-07T20:26:17.0440748Z 2025-05-07T20:26:17.0551895Z nsight-compute-2025. | 320.6 MB | #####8 | 58%  2025-05-07T20:26:17.1443344Z libcublas-12.8.3.14 | 460.2 MB | ###8 | 38% 2025-05-07T20:26:17.1445313Z 2025-05-07T20:26:17.1552232Z nsight-compute-2025. | 320.6 MB | #####9 | 60%  2025-05-07T20:26:17.2444960Z libcublas-12.8.3.14 | 460.2 MB | ###9 | 40% 2025-05-07T20:26:17.2447360Z 2025-05-07T20:26:17.2553256Z nsight-compute-2025. | 320.6 MB | ######1 | 61%  2025-05-07T20:26:17.3554908Z libcublas-12.8.3.14 | 460.2 MB | #### | 41% 2025-05-07T20:26:17.3814584Z libcublas-12.8.3.14 | 460.2 MB | ####2 | 42% 2025-05-07T20:26:17.3814835Z 2025-05-07T20:26:17.4685427Z nsight-compute-2025. | 320.6 MB | ######2 | 63%  2025-05-07T20:26:17.4817648Z libcublas-12.8.3.14 | 460.2 MB | ####3 | 44% 2025-05-07T20:26:17.4818545Z 2025-05-07T20:26:17.5749666Z nsight-compute-2025. | 320.6 MB | ######3 | 64%  2025-05-07T20:26:17.5820513Z libcublas-12.8.3.14 | 460.2 MB | ####4 | 45% 2025-05-07T20:26:17.5820851Z 2025-05-07T20:26:17.6822221Z nsight-compute-2025. | 320.6 MB | ######5 | 65%  2025-05-07T20:26:17.6822641Z 2025-05-07T20:26:17.7091600Z nsight-compute-2025. | 320.6 MB | ######7 | 67%  2025-05-07T20:26:17.7822688Z libcublas-12.8.3.14 | 460.2 MB | ####6 | 46% 2025-05-07T20:26:17.7823062Z 2025-05-07T20:26:17.8282623Z nsight-compute-2025. | 320.6 MB | ######8 | 69%  2025-05-07T20:26:17.8830026Z libcublas-12.8.3.14 | 460.2 MB | ####7 | 47% 2025-05-07T20:26:17.8831550Z 2025-05-07T20:26:17.9302370Z nsight-compute-2025. | 320.6 MB | ####### | 70%  2025-05-07T20:26:17.9862915Z libcublas-12.8.3.14 | 460.2 MB | ####8 | 48% 2025-05-07T20:26:17.9865287Z 2025-05-07T20:26:18.0303921Z nsight-compute-2025. | 320.6 MB | #######1 | 72%  2025-05-07T20:26:18.0866236Z libcublas-12.8.3.14 | 460.2 MB | ####9 | 49% 2025-05-07T20:26:18.0866487Z 2025-05-07T20:26:18.1304955Z nsight-compute-2025. | 320.6 MB | #######3 | 73%  2025-05-07T20:26:18.1871013Z libcublas-12.8.3.14 | 460.2 MB | ##### | 51% 2025-05-07T20:26:18.1873264Z 2025-05-07T20:26:18.2306302Z nsight-compute-2025. | 320.6 MB | #######4 | 75%  2025-05-07T20:26:18.2873298Z libcublas-12.8.3.14 | 460.2 MB | #####1 | 52% 2025-05-07T20:26:18.2874864Z 2025-05-07T20:26:18.3310358Z nsight-compute-2025. | 320.6 MB | #######6 | 76%  2025-05-07T20:26:18.3877754Z libcublas-12.8.3.14 | 460.2 MB | #####2 | 53% 2025-05-07T20:26:18.3878009Z 2025-05-07T20:26:18.4327789Z nsight-compute-2025. | 320.6 MB | #######7 | 78%  2025-05-07T20:26:18.4877913Z libcublas-12.8.3.14 | 460.2 MB | #####4 | 54% 2025-05-07T20:26:18.4878317Z 2025-05-07T20:26:18.5342807Z nsight-compute-2025. | 320.6 MB | #######9 | 79%  2025-05-07T20:26:18.5880376Z libcublas-12.8.3.14 | 460.2 MB | #####5 | 55% 2025-05-07T20:26:18.5880698Z 2025-05-07T20:26:18.6350787Z nsight-compute-2025. | 320.6 MB | ######## | 81%  2025-05-07T20:26:18.6957112Z libcublas-12.8.3.14 | 460.2 MB | #####6 | 56% 2025-05-07T20:26:18.6957546Z 2025-05-07T20:26:18.7350613Z nsight-compute-2025. | 320.6 MB | ########2 | 82%  2025-05-07T20:26:18.7983912Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 58% 2025-05-07T20:26:18.7984846Z 2025-05-07T20:26:18.8433100Z nsight-compute-2025. | 320.6 MB | ########3 | 84%  2025-05-07T20:26:18.9012263Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 59% 2025-05-07T20:26:18.9012510Z 2025-05-07T20:26:18.9471387Z nsight-compute-2025. | 320.6 MB | ########5 | 86%  2025-05-07T20:26:18.9471907Z 2025-05-07T20:26:18.9471911Z 2025-05-07T20:26:18.9471915Z 2025-05-07T20:26:18.9471919Z 2025-05-07T20:26:18.9645798Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:26:18.9886877Z libcublas-12.8.3.14 | 460.2 MB | ###### | 60% 2025-05-07T20:26:18.9887126Z 2025-05-07T20:26:18.9887130Z 2025-05-07T20:26:18.9887134Z 2025-05-07T20:26:18.9887137Z 2025-05-07T20:26:18.9888933Z 2025-05-07T20:26:19.0217173Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:26:19.0218834Z 2025-05-07T20:26:19.0887582Z nsight-compute-2025. | 320.6 MB | ########7 | 87%  2025-05-07T20:26:19.0887917Z 2025-05-07T20:26:19.0887921Z 2025-05-07T20:26:19.0887925Z 2025-05-07T20:26:19.0887929Z 2025-05-07T20:26:19.0887933Z 2025-05-07T20:26:19.1219317Z libnpp-12.3.3.65 | 130.6 MB | 2 | 3%  2025-05-07T20:26:19.1545982Z libcublas-12.8.3.14 | 460.2 MB | ######1 | 61% 2025-05-07T20:26:19.1547951Z 2025-05-07T20:26:19.1896212Z nsight-compute-2025. | 320.6 MB | ########8 | 89%  2025-05-07T20:26:19.1896542Z 2025-05-07T20:26:19.1896548Z 2025-05-07T20:26:19.1896553Z 2025-05-07T20:26:19.1896558Z 2025-05-07T20:26:19.1900256Z 2025-05-07T20:26:19.2627997Z libnpp-12.3.3.65 | 130.6 MB | 5 | 6%  2025-05-07T20:26:19.2823632Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 62% 2025-05-07T20:26:19.2827006Z 2025-05-07T20:26:19.2904936Z nsight-compute-2025. | 320.6 MB | ########9 | 90%  2025-05-07T20:26:19.2905309Z 2025-05-07T20:26:19.2905315Z 2025-05-07T20:26:19.2905320Z 2025-05-07T20:26:19.2905325Z 2025-05-07T20:26:19.2905331Z 2025-05-07T20:26:19.3905658Z libnpp-12.3.3.65 | 130.6 MB | 8 | 8%  2025-05-07T20:26:19.3906051Z 2025-05-07T20:26:19.3906057Z 2025-05-07T20:26:19.3906062Z 2025-05-07T20:26:19.3906088Z 2025-05-07T20:26:19.3908982Z 2025-05-07T20:26:19.3937264Z libnpp-12.3.3.65 | 130.6 MB | #1 | 11%  2025-05-07T20:26:19.3988717Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 63% 2025-05-07T20:26:19.3989069Z 2025-05-07T20:26:19.4913216Z nsight-compute-2025. | 320.6 MB | #########1 | 91%  2025-05-07T20:26:19.4913583Z 2025-05-07T20:26:19.4913590Z 2025-05-07T20:26:19.4913595Z 2025-05-07T20:26:19.4913615Z 2025-05-07T20:26:19.4916836Z 2025-05-07T20:26:19.5049341Z libnpp-12.3.3.65 | 130.6 MB | #3 | 14%  2025-05-07T20:26:19.5148757Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 64% 2025-05-07T20:26:19.5153277Z 2025-05-07T20:26:19.5919775Z nsight-compute-2025. | 320.6 MB | #########2 | 92%  2025-05-07T20:26:19.5920156Z 2025-05-07T20:26:19.5920162Z 2025-05-07T20:26:19.5920168Z 2025-05-07T20:26:19.5920173Z 2025-05-07T20:26:19.5920266Z 2025-05-07T20:26:19.6151453Z libnpp-12.3.3.65 | 130.6 MB | #6 | 16%  2025-05-07T20:26:19.6155098Z 2025-05-07T20:26:19.6165781Z nsight-compute-2025. | 320.6 MB | #########3 | 93%  2025-05-07T20:26:19.6166148Z 2025-05-07T20:26:19.6166157Z 2025-05-07T20:26:19.6168198Z 2025-05-07T20:26:19.6204630Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:26:19.6335141Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 65% 2025-05-07T20:26:19.6335408Z 2025-05-07T20:26:19.6345085Z 2025-05-07T20:26:19.6676293Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:26:19.6676656Z 2025-05-07T20:26:19.6676662Z 2025-05-07T20:26:19.6676667Z 2025-05-07T20:26:19.6676672Z 2025-05-07T20:26:19.6676678Z 2025-05-07T20:26:19.6676682Z 2025-05-07T20:26:19.6678452Z 2025-05-07T20:26:19.6713227Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:26:19.6713500Z 2025-05-07T20:26:19.6713504Z 2025-05-07T20:26:19.6713508Z 2025-05-07T20:26:19.6713511Z 2025-05-07T20:26:19.6713515Z 2025-05-07T20:26:19.6713542Z 2025-05-07T20:26:19.7160893Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:26:19.7161236Z 2025-05-07T20:26:19.7161499Z 2025-05-07T20:26:19.7161504Z 2025-05-07T20:26:19.7161520Z 2025-05-07T20:26:19.7168137Z 2025-05-07T20:26:19.7487311Z libnpp-12.3.3.65 | 130.6 MB | #9 | 19%  2025-05-07T20:26:19.7487706Z 2025-05-07T20:26:19.7681567Z nsight-compute-2025. | 320.6 MB | #########4 | 95%  2025-05-07T20:26:19.7682174Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 65% 2025-05-07T20:26:19.7682522Z 2025-05-07T20:26:19.7682528Z 2025-05-07T20:26:19.7682534Z 2025-05-07T20:26:19.7682539Z 2025-05-07T20:26:19.7682544Z 2025-05-07T20:26:19.7682549Z 2025-05-07T20:26:19.7682559Z 2025-05-07T20:26:19.7713626Z cuda-nvvp-12.8.57 | 112.4 MB | 1 | 2%  2025-05-07T20:26:19.7713915Z 2025-05-07T20:26:19.7713919Z 2025-05-07T20:26:19.7713922Z 2025-05-07T20:26:19.7713926Z 2025-05-07T20:26:19.7713930Z 2025-05-07T20:26:19.7719517Z 2025-05-07T20:26:19.8517219Z cuda-nsight-12.8.55 | 113.2 MB | 2 | 2%  2025-05-07T20:26:19.8517534Z 2025-05-07T20:26:19.8517538Z 2025-05-07T20:26:19.8517542Z 2025-05-07T20:26:19.8517545Z 2025-05-07T20:26:19.8517829Z 2025-05-07T20:26:19.8687971Z libnpp-12.3.3.65 | 130.6 MB | ##1 | 21%  2025-05-07T20:26:19.8688384Z 2025-05-07T20:26:19.8688391Z 2025-05-07T20:26:19.8688397Z 2025-05-07T20:26:19.8688402Z 2025-05-07T20:26:19.8688408Z 2025-05-07T20:26:19.8688414Z 2025-05-07T20:26:19.8688420Z 2025-05-07T20:26:19.8723104Z cuda-nvvp-12.8.57 | 112.4 MB | 3 | 4%  2025-05-07T20:26:19.8723409Z 2025-05-07T20:26:19.8723414Z 2025-05-07T20:26:19.8723420Z 2025-05-07T20:26:19.8723425Z 2025-05-07T20:26:19.8723430Z 2025-05-07T20:26:19.8725664Z 2025-05-07T20:26:19.8869809Z cuda-nsight-12.8.55 | 113.2 MB | 4 | 5%  2025-05-07T20:26:19.8870101Z 2025-05-07T20:26:19.9140411Z nsight-compute-2025. | 320.6 MB | #########5 | 96%  2025-05-07T20:26:19.9696510Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 66% 2025-05-07T20:26:19.9696795Z 2025-05-07T20:26:19.9696799Z 2025-05-07T20:26:19.9696803Z 2025-05-07T20:26:19.9696807Z 2025-05-07T20:26:19.9696810Z 2025-05-07T20:26:19.9696814Z 2025-05-07T20:26:19.9702612Z 2025-05-07T20:26:19.9717116Z cuda-nvvp-12.8.57 | 112.4 MB | 6 | 6%  2025-05-07T20:26:19.9717423Z 2025-05-07T20:26:19.9717429Z 2025-05-07T20:26:19.9717434Z 2025-05-07T20:26:19.9717439Z 2025-05-07T20:26:19.9717444Z 2025-05-07T20:26:19.9726776Z libnpp-12.3.3.65 | 130.6 MB | ##3 | 24%  2025-05-07T20:26:19.9727057Z 2025-05-07T20:26:19.9727061Z 2025-05-07T20:26:19.9727065Z 2025-05-07T20:26:19.9727069Z 2025-05-07T20:26:19.9727072Z 2025-05-07T20:26:19.9731425Z 2025-05-07T20:26:20.0335955Z cuda-nsight-12.8.55 | 113.2 MB | 6 | 7%  2025-05-07T20:26:20.0480421Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 67% 2025-05-07T20:26:20.0482721Z 2025-05-07T20:26:20.0697244Z nsight-compute-2025. | 320.6 MB | #########6 | 97%  2025-05-07T20:26:20.0697612Z 2025-05-07T20:26:20.0697618Z 2025-05-07T20:26:20.0697623Z 2025-05-07T20:26:20.0697628Z 2025-05-07T20:26:20.0697634Z 2025-05-07T20:26:20.0697639Z 2025-05-07T20:26:20.0708836Z 2025-05-07T20:26:20.0729875Z cuda-nvvp-12.8.57 | 112.4 MB | 8 | 9%  2025-05-07T20:26:20.0730166Z 2025-05-07T20:26:20.0730170Z 2025-05-07T20:26:20.0730173Z 2025-05-07T20:26:20.0730177Z 2025-05-07T20:26:20.0730180Z 2025-05-07T20:26:20.0732386Z 2025-05-07T20:26:20.0942035Z cuda-nsight-12.8.55 | 113.2 MB | 9 | 9%  2025-05-07T20:26:20.0942380Z 2025-05-07T20:26:20.0942384Z 2025-05-07T20:26:20.0942388Z 2025-05-07T20:26:20.0942392Z 2025-05-07T20:26:20.0942395Z 2025-05-07T20:26:20.1335727Z libnpp-12.3.3.65 | 130.6 MB | ##5 | 26%  2025-05-07T20:26:20.1698912Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 67% 2025-05-07T20:26:20.1699293Z 2025-05-07T20:26:20.1699311Z 2025-05-07T20:26:20.1699580Z 2025-05-07T20:26:20.1699584Z 2025-05-07T20:26:20.1699587Z 2025-05-07T20:26:20.1699591Z 2025-05-07T20:26:20.1699594Z 2025-05-07T20:26:20.1731542Z cuda-nvvp-12.8.57 | 112.4 MB | # | 11%  2025-05-07T20:26:20.1731836Z 2025-05-07T20:26:20.1731840Z 2025-05-07T20:26:20.1731844Z 2025-05-07T20:26:20.1731847Z 2025-05-07T20:26:20.1731851Z 2025-05-07T20:26:20.1739268Z 2025-05-07T20:26:20.1741505Z cuda-nsight-12.8.55 | 113.2 MB | #1 | 11%  2025-05-07T20:26:20.1744003Z 2025-05-07T20:26:20.1944974Z nsight-compute-2025. | 320.6 MB | #########7 | 98%  2025-05-07T20:26:20.1945258Z 2025-05-07T20:26:20.1945262Z 2025-05-07T20:26:20.1945266Z 2025-05-07T20:26:20.1945270Z 2025-05-07T20:26:20.1945275Z 2025-05-07T20:26:20.2480278Z libnpp-12.3.3.65 | 130.6 MB | ##7 | 28%  2025-05-07T20:26:20.2710112Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 68% 2025-05-07T20:26:20.2710461Z 2025-05-07T20:26:20.2710467Z 2025-05-07T20:26:20.2710487Z 2025-05-07T20:26:20.2710492Z 2025-05-07T20:26:20.2710498Z 2025-05-07T20:26:20.2710503Z 2025-05-07T20:26:20.2710517Z 2025-05-07T20:26:20.2804925Z cuda-nvvp-12.8.57 | 112.4 MB | #3 | 13%  2025-05-07T20:26:20.2805363Z 2025-05-07T20:26:20.2805370Z 2025-05-07T20:26:20.2805377Z 2025-05-07T20:26:20.2805383Z 2025-05-07T20:26:20.2805405Z 2025-05-07T20:26:20.2807226Z 2025-05-07T20:26:20.2864835Z cuda-nsight-12.8.55 | 113.2 MB | #3 | 13%  2025-05-07T20:26:20.2867338Z 2025-05-07T20:26:20.3037508Z nsight-compute-2025. | 320.6 MB | #########8 | 98%  2025-05-07T20:26:20.3037892Z 2025-05-07T20:26:20.3037898Z 2025-05-07T20:26:20.3037903Z 2025-05-07T20:26:20.3037908Z 2025-05-07T20:26:20.3037913Z 2025-05-07T20:26:20.3553728Z libnpp-12.3.3.65 | 130.6 MB | ##9 | 30%  2025-05-07T20:26:20.3710427Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 69% 2025-05-07T20:26:20.3710694Z 2025-05-07T20:26:20.3710711Z 2025-05-07T20:26:20.3710714Z 2025-05-07T20:26:20.3710718Z 2025-05-07T20:26:20.3710721Z 2025-05-07T20:26:20.3710725Z 2025-05-07T20:26:20.3717773Z 2025-05-07T20:26:20.3853465Z cuda-nvvp-12.8.57 | 112.4 MB | #5 | 15%  2025-05-07T20:26:20.3853911Z 2025-05-07T20:26:20.3853915Z 2025-05-07T20:26:20.3853926Z 2025-05-07T20:26:20.3853930Z 2025-05-07T20:26:20.3853933Z 2025-05-07T20:26:20.3855933Z 2025-05-07T20:26:20.3955293Z cuda-nsight-12.8.55 | 113.2 MB | #5 | 15%  2025-05-07T20:26:20.3959135Z 2025-05-07T20:26:20.4044676Z nsight-compute-2025. | 320.6 MB | #########9 | 99%  2025-05-07T20:26:20.4044930Z 2025-05-07T20:26:20.4044934Z 2025-05-07T20:26:20.4044938Z 2025-05-07T20:26:20.4044941Z 2025-05-07T20:26:20.4047346Z 2025-05-07T20:26:20.4734108Z libnpp-12.3.3.65 | 130.6 MB | ###1 | 32%  2025-05-07T20:26:20.4888256Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 69% 2025-05-07T20:26:20.4888619Z 2025-05-07T20:26:20.4888646Z 2025-05-07T20:26:20.4888652Z 2025-05-07T20:26:20.4888658Z 2025-05-07T20:26:20.4888663Z 2025-05-07T20:26:20.4888668Z 2025-05-07T20:26:20.4888673Z 2025-05-07T20:26:20.5007175Z cuda-nvvp-12.8.57 | 112.4 MB | #7 | 18%  2025-05-07T20:26:20.5007565Z 2025-05-07T20:26:20.5007569Z 2025-05-07T20:26:20.5007573Z 2025-05-07T20:26:20.5007576Z 2025-05-07T20:26:20.5007580Z 2025-05-07T20:26:20.5007583Z 2025-05-07T20:26:20.5068734Z cuda-nsight-12.8.55 | 113.2 MB | #7 | 17%  2025-05-07T20:26:20.5073628Z 2025-05-07T20:26:20.5126730Z nsight-compute-2025. | 320.6 MB | #########9 | 100%  2025-05-07T20:26:20.5127002Z 2025-05-07T20:26:20.5127005Z 2025-05-07T20:26:20.5127009Z 2025-05-07T20:26:20.5127013Z 2025-05-07T20:26:20.5129353Z 2025-05-07T20:26:20.5875541Z libnpp-12.3.3.65 | 130.6 MB | ###3 | 34%  2025-05-07T20:26:20.5959040Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 70% 2025-05-07T20:26:20.5959531Z 2025-05-07T20:26:20.5959535Z 2025-05-07T20:26:20.5959546Z 2025-05-07T20:26:20.5959549Z 2025-05-07T20:26:20.5959553Z 2025-05-07T20:26:20.5959557Z 2025-05-07T20:26:20.5970720Z 2025-05-07T20:26:20.6010206Z cuda-nvvp-12.8.57 | 112.4 MB | #9 | 20%  2025-05-07T20:26:20.6010522Z 2025-05-07T20:26:20.6010526Z 2025-05-07T20:26:20.6010530Z 2025-05-07T20:26:20.6010533Z 2025-05-07T20:26:20.6010537Z 2025-05-07T20:26:20.6010543Z 2025-05-07T20:26:20.6223661Z cuda-nsight-12.8.55 | 113.2 MB | #9 | 19%  2025-05-07T20:26:20.6224071Z 2025-05-07T20:26:20.6224077Z 2025-05-07T20:26:20.6224082Z 2025-05-07T20:26:20.6224087Z 2025-05-07T20:26:20.6230084Z 2025-05-07T20:26:20.6966821Z libnpp-12.3.3.65 | 130.6 MB | ###5 | 36%  2025-05-07T20:26:20.6967126Z 2025-05-07T20:26:20.6967159Z 2025-05-07T20:26:20.6967164Z 2025-05-07T20:26:20.6967167Z 2025-05-07T20:26:20.6967171Z 2025-05-07T20:26:20.6967176Z 2025-05-07T20:26:20.6971680Z 2025-05-07T20:26:20.6996542Z cuda-nvvp-12.8.57 | 112.4 MB | ##2 | 22%  2025-05-07T20:26:20.7012970Z libcublas-12.8.3.14 | 460.2 MB | ####### | 70% 2025-05-07T20:26:20.7013227Z 2025-05-07T20:26:20.7013233Z 2025-05-07T20:26:20.7013249Z 2025-05-07T20:26:20.7013255Z 2025-05-07T20:26:20.7013260Z 2025-05-07T20:26:20.7013401Z 2025-05-07T20:26:20.7231363Z cuda-nsight-12.8.55 | 113.2 MB | ##1 | 22%  2025-05-07T20:26:20.7231777Z 2025-05-07T20:26:20.7231783Z 2025-05-07T20:26:20.7231788Z 2025-05-07T20:26:20.7231793Z 2025-05-07T20:26:20.7231799Z 2025-05-07T20:26:20.7968174Z libnpp-12.3.3.65 | 130.6 MB | ###7 | 38%  2025-05-07T20:26:20.7968599Z 2025-05-07T20:26:20.7968603Z 2025-05-07T20:26:20.7968607Z 2025-05-07T20:26:20.7968610Z 2025-05-07T20:26:20.7968615Z 2025-05-07T20:26:20.7968644Z 2025-05-07T20:26:20.7970142Z 2025-05-07T20:26:20.7999883Z cuda-nvvp-12.8.57 | 112.4 MB | ##4 | 25%  2025-05-07T20:26:20.8019479Z libcublas-12.8.3.14 | 460.2 MB | ####### | 71% 2025-05-07T20:26:20.8019850Z 2025-05-07T20:26:20.8019856Z 2025-05-07T20:26:20.8019861Z 2025-05-07T20:26:20.8019866Z 2025-05-07T20:26:20.8019871Z 2025-05-07T20:26:20.8022008Z 2025-05-07T20:26:20.8233019Z cuda-nsight-12.8.55 | 113.2 MB | ##3 | 24%  2025-05-07T20:26:20.8233318Z 2025-05-07T20:26:20.8233322Z 2025-05-07T20:26:20.8233326Z 2025-05-07T20:26:20.8233337Z 2025-05-07T20:26:20.8235829Z 2025-05-07T20:26:20.8979482Z libnpp-12.3.3.65 | 130.6 MB | ###9 | 40%  2025-05-07T20:26:20.8979888Z 2025-05-07T20:26:20.8979894Z 2025-05-07T20:26:20.8979911Z 2025-05-07T20:26:20.8979917Z 2025-05-07T20:26:20.8979922Z 2025-05-07T20:26:20.8979927Z 2025-05-07T20:26:20.8979934Z 2025-05-07T20:26:20.9004048Z cuda-nvvp-12.8.57 | 112.4 MB | ##7 | 27%  2025-05-07T20:26:20.9020395Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 72% 2025-05-07T20:26:20.9020792Z 2025-05-07T20:26:20.9020798Z 2025-05-07T20:26:20.9020803Z 2025-05-07T20:26:20.9020808Z 2025-05-07T20:26:20.9020814Z 2025-05-07T20:26:20.9024847Z 2025-05-07T20:26:20.9239823Z cuda-nsight-12.8.55 | 113.2 MB | ##5 | 26%  2025-05-07T20:26:20.9240180Z 2025-05-07T20:26:20.9240187Z 2025-05-07T20:26:20.9240192Z 2025-05-07T20:26:20.9240197Z 2025-05-07T20:26:20.9242231Z 2025-05-07T20:26:20.9988989Z libnpp-12.3.3.65 | 130.6 MB | ####1 | 42%  2025-05-07T20:26:20.9989295Z 2025-05-07T20:26:20.9989299Z 2025-05-07T20:26:20.9989303Z 2025-05-07T20:26:20.9989314Z 2025-05-07T20:26:20.9989318Z 2025-05-07T20:26:20.9989322Z 2025-05-07T20:26:20.9992758Z 2025-05-07T20:26:21.0005514Z cuda-nvvp-12.8.57 | 112.4 MB | ##9 | 30%  2025-05-07T20:26:21.0029164Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 72% 2025-05-07T20:26:21.0029469Z 2025-05-07T20:26:21.0029475Z 2025-05-07T20:26:21.0029480Z 2025-05-07T20:26:21.0029781Z 2025-05-07T20:26:21.0029786Z 2025-05-07T20:26:21.0032610Z 2025-05-07T20:26:21.0270724Z cuda-nsight-12.8.55 | 113.2 MB | ##8 | 28%  2025-05-07T20:26:21.0271104Z 2025-05-07T20:26:21.0271109Z 2025-05-07T20:26:21.0271112Z 2025-05-07T20:26:21.0271124Z 2025-05-07T20:26:21.0271127Z 2025-05-07T20:26:21.0991125Z libnpp-12.3.3.65 | 130.6 MB | ####3 | 44%  2025-05-07T20:26:21.0991577Z 2025-05-07T20:26:21.0991584Z 2025-05-07T20:26:21.0991589Z 2025-05-07T20:26:21.0991594Z 2025-05-07T20:26:21.0991599Z 2025-05-07T20:26:21.0991604Z 2025-05-07T20:26:21.0991610Z 2025-05-07T20:26:21.1007607Z cuda-nvvp-12.8.57 | 112.4 MB | ###2 | 32%  2025-05-07T20:26:21.1030831Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 73% 2025-05-07T20:26:21.1031087Z 2025-05-07T20:26:21.1031121Z 2025-05-07T20:26:21.1031127Z 2025-05-07T20:26:21.1031132Z 2025-05-07T20:26:21.1031146Z 2025-05-07T20:26:21.1038299Z 2025-05-07T20:26:21.1271885Z cuda-nsight-12.8.55 | 113.2 MB | ### | 30%  2025-05-07T20:26:21.1272220Z 2025-05-07T20:26:21.1272226Z 2025-05-07T20:26:21.1272240Z 2025-05-07T20:26:21.1272246Z 2025-05-07T20:26:21.1272251Z 2025-05-07T20:26:21.2001470Z libnpp-12.3.3.65 | 130.6 MB | ####5 | 46%  2025-05-07T20:26:21.2001868Z 2025-05-07T20:26:21.2001881Z 2025-05-07T20:26:21.2001885Z 2025-05-07T20:26:21.2001889Z 2025-05-07T20:26:21.2001892Z 2025-05-07T20:26:21.2001896Z 2025-05-07T20:26:21.2004269Z 2025-05-07T20:26:21.2036220Z cuda-nvvp-12.8.57 | 112.4 MB | ###4 | 35%  2025-05-07T20:26:21.2036620Z 2025-05-07T20:26:21.2036624Z 2025-05-07T20:26:21.2036628Z 2025-05-07T20:26:21.2036631Z 2025-05-07T20:26:21.2036635Z 2025-05-07T20:26:21.2040721Z 2025-05-07T20:26:21.2104252Z cuda-nsight-12.8.55 | 113.2 MB | ###2 | 33%  2025-05-07T20:26:21.2339424Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 73% 2025-05-07T20:26:21.2339764Z 2025-05-07T20:26:21.2339768Z 2025-05-07T20:26:21.2339778Z 2025-05-07T20:26:21.2339782Z 2025-05-07T20:26:21.2339786Z 2025-05-07T20:26:21.3041696Z libnpp-12.3.3.65 | 130.6 MB | ####7 | 48%  2025-05-07T20:26:21.3042049Z 2025-05-07T20:26:21.3042053Z 2025-05-07T20:26:21.3042066Z 2025-05-07T20:26:21.3042069Z 2025-05-07T20:26:21.3042073Z 2025-05-07T20:26:21.3043406Z 2025-05-07T20:26:21.3109739Z cuda-nsight-12.8.55 | 113.2 MB | ###4 | 35%  2025-05-07T20:26:21.3195463Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 74% 2025-05-07T20:26:21.3195837Z 2025-05-07T20:26:21.3195843Z 2025-05-07T20:26:21.3195849Z 2025-05-07T20:26:21.3195854Z 2025-05-07T20:26:21.3195860Z 2025-05-07T20:26:21.3195865Z 2025-05-07T20:26:21.3195880Z 2025-05-07T20:26:21.3344233Z cuda-nvvp-12.8.57 | 112.4 MB | ###7 | 37%  2025-05-07T20:26:21.3344650Z 2025-05-07T20:26:21.3344656Z 2025-05-07T20:26:21.3344661Z 2025-05-07T20:26:21.3344682Z 2025-05-07T20:26:21.3346039Z 2025-05-07T20:26:21.4084428Z libnpp-12.3.3.65 | 130.6 MB | ####9 | 50%  2025-05-07T20:26:21.4084779Z 2025-05-07T20:26:21.4084785Z 2025-05-07T20:26:21.4084790Z 2025-05-07T20:26:21.4084795Z 2025-05-07T20:26:21.4084800Z 2025-05-07T20:26:21.4084806Z 2025-05-07T20:26:21.4116066Z cuda-nsight-12.8.55 | 113.2 MB | ###6 | 37%  2025-05-07T20:26:21.4197483Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 74% 2025-05-07T20:26:21.4197805Z 2025-05-07T20:26:21.4197810Z 2025-05-07T20:26:21.4197813Z 2025-05-07T20:26:21.4197817Z 2025-05-07T20:26:21.4197821Z 2025-05-07T20:26:21.4197825Z 2025-05-07T20:26:21.4199914Z 2025-05-07T20:26:21.4409662Z cuda-nvvp-12.8.57 | 112.4 MB | ###9 | 39%  2025-05-07T20:26:21.4410065Z 2025-05-07T20:26:21.4410070Z 2025-05-07T20:26:21.4410094Z 2025-05-07T20:26:21.4410098Z 2025-05-07T20:26:21.4414493Z 2025-05-07T20:26:21.5094291Z libnpp-12.3.3.65 | 130.6 MB | #####1 | 52%  2025-05-07T20:26:21.5094868Z 2025-05-07T20:26:21.5094872Z 2025-05-07T20:26:21.5094876Z 2025-05-07T20:26:21.5094880Z 2025-05-07T20:26:21.5094884Z 2025-05-07T20:26:21.5095544Z 2025-05-07T20:26:21.5170955Z cuda-nsight-12.8.55 | 113.2 MB | ###9 | 39%  2025-05-07T20:26:21.5248183Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 75% 2025-05-07T20:26:21.5248540Z 2025-05-07T20:26:21.5248547Z 2025-05-07T20:26:21.5248552Z 2025-05-07T20:26:21.5248559Z 2025-05-07T20:26:21.5248565Z 2025-05-07T20:26:21.5248570Z 2025-05-07T20:26:21.5251834Z 2025-05-07T20:26:21.5412937Z cuda-nvvp-12.8.57 | 112.4 MB | ####1 | 42%  2025-05-07T20:26:21.5413229Z 2025-05-07T20:26:21.5413233Z 2025-05-07T20:26:21.5413237Z 2025-05-07T20:26:21.5413240Z 2025-05-07T20:26:21.5415439Z 2025-05-07T20:26:21.6095248Z libnpp-12.3.3.65 | 130.6 MB | #####3 | 54%  2025-05-07T20:26:21.6095549Z 2025-05-07T20:26:21.6095567Z 2025-05-07T20:26:21.6095571Z 2025-05-07T20:26:21.6095574Z 2025-05-07T20:26:21.6095578Z 2025-05-07T20:26:21.6095581Z 2025-05-07T20:26:21.6214748Z cuda-nsight-12.8.55 | 113.2 MB | ####1 | 42%  2025-05-07T20:26:21.6383712Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 76% 2025-05-07T20:26:21.6384067Z 2025-05-07T20:26:21.6384074Z 2025-05-07T20:26:21.6384079Z 2025-05-07T20:26:21.6384084Z 2025-05-07T20:26:21.6384089Z 2025-05-07T20:26:21.6384094Z 2025-05-07T20:26:21.6384099Z 2025-05-07T20:26:21.6434190Z cuda-nvvp-12.8.57 | 112.4 MB | ####3 | 44%  2025-05-07T20:26:21.6434540Z 2025-05-07T20:26:21.6434545Z 2025-05-07T20:26:21.6434549Z 2025-05-07T20:26:21.6434552Z 2025-05-07T20:26:21.6434556Z 2025-05-07T20:26:21.7099632Z libnpp-12.3.3.65 | 130.6 MB | #####5 | 56%  2025-05-07T20:26:21.7099959Z 2025-05-07T20:26:21.7099964Z 2025-05-07T20:26:21.7099967Z 2025-05-07T20:26:21.7099971Z 2025-05-07T20:26:21.7099975Z 2025-05-07T20:26:21.7101602Z 2025-05-07T20:26:21.7252226Z cuda-nsight-12.8.55 | 113.2 MB | ####3 | 44%  2025-05-07T20:26:21.7411271Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 76% 2025-05-07T20:26:21.7411526Z 2025-05-07T20:26:21.7411530Z 2025-05-07T20:26:21.7411534Z 2025-05-07T20:26:21.7411538Z 2025-05-07T20:26:21.7411542Z 2025-05-07T20:26:21.7411545Z 2025-05-07T20:26:21.7411549Z 2025-05-07T20:26:21.7437058Z cuda-nvvp-12.8.57 | 112.4 MB | ####6 | 46%  2025-05-07T20:26:21.7437346Z 2025-05-07T20:26:21.7437351Z 2025-05-07T20:26:21.7437354Z 2025-05-07T20:26:21.7437358Z 2025-05-07T20:26:21.7437362Z 2025-05-07T20:26:21.8122306Z libnpp-12.3.3.65 | 130.6 MB | #####7 | 58%  2025-05-07T20:26:21.8122610Z 2025-05-07T20:26:21.8122615Z 2025-05-07T20:26:21.8122619Z 2025-05-07T20:26:21.8122862Z 2025-05-07T20:26:21.8122868Z 2025-05-07T20:26:21.8123644Z 2025-05-07T20:26:21.8350949Z cuda-nsight-12.8.55 | 113.2 MB | ####6 | 46%  2025-05-07T20:26:21.8422633Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 77% 2025-05-07T20:26:21.8422890Z 2025-05-07T20:26:21.8422894Z 2025-05-07T20:26:21.8422898Z 2025-05-07T20:26:21.8422902Z 2025-05-07T20:26:21.8422905Z 2025-05-07T20:26:21.8422909Z 2025-05-07T20:26:21.8430005Z 2025-05-07T20:26:21.8538162Z cuda-nvvp-12.8.57 | 112.4 MB | ####8 | 48%  2025-05-07T20:26:21.8538462Z 2025-05-07T20:26:21.8538472Z 2025-05-07T20:26:21.8538478Z 2025-05-07T20:26:21.8538483Z 2025-05-07T20:26:21.8540106Z 2025-05-07T20:26:21.9123780Z libnpp-12.3.3.65 | 130.6 MB | #####9 | 60%  2025-05-07T20:26:21.9124190Z 2025-05-07T20:26:21.9124196Z 2025-05-07T20:26:21.9124201Z 2025-05-07T20:26:21.9124206Z 2025-05-07T20:26:21.9124212Z 2025-05-07T20:26:21.9127186Z 2025-05-07T20:26:21.9436348Z cuda-nsight-12.8.55 | 113.2 MB | ####8 | 49%  2025-05-07T20:26:21.9543275Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 77% 2025-05-07T20:26:21.9543830Z 2025-05-07T20:26:21.9543834Z 2025-05-07T20:26:21.9543838Z 2025-05-07T20:26:21.9543842Z 2025-05-07T20:26:21.9543845Z 2025-05-07T20:26:21.9546139Z libnpp-12.3.3.65 | 130.6 MB | ######1 | 62%  2025-05-07T20:26:21.9546439Z 2025-05-07T20:26:21.9546443Z 2025-05-07T20:26:21.9546446Z 2025-05-07T20:26:21.9546450Z 2025-05-07T20:26:21.9546454Z 2025-05-07T20:26:21.9546457Z 2025-05-07T20:26:21.9553725Z 2025-05-07T20:26:22.0171174Z cuda-nvvp-12.8.57 | 112.4 MB | ##### | 51%  2025-05-07T20:26:22.0171477Z 2025-05-07T20:26:22.0171482Z 2025-05-07T20:26:22.0171485Z 2025-05-07T20:26:22.0171489Z 2025-05-07T20:26:22.0171501Z 2025-05-07T20:26:22.0171505Z 2025-05-07T20:26:22.0438019Z cuda-nsight-12.8.55 | 113.2 MB | ##### | 51%  2025-05-07T20:26:22.0544611Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 78% 2025-05-07T20:26:22.0544996Z 2025-05-07T20:26:22.0545002Z 2025-05-07T20:26:22.0545024Z 2025-05-07T20:26:22.0545029Z 2025-05-07T20:26:22.0547073Z 2025-05-07T20:26:22.0555561Z libnpp-12.3.3.65 | 130.6 MB | ######3 | 64%  2025-05-07T20:26:22.0555835Z 2025-05-07T20:26:22.0555839Z 2025-05-07T20:26:22.0555843Z 2025-05-07T20:26:22.0555846Z 2025-05-07T20:26:22.0555850Z 2025-05-07T20:26:22.0555854Z 2025-05-07T20:26:22.0560823Z 2025-05-07T20:26:22.1171965Z cuda-nvvp-12.8.57 | 112.4 MB | #####2 | 53%  2025-05-07T20:26:22.1172294Z 2025-05-07T20:26:22.1172299Z 2025-05-07T20:26:22.1172305Z 2025-05-07T20:26:22.1172309Z 2025-05-07T20:26:22.1172314Z 2025-05-07T20:26:22.1172595Z 2025-05-07T20:26:22.1443274Z cuda-nsight-12.8.55 | 113.2 MB | #####3 | 53%  2025-05-07T20:26:22.1556591Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 78% 2025-05-07T20:26:22.1556949Z 2025-05-07T20:26:22.1556974Z 2025-05-07T20:26:22.1556978Z 2025-05-07T20:26:22.1556982Z 2025-05-07T20:26:22.1556986Z 2025-05-07T20:26:22.1557002Z 2025-05-07T20:26:22.1557005Z 2025-05-07T20:26:22.1569013Z cuda-nvvp-12.8.57 | 112.4 MB | #####5 | 55%  2025-05-07T20:26:22.1569312Z 2025-05-07T20:26:22.1569316Z 2025-05-07T20:26:22.1569320Z 2025-05-07T20:26:22.1569323Z 2025-05-07T20:26:22.1571540Z 2025-05-07T20:26:22.2304605Z libnpp-12.3.3.65 | 130.6 MB | ######5 | 66%  2025-05-07T20:26:22.2304903Z 2025-05-07T20:26:22.2304907Z 2025-05-07T20:26:22.2304911Z 2025-05-07T20:26:22.2304914Z 2025-05-07T20:26:22.2304925Z 2025-05-07T20:26:22.2306499Z 2025-05-07T20:26:22.2450765Z cuda-nsight-12.8.55 | 113.2 MB | #####5 | 56%  2025-05-07T20:26:22.2575372Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 79% 2025-05-07T20:26:22.2575731Z 2025-05-07T20:26:22.2575736Z 2025-05-07T20:26:22.2575741Z 2025-05-07T20:26:22.2575746Z 2025-05-07T20:26:22.2576047Z 2025-05-07T20:26:22.2576061Z 2025-05-07T20:26:22.2576066Z 2025-05-07T20:26:22.2576872Z cuda-nvvp-12.8.57 | 112.4 MB | #####7 | 58%  2025-05-07T20:26:22.2577169Z 2025-05-07T20:26:22.2577173Z 2025-05-07T20:26:22.2577177Z 2025-05-07T20:26:22.2577188Z 2025-05-07T20:26:22.2577438Z 2025-05-07T20:26:22.3307193Z libnpp-12.3.3.65 | 130.6 MB | ######7 | 68%  2025-05-07T20:26:22.3307487Z 2025-05-07T20:26:22.3307491Z 2025-05-07T20:26:22.3307507Z 2025-05-07T20:26:22.3307511Z 2025-05-07T20:26:22.3307515Z 2025-05-07T20:26:22.3309520Z 2025-05-07T20:26:22.3452999Z cuda-nsight-12.8.55 | 113.2 MB | #####7 | 58%  2025-05-07T20:26:22.3581108Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 80% 2025-05-07T20:26:22.3581453Z 2025-05-07T20:26:22.3581458Z 2025-05-07T20:26:22.3581464Z 2025-05-07T20:26:22.3581470Z 2025-05-07T20:26:22.3581474Z 2025-05-07T20:26:22.3593855Z libnpp-12.3.3.65 | 130.6 MB | ####### | 70%  2025-05-07T20:26:22.3594127Z 2025-05-07T20:26:22.3594131Z 2025-05-07T20:26:22.3594134Z 2025-05-07T20:26:22.3594138Z 2025-05-07T20:26:22.3594458Z 2025-05-07T20:26:22.3594463Z 2025-05-07T20:26:22.3599498Z 2025-05-07T20:26:22.4416445Z cuda-nvvp-12.8.57 | 112.4 MB | ###### | 60%  2025-05-07T20:26:22.4416746Z 2025-05-07T20:26:22.4416750Z 2025-05-07T20:26:22.4416765Z 2025-05-07T20:26:22.4416769Z 2025-05-07T20:26:22.4416772Z 2025-05-07T20:26:22.4416776Z 2025-05-07T20:26:22.4456165Z cuda-nsight-12.8.55 | 113.2 MB | ###### | 60%  2025-05-07T20:26:22.4581651Z libcublas-12.8.3.14 | 460.2 MB | ######## | 80% 2025-05-07T20:26:22.4581993Z 2025-05-07T20:26:22.4582061Z 2025-05-07T20:26:22.4582066Z 2025-05-07T20:26:22.4582073Z 2025-05-07T20:26:22.4582094Z 2025-05-07T20:26:22.4596731Z libnpp-12.3.3.65 | 130.6 MB | #######2 | 72%  2025-05-07T20:26:22.4597168Z 2025-05-07T20:26:22.4597176Z 2025-05-07T20:26:22.4597208Z 2025-05-07T20:26:22.4597214Z 2025-05-07T20:26:22.4597220Z 2025-05-07T20:26:22.4597224Z 2025-05-07T20:26:22.4597229Z 2025-05-07T20:26:22.5426818Z cuda-nvvp-12.8.57 | 112.4 MB | ######2 | 63%  2025-05-07T20:26:22.5427117Z 2025-05-07T20:26:22.5427121Z 2025-05-07T20:26:22.5427124Z 2025-05-07T20:26:22.5427128Z 2025-05-07T20:26:22.5427131Z 2025-05-07T20:26:22.5427135Z 2025-05-07T20:26:22.5456556Z cuda-nsight-12.8.55 | 113.2 MB | ######2 | 62%  2025-05-07T20:26:22.5583769Z libcublas-12.8.3.14 | 460.2 MB | ######## | 81% 2025-05-07T20:26:22.5584026Z 2025-05-07T20:26:22.5584030Z 2025-05-07T20:26:22.5584034Z 2025-05-07T20:26:22.5584038Z 2025-05-07T20:26:22.5584041Z 2025-05-07T20:26:22.5598453Z libnpp-12.3.3.65 | 130.6 MB | #######4 | 74%  2025-05-07T20:26:22.5598730Z 2025-05-07T20:26:22.5598734Z 2025-05-07T20:26:22.5598737Z 2025-05-07T20:26:22.5598741Z 2025-05-07T20:26:22.5598744Z 2025-05-07T20:26:22.5598749Z 2025-05-07T20:26:22.5602139Z 2025-05-07T20:26:22.6430659Z cuda-nvvp-12.8.57 | 112.4 MB | ######5 | 65%  2025-05-07T20:26:22.6430986Z 2025-05-07T20:26:22.6430990Z 2025-05-07T20:26:22.6430994Z 2025-05-07T20:26:22.6430998Z 2025-05-07T20:26:22.6431001Z 2025-05-07T20:26:22.6433412Z 2025-05-07T20:26:22.6599359Z cuda-nsight-12.8.55 | 113.2 MB | ######4 | 64%  2025-05-07T20:26:22.6657803Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 81% 2025-05-07T20:26:22.6658064Z 2025-05-07T20:26:22.6658068Z 2025-05-07T20:26:22.6658071Z 2025-05-07T20:26:22.6658075Z 2025-05-07T20:26:22.6658078Z 2025-05-07T20:26:22.6658082Z 2025-05-07T20:26:22.6663472Z 2025-05-07T20:26:22.6669060Z cuda-nvvp-12.8.57 | 112.4 MB | ######7 | 68%  2025-05-07T20:26:22.6669351Z 2025-05-07T20:26:22.6669355Z 2025-05-07T20:26:22.6669359Z 2025-05-07T20:26:22.6669362Z 2025-05-07T20:26:22.6672948Z 2025-05-07T20:26:22.7451981Z libnpp-12.3.3.65 | 130.6 MB | #######6 | 77%  2025-05-07T20:26:22.7452279Z 2025-05-07T20:26:22.7452282Z 2025-05-07T20:26:22.7452286Z 2025-05-07T20:26:22.7452300Z 2025-05-07T20:26:22.7452304Z 2025-05-07T20:26:22.7456687Z 2025-05-07T20:26:22.7687428Z cuda-nsight-12.8.55 | 113.2 MB | ######6 | 67%  2025-05-07T20:26:22.7687844Z 2025-05-07T20:26:22.7687850Z 2025-05-07T20:26:22.7687856Z 2025-05-07T20:26:22.7687861Z 2025-05-07T20:26:22.7689816Z 2025-05-07T20:26:22.7791192Z libnpp-12.3.3.65 | 130.6 MB | #######8 | 79%  2025-05-07T20:26:22.7807027Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 82% 2025-05-07T20:26:22.7807394Z 2025-05-07T20:26:22.7807400Z 2025-05-07T20:26:22.7807406Z 2025-05-07T20:26:22.7807410Z 2025-05-07T20:26:22.7807415Z 2025-05-07T20:26:22.7807420Z 2025-05-07T20:26:22.7807426Z 2025-05-07T20:26:22.8528097Z cuda-nvvp-12.8.57 | 112.4 MB | ####### | 70%  2025-05-07T20:26:22.8528419Z 2025-05-07T20:26:22.8528423Z 2025-05-07T20:26:22.8528449Z 2025-05-07T20:26:22.8528453Z 2025-05-07T20:26:22.8528456Z 2025-05-07T20:26:22.8530378Z 2025-05-07T20:26:22.8692358Z cuda-nsight-12.8.55 | 113.2 MB | ######8 | 69%  2025-05-07T20:26:22.8692654Z 2025-05-07T20:26:22.8692658Z 2025-05-07T20:26:22.8692662Z 2025-05-07T20:26:22.8692665Z 2025-05-07T20:26:22.8692669Z 2025-05-07T20:26:22.8794997Z libnpp-12.3.3.65 | 130.6 MB | ######## | 81%  2025-05-07T20:26:22.8902097Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 82% 2025-05-07T20:26:22.8902390Z 2025-05-07T20:26:22.8902394Z 2025-05-07T20:26:22.8902397Z 2025-05-07T20:26:22.8902401Z 2025-05-07T20:26:22.8902404Z 2025-05-07T20:26:22.8902408Z 2025-05-07T20:26:22.8902412Z 2025-05-07T20:26:22.9528912Z cuda-nvvp-12.8.57 | 112.4 MB | #######2 | 72%  2025-05-07T20:26:22.9529230Z 2025-05-07T20:26:22.9529234Z 2025-05-07T20:26:22.9529237Z 2025-05-07T20:26:22.9529241Z 2025-05-07T20:26:22.9529245Z 2025-05-07T20:26:22.9530625Z 2025-05-07T20:26:22.9736194Z cuda-nsight-12.8.55 | 113.2 MB | #######1 | 71%  2025-05-07T20:26:22.9736506Z 2025-05-07T20:26:22.9736510Z 2025-05-07T20:26:22.9736513Z 2025-05-07T20:26:22.9736517Z 2025-05-07T20:26:22.9742101Z 2025-05-07T20:26:22.9886092Z libnpp-12.3.3.65 | 130.6 MB | ########2 | 83%  2025-05-07T20:26:22.9978843Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 83% 2025-05-07T20:26:22.9979234Z 2025-05-07T20:26:22.9979241Z 2025-05-07T20:26:22.9979246Z 2025-05-07T20:26:22.9979251Z 2025-05-07T20:26:22.9979257Z 2025-05-07T20:26:22.9979262Z 2025-05-07T20:26:22.9979436Z 2025-05-07T20:26:23.0534005Z cuda-nvvp-12.8.57 | 112.4 MB | #######4 | 75%  2025-05-07T20:26:23.0534297Z 2025-05-07T20:26:23.0534301Z 2025-05-07T20:26:23.0534305Z 2025-05-07T20:26:23.0534308Z 2025-05-07T20:26:23.0534313Z 2025-05-07T20:26:23.0534521Z 2025-05-07T20:26:23.0757453Z cuda-nsight-12.8.55 | 113.2 MB | #######3 | 73%  2025-05-07T20:26:23.0757752Z 2025-05-07T20:26:23.0757756Z 2025-05-07T20:26:23.0757759Z 2025-05-07T20:26:23.0757763Z 2025-05-07T20:26:23.0767314Z 2025-05-07T20:26:23.0920169Z libnpp-12.3.3.65 | 130.6 MB | ########4 | 85%  2025-05-07T20:26:23.0988020Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 83% 2025-05-07T20:26:23.0988297Z 2025-05-07T20:26:23.0988302Z 2025-05-07T20:26:23.0988305Z 2025-05-07T20:26:23.0988309Z 2025-05-07T20:26:23.0988313Z 2025-05-07T20:26:23.0988316Z 2025-05-07T20:26:23.0988320Z 2025-05-07T20:26:23.1542352Z cuda-nvvp-12.8.57 | 112.4 MB | #######7 | 77%  2025-05-07T20:26:23.1542658Z 2025-05-07T20:26:23.1542662Z 2025-05-07T20:26:23.1542666Z 2025-05-07T20:26:23.1542670Z 2025-05-07T20:26:23.1542674Z 2025-05-07T20:26:23.1546021Z 2025-05-07T20:26:23.1961103Z cuda-nsight-12.8.55 | 113.2 MB | #######5 | 76%  2025-05-07T20:26:23.1961412Z 2025-05-07T20:26:23.1961416Z 2025-05-07T20:26:23.1961655Z 2025-05-07T20:26:23.1961660Z 2025-05-07T20:26:23.1963535Z 2025-05-07T20:26:23.1989700Z libnpp-12.3.3.65 | 130.6 MB | ########6 | 87%  2025-05-07T20:26:23.1990000Z 2025-05-07T20:26:23.1990004Z 2025-05-07T20:26:23.1990007Z 2025-05-07T20:26:23.1990011Z 2025-05-07T20:26:23.1990014Z 2025-05-07T20:26:23.1990018Z 2025-05-07T20:26:23.1998764Z 2025-05-07T20:26:23.2001693Z cuda-nvvp-12.8.57 | 112.4 MB | #######9 | 80%  2025-05-07T20:26:23.2585281Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 84% 2025-05-07T20:26:23.2585610Z 2025-05-07T20:26:23.2585642Z 2025-05-07T20:26:23.2585648Z 2025-05-07T20:26:23.2585653Z 2025-05-07T20:26:23.2585658Z 2025-05-07T20:26:23.2585882Z 2025-05-07T20:26:23.3007278Z cuda-nsight-12.8.55 | 113.2 MB | #######7 | 78%  2025-05-07T20:26:23.3036755Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 84% 2025-05-07T20:26:23.3037137Z 2025-05-07T20:26:23.3037145Z 2025-05-07T20:26:23.3037151Z 2025-05-07T20:26:23.3037184Z 2025-05-07T20:26:23.3037190Z 2025-05-07T20:26:23.3037196Z 2025-05-07T20:26:23.3037231Z 2025-05-07T20:26:23.3068743Z cuda-nvvp-12.8.57 | 112.4 MB | ########1 | 82%  2025-05-07T20:26:23.3069135Z 2025-05-07T20:26:23.3069140Z 2025-05-07T20:26:23.3069143Z 2025-05-07T20:26:23.3069147Z 2025-05-07T20:26:23.3069367Z 2025-05-07T20:26:23.3742571Z libnpp-12.3.3.65 | 130.6 MB | ########8 | 89%  2025-05-07T20:26:23.3742858Z 2025-05-07T20:26:23.3742862Z 2025-05-07T20:26:23.3742866Z 2025-05-07T20:26:23.3742869Z 2025-05-07T20:26:23.3742873Z 2025-05-07T20:26:23.3742880Z 2025-05-07T20:26:23.4008778Z cuda-nsight-12.8.55 | 113.2 MB | ######## | 80%  2025-05-07T20:26:23.4040329Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 85% 2025-05-07T20:26:23.4040604Z 2025-05-07T20:26:23.4040608Z 2025-05-07T20:26:23.4040612Z 2025-05-07T20:26:23.4040616Z 2025-05-07T20:26:23.4040620Z 2025-05-07T20:26:23.4040624Z 2025-05-07T20:26:23.4040652Z 2025-05-07T20:26:23.4097320Z cuda-nvvp-12.8.57 | 112.4 MB | ########4 | 84%  2025-05-07T20:26:23.4097778Z 2025-05-07T20:26:23.4097784Z 2025-05-07T20:26:23.4097789Z 2025-05-07T20:26:23.4097794Z 2025-05-07T20:26:23.4102087Z 2025-05-07T20:26:23.4847649Z libnpp-12.3.3.65 | 130.6 MB | ######### | 91%  2025-05-07T20:26:23.4847990Z 2025-05-07T20:26:23.4847996Z 2025-05-07T20:26:23.4848001Z 2025-05-07T20:26:23.4848006Z 2025-05-07T20:26:23.4848011Z 2025-05-07T20:26:23.4848016Z 2025-05-07T20:26:23.5018892Z cuda-nsight-12.8.55 | 113.2 MB | ########2 | 82%  2025-05-07T20:26:23.5044920Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 86% 2025-05-07T20:26:23.5045166Z 2025-05-07T20:26:23.5045173Z 2025-05-07T20:26:23.5045520Z 2025-05-07T20:26:23.5045581Z 2025-05-07T20:26:23.5045585Z 2025-05-07T20:26:23.5045589Z 2025-05-07T20:26:23.5045734Z 2025-05-07T20:26:23.5101872Z cuda-nvvp-12.8.57 | 112.4 MB | ########6 | 87%  2025-05-07T20:26:23.5102171Z 2025-05-07T20:26:23.5102175Z 2025-05-07T20:26:23.5102179Z 2025-05-07T20:26:23.5102193Z 2025-05-07T20:26:23.5107769Z 2025-05-07T20:26:23.5914163Z libnpp-12.3.3.65 | 130.6 MB | #########2 | 92%  2025-05-07T20:26:23.5914448Z 2025-05-07T20:26:23.5914452Z 2025-05-07T20:26:23.5914456Z 2025-05-07T20:26:23.5914459Z 2025-05-07T20:26:23.5914463Z 2025-05-07T20:26:23.5917774Z 2025-05-07T20:26:23.6020888Z cuda-nsight-12.8.55 | 113.2 MB | ########4 | 84%  2025-05-07T20:26:23.6077655Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 86% 2025-05-07T20:26:23.6077897Z 2025-05-07T20:26:23.6077900Z 2025-05-07T20:26:23.6077904Z 2025-05-07T20:26:23.6077908Z 2025-05-07T20:26:23.6077919Z 2025-05-07T20:26:23.6077923Z 2025-05-07T20:26:23.6080367Z 2025-05-07T20:26:23.6197699Z cuda-nvvp-12.8.57 | 112.4 MB | ########9 | 89%  2025-05-07T20:26:23.6197973Z 2025-05-07T20:26:23.6197983Z 2025-05-07T20:26:23.6198214Z 2025-05-07T20:26:23.6198219Z 2025-05-07T20:26:23.6200061Z 2025-05-07T20:26:23.6920084Z libnpp-12.3.3.65 | 130.6 MB | #########4 | 94%  2025-05-07T20:26:23.6920372Z 2025-05-07T20:26:23.6920376Z 2025-05-07T20:26:23.6920380Z 2025-05-07T20:26:23.6920383Z 2025-05-07T20:26:23.6920387Z 2025-05-07T20:26:23.6920390Z 2025-05-07T20:26:23.7023199Z cuda-nsight-12.8.55 | 113.2 MB | ########6 | 86%  2025-05-07T20:26:23.7099050Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 87% 2025-05-07T20:26:23.7099293Z 2025-05-07T20:26:23.7099297Z 2025-05-07T20:26:23.7099301Z 2025-05-07T20:26:23.7099304Z 2025-05-07T20:26:23.7099315Z 2025-05-07T20:26:23.7099319Z 2025-05-07T20:26:23.7099323Z 2025-05-07T20:26:23.7282718Z cuda-nvvp-12.8.57 | 112.4 MB | #########1 | 92%  2025-05-07T20:26:23.7282991Z 2025-05-07T20:26:23.7282994Z 2025-05-07T20:26:23.7283005Z 2025-05-07T20:26:23.7283009Z 2025-05-07T20:26:23.7285491Z 2025-05-07T20:26:23.7922062Z libnpp-12.3.3.65 | 130.6 MB | #########6 | 96%  2025-05-07T20:26:23.7922330Z 2025-05-07T20:26:23.7922341Z 2025-05-07T20:26:23.7922873Z 2025-05-07T20:26:23.7922877Z 2025-05-07T20:26:23.7922881Z 2025-05-07T20:26:23.7924669Z 2025-05-07T20:26:23.8037588Z cuda-nsight-12.8.55 | 113.2 MB | ########8 | 89%  2025-05-07T20:26:23.8182824Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 87% 2025-05-07T20:26:23.8183067Z 2025-05-07T20:26:23.8183071Z 2025-05-07T20:26:23.8183074Z 2025-05-07T20:26:23.8183078Z 2025-05-07T20:26:23.8183082Z 2025-05-07T20:26:23.8183092Z 2025-05-07T20:26:23.8183096Z 2025-05-07T20:26:23.8287359Z cuda-nvvp-12.8.57 | 112.4 MB | #########3 | 94%  2025-05-07T20:26:23.8287633Z 2025-05-07T20:26:23.8287636Z 2025-05-07T20:26:23.8287640Z 2025-05-07T20:26:23.8287650Z 2025-05-07T20:26:23.8290877Z 2025-05-07T20:26:23.8926326Z libnpp-12.3.3.65 | 130.6 MB | #########7 | 98%  2025-05-07T20:26:23.8926609Z 2025-05-07T20:26:23.8926613Z 2025-05-07T20:26:23.8926630Z 2025-05-07T20:26:23.8926633Z 2025-05-07T20:26:23.8926637Z 2025-05-07T20:26:23.8926648Z 2025-05-07T20:26:23.9109308Z cuda-nsight-12.8.55 | 113.2 MB | ######### | 91%  2025-05-07T20:26:23.9193618Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 88% 2025-05-07T20:26:23.9193897Z 2025-05-07T20:26:23.9193912Z 2025-05-07T20:26:23.9193917Z 2025-05-07T20:26:23.9193922Z 2025-05-07T20:26:23.9193927Z 2025-05-07T20:26:23.9193932Z 2025-05-07T20:26:23.9193937Z 2025-05-07T20:26:23.9931608Z cuda-nvvp-12.8.57 | 112.4 MB | #########6 | 96%  2025-05-07T20:26:23.9932007Z 2025-05-07T20:26:23.9932013Z 2025-05-07T20:26:23.9932019Z 2025-05-07T20:26:23.9932024Z 2025-05-07T20:26:23.9932029Z 2025-05-07T20:26:23.9933583Z 2025-05-07T20:26:24.0114398Z cuda-nsight-12.8.55 | 113.2 MB | #########2 | 93%  2025-05-07T20:26:24.0193706Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 89% 2025-05-07T20:26:24.0194080Z 2025-05-07T20:26:24.0194087Z 2025-05-07T20:26:24.0194092Z 2025-05-07T20:26:24.0194097Z 2025-05-07T20:26:24.0194114Z 2025-05-07T20:26:24.0194119Z 2025-05-07T20:26:24.0194124Z 2025-05-07T20:26:24.0936272Z cuda-nvvp-12.8.57 | 112.4 MB | #########8 | 99%  2025-05-07T20:26:24.0936656Z 2025-05-07T20:26:24.0936662Z 2025-05-07T20:26:24.0936667Z 2025-05-07T20:26:24.0936672Z 2025-05-07T20:26:24.0936678Z 2025-05-07T20:26:24.0938084Z 2025-05-07T20:26:24.1119793Z cuda-nsight-12.8.55 | 113.2 MB | #########5 | 95%  2025-05-07T20:26:24.1937174Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 89% 2025-05-07T20:26:24.1937536Z 2025-05-07T20:26:24.1937542Z 2025-05-07T20:26:24.1937547Z 2025-05-07T20:26:24.1937552Z 2025-05-07T20:26:24.1937557Z 2025-05-07T20:26:24.1939432Z 2025-05-07T20:26:24.2351838Z cuda-nsight-12.8.55 | 113.2 MB | #########7 | 98%  2025-05-07T20:26:24.3710402Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 90% 2025-05-07T20:26:24.4717210Z libcublas-12.8.3.14 | 460.2 MB | ######### | 90% 2025-05-07T20:26:24.5718006Z libcublas-12.8.3.14 | 460.2 MB | ######### | 91% 2025-05-07T20:26:24.6726127Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 92% 2025-05-07T20:26:24.7737572Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 92% 2025-05-07T20:26:24.8738874Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 93% 2025-05-07T20:26:24.9742628Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 94% 2025-05-07T20:26:25.0743635Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 95% 2025-05-07T20:26:25.1744178Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 96% 2025-05-07T20:26:25.2745405Z libcublas-12.8.3.14 | 460.2 MB | #########6 | 97% 2025-05-07T20:26:25.3823529Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 98% 2025-05-07T20:26:25.4913189Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 98% 2025-05-07T20:26:27.4230448Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 99% 2025-05-07T20:26:27.4230754Z 2025-05-07T20:26:27.4230761Z 2025-05-07T20:26:27.4230766Z 2025-05-07T20:26:27.4231920Z 2025-05-07T20:26:27.7831667Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:26:27.7832232Z 2025-05-07T20:26:27.7832236Z 2025-05-07T20:26:27.7832239Z 2025-05-07T20:26:27.7832243Z 2025-05-07T20:26:27.7832247Z 2025-05-07T20:26:27.7832251Z 2025-05-07T20:26:27.7836213Z 2025-05-07T20:26:27.8232809Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:26:27.8233208Z 2025-05-07T20:26:27.8233214Z 2025-05-07T20:26:27.8233219Z 2025-05-07T20:26:27.8233224Z 2025-05-07T20:26:27.8233229Z 2025-05-07T20:26:27.8233234Z 2025-05-07T20:26:27.8233239Z 2025-05-07T20:26:27.8233245Z 2025-05-07T20:26:27.9236210Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:26:27.9236598Z 2025-05-07T20:26:27.9236604Z 2025-05-07T20:26:27.9236609Z 2025-05-07T20:26:27.9236614Z 2025-05-07T20:26:27.9236620Z 2025-05-07T20:26:27.9236642Z 2025-05-07T20:26:27.9236648Z 2025-05-07T20:26:27.9236653Z 2025-05-07T20:26:28.0227938Z cuda-nvrtc-12.8.61 | 63.1 MB | 4 | 5%  2025-05-07T20:26:28.0228354Z 2025-05-07T20:26:28.0228360Z 2025-05-07T20:26:28.0228365Z 2025-05-07T20:26:28.0228371Z 2025-05-07T20:26:28.0228376Z 2025-05-07T20:26:28.0229734Z 2025-05-07T20:26:28.0239868Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:26:28.0240276Z 2025-05-07T20:26:28.0240282Z 2025-05-07T20:26:28.0240288Z 2025-05-07T20:26:28.0240300Z 2025-05-07T20:26:28.0240306Z 2025-05-07T20:26:28.0240311Z 2025-05-07T20:26:28.0240316Z 2025-05-07T20:26:28.0240321Z 2025-05-07T20:26:28.0914156Z cuda-nvrtc-12.8.61 | 63.1 MB | 9 | 10%  2025-05-07T20:26:28.0914576Z 2025-05-07T20:26:28.0914582Z 2025-05-07T20:26:28.0914587Z 2025-05-07T20:26:28.0914592Z 2025-05-07T20:26:28.0914597Z 2025-05-07T20:26:28.0914603Z 2025-05-07T20:26:28.0914624Z 2025-05-07T20:26:28.0914629Z 2025-05-07T20:26:28.0916393Z 2025-05-07T20:26:28.1240118Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:26:28.1240546Z 2025-05-07T20:26:28.1240552Z 2025-05-07T20:26:28.1240557Z 2025-05-07T20:26:28.1240562Z 2025-05-07T20:26:28.1240568Z 2025-05-07T20:26:28.1240573Z 2025-05-07T20:26:28.1240578Z 2025-05-07T20:26:28.1240583Z 2025-05-07T20:26:28.1917117Z cuda-nvrtc-12.8.61 | 63.1 MB | #4 | 15%  2025-05-07T20:26:28.1917505Z 2025-05-07T20:26:28.1917510Z 2025-05-07T20:26:28.1917515Z 2025-05-07T20:26:28.1917520Z 2025-05-07T20:26:28.1917525Z 2025-05-07T20:26:28.1917531Z 2025-05-07T20:26:28.1917535Z 2025-05-07T20:26:28.1917552Z 2025-05-07T20:26:28.1917558Z 2025-05-07T20:26:28.2244971Z libcurand-10.3.9.55 | 43.6 MB | 7 | 7%  2025-05-07T20:26:28.2245360Z 2025-05-07T20:26:28.2245366Z 2025-05-07T20:26:28.2245371Z 2025-05-07T20:26:28.2245384Z 2025-05-07T20:26:28.2245646Z 2025-05-07T20:26:28.2245652Z 2025-05-07T20:26:28.2245657Z 2025-05-07T20:26:28.2246931Z 2025-05-07T20:26:28.2624024Z cuda-nvrtc-12.8.61 | 63.1 MB | ## | 20%  2025-05-07T20:26:28.2624425Z 2025-05-07T20:26:28.2624430Z 2025-05-07T20:26:28.2624435Z 2025-05-07T20:26:28.2624441Z 2025-05-07T20:26:28.2624446Z 2025-05-07T20:26:28.2624776Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:28.2625134Z 2025-05-07T20:26:28.2625140Z 2025-05-07T20:26:28.2625145Z 2025-05-07T20:26:28.2625151Z 2025-05-07T20:26:28.2625155Z 2025-05-07T20:26:28.2917562Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:28.2917941Z 2025-05-07T20:26:28.2917946Z 2025-05-07T20:26:28.2917952Z 2025-05-07T20:26:28.2917957Z 2025-05-07T20:26:28.2917963Z 2025-05-07T20:26:28.2917968Z 2025-05-07T20:26:28.2917973Z 2025-05-07T20:26:28.2917979Z 2025-05-07T20:26:28.2917985Z 2025-05-07T20:26:28.3250161Z libcurand-10.3.9.55 | 43.6 MB | #4 | 14%  2025-05-07T20:26:28.3250559Z 2025-05-07T20:26:28.3250565Z 2025-05-07T20:26:28.3250833Z 2025-05-07T20:26:28.3250838Z 2025-05-07T20:26:28.3250843Z 2025-05-07T20:26:28.3250848Z 2025-05-07T20:26:28.3250853Z 2025-05-07T20:26:28.3255491Z 2025-05-07T20:26:28.3274410Z cuda-nvrtc-12.8.61 | 63.1 MB | ##5 | 25%  2025-05-07T20:26:28.3274802Z 2025-05-07T20:26:28.3274807Z 2025-05-07T20:26:28.3274812Z 2025-05-07T20:26:28.3274817Z 2025-05-07T20:26:28.3274822Z 2025-05-07T20:26:28.3274827Z 2025-05-07T20:26:28.3274832Z 2025-05-07T20:26:28.3274838Z 2025-05-07T20:26:28.3274843Z 2025-05-07T20:26:28.3274848Z 2025-05-07T20:26:28.4024359Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:26:28.4024756Z 2025-05-07T20:26:28.4024761Z 2025-05-07T20:26:28.4024767Z 2025-05-07T20:26:28.4024772Z 2025-05-07T20:26:28.4024777Z 2025-05-07T20:26:28.4024782Z 2025-05-07T20:26:28.4024805Z 2025-05-07T20:26:28.4024811Z 2025-05-07T20:26:28.4030654Z 2025-05-07T20:26:28.4287377Z libcurand-10.3.9.55 | 43.6 MB | ##1 | 21%  2025-05-07T20:26:28.4287793Z 2025-05-07T20:26:28.4287799Z 2025-05-07T20:26:28.4287804Z 2025-05-07T20:26:28.4287809Z 2025-05-07T20:26:28.4287823Z 2025-05-07T20:26:28.4287829Z 2025-05-07T20:26:28.4287833Z 2025-05-07T20:26:28.4287839Z 2025-05-07T20:26:28.4287843Z 2025-05-07T20:26:28.4287848Z 2025-05-07T20:26:28.4432520Z gds-tools-1.13.0.11 | 37.9 MB | 6 | 7%  2025-05-07T20:26:28.4432922Z 2025-05-07T20:26:28.4432928Z 2025-05-07T20:26:28.4432933Z 2025-05-07T20:26:28.4432938Z 2025-05-07T20:26:28.4432943Z 2025-05-07T20:26:28.4432948Z 2025-05-07T20:26:28.4432953Z 2025-05-07T20:26:28.4437754Z 2025-05-07T20:26:28.5059215Z cuda-nvrtc-12.8.61 | 63.1 MB | ### | 31%  2025-05-07T20:26:28.5059628Z 2025-05-07T20:26:28.5059634Z 2025-05-07T20:26:28.5059660Z 2025-05-07T20:26:28.5059666Z 2025-05-07T20:26:28.5059671Z 2025-05-07T20:26:28.5059676Z 2025-05-07T20:26:28.5059681Z 2025-05-07T20:26:28.5059696Z 2025-05-07T20:26:28.5059701Z 2025-05-07T20:26:28.5302465Z libcurand-10.3.9.55 | 43.6 MB | ##8 | 28%  2025-05-07T20:26:28.5302868Z 2025-05-07T20:26:28.5302873Z 2025-05-07T20:26:28.5302878Z 2025-05-07T20:26:28.5302883Z 2025-05-07T20:26:28.5302888Z 2025-05-07T20:26:28.5302893Z 2025-05-07T20:26:28.5302898Z 2025-05-07T20:26:28.5302904Z 2025-05-07T20:26:28.5302909Z 2025-05-07T20:26:28.5302914Z 2025-05-07T20:26:28.5528946Z gds-tools-1.13.0.11 | 37.9 MB | #3 | 13%  2025-05-07T20:26:28.5529352Z 2025-05-07T20:26:28.5529357Z 2025-05-07T20:26:28.5529363Z 2025-05-07T20:26:28.5529368Z 2025-05-07T20:26:28.5529373Z 2025-05-07T20:26:28.5529386Z 2025-05-07T20:26:28.5529391Z 2025-05-07T20:26:28.5533628Z 2025-05-07T20:26:28.6095169Z cuda-nvrtc-12.8.61 | 63.1 MB | ###5 | 36%  2025-05-07T20:26:28.6095577Z 2025-05-07T20:26:28.6095582Z 2025-05-07T20:26:28.6095587Z 2025-05-07T20:26:28.6095606Z 2025-05-07T20:26:28.6095611Z 2025-05-07T20:26:28.6095616Z 2025-05-07T20:26:28.6095621Z 2025-05-07T20:26:28.6095626Z 2025-05-07T20:26:28.6099327Z 2025-05-07T20:26:28.6303406Z libcurand-10.3.9.55 | 43.6 MB | ###4 | 35%  2025-05-07T20:26:28.6303800Z 2025-05-07T20:26:28.6303805Z 2025-05-07T20:26:28.6303810Z 2025-05-07T20:26:28.6303815Z 2025-05-07T20:26:28.6303820Z 2025-05-07T20:26:28.6303825Z 2025-05-07T20:26:28.6303830Z 2025-05-07T20:26:28.6303835Z 2025-05-07T20:26:28.6303840Z 2025-05-07T20:26:28.6307370Z 2025-05-07T20:26:28.6644395Z gds-tools-1.13.0.11 | 37.9 MB | ## | 20%  2025-05-07T20:26:28.6644789Z 2025-05-07T20:26:28.6644794Z 2025-05-07T20:26:28.6644800Z 2025-05-07T20:26:28.6644805Z 2025-05-07T20:26:28.6644810Z 2025-05-07T20:26:28.6644815Z 2025-05-07T20:26:28.6644834Z 2025-05-07T20:26:28.6648273Z 2025-05-07T20:26:28.7096891Z cuda-nvrtc-12.8.61 | 63.1 MB | #### | 40%  2025-05-07T20:26:28.7097548Z 2025-05-07T20:26:28.7097553Z 2025-05-07T20:26:28.7097559Z 2025-05-07T20:26:28.7097564Z 2025-05-07T20:26:28.7097569Z 2025-05-07T20:26:28.7097573Z 2025-05-07T20:26:28.7097588Z 2025-05-07T20:26:28.7097594Z 2025-05-07T20:26:28.7097599Z 2025-05-07T20:26:28.7304856Z libcurand-10.3.9.55 | 43.6 MB | ####1 | 42%  2025-05-07T20:26:28.7305253Z 2025-05-07T20:26:28.7305259Z 2025-05-07T20:26:28.7305274Z 2025-05-07T20:26:28.7305279Z 2025-05-07T20:26:28.7305284Z 2025-05-07T20:26:28.7305289Z 2025-05-07T20:26:28.7305294Z 2025-05-07T20:26:28.7305299Z 2025-05-07T20:26:28.7305304Z 2025-05-07T20:26:28.7305416Z 2025-05-07T20:26:28.7759429Z gds-tools-1.13.0.11 | 37.9 MB | ##7 | 27%  2025-05-07T20:26:28.7759833Z 2025-05-07T20:26:28.7759839Z 2025-05-07T20:26:28.7759873Z 2025-05-07T20:26:28.7759878Z 2025-05-07T20:26:28.7759883Z 2025-05-07T20:26:28.7759888Z 2025-05-07T20:26:28.7759893Z 2025-05-07T20:26:28.7762744Z 2025-05-07T20:26:28.8115882Z cuda-nvrtc-12.8.61 | 63.1 MB | ####4 | 45%  2025-05-07T20:26:28.8116277Z 2025-05-07T20:26:28.8116282Z 2025-05-07T20:26:28.8116286Z 2025-05-07T20:26:28.8116289Z 2025-05-07T20:26:28.8116293Z 2025-05-07T20:26:28.8116297Z 2025-05-07T20:26:28.8116300Z 2025-05-07T20:26:28.8116304Z 2025-05-07T20:26:28.8116308Z 2025-05-07T20:26:28.8305724Z libcurand-10.3.9.55 | 43.6 MB | ####8 | 48%  2025-05-07T20:26:28.8306135Z 2025-05-07T20:26:28.8306141Z 2025-05-07T20:26:28.8306146Z 2025-05-07T20:26:28.8306151Z 2025-05-07T20:26:28.8306156Z 2025-05-07T20:26:28.8306162Z 2025-05-07T20:26:28.8306167Z 2025-05-07T20:26:28.8306172Z 2025-05-07T20:26:28.8306177Z 2025-05-07T20:26:28.8306182Z 2025-05-07T20:26:28.8829681Z gds-tools-1.13.0.11 | 37.9 MB | ###4 | 35%  2025-05-07T20:26:28.8830072Z 2025-05-07T20:26:28.8830078Z 2025-05-07T20:26:28.8830083Z 2025-05-07T20:26:28.8830101Z 2025-05-07T20:26:28.8830106Z 2025-05-07T20:26:28.8830111Z 2025-05-07T20:26:28.8830116Z 2025-05-07T20:26:28.8830898Z 2025-05-07T20:26:28.9222451Z cuda-nvrtc-12.8.61 | 63.1 MB | ####9 | 49%  2025-05-07T20:26:28.9222839Z 2025-05-07T20:26:28.9222845Z 2025-05-07T20:26:28.9222850Z 2025-05-07T20:26:28.9222855Z 2025-05-07T20:26:28.9222860Z 2025-05-07T20:26:28.9222865Z 2025-05-07T20:26:28.9222883Z 2025-05-07T20:26:28.9222888Z 2025-05-07T20:26:28.9224517Z 2025-05-07T20:26:28.9353330Z libcurand-10.3.9.55 | 43.6 MB | #####4 | 55%  2025-05-07T20:26:28.9353720Z 2025-05-07T20:26:28.9353734Z 2025-05-07T20:26:28.9353739Z 2025-05-07T20:26:28.9353745Z 2025-05-07T20:26:28.9353750Z 2025-05-07T20:26:28.9353755Z 2025-05-07T20:26:28.9353760Z 2025-05-07T20:26:28.9353765Z 2025-05-07T20:26:28.9354028Z 2025-05-07T20:26:28.9354035Z 2025-05-07T20:26:28.9859234Z gds-tools-1.13.0.11 | 37.9 MB | ####1 | 42%  2025-05-07T20:26:28.9859644Z 2025-05-07T20:26:28.9859649Z 2025-05-07T20:26:28.9859654Z 2025-05-07T20:26:28.9859660Z 2025-05-07T20:26:28.9859665Z 2025-05-07T20:26:28.9859670Z 2025-05-07T20:26:28.9859675Z 2025-05-07T20:26:28.9861139Z 2025-05-07T20:26:29.0261196Z cuda-nvrtc-12.8.61 | 63.1 MB | #####3 | 54%  2025-05-07T20:26:29.0261593Z 2025-05-07T20:26:29.0261598Z 2025-05-07T20:26:29.0261604Z 2025-05-07T20:26:29.0261609Z 2025-05-07T20:26:29.0261614Z 2025-05-07T20:26:29.0261620Z 2025-05-07T20:26:29.0261625Z 2025-05-07T20:26:29.0261630Z 2025-05-07T20:26:29.0261635Z 2025-05-07T20:26:29.0355001Z libcurand-10.3.9.55 | 43.6 MB | ######1 | 61%  2025-05-07T20:26:29.0355418Z 2025-05-07T20:26:29.0355423Z 2025-05-07T20:26:29.0355429Z 2025-05-07T20:26:29.0355434Z 2025-05-07T20:26:29.0355450Z 2025-05-07T20:26:29.0355455Z 2025-05-07T20:26:29.0355460Z 2025-05-07T20:26:29.0355465Z 2025-05-07T20:26:29.0355559Z 2025-05-07T20:26:29.0357523Z 2025-05-07T20:26:29.0875366Z gds-tools-1.13.0.11 | 37.9 MB | ####9 | 49%  2025-05-07T20:26:29.0875880Z 2025-05-07T20:26:29.0875886Z 2025-05-07T20:26:29.0875891Z 2025-05-07T20:26:29.0875896Z 2025-05-07T20:26:29.0875901Z 2025-05-07T20:26:29.0875906Z 2025-05-07T20:26:29.0875911Z 2025-05-07T20:26:29.0875916Z 2025-05-07T20:26:29.1269447Z cuda-nvrtc-12.8.61 | 63.1 MB | #####8 | 58%  2025-05-07T20:26:29.1269838Z 2025-05-07T20:26:29.1269844Z 2025-05-07T20:26:29.1269849Z 2025-05-07T20:26:29.1269854Z 2025-05-07T20:26:29.1269869Z 2025-05-07T20:26:29.1269874Z 2025-05-07T20:26:29.1269879Z 2025-05-07T20:26:29.1269884Z 2025-05-07T20:26:29.1269890Z 2025-05-07T20:26:29.1449537Z libcurand-10.3.9.55 | 43.6 MB | ######7 | 68%  2025-05-07T20:26:29.1449964Z 2025-05-07T20:26:29.1449970Z 2025-05-07T20:26:29.1449975Z 2025-05-07T20:26:29.1449980Z 2025-05-07T20:26:29.1449985Z 2025-05-07T20:26:29.1450003Z 2025-05-07T20:26:29.1450008Z 2025-05-07T20:26:29.1450013Z 2025-05-07T20:26:29.1450018Z 2025-05-07T20:26:29.1451595Z 2025-05-07T20:26:29.1880230Z gds-tools-1.13.0.11 | 37.9 MB | #####6 | 57%  2025-05-07T20:26:29.1880573Z 2025-05-07T20:26:29.1880577Z 2025-05-07T20:26:29.1880581Z 2025-05-07T20:26:29.1880585Z 2025-05-07T20:26:29.1880588Z 2025-05-07T20:26:29.1880599Z 2025-05-07T20:26:29.1880603Z 2025-05-07T20:26:29.1880606Z 2025-05-07T20:26:29.2307799Z cuda-nvrtc-12.8.61 | 63.1 MB | ######2 | 63%  2025-05-07T20:26:29.2308102Z 2025-05-07T20:26:29.2308114Z 2025-05-07T20:26:29.2308118Z 2025-05-07T20:26:29.2308121Z 2025-05-07T20:26:29.2308125Z 2025-05-07T20:26:29.2308129Z 2025-05-07T20:26:29.2308132Z 2025-05-07T20:26:29.2308136Z 2025-05-07T20:26:29.2308140Z 2025-05-07T20:26:29.2466793Z libcurand-10.3.9.55 | 43.6 MB | #######4 | 74%  2025-05-07T20:26:29.2467171Z 2025-05-07T20:26:29.2467195Z 2025-05-07T20:26:29.2467200Z 2025-05-07T20:26:29.2467205Z 2025-05-07T20:26:29.2467210Z 2025-05-07T20:26:29.2467215Z 2025-05-07T20:26:29.2467221Z 2025-05-07T20:26:29.2467225Z 2025-05-07T20:26:29.2467230Z 2025-05-07T20:26:29.2467236Z 2025-05-07T20:26:29.2923921Z gds-tools-1.13.0.11 | 37.9 MB | ######3 | 64%  2025-05-07T20:26:29.2924228Z 2025-05-07T20:26:29.2924232Z 2025-05-07T20:26:29.2924235Z 2025-05-07T20:26:29.2924239Z 2025-05-07T20:26:29.2924243Z 2025-05-07T20:26:29.2924247Z 2025-05-07T20:26:29.2924250Z 2025-05-07T20:26:29.2929995Z 2025-05-07T20:26:29.3308566Z cuda-nvrtc-12.8.61 | 63.1 MB | ######7 | 67%  2025-05-07T20:26:29.3308951Z 2025-05-07T20:26:29.3308957Z 2025-05-07T20:26:29.3308962Z 2025-05-07T20:26:29.3308966Z 2025-05-07T20:26:29.3308971Z 2025-05-07T20:26:29.3309242Z 2025-05-07T20:26:29.3309249Z 2025-05-07T20:26:29.3309254Z 2025-05-07T20:26:29.3309259Z 2025-05-07T20:26:29.3469749Z libcurand-10.3.9.55 | 43.6 MB | ########1 | 81%  2025-05-07T20:26:29.3470167Z 2025-05-07T20:26:29.3470172Z 2025-05-07T20:26:29.3470177Z 2025-05-07T20:26:29.3470181Z 2025-05-07T20:26:29.3470186Z 2025-05-07T20:26:29.3470198Z 2025-05-07T20:26:29.3470203Z 2025-05-07T20:26:29.3470207Z 2025-05-07T20:26:29.3470212Z 2025-05-07T20:26:29.3471344Z 2025-05-07T20:26:29.4038521Z gds-tools-1.13.0.11 | 37.9 MB | #######1 | 71%  2025-05-07T20:26:29.4038865Z 2025-05-07T20:26:29.4038869Z 2025-05-07T20:26:29.4038873Z 2025-05-07T20:26:29.4038877Z 2025-05-07T20:26:29.4038880Z 2025-05-07T20:26:29.4038884Z 2025-05-07T20:26:29.4038888Z 2025-05-07T20:26:29.4040202Z 2025-05-07T20:26:29.4356908Z cuda-nvrtc-12.8.61 | 63.1 MB | #######1 | 72%  2025-05-07T20:26:29.4357247Z 2025-05-07T20:26:29.4357281Z 2025-05-07T20:26:29.4357285Z 2025-05-07T20:26:29.4357289Z 2025-05-07T20:26:29.4357293Z 2025-05-07T20:26:29.4357296Z 2025-05-07T20:26:29.4357534Z 2025-05-07T20:26:29.4357537Z 2025-05-07T20:26:29.4360022Z 2025-05-07T20:26:29.4498482Z libcurand-10.3.9.55 | 43.6 MB | ########7 | 88%  2025-05-07T20:26:29.4498858Z 2025-05-07T20:26:29.4498862Z 2025-05-07T20:26:29.4498866Z 2025-05-07T20:26:29.4498869Z 2025-05-07T20:26:29.4498873Z 2025-05-07T20:26:29.4498877Z 2025-05-07T20:26:29.4498880Z 2025-05-07T20:26:29.4498884Z 2025-05-07T20:26:29.4498887Z 2025-05-07T20:26:29.4498891Z 2025-05-07T20:26:29.5038793Z gds-tools-1.13.0.11 | 37.9 MB | #######8 | 78%  2025-05-07T20:26:29.5039109Z 2025-05-07T20:26:29.5039113Z 2025-05-07T20:26:29.5039116Z 2025-05-07T20:26:29.5039120Z 2025-05-07T20:26:29.5039124Z 2025-05-07T20:26:29.5039127Z 2025-05-07T20:26:29.5039131Z 2025-05-07T20:26:29.5040588Z 2025-05-07T20:26:29.5452861Z cuda-nvrtc-12.8.61 | 63.1 MB | #######6 | 76%  2025-05-07T20:26:29.5453192Z 2025-05-07T20:26:29.5453196Z 2025-05-07T20:26:29.5453211Z 2025-05-07T20:26:29.5453215Z 2025-05-07T20:26:29.5453218Z 2025-05-07T20:26:29.5453222Z 2025-05-07T20:26:29.5453225Z 2025-05-07T20:26:29.5453229Z 2025-05-07T20:26:29.5458167Z 2025-05-07T20:26:29.5512953Z libcurand-10.3.9.55 | 43.6 MB | #########3 | 94%  2025-05-07T20:26:29.5513332Z 2025-05-07T20:26:29.5513338Z 2025-05-07T20:26:29.5513353Z 2025-05-07T20:26:29.5513359Z 2025-05-07T20:26:29.5513363Z 2025-05-07T20:26:29.5513366Z 2025-05-07T20:26:29.5513370Z 2025-05-07T20:26:29.5513373Z 2025-05-07T20:26:29.5513377Z 2025-05-07T20:26:29.5513380Z 2025-05-07T20:26:29.6040830Z gds-tools-1.13.0.11 | 37.9 MB | ########5 | 85%  2025-05-07T20:26:29.6041197Z 2025-05-07T20:26:29.6041201Z 2025-05-07T20:26:29.6041205Z 2025-05-07T20:26:29.6041208Z 2025-05-07T20:26:29.6041212Z 2025-05-07T20:26:29.6041228Z 2025-05-07T20:26:29.6041232Z 2025-05-07T20:26:29.6043476Z 2025-05-07T20:26:29.7043016Z cuda-nvrtc-12.8.61 | 63.1 MB | ######## | 81%  2025-05-07T20:26:29.7043332Z 2025-05-07T20:26:29.7043335Z 2025-05-07T20:26:29.7043339Z 2025-05-07T20:26:29.7043342Z 2025-05-07T20:26:29.7043346Z 2025-05-07T20:26:29.7043349Z 2025-05-07T20:26:29.7043353Z 2025-05-07T20:26:29.7056376Z 2025-05-07T20:26:29.7797480Z cuda-nvrtc-12.8.61 | 63.1 MB | ########6 | 86%  2025-05-07T20:26:29.7797771Z 2025-05-07T20:26:29.7797775Z 2025-05-07T20:26:29.7797779Z 2025-05-07T20:26:29.7797782Z 2025-05-07T20:26:29.7797786Z 2025-05-07T20:26:29.7797789Z 2025-05-07T20:26:29.7797793Z 2025-05-07T20:26:29.7797797Z 2025-05-07T20:26:29.7797800Z 2025-05-07T20:26:29.7797804Z 2025-05-07T20:26:29.8045313Z gds-tools-1.13.0.11 | 37.9 MB | #########2 | 92%  2025-05-07T20:26:29.8045671Z 2025-05-07T20:26:29.8045675Z 2025-05-07T20:26:29.8045888Z 2025-05-07T20:26:29.8045894Z 2025-05-07T20:26:29.8045897Z 2025-05-07T20:26:29.8045910Z 2025-05-07T20:26:29.8045913Z 2025-05-07T20:26:29.8047392Z 2025-05-07T20:26:29.8800278Z cuda-nvrtc-12.8.61 | 63.1 MB | #########1 | 91%  2025-05-07T20:26:29.8800683Z 2025-05-07T20:26:29.8800690Z 2025-05-07T20:26:29.8800694Z 2025-05-07T20:26:29.8800700Z 2025-05-07T20:26:29.8800705Z 2025-05-07T20:26:29.8800710Z 2025-05-07T20:26:29.8800715Z 2025-05-07T20:26:29.8800720Z 2025-05-07T20:26:29.8800725Z 2025-05-07T20:26:29.8800730Z 2025-05-07T20:26:29.9062912Z gds-tools-1.13.0.11 | 37.9 MB | #########9 | 100%  2025-05-07T20:26:29.9063359Z 2025-05-07T20:26:29.9063365Z 2025-05-07T20:26:29.9063370Z 2025-05-07T20:26:29.9063375Z 2025-05-07T20:26:29.9063380Z 2025-05-07T20:26:29.9063385Z 2025-05-07T20:26:29.9063390Z 2025-05-07T20:26:29.9063395Z 2025-05-07T20:26:31.1497596Z cuda-nvrtc-12.8.61 | 63.1 MB | #########6 | 96%  2025-05-07T20:26:31.1498042Z 2025-05-07T20:26:31.1498049Z 2025-05-07T20:26:31.1498055Z 2025-05-07T20:26:31.1498068Z 2025-05-07T20:26:31.1498338Z 2025-05-07T20:26:31.1498341Z 2025-05-07T20:26:31.1498345Z 2025-05-07T20:26:31.1498349Z 2025-05-07T20:26:31.1498353Z 2025-05-07T20:26:31.1498356Z 2025-05-07T20:26:31.1907427Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:26:31.1907839Z 2025-05-07T20:26:31.1907844Z 2025-05-07T20:26:31.1907849Z 2025-05-07T20:26:31.1907854Z 2025-05-07T20:26:31.1907859Z 2025-05-07T20:26:31.1907864Z 2025-05-07T20:26:31.1907870Z 2025-05-07T20:26:31.1907875Z 2025-05-07T20:26:31.1909149Z 2025-05-07T20:26:31.1912982Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:26:31.1913373Z 2025-05-07T20:26:31.1913380Z 2025-05-07T20:26:31.1913385Z 2025-05-07T20:26:31.1913390Z 2025-05-07T20:26:31.1913395Z 2025-05-07T20:26:31.1913400Z 2025-05-07T20:26:31.1913406Z 2025-05-07T20:26:31.1913432Z 2025-05-07T20:26:31.1913438Z 2025-05-07T20:26:31.1913443Z 2025-05-07T20:26:31.1922474Z 2025-05-07T20:26:31.2472540Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:26:31.2472973Z 2025-05-07T20:26:31.2472978Z 2025-05-07T20:26:31.2472984Z 2025-05-07T20:26:31.2472989Z 2025-05-07T20:26:31.2472994Z 2025-05-07T20:26:31.2472999Z 2025-05-07T20:26:31.2473005Z 2025-05-07T20:26:31.2473010Z 2025-05-07T20:26:31.2473026Z 2025-05-07T20:26:31.2473032Z 2025-05-07T20:26:31.2473038Z 2025-05-07T20:26:31.2473047Z 2025-05-07T20:26:31.2908893Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:26:31.2909226Z 2025-05-07T20:26:31.2909230Z 2025-05-07T20:26:31.2909234Z 2025-05-07T20:26:31.2909238Z 2025-05-07T20:26:31.2909241Z 2025-05-07T20:26:31.2909245Z 2025-05-07T20:26:31.2909249Z 2025-05-07T20:26:31.2909252Z 2025-05-07T20:26:31.2909256Z 2025-05-07T20:26:31.2909260Z 2025-05-07T20:26:31.2911504Z 2025-05-07T20:26:31.3478863Z libnvjitlink-12.8.61 | 28.7 MB | # | 10%  2025-05-07T20:26:31.3479199Z 2025-05-07T20:26:31.3479203Z 2025-05-07T20:26:31.3479207Z 2025-05-07T20:26:31.3479211Z 2025-05-07T20:26:31.3479214Z 2025-05-07T20:26:31.3479218Z 2025-05-07T20:26:31.3479222Z 2025-05-07T20:26:31.3479225Z 2025-05-07T20:26:31.3479229Z 2025-05-07T20:26:31.3479233Z 2025-05-07T20:26:31.3479236Z 2025-05-07T20:26:31.3479240Z 2025-05-07T20:26:31.3917043Z cuda-nvcc-tools-12.8 | 24.5 MB | # | 11%  2025-05-07T20:26:31.3917358Z 2025-05-07T20:26:31.3917363Z 2025-05-07T20:26:31.3917366Z 2025-05-07T20:26:31.3917370Z 2025-05-07T20:26:31.3917374Z 2025-05-07T20:26:31.3917378Z 2025-05-07T20:26:31.3917381Z 2025-05-07T20:26:31.3917392Z 2025-05-07T20:26:31.3917396Z 2025-05-07T20:26:31.3917399Z 2025-05-07T20:26:31.3917431Z 2025-05-07T20:26:31.4137712Z libnvjitlink-12.8.61 | 28.7 MB | ## | 21%  2025-05-07T20:26:31.4138143Z 2025-05-07T20:26:31.4138150Z 2025-05-07T20:26:31.4138895Z 2025-05-07T20:26:31.4490000Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:26:31.4490336Z 2025-05-07T20:26:31.4490341Z 2025-05-07T20:26:31.4490345Z 2025-05-07T20:26:31.4490348Z 2025-05-07T20:26:31.4490352Z 2025-05-07T20:26:31.4490356Z 2025-05-07T20:26:31.4490359Z 2025-05-07T20:26:31.4490363Z 2025-05-07T20:26:31.4490366Z 2025-05-07T20:26:31.4490370Z 2025-05-07T20:26:31.4490374Z 2025-05-07T20:26:31.4493133Z 2025-05-07T20:26:31.4921590Z cuda-nvcc-tools-12.8 | 24.5 MB | ##1 | 21%  2025-05-07T20:26:31.4922093Z 2025-05-07T20:26:31.4922097Z 2025-05-07T20:26:31.4922101Z 2025-05-07T20:26:31.4922105Z 2025-05-07T20:26:31.4922108Z 2025-05-07T20:26:31.4922112Z 2025-05-07T20:26:31.4922115Z 2025-05-07T20:26:31.4922119Z 2025-05-07T20:26:31.4922123Z 2025-05-07T20:26:31.4922126Z 2025-05-07T20:26:31.4922130Z 2025-05-07T20:26:31.5559155Z libnvjitlink-12.8.61 | 28.7 MB | ###1 | 32%  2025-05-07T20:26:31.5559553Z 2025-05-07T20:26:31.5559805Z 2025-05-07T20:26:31.5559809Z 2025-05-07T20:26:31.5559813Z 2025-05-07T20:26:31.5559816Z 2025-05-07T20:26:31.5559820Z 2025-05-07T20:26:31.5559823Z 2025-05-07T20:26:31.5559827Z 2025-05-07T20:26:31.5559831Z 2025-05-07T20:26:31.5559844Z 2025-05-07T20:26:31.5559847Z 2025-05-07T20:26:31.5563009Z 2025-05-07T20:26:31.5924267Z cuda-nvcc-tools-12.8 | 24.5 MB | ###2 | 32%  2025-05-07T20:26:31.5924619Z 2025-05-07T20:26:31.5924623Z 2025-05-07T20:26:31.5924627Z 2025-05-07T20:26:31.5924630Z 2025-05-07T20:26:31.5924634Z 2025-05-07T20:26:31.5924638Z 2025-05-07T20:26:31.5924641Z 2025-05-07T20:26:31.5924645Z 2025-05-07T20:26:31.5924649Z 2025-05-07T20:26:31.5924652Z 2025-05-07T20:26:31.5927694Z 2025-05-07T20:26:31.6568037Z libnvjitlink-12.8.61 | 28.7 MB | ####2 | 42%  2025-05-07T20:26:31.6568470Z 2025-05-07T20:26:31.6568474Z 2025-05-07T20:26:31.6568478Z 2025-05-07T20:26:31.6568481Z 2025-05-07T20:26:31.6568499Z 2025-05-07T20:26:31.6568503Z 2025-05-07T20:26:31.6568506Z 2025-05-07T20:26:31.6568510Z 2025-05-07T20:26:31.6568514Z 2025-05-07T20:26:31.6568517Z 2025-05-07T20:26:31.6568521Z 2025-05-07T20:26:31.6568524Z 2025-05-07T20:26:31.6969796Z cuda-nvcc-tools-12.8 | 24.5 MB | ####4 | 44%  2025-05-07T20:26:31.6970168Z 2025-05-07T20:26:31.6970172Z 2025-05-07T20:26:31.6970176Z 2025-05-07T20:26:31.6970179Z 2025-05-07T20:26:31.6970183Z 2025-05-07T20:26:31.6970187Z 2025-05-07T20:26:31.6970190Z 2025-05-07T20:26:31.6970203Z 2025-05-07T20:26:31.6970207Z 2025-05-07T20:26:31.6970211Z 2025-05-07T20:26:31.6970214Z 2025-05-07T20:26:31.7572644Z libnvjitlink-12.8.61 | 28.7 MB | #####2 | 53%  2025-05-07T20:26:31.7572969Z 2025-05-07T20:26:31.7572973Z 2025-05-07T20:26:31.7572976Z 2025-05-07T20:26:31.7573002Z 2025-05-07T20:26:31.7573006Z 2025-05-07T20:26:31.7573009Z 2025-05-07T20:26:31.7573013Z 2025-05-07T20:26:31.7573017Z 2025-05-07T20:26:31.7573030Z 2025-05-07T20:26:31.7573033Z 2025-05-07T20:26:31.7573037Z 2025-05-07T20:26:31.7573040Z 2025-05-07T20:26:31.7788324Z cuda-nvcc-tools-12.8 | 24.5 MB | #####6 | 56%  2025-05-07T20:26:31.7788649Z 2025-05-07T20:26:31.7790379Z 2025-05-07T20:26:31.8161325Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:26:31.8161780Z 2025-05-07T20:26:31.8161788Z 2025-05-07T20:26:31.8161796Z 2025-05-07T20:26:31.8161803Z 2025-05-07T20:26:31.8161810Z 2025-05-07T20:26:31.8161815Z 2025-05-07T20:26:31.8161821Z 2025-05-07T20:26:31.8161826Z 2025-05-07T20:26:31.8161832Z 2025-05-07T20:26:31.8161836Z 2025-05-07T20:26:31.8165060Z 2025-05-07T20:26:31.8576706Z libnvjitlink-12.8.61 | 28.7 MB | ######3 | 63%  2025-05-07T20:26:31.8577179Z 2025-05-07T20:26:31.8577564Z 2025-05-07T20:26:31.8577575Z 2025-05-07T20:26:31.8577580Z 2025-05-07T20:26:31.8577586Z 2025-05-07T20:26:31.8577591Z 2025-05-07T20:26:31.8577611Z 2025-05-07T20:26:31.8577617Z 2025-05-07T20:26:31.8577623Z 2025-05-07T20:26:31.8577628Z 2025-05-07T20:26:31.8577647Z 2025-05-07T20:26:31.8577653Z 2025-05-07T20:26:31.9242436Z cuda-nvcc-tools-12.8 | 24.5 MB | ######8 | 68%  2025-05-07T20:26:31.9242761Z 2025-05-07T20:26:31.9242765Z 2025-05-07T20:26:31.9242777Z 2025-05-07T20:26:31.9242780Z 2025-05-07T20:26:31.9242784Z 2025-05-07T20:26:31.9242788Z 2025-05-07T20:26:31.9242791Z 2025-05-07T20:26:31.9242795Z 2025-05-07T20:26:31.9242799Z 2025-05-07T20:26:31.9242803Z 2025-05-07T20:26:31.9244159Z 2025-05-07T20:26:31.9577385Z libnvjitlink-12.8.61 | 28.7 MB | #######3 | 73%  2025-05-07T20:26:31.9577717Z 2025-05-07T20:26:31.9577721Z 2025-05-07T20:26:31.9577725Z 2025-05-07T20:26:31.9577728Z 2025-05-07T20:26:31.9577757Z 2025-05-07T20:26:31.9577761Z 2025-05-07T20:26:31.9577765Z 2025-05-07T20:26:31.9577768Z 2025-05-07T20:26:31.9577772Z 2025-05-07T20:26:31.9578051Z 2025-05-07T20:26:31.9578057Z 2025-05-07T20:26:31.9582798Z 2025-05-07T20:26:31.9741294Z cuda-nvcc-tools-12.8 | 24.5 MB | ######## | 80%  2025-05-07T20:26:31.9742837Z 2025-05-07T20:26:32.0243391Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:32.0243674Z 2025-05-07T20:26:32.0243678Z 2025-05-07T20:26:32.0243681Z 2025-05-07T20:26:32.0243685Z 2025-05-07T20:26:32.0243689Z 2025-05-07T20:26:32.0243692Z 2025-05-07T20:26:32.0243697Z 2025-05-07T20:26:32.0243701Z 2025-05-07T20:26:32.0243704Z 2025-05-07T20:26:32.0243708Z 2025-05-07T20:26:32.0243711Z 2025-05-07T20:26:32.0343022Z libnvjitlink-12.8.61 | 28.7 MB | ########4 | 84%  2025-05-07T20:26:32.0343322Z 2025-05-07T20:26:32.0343326Z 2025-05-07T20:26:32.0343330Z 2025-05-07T20:26:32.0343333Z 2025-05-07T20:26:32.0343362Z 2025-05-07T20:26:32.0343367Z 2025-05-07T20:26:32.0343377Z 2025-05-07T20:26:32.0343381Z 2025-05-07T20:26:32.0343384Z 2025-05-07T20:26:32.0343400Z 2025-05-07T20:26:32.0343403Z 2025-05-07T20:26:32.0343407Z 2025-05-07T20:26:32.0344776Z 2025-05-07T20:26:32.1006972Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:26:32.1007294Z 2025-05-07T20:26:32.1007298Z 2025-05-07T20:26:32.1007302Z 2025-05-07T20:26:32.1007305Z 2025-05-07T20:26:32.1007309Z 2025-05-07T20:26:32.1007313Z 2025-05-07T20:26:32.1007316Z 2025-05-07T20:26:32.1007320Z 2025-05-07T20:26:32.1007323Z 2025-05-07T20:26:32.1007327Z 2025-05-07T20:26:32.1007331Z 2025-05-07T20:26:32.1009049Z 2025-05-07T20:26:32.1345032Z cuda-nvcc-tools-12.8 | 24.5 MB | #########1 | 92%  2025-05-07T20:26:32.1345365Z 2025-05-07T20:26:32.1345369Z 2025-05-07T20:26:32.1345373Z 2025-05-07T20:26:32.1345377Z 2025-05-07T20:26:32.1345380Z 2025-05-07T20:26:32.1345401Z 2025-05-07T20:26:32.1345405Z 2025-05-07T20:26:32.1345409Z 2025-05-07T20:26:32.1345413Z 2025-05-07T20:26:32.1345416Z 2025-05-07T20:26:32.1345430Z 2025-05-07T20:26:32.1345433Z 2025-05-07T20:26:32.1347996Z 2025-05-07T20:26:32.1371087Z cuda-nvvm-tools-12.8 | 23.5 MB | #1 | 12%  2025-05-07T20:26:32.1371427Z 2025-05-07T20:26:32.1371431Z 2025-05-07T20:26:32.1371435Z 2025-05-07T20:26:32.1371446Z 2025-05-07T20:26:32.1371450Z 2025-05-07T20:26:32.1371454Z 2025-05-07T20:26:32.1371457Z 2025-05-07T20:26:32.1371461Z 2025-05-07T20:26:32.1371464Z 2025-05-07T20:26:32.1371468Z 2025-05-07T20:26:32.1371471Z 2025-05-07T20:26:32.1730227Z libnvjitlink-12.8.61 | 28.7 MB | #########4 | 94%  2025-05-07T20:26:32.1730650Z 2025-05-07T20:26:32.1730656Z 2025-05-07T20:26:32.1730661Z 2025-05-07T20:26:32.1730666Z 2025-05-07T20:26:32.1730672Z 2025-05-07T20:26:32.1730677Z 2025-05-07T20:26:32.1730682Z 2025-05-07T20:26:32.1734721Z 2025-05-07T20:26:32.2131724Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:26:32.2132032Z 2025-05-07T20:26:32.2132036Z 2025-05-07T20:26:32.2132039Z 2025-05-07T20:26:32.2132043Z 2025-05-07T20:26:32.2132047Z 2025-05-07T20:26:32.2132050Z 2025-05-07T20:26:32.2132054Z 2025-05-07T20:26:32.2132057Z 2025-05-07T20:26:32.2132061Z 2025-05-07T20:26:32.2132065Z 2025-05-07T20:26:32.2132068Z 2025-05-07T20:26:32.2132072Z 2025-05-07T20:26:32.2132075Z 2025-05-07T20:26:32.2133680Z 2025-05-07T20:26:32.2349325Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:26:32.2349638Z 2025-05-07T20:26:32.2349642Z 2025-05-07T20:26:32.2349646Z 2025-05-07T20:26:32.2349649Z 2025-05-07T20:26:32.2349660Z 2025-05-07T20:26:32.2349664Z 2025-05-07T20:26:32.2349667Z 2025-05-07T20:26:32.2349671Z 2025-05-07T20:26:32.2349674Z 2025-05-07T20:26:32.2349678Z 2025-05-07T20:26:32.2349681Z 2025-05-07T20:26:32.2349685Z 2025-05-07T20:26:32.2349696Z 2025-05-07T20:26:32.3135069Z cuda-nvvm-tools-12.8 | 23.5 MB | ##3 | 24%  2025-05-07T20:26:32.3135641Z 2025-05-07T20:26:32.3135644Z 2025-05-07T20:26:32.3135648Z 2025-05-07T20:26:32.3135651Z 2025-05-07T20:26:32.3135655Z 2025-05-07T20:26:32.3135658Z 2025-05-07T20:26:32.3135662Z 2025-05-07T20:26:32.3135665Z 2025-05-07T20:26:32.3135669Z 2025-05-07T20:26:32.3135672Z 2025-05-07T20:26:32.3135676Z 2025-05-07T20:26:32.3135679Z 2025-05-07T20:26:32.3135683Z 2025-05-07T20:26:32.3137296Z 2025-05-07T20:26:32.3401990Z cuda-nvvm-impl-12.8. | 20.8 MB | #3 | 14%  2025-05-07T20:26:32.3402352Z 2025-05-07T20:26:32.3402357Z 2025-05-07T20:26:32.3402360Z 2025-05-07T20:26:32.3402364Z 2025-05-07T20:26:32.3402368Z 2025-05-07T20:26:32.3402371Z 2025-05-07T20:26:32.3402375Z 2025-05-07T20:26:32.3402379Z 2025-05-07T20:26:32.3402382Z 2025-05-07T20:26:32.3402393Z 2025-05-07T20:26:32.3402396Z 2025-05-07T20:26:32.3402408Z 2025-05-07T20:26:32.3410225Z 2025-05-07T20:26:32.4141275Z cuda-nvvm-tools-12.8 | 23.5 MB | ###5 | 36%  2025-05-07T20:26:32.4141741Z 2025-05-07T20:26:32.4141764Z 2025-05-07T20:26:32.4141771Z 2025-05-07T20:26:32.4141777Z 2025-05-07T20:26:32.4141783Z 2025-05-07T20:26:32.4141789Z 2025-05-07T20:26:32.4141795Z 2025-05-07T20:26:32.4141802Z 2025-05-07T20:26:32.4141808Z 2025-05-07T20:26:32.4141814Z 2025-05-07T20:26:32.4141820Z 2025-05-07T20:26:32.4141825Z 2025-05-07T20:26:32.4141830Z 2025-05-07T20:26:32.4146294Z 2025-05-07T20:26:32.4404486Z cuda-nvvm-impl-12.8. | 20.8 MB | ##8 | 28%  2025-05-07T20:26:32.4404882Z 2025-05-07T20:26:32.4404886Z 2025-05-07T20:26:32.4404889Z 2025-05-07T20:26:32.4404893Z 2025-05-07T20:26:32.4404897Z 2025-05-07T20:26:32.4404900Z 2025-05-07T20:26:32.4404904Z 2025-05-07T20:26:32.4404907Z 2025-05-07T20:26:32.4404911Z 2025-05-07T20:26:32.4404924Z 2025-05-07T20:26:32.4404928Z 2025-05-07T20:26:32.4404939Z 2025-05-07T20:26:32.4404942Z 2025-05-07T20:26:32.5202436Z cuda-nvvm-tools-12.8 | 23.5 MB | ####7 | 48%  2025-05-07T20:26:32.5202851Z 2025-05-07T20:26:32.5202855Z 2025-05-07T20:26:32.5202858Z 2025-05-07T20:26:32.5202869Z 2025-05-07T20:26:32.5202872Z 2025-05-07T20:26:32.5202876Z 2025-05-07T20:26:32.5202879Z 2025-05-07T20:26:32.5202883Z 2025-05-07T20:26:32.5202886Z 2025-05-07T20:26:32.5202890Z 2025-05-07T20:26:32.5202894Z 2025-05-07T20:26:32.5202897Z 2025-05-07T20:26:32.5202901Z 2025-05-07T20:26:32.5205905Z 2025-05-07T20:26:32.5528490Z cuda-nvvm-impl-12.8. | 20.8 MB | ####2 | 42%  2025-05-07T20:26:32.5528810Z 2025-05-07T20:26:32.5528814Z 2025-05-07T20:26:32.5528818Z 2025-05-07T20:26:32.5528821Z 2025-05-07T20:26:32.5528825Z 2025-05-07T20:26:32.5528829Z 2025-05-07T20:26:32.5528839Z 2025-05-07T20:26:32.5528843Z 2025-05-07T20:26:32.5529035Z 2025-05-07T20:26:32.5529040Z 2025-05-07T20:26:32.5529044Z 2025-05-07T20:26:32.5529047Z 2025-05-07T20:26:32.5529051Z 2025-05-07T20:26:32.6204513Z cuda-nvvm-tools-12.8 | 23.5 MB | #####9 | 59%  2025-05-07T20:26:32.6204833Z 2025-05-07T20:26:32.6204837Z 2025-05-07T20:26:32.6204840Z 2025-05-07T20:26:32.6204844Z 2025-05-07T20:26:32.6204847Z 2025-05-07T20:26:32.6204851Z 2025-05-07T20:26:32.6204854Z 2025-05-07T20:26:32.6204858Z 2025-05-07T20:26:32.6204861Z 2025-05-07T20:26:32.6204865Z 2025-05-07T20:26:32.6204868Z 2025-05-07T20:26:32.6204872Z 2025-05-07T20:26:32.6204875Z 2025-05-07T20:26:32.6209705Z 2025-05-07T20:26:32.6532678Z cuda-nvvm-impl-12.8. | 20.8 MB | #####7 | 57%  2025-05-07T20:26:32.6532988Z 2025-05-07T20:26:32.6532992Z 2025-05-07T20:26:32.6532995Z 2025-05-07T20:26:32.6532999Z 2025-05-07T20:26:32.6533002Z 2025-05-07T20:26:32.6533006Z 2025-05-07T20:26:32.6533009Z 2025-05-07T20:26:32.6533022Z 2025-05-07T20:26:32.6533026Z 2025-05-07T20:26:32.6533036Z 2025-05-07T20:26:32.6533040Z 2025-05-07T20:26:32.6533043Z 2025-05-07T20:26:32.6534563Z 2025-05-07T20:26:32.7228184Z cuda-nvvm-tools-12.8 | 23.5 MB | #######1 | 72%  2025-05-07T20:26:32.7228516Z 2025-05-07T20:26:32.7228519Z 2025-05-07T20:26:32.7228523Z 2025-05-07T20:26:32.7228526Z 2025-05-07T20:26:32.7228530Z 2025-05-07T20:26:32.7228533Z 2025-05-07T20:26:32.7228537Z 2025-05-07T20:26:32.7228540Z 2025-05-07T20:26:32.7228544Z 2025-05-07T20:26:32.7228548Z 2025-05-07T20:26:32.7228551Z 2025-05-07T20:26:32.7228555Z 2025-05-07T20:26:32.7228558Z 2025-05-07T20:26:32.7229974Z 2025-05-07T20:26:32.7554532Z cuda-nvvm-impl-12.8. | 20.8 MB | #######1 | 71%  2025-05-07T20:26:32.7554842Z 2025-05-07T20:26:32.7554846Z 2025-05-07T20:26:32.7554849Z 2025-05-07T20:26:32.7554853Z 2025-05-07T20:26:32.7554856Z 2025-05-07T20:26:32.7554860Z 2025-05-07T20:26:32.7554871Z 2025-05-07T20:26:32.7554875Z 2025-05-07T20:26:32.7554879Z 2025-05-07T20:26:32.7554882Z 2025-05-07T20:26:32.7554886Z 2025-05-07T20:26:32.7554903Z 2025-05-07T20:26:32.7556352Z 2025-05-07T20:26:32.8228895Z cuda-nvvm-tools-12.8 | 23.5 MB | ########3 | 83%  2025-05-07T20:26:32.8229202Z 2025-05-07T20:26:32.8229582Z 2025-05-07T20:26:32.8229587Z 2025-05-07T20:26:32.8229783Z 2025-05-07T20:26:32.8229792Z 2025-05-07T20:26:32.8229796Z 2025-05-07T20:26:32.8229800Z 2025-05-07T20:26:32.8229804Z 2025-05-07T20:26:32.8229807Z 2025-05-07T20:26:32.8229811Z 2025-05-07T20:26:32.8229815Z 2025-05-07T20:26:32.8229819Z 2025-05-07T20:26:32.8229822Z 2025-05-07T20:26:32.8235628Z 2025-05-07T20:26:32.8562723Z cuda-nvvm-impl-12.8. | 20.8 MB | ########5 | 85%  2025-05-07T20:26:32.8563055Z 2025-05-07T20:26:32.8563059Z 2025-05-07T20:26:32.8563070Z 2025-05-07T20:26:32.8563074Z 2025-05-07T20:26:32.8563077Z 2025-05-07T20:26:32.8563096Z 2025-05-07T20:26:32.8563100Z 2025-05-07T20:26:32.8563104Z 2025-05-07T20:26:32.8563108Z 2025-05-07T20:26:32.8563111Z 2025-05-07T20:26:32.8563124Z 2025-05-07T20:26:32.8563127Z 2025-05-07T20:26:32.8564553Z 2025-05-07T20:26:32.9243845Z cuda-nvvm-tools-12.8 | 23.5 MB | #########4 | 95%  2025-05-07T20:26:32.9244171Z 2025-05-07T20:26:32.9244175Z 2025-05-07T20:26:32.9244179Z 2025-05-07T20:26:32.9244183Z 2025-05-07T20:26:32.9244186Z 2025-05-07T20:26:32.9244192Z 2025-05-07T20:26:32.9244197Z 2025-05-07T20:26:32.9244202Z 2025-05-07T20:26:32.9244207Z 2025-05-07T20:26:32.9244212Z 2025-05-07T20:26:32.9244217Z 2025-05-07T20:26:32.9244223Z 2025-05-07T20:26:32.9244227Z 2025-05-07T20:26:32.9244233Z 2025-05-07T20:26:33.0556688Z cuda-nvvm-impl-12.8. | 20.8 MB | #########9 | 99%  2025-05-07T20:26:33.0557016Z 2025-05-07T20:26:33.0557027Z 2025-05-07T20:26:33.0557031Z 2025-05-07T20:26:33.0557034Z 2025-05-07T20:26:33.0557263Z 2025-05-07T20:26:33.0557268Z 2025-05-07T20:26:33.0557273Z 2025-05-07T20:26:33.0557277Z 2025-05-07T20:26:33.0557280Z 2025-05-07T20:26:33.0557296Z 2025-05-07T20:26:33.0557299Z 2025-05-07T20:26:33.0564674Z 2025-05-07T20:26:33.1287830Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:33.1288175Z 2025-05-07T20:26:33.1288179Z 2025-05-07T20:26:33.1288183Z 2025-05-07T20:26:33.1288187Z 2025-05-07T20:26:33.1288190Z 2025-05-07T20:26:33.1288194Z 2025-05-07T20:26:33.1288198Z 2025-05-07T20:26:33.1288201Z 2025-05-07T20:26:33.1288205Z 2025-05-07T20:26:33.1288208Z 2025-05-07T20:26:33.1288212Z 2025-05-07T20:26:33.1288216Z 2025-05-07T20:26:33.1288219Z 2025-05-07T20:26:33.1288223Z 2025-05-07T20:26:33.1288227Z 2025-05-07T20:26:33.2288674Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:26:33.2289022Z 2025-05-07T20:26:33.2289026Z 2025-05-07T20:26:33.2289030Z 2025-05-07T20:26:33.2289061Z 2025-05-07T20:26:33.2289074Z 2025-05-07T20:26:33.2289078Z 2025-05-07T20:26:33.2289082Z 2025-05-07T20:26:33.2289085Z 2025-05-07T20:26:33.2289373Z 2025-05-07T20:26:33.2289377Z 2025-05-07T20:26:33.2289381Z 2025-05-07T20:26:33.2289385Z 2025-05-07T20:26:33.2289388Z 2025-05-07T20:26:33.2289392Z 2025-05-07T20:26:33.2291793Z 2025-05-07T20:26:33.3291167Z cuda-nvcc-dev_linux- | 12.7 MB | ##6 | 26%  2025-05-07T20:26:33.3291510Z 2025-05-07T20:26:33.3291514Z 2025-05-07T20:26:33.3291517Z 2025-05-07T20:26:33.3291521Z 2025-05-07T20:26:33.3291525Z 2025-05-07T20:26:33.3291528Z 2025-05-07T20:26:33.3291532Z 2025-05-07T20:26:33.3291536Z 2025-05-07T20:26:33.3291539Z 2025-05-07T20:26:33.3291543Z 2025-05-07T20:26:33.3291546Z 2025-05-07T20:26:33.3291550Z 2025-05-07T20:26:33.3291554Z 2025-05-07T20:26:33.3291557Z 2025-05-07T20:26:33.3292243Z 2025-05-07T20:26:33.3305096Z cuda-nvcc-dev_linux- | 12.7 MB | #####4 | 54%  2025-05-07T20:26:33.3305510Z 2025-05-07T20:26:33.3305514Z 2025-05-07T20:26:33.3305518Z 2025-05-07T20:26:33.3305532Z 2025-05-07T20:26:33.3305544Z 2025-05-07T20:26:33.3305547Z 2025-05-07T20:26:33.3305551Z 2025-05-07T20:26:33.3305554Z 2025-05-07T20:26:33.3305558Z 2025-05-07T20:26:33.3305561Z 2025-05-07T20:26:33.3305565Z 2025-05-07T20:26:33.3789384Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:26:33.3789788Z 2025-05-07T20:26:33.3789794Z 2025-05-07T20:26:33.3789799Z 2025-05-07T20:26:33.3789804Z 2025-05-07T20:26:33.3789808Z 2025-05-07T20:26:33.3789813Z 2025-05-07T20:26:33.3789818Z 2025-05-07T20:26:33.3789823Z 2025-05-07T20:26:33.3789828Z 2025-05-07T20:26:33.3789833Z 2025-05-07T20:26:33.3789837Z 2025-05-07T20:26:33.3789842Z 2025-05-07T20:26:33.3789847Z 2025-05-07T20:26:33.3789853Z 2025-05-07T20:26:33.3789858Z 2025-05-07T20:26:33.3789863Z 2025-05-07T20:26:33.4292734Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:26:33.4293223Z 2025-05-07T20:26:33.4293229Z 2025-05-07T20:26:33.4293249Z 2025-05-07T20:26:33.4293255Z 2025-05-07T20:26:33.4293260Z 2025-05-07T20:26:33.4293265Z 2025-05-07T20:26:33.4293270Z 2025-05-07T20:26:33.4293275Z 2025-05-07T20:26:33.4293281Z 2025-05-07T20:26:33.4293297Z 2025-05-07T20:26:33.4293302Z 2025-05-07T20:26:33.4293307Z 2025-05-07T20:26:33.4293312Z 2025-05-07T20:26:33.4293318Z 2025-05-07T20:26:33.4294680Z 2025-05-07T20:26:33.4793513Z cuda-nvcc-dev_linux- | 12.7 MB | ########3 | 84%  2025-05-07T20:26:33.4793909Z 2025-05-07T20:26:33.4793915Z 2025-05-07T20:26:33.4793921Z 2025-05-07T20:26:33.4793926Z 2025-05-07T20:26:33.4793931Z 2025-05-07T20:26:33.4793937Z 2025-05-07T20:26:33.4793942Z 2025-05-07T20:26:33.4793947Z 2025-05-07T20:26:33.4793952Z 2025-05-07T20:26:33.4793958Z 2025-05-07T20:26:33.4793963Z 2025-05-07T20:26:33.4793968Z 2025-05-07T20:26:33.4794260Z 2025-05-07T20:26:33.4794268Z 2025-05-07T20:26:33.4794273Z 2025-05-07T20:26:33.4801633Z 2025-05-07T20:26:33.5890779Z cuda-sanitizer-api-1 | 8.8 MB | ###8 | 39%  2025-05-07T20:26:33.5891243Z 2025-05-07T20:26:33.5891249Z 2025-05-07T20:26:33.5891254Z 2025-05-07T20:26:33.5891259Z 2025-05-07T20:26:33.5891264Z 2025-05-07T20:26:33.5891269Z 2025-05-07T20:26:33.5891275Z 2025-05-07T20:26:33.5891279Z 2025-05-07T20:26:33.5891284Z 2025-05-07T20:26:33.5891290Z 2025-05-07T20:26:33.5891305Z 2025-05-07T20:26:33.5891310Z 2025-05-07T20:26:33.5891316Z 2025-05-07T20:26:33.5891321Z 2025-05-07T20:26:33.5891326Z 2025-05-07T20:26:33.5893797Z 2025-05-07T20:26:33.6592681Z cuda-sanitizer-api-1 | 8.8 MB | #######7 | 77%  2025-05-07T20:26:33.6593036Z 2025-05-07T20:26:33.6593040Z 2025-05-07T20:26:33.6593044Z 2025-05-07T20:26:33.6593047Z 2025-05-07T20:26:33.6593051Z 2025-05-07T20:26:33.6593055Z 2025-05-07T20:26:33.6593072Z 2025-05-07T20:26:33.6593076Z 2025-05-07T20:26:33.6593079Z 2025-05-07T20:26:33.6593083Z 2025-05-07T20:26:33.6593337Z 2025-05-07T20:26:33.6593341Z 2025-05-07T20:26:33.6593344Z 2025-05-07T20:26:33.6593348Z 2025-05-07T20:26:33.6935735Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:33.6936093Z 2025-05-07T20:26:33.6936099Z 2025-05-07T20:26:33.6936104Z 2025-05-07T20:26:33.6936109Z 2025-05-07T20:26:33.6936114Z 2025-05-07T20:26:33.6936119Z 2025-05-07T20:26:33.6936124Z 2025-05-07T20:26:33.6936130Z 2025-05-07T20:26:33.6936135Z 2025-05-07T20:26:33.6936141Z 2025-05-07T20:26:33.6936155Z 2025-05-07T20:26:33.6936161Z 2025-05-07T20:26:33.6936166Z 2025-05-07T20:26:33.6936171Z 2025-05-07T20:26:33.6936176Z 2025-05-07T20:26:33.6936181Z 2025-05-07T20:26:33.6939071Z 2025-05-07T20:26:33.7214321Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:26:33.7214668Z 2025-05-07T20:26:33.7214672Z 2025-05-07T20:26:33.7214676Z 2025-05-07T20:26:33.7214680Z 2025-05-07T20:26:33.7214684Z 2025-05-07T20:26:33.7214695Z 2025-05-07T20:26:33.7214698Z 2025-05-07T20:26:33.7214702Z 2025-05-07T20:26:33.7214706Z 2025-05-07T20:26:33.7214709Z 2025-05-07T20:26:33.7214713Z 2025-05-07T20:26:33.7214716Z 2025-05-07T20:26:33.7218585Z 2025-05-07T20:26:33.7936329Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:33.7936654Z 2025-05-07T20:26:33.7936658Z 2025-05-07T20:26:33.7936662Z 2025-05-07T20:26:33.7936666Z 2025-05-07T20:26:33.7936669Z 2025-05-07T20:26:33.7936673Z 2025-05-07T20:26:33.7936677Z 2025-05-07T20:26:33.7936683Z 2025-05-07T20:26:33.7936688Z 2025-05-07T20:26:33.7936693Z 2025-05-07T20:26:33.7936709Z 2025-05-07T20:26:33.7936714Z 2025-05-07T20:26:33.7936719Z 2025-05-07T20:26:33.7936724Z 2025-05-07T20:26:33.7936729Z 2025-05-07T20:26:33.7936734Z 2025-05-07T20:26:33.7938178Z 2025-05-07T20:26:33.7968711Z cuda-nvdisasm-12.8.5 | 4.9 MB | #######8 | 79%  2025-05-07T20:26:33.7969192Z 2025-05-07T20:26:33.7969199Z 2025-05-07T20:26:33.7969204Z 2025-05-07T20:26:33.7969209Z 2025-05-07T20:26:33.7969214Z 2025-05-07T20:26:33.7969221Z 2025-05-07T20:26:33.7969224Z 2025-05-07T20:26:33.7969228Z 2025-05-07T20:26:33.7969232Z 2025-05-07T20:26:33.7969235Z 2025-05-07T20:26:33.7969239Z 2025-05-07T20:26:33.7969253Z 2025-05-07T20:26:33.7969256Z 2025-05-07T20:26:33.7969260Z 2025-05-07T20:26:33.7969263Z 2025-05-07T20:26:33.7969267Z 2025-05-07T20:26:33.7969270Z 2025-05-07T20:26:33.7969274Z 2025-05-07T20:26:33.8971446Z cuda-cupti-dev-12.8. | 4.0 MB | | 0%  2025-05-07T20:26:33.8971870Z 2025-05-07T20:26:33.8971874Z 2025-05-07T20:26:33.8971877Z 2025-05-07T20:26:33.8971881Z 2025-05-07T20:26:33.8971885Z 2025-05-07T20:26:33.8971888Z 2025-05-07T20:26:33.8971892Z 2025-05-07T20:26:33.8972157Z 2025-05-07T20:26:33.8972165Z 2025-05-07T20:26:33.8972170Z 2025-05-07T20:26:33.8972175Z 2025-05-07T20:26:33.8972180Z 2025-05-07T20:26:33.8972194Z 2025-05-07T20:26:33.8972200Z 2025-05-07T20:26:33.8972205Z 2025-05-07T20:26:33.8972210Z 2025-05-07T20:26:33.8972215Z 2025-05-07T20:26:33.8974948Z 2025-05-07T20:26:33.9192110Z cuda-cupti-dev-12.8. | 4.0 MB | #######3 | 73%  2025-05-07T20:26:33.9192525Z 2025-05-07T20:26:33.9192532Z 2025-05-07T20:26:33.9192537Z 2025-05-07T20:26:33.9192542Z 2025-05-07T20:26:33.9192555Z 2025-05-07T20:26:33.9192560Z 2025-05-07T20:26:33.9192565Z 2025-05-07T20:26:33.9192570Z 2025-05-07T20:26:33.9192574Z 2025-05-07T20:26:33.9192577Z 2025-05-07T20:26:33.9192581Z 2025-05-07T20:26:33.9192584Z 2025-05-07T20:26:33.9192588Z 2025-05-07T20:26:33.9192591Z 2025-05-07T20:26:33.9195009Z 2025-05-07T20:26:33.9624654Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:33.9625088Z 2025-05-07T20:26:33.9625095Z 2025-05-07T20:26:33.9625100Z 2025-05-07T20:26:33.9625105Z 2025-05-07T20:26:33.9625353Z 2025-05-07T20:26:33.9625357Z 2025-05-07T20:26:33.9625360Z 2025-05-07T20:26:33.9625364Z 2025-05-07T20:26:33.9625368Z 2025-05-07T20:26:33.9625371Z 2025-05-07T20:26:33.9625375Z 2025-05-07T20:26:33.9625379Z 2025-05-07T20:26:33.9625382Z 2025-05-07T20:26:33.9625386Z 2025-05-07T20:26:33.9625390Z 2025-05-07T20:26:33.9626918Z 2025-05-07T20:26:33.9969104Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:33.9969514Z 2025-05-07T20:26:33.9969519Z 2025-05-07T20:26:33.9969532Z 2025-05-07T20:26:33.9969537Z 2025-05-07T20:26:33.9969542Z 2025-05-07T20:26:33.9969547Z 2025-05-07T20:26:33.9969552Z 2025-05-07T20:26:33.9969557Z 2025-05-07T20:26:33.9969562Z 2025-05-07T20:26:33.9969567Z 2025-05-07T20:26:33.9969572Z 2025-05-07T20:26:33.9969577Z 2025-05-07T20:26:33.9969583Z 2025-05-07T20:26:33.9969603Z 2025-05-07T20:26:33.9969609Z 2025-05-07T20:26:33.9969615Z 2025-05-07T20:26:33.9969618Z 2025-05-07T20:26:34.0010588Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:34.0011019Z 2025-05-07T20:26:34.0011023Z 2025-05-07T20:26:34.0011027Z 2025-05-07T20:26:34.0011031Z 2025-05-07T20:26:34.0011035Z 2025-05-07T20:26:34.0011038Z 2025-05-07T20:26:34.0011042Z 2025-05-07T20:26:34.0011046Z 2025-05-07T20:26:34.0011049Z 2025-05-07T20:26:34.0011053Z 2025-05-07T20:26:34.0011057Z 2025-05-07T20:26:34.0011060Z 2025-05-07T20:26:34.0011070Z 2025-05-07T20:26:34.0011074Z 2025-05-07T20:26:34.0011077Z 2025-05-07T20:26:34.0011081Z 2025-05-07T20:26:34.0011085Z 2025-05-07T20:26:34.0011088Z 2025-05-07T20:26:34.0011092Z 2025-05-07T20:26:34.0555181Z ... (more hidden) ... 2025-05-07T20:26:34.0555581Z 2025-05-07T20:26:34.0555585Z 2025-05-07T20:26:34.0555589Z 2025-05-07T20:26:34.0555608Z 2025-05-07T20:26:34.0555612Z 2025-05-07T20:26:34.0555616Z 2025-05-07T20:26:34.0555619Z 2025-05-07T20:26:34.0555623Z 2025-05-07T20:26:34.0555636Z 2025-05-07T20:26:34.0555640Z 2025-05-07T20:26:34.0555644Z 2025-05-07T20:26:34.0555647Z 2025-05-07T20:26:34.0555651Z 2025-05-07T20:26:34.0555654Z 2025-05-07T20:26:34.0555658Z 2025-05-07T20:26:34.0555661Z 2025-05-07T20:26:34.0555665Z 2025-05-07T20:26:34.0560071Z 2025-05-07T20:26:34.1012604Z cuda-cupti-dev-12.8. | 4.0 MB | ########## | 100%  2025-05-07T20:26:34.1012949Z 2025-05-07T20:26:34.1012955Z 2025-05-07T20:26:34.1012960Z 2025-05-07T20:26:34.1012965Z 2025-05-07T20:26:34.1012970Z 2025-05-07T20:26:34.1012975Z 2025-05-07T20:26:34.1012981Z 2025-05-07T20:26:34.1012986Z 2025-05-07T20:26:34.1013000Z 2025-05-07T20:26:34.1013005Z 2025-05-07T20:26:34.1013010Z 2025-05-07T20:26:34.1013015Z 2025-05-07T20:26:34.1013020Z 2025-05-07T20:26:34.1013025Z 2025-05-07T20:26:34.1013030Z 2025-05-07T20:26:34.1013290Z 2025-05-07T20:26:34.1013298Z 2025-05-07T20:26:34.1013302Z 2025-05-07T20:26:34.1013306Z 2025-05-07T20:26:34.2596800Z ... (more hidden) ... 2025-05-07T20:26:34.2597170Z 2025-05-07T20:26:34.2597174Z 2025-05-07T20:26:34.2597178Z 2025-05-07T20:26:34.2597181Z 2025-05-07T20:26:34.2597185Z 2025-05-07T20:26:34.2597189Z 2025-05-07T20:26:34.2597192Z 2025-05-07T20:26:34.2597196Z 2025-05-07T20:26:34.2597199Z 2025-05-07T20:26:34.2597203Z 2025-05-07T20:26:34.2597207Z 2025-05-07T20:26:34.2597210Z 2025-05-07T20:26:34.2597214Z 2025-05-07T20:26:34.2597218Z 2025-05-07T20:26:34.2597221Z 2025-05-07T20:26:34.2597225Z 2025-05-07T20:26:34.2597237Z 2025-05-07T20:26:34.2597241Z 2025-05-07T20:26:34.2597971Z 2025-05-07T20:26:34.9644172Z ... (more hidden) ... 2025-05-07T20:26:34.9644490Z 2025-05-07T20:26:34.9644494Z 2025-05-07T20:26:34.9644497Z 2025-05-07T20:26:34.9644502Z 2025-05-07T20:26:34.9644535Z 2025-05-07T20:26:34.9644539Z 2025-05-07T20:26:35.0954590Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:26:35.0955142Z 2025-05-07T20:26:35.0955147Z 2025-05-07T20:26:35.0955150Z 2025-05-07T20:26:35.0955154Z 2025-05-07T20:26:35.0955157Z 2025-05-07T20:26:35.0955161Z 2025-05-07T20:26:35.0955738Z 2025-05-07T20:26:35.6477980Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:26:35.6478383Z 2025-05-07T20:26:35.6478388Z 2025-05-07T20:26:35.6478391Z 2025-05-07T20:26:35.6478395Z 2025-05-07T20:26:35.6478399Z 2025-05-07T20:26:35.6478402Z 2025-05-07T20:26:35.6478406Z 2025-05-07T20:26:35.6478410Z 2025-05-07T20:26:35.6478413Z 2025-05-07T20:26:35.6478417Z 2025-05-07T20:26:36.2290263Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:26:36.2614606Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:36.2615002Z 2025-05-07T20:26:36.2615007Z 2025-05-07T20:26:36.2615053Z 2025-05-07T20:26:36.2615059Z 2025-05-07T20:26:36.2615064Z 2025-05-07T20:26:36.2615069Z 2025-05-07T20:26:36.2615091Z 2025-05-07T20:26:36.2615096Z 2025-05-07T20:26:36.2615101Z 2025-05-07T20:26:36.6271055Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:26:36.6271542Z 2025-05-07T20:26:36.6271549Z 2025-05-07T20:26:36.6271554Z 2025-05-07T20:26:36.6271560Z 2025-05-07T20:26:36.6271565Z 2025-05-07T20:26:36.9864936Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:36.9865607Z 2025-05-07T20:26:36.9865616Z 2025-05-07T20:26:36.9865623Z 2025-05-07T20:26:36.9865629Z 2025-05-07T20:26:36.9865638Z 2025-05-07T20:26:36.9865647Z 2025-05-07T20:26:36.9865655Z 2025-05-07T20:26:36.9865664Z 2025-05-07T20:26:36.9865673Z 2025-05-07T20:26:36.9865682Z 2025-05-07T20:26:36.9865691Z 2025-05-07T20:26:36.9865699Z 2025-05-07T20:26:37.1623538Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:37.1624033Z 2025-05-07T20:26:37.1624042Z 2025-05-07T20:26:37.1624049Z 2025-05-07T20:26:37.1624055Z 2025-05-07T20:26:37.1624081Z 2025-05-07T20:26:37.1624087Z 2025-05-07T20:26:37.1624094Z 2025-05-07T20:26:37.1624100Z 2025-05-07T20:26:37.3796667Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:26:37.3797139Z 2025-05-07T20:26:37.3797146Z 2025-05-07T20:26:37.3797152Z 2025-05-07T20:26:37.3797157Z 2025-05-07T20:26:37.3797163Z 2025-05-07T20:26:37.3797183Z 2025-05-07T20:26:37.3797189Z 2025-05-07T20:26:37.3797194Z 2025-05-07T20:26:37.3797200Z 2025-05-07T20:26:37.3797205Z 2025-05-07T20:26:37.3797211Z 2025-05-07T20:26:37.4612777Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:26:37.4613264Z 2025-05-07T20:26:37.4613270Z 2025-05-07T20:26:37.4613276Z 2025-05-07T20:26:37.4613281Z 2025-05-07T20:26:37.4613287Z 2025-05-07T20:26:37.4613293Z 2025-05-07T20:26:37.4613582Z 2025-05-07T20:26:37.4613590Z 2025-05-07T20:26:37.4613594Z 2025-05-07T20:26:37.4613599Z 2025-05-07T20:26:37.4613604Z 2025-05-07T20:26:37.4613626Z 2025-05-07T20:26:37.4613631Z 2025-05-07T20:26:37.4613636Z 2025-05-07T20:26:37.7352317Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:37.7352805Z 2025-05-07T20:26:37.7352812Z 2025-05-07T20:26:37.7352818Z 2025-05-07T20:26:37.7352822Z 2025-05-07T20:26:37.7352827Z 2025-05-07T20:26:37.7352842Z 2025-05-07T20:26:37.7352847Z 2025-05-07T20:26:37.7352852Z 2025-05-07T20:26:37.7352857Z 2025-05-07T20:26:37.7352862Z 2025-05-07T20:26:37.7352867Z 2025-05-07T20:26:37.7352872Z 2025-05-07T20:26:37.7352877Z 2025-05-07T20:26:37.7352883Z 2025-05-07T20:26:37.7352888Z 2025-05-07T20:26:37.7491472Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:37.7491972Z 2025-05-07T20:26:37.7491979Z 2025-05-07T20:26:37.7491984Z 2025-05-07T20:26:37.7492024Z 2025-05-07T20:26:37.7492030Z 2025-05-07T20:26:37.7492035Z 2025-05-07T20:26:37.7492040Z 2025-05-07T20:26:37.7492045Z 2025-05-07T20:26:37.7492360Z 2025-05-07T20:26:37.7492365Z 2025-05-07T20:26:37.7492370Z 2025-05-07T20:26:37.7492375Z 2025-05-07T20:26:37.7492380Z 2025-05-07T20:26:37.7980288Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:37.7980717Z 2025-05-07T20:26:37.7980723Z 2025-05-07T20:26:37.7980728Z 2025-05-07T20:26:37.7980733Z 2025-05-07T20:26:37.7980738Z 2025-05-07T20:26:37.7980748Z 2025-05-07T20:26:37.7980754Z 2025-05-07T20:26:37.7980759Z 2025-05-07T20:26:37.7980776Z 2025-05-07T20:26:37.7980782Z 2025-05-07T20:26:37.7980787Z 2025-05-07T20:26:37.7980792Z 2025-05-07T20:26:37.7980797Z 2025-05-07T20:26:37.7980804Z 2025-05-07T20:26:37.7980807Z 2025-05-07T20:26:37.7980811Z 2025-05-07T20:26:37.7980814Z 2025-05-07T20:26:37.9769365Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:37.9769898Z 2025-05-07T20:26:37.9769905Z 2025-05-07T20:26:37.9769910Z 2025-05-07T20:26:37.9769915Z 2025-05-07T20:26:37.9769937Z 2025-05-07T20:26:37.9769943Z 2025-05-07T20:26:37.9769948Z 2025-05-07T20:26:37.9769953Z 2025-05-07T20:26:37.9769958Z 2025-05-07T20:26:37.9769963Z 2025-05-07T20:26:37.9769968Z 2025-05-07T20:26:37.9769973Z 2025-05-07T20:26:37.9769979Z 2025-05-07T20:26:37.9769984Z 2025-05-07T20:26:37.9769990Z 2025-05-07T20:26:37.9769995Z 2025-05-07T20:26:37.9960815Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:37.9961288Z 2025-05-07T20:26:37.9961295Z 2025-05-07T20:26:37.9961300Z 2025-05-07T20:26:37.9961306Z 2025-05-07T20:26:37.9961311Z 2025-05-07T20:26:37.9961316Z 2025-05-07T20:26:37.9961332Z 2025-05-07T20:26:37.9961337Z 2025-05-07T20:26:37.9961342Z 2025-05-07T20:26:37.9961347Z 2025-05-07T20:26:37.9961352Z 2025-05-07T20:26:37.9961358Z 2025-05-07T20:26:37.9961363Z 2025-05-07T20:26:37.9961395Z 2025-05-07T20:26:37.9961401Z 2025-05-07T20:26:37.9961406Z 2025-05-07T20:26:37.9961411Z 2025-05-07T20:26:37.9961429Z 2025-05-07T20:26:38.3866642Z cuda-cupti-dev-12.8. | 4.0 MB | ########## | 100%  2025-05-07T20:26:38.3867141Z 2025-05-07T20:26:38.3867148Z 2025-05-07T20:26:38.3867153Z 2025-05-07T20:26:38.3867158Z 2025-05-07T20:26:38.3867163Z 2025-05-07T20:26:38.3867189Z 2025-05-07T20:26:38.3867194Z 2025-05-07T20:26:38.3867199Z 2025-05-07T20:26:38.3867204Z 2025-05-07T20:26:38.3867209Z 2025-05-07T20:26:38.3867214Z 2025-05-07T20:26:38.3867219Z 2025-05-07T20:26:38.3867225Z 2025-05-07T20:26:38.3867230Z 2025-05-07T20:26:38.3867235Z 2025-05-07T20:26:38.3867240Z 2025-05-07T20:26:38.3867245Z 2025-05-07T20:26:38.3867250Z 2025-05-07T20:26:38.3867255Z 2025-05-07T20:26:42.6476292Z ... (more hidden) ... 2025-05-07T20:26:42.6476884Z 2025-05-07T20:26:43.8188640Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:43.8196851Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:43.8197246Z 2025-05-07T20:26:43.8197251Z 2025-05-07T20:26:43.8197255Z 2025-05-07T20:26:43.8197258Z 2025-05-07T20:26:43.8197262Z 2025-05-07T20:26:43.8197266Z 2025-05-07T20:26:43.8197269Z 2025-05-07T20:26:43.8197273Z 2025-05-07T20:26:43.8197276Z 2025-05-07T20:26:43.8197280Z 2025-05-07T20:26:43.8197284Z 2025-05-07T20:26:43.8197291Z 2025-05-07T20:26:43.8197296Z 2025-05-07T20:26:43.8197300Z 2025-05-07T20:26:43.8197305Z 2025-05-07T20:26:43.8197310Z 2025-05-07T20:26:43.8197315Z 2025-05-07T20:26:43.8197321Z 2025-05-07T20:26:43.8197326Z 2025-05-07T20:26:43.8197462Z 2025-05-07T20:26:43.8197867Z  2025-05-07T20:26:43.8198318Z 2025-05-07T20:26:43.8198561Z 2025-05-07T20:26:43.8198831Z  2025-05-07T20:26:43.8199040Z 2025-05-07T20:26:43.8199045Z 2025-05-07T20:26:43.8199292Z  2025-05-07T20:26:43.8199777Z 2025-05-07T20:26:43.8199783Z 2025-05-07T20:26:43.8199788Z 2025-05-07T20:26:43.8199996Z  2025-05-07T20:26:43.8200234Z 2025-05-07T20:26:43.8200240Z 2025-05-07T20:26:43.8200245Z 2025-05-07T20:26:43.8200250Z 2025-05-07T20:26:43.8200516Z  2025-05-07T20:26:43.8200747Z 2025-05-07T20:26:43.8200753Z 2025-05-07T20:26:43.8200757Z 2025-05-07T20:26:43.8200763Z 2025-05-07T20:26:43.8200768Z 2025-05-07T20:26:43.8201011Z  2025-05-07T20:26:43.8201224Z 2025-05-07T20:26:43.8201228Z 2025-05-07T20:26:43.8201234Z 2025-05-07T20:26:43.8201239Z 2025-05-07T20:26:43.8201244Z 2025-05-07T20:26:43.8201250Z 2025-05-07T20:26:43.8201521Z  2025-05-07T20:26:43.8201738Z 2025-05-07T20:26:43.8201741Z 2025-05-07T20:26:43.8201752Z 2025-05-07T20:26:43.8201756Z 2025-05-07T20:26:43.8201759Z 2025-05-07T20:26:43.8201763Z 2025-05-07T20:26:43.8201766Z 2025-05-07T20:26:43.8202041Z  2025-05-07T20:26:43.8202317Z 2025-05-07T20:26:43.8202323Z 2025-05-07T20:26:43.8202328Z 2025-05-07T20:26:43.8202332Z 2025-05-07T20:26:43.8202337Z 2025-05-07T20:26:43.8202342Z 2025-05-07T20:26:43.8202347Z 2025-05-07T20:26:43.8202352Z 2025-05-07T20:26:43.8202630Z  2025-05-07T20:26:43.8202851Z 2025-05-07T20:26:43.8202855Z 2025-05-07T20:26:43.8202858Z 2025-05-07T20:26:43.8202862Z 2025-05-07T20:26:43.8202865Z 2025-05-07T20:26:43.8202869Z 2025-05-07T20:26:43.8202872Z 2025-05-07T20:26:43.8202876Z 2025-05-07T20:26:43.8202879Z 2025-05-07T20:26:43.8203085Z  2025-05-07T20:26:43.8203304Z 2025-05-07T20:26:43.8203312Z 2025-05-07T20:26:43.8203315Z 2025-05-07T20:26:43.8203319Z 2025-05-07T20:26:43.8203322Z 2025-05-07T20:26:43.8203326Z 2025-05-07T20:26:43.8203335Z 2025-05-07T20:26:43.8203339Z 2025-05-07T20:26:43.8203342Z 2025-05-07T20:26:43.8203346Z 2025-05-07T20:26:43.8203591Z  2025-05-07T20:26:43.8203837Z 2025-05-07T20:26:43.8203848Z 2025-05-07T20:26:43.8203852Z 2025-05-07T20:26:43.8203855Z 2025-05-07T20:26:43.8203859Z 2025-05-07T20:26:43.8203862Z 2025-05-07T20:26:43.8203866Z 2025-05-07T20:26:43.8203869Z 2025-05-07T20:26:43.8203873Z 2025-05-07T20:26:43.8203876Z 2025-05-07T20:26:43.8203880Z 2025-05-07T20:26:43.8204081Z  2025-05-07T20:26:43.8204370Z 2025-05-07T20:26:43.8204373Z 2025-05-07T20:26:43.8204490Z 2025-05-07T20:26:43.8204495Z 2025-05-07T20:26:43.8204498Z 2025-05-07T20:26:43.8204502Z 2025-05-07T20:26:43.8204505Z 2025-05-07T20:26:43.8204523Z 2025-05-07T20:26:43.8204526Z 2025-05-07T20:26:43.8204530Z 2025-05-07T20:26:43.8204533Z 2025-05-07T20:26:43.8204537Z 2025-05-07T20:26:43.8204761Z  2025-05-07T20:26:43.8204986Z 2025-05-07T20:26:43.8204990Z 2025-05-07T20:26:43.8204993Z 2025-05-07T20:26:43.8204997Z 2025-05-07T20:26:43.8205000Z 2025-05-07T20:26:43.8205004Z 2025-05-07T20:26:43.8205007Z 2025-05-07T20:26:43.8205011Z 2025-05-07T20:26:43.8205014Z 2025-05-07T20:26:43.8205018Z 2025-05-07T20:26:43.8205021Z 2025-05-07T20:26:43.8205025Z 2025-05-07T20:26:43.8205036Z 2025-05-07T20:26:43.8205240Z  2025-05-07T20:26:43.8205467Z 2025-05-07T20:26:43.8205470Z 2025-05-07T20:26:43.8205474Z 2025-05-07T20:26:43.8205483Z 2025-05-07T20:26:43.8205493Z 2025-05-07T20:26:43.8205497Z 2025-05-07T20:26:43.8205501Z 2025-05-07T20:26:43.8205504Z 2025-05-07T20:26:43.8205586Z 2025-05-07T20:26:43.8205590Z 2025-05-07T20:26:43.8205594Z 2025-05-07T20:26:43.8205597Z 2025-05-07T20:26:43.8205601Z 2025-05-07T20:26:43.8205604Z 2025-05-07T20:26:43.8205819Z  2025-05-07T20:26:43.8206056Z 2025-05-07T20:26:43.8206059Z 2025-05-07T20:26:43.8206063Z 2025-05-07T20:26:43.8206066Z 2025-05-07T20:26:43.8206070Z 2025-05-07T20:26:43.8206074Z 2025-05-07T20:26:43.8206077Z 2025-05-07T20:26:43.8206081Z 2025-05-07T20:26:43.8206084Z 2025-05-07T20:26:43.8206088Z 2025-05-07T20:26:43.8206091Z 2025-05-07T20:26:43.8206095Z 2025-05-07T20:26:43.8206099Z 2025-05-07T20:26:43.8206102Z 2025-05-07T20:26:43.8206106Z 2025-05-07T20:26:43.8206325Z  2025-05-07T20:26:43.8206564Z 2025-05-07T20:26:43.8206567Z 2025-05-07T20:26:43.8206571Z 2025-05-07T20:26:43.8206574Z 2025-05-07T20:26:43.8206578Z 2025-05-07T20:26:43.8206587Z 2025-05-07T20:26:43.8206590Z 2025-05-07T20:26:43.8206594Z 2025-05-07T20:26:43.8206604Z 2025-05-07T20:26:43.8206608Z 2025-05-07T20:26:43.8206611Z 2025-05-07T20:26:43.8206615Z 2025-05-07T20:26:43.8206618Z 2025-05-07T20:26:43.8206622Z 2025-05-07T20:26:43.8206625Z 2025-05-07T20:26:43.8206629Z 2025-05-07T20:26:43.8207194Z  2025-05-07T20:26:43.8207469Z 2025-05-07T20:26:43.8207473Z 2025-05-07T20:26:43.8207477Z 2025-05-07T20:26:43.8207480Z 2025-05-07T20:26:43.8207484Z 2025-05-07T20:26:43.8207487Z 2025-05-07T20:26:43.8207491Z 2025-05-07T20:26:43.8207495Z 2025-05-07T20:26:43.8207498Z 2025-05-07T20:26:43.8207502Z 2025-05-07T20:26:43.8207505Z 2025-05-07T20:26:43.8207509Z 2025-05-07T20:26:43.8207513Z 2025-05-07T20:26:43.8207516Z 2025-05-07T20:26:43.8207530Z 2025-05-07T20:26:43.8207539Z 2025-05-07T20:26:43.8207543Z 2025-05-07T20:26:43.8207876Z  2025-05-07T20:26:43.8208144Z 2025-05-07T20:26:43.8208148Z 2025-05-07T20:26:43.8208152Z 2025-05-07T20:26:43.8208155Z 2025-05-07T20:26:43.8208164Z 2025-05-07T20:26:43.8208174Z 2025-05-07T20:26:43.8208177Z 2025-05-07T20:26:43.8208181Z 2025-05-07T20:26:43.8208184Z 2025-05-07T20:26:43.8208188Z 2025-05-07T20:26:43.8208191Z 2025-05-07T20:26:43.8208195Z 2025-05-07T20:26:43.8208198Z 2025-05-07T20:26:43.8208202Z 2025-05-07T20:26:43.8208205Z 2025-05-07T20:26:43.8208209Z 2025-05-07T20:26:43.8208212Z 2025-05-07T20:26:43.8208216Z 2025-05-07T20:26:43.8208929Z  2025-05-07T20:26:43.8209175Z 2025-05-07T20:26:43.8209179Z 2025-05-07T20:26:43.8209320Z  2025-05-07T20:26:43.8209472Z 2025-05-07T20:26:43.8209973Z 2025-05-07T20:26:43.8210101Z  2025-05-07T20:26:43.8210215Z 2025-05-07T20:26:43.8210219Z 2025-05-07T20:26:43.8210235Z 2025-05-07T20:26:43.8210346Z  2025-05-07T20:26:43.8210456Z 2025-05-07T20:26:43.8210460Z 2025-05-07T20:26:43.8210464Z 2025-05-07T20:26:43.8210467Z 2025-05-07T20:26:43.8210823Z  2025-05-07T20:26:43.8210992Z 2025-05-07T20:26:43.8210997Z 2025-05-07T20:26:43.8211007Z 2025-05-07T20:26:43.8211012Z 2025-05-07T20:26:43.8211015Z 2025-05-07T20:26:43.8211393Z  2025-05-07T20:26:43.8211546Z 2025-05-07T20:26:43.8211550Z 2025-05-07T20:26:43.8211554Z 2025-05-07T20:26:43.8211557Z 2025-05-07T20:26:43.8211561Z 2025-05-07T20:26:43.8211565Z 2025-05-07T20:26:43.8211839Z  2025-05-07T20:26:43.8211972Z 2025-05-07T20:26:43.8211980Z 2025-05-07T20:26:43.8211983Z 2025-05-07T20:26:43.8211987Z 2025-05-07T20:26:43.8211991Z 2025-05-07T20:26:43.8211994Z 2025-05-07T20:26:43.8211998Z 2025-05-07T20:26:43.8212414Z  2025-05-07T20:26:43.8212558Z 2025-05-07T20:26:43.8212561Z 2025-05-07T20:26:43.8212565Z 2025-05-07T20:26:43.8212573Z 2025-05-07T20:26:43.8212691Z 2025-05-07T20:26:43.8212694Z 2025-05-07T20:26:43.8212698Z 2025-05-07T20:26:43.8212701Z 2025-05-07T20:26:43.8212929Z  2025-05-07T20:26:43.8213125Z 2025-05-07T20:26:43.8213128Z 2025-05-07T20:26:43.8213132Z 2025-05-07T20:26:43.8213135Z 2025-05-07T20:26:43.8213143Z 2025-05-07T20:26:43.8213147Z 2025-05-07T20:26:43.8213158Z 2025-05-07T20:26:43.8213161Z 2025-05-07T20:26:43.8213165Z 2025-05-07T20:26:43.8213421Z  2025-05-07T20:26:43.8213585Z 2025-05-07T20:26:43.8213593Z 2025-05-07T20:26:43.8213597Z 2025-05-07T20:26:43.8213601Z 2025-05-07T20:26:43.8213604Z 2025-05-07T20:26:43.8213608Z 2025-05-07T20:26:43.8213611Z 2025-05-07T20:26:43.8213615Z 2025-05-07T20:26:43.8213618Z 2025-05-07T20:26:43.8213622Z 2025-05-07T20:26:43.8213985Z  2025-05-07T20:26:43.8214183Z 2025-05-07T20:26:43.8214187Z 2025-05-07T20:26:43.8214203Z 2025-05-07T20:26:43.8214206Z 2025-05-07T20:26:43.8214210Z 2025-05-07T20:26:43.8214220Z 2025-05-07T20:26:43.8214224Z 2025-05-07T20:26:43.8214227Z 2025-05-07T20:26:43.8214231Z 2025-05-07T20:26:43.8214234Z 2025-05-07T20:26:43.8214238Z 2025-05-07T20:26:43.8214431Z  2025-05-07T20:26:43.8214613Z 2025-05-07T20:26:43.8214621Z 2025-05-07T20:26:43.8214625Z 2025-05-07T20:26:43.8214629Z 2025-05-07T20:26:43.8214632Z 2025-05-07T20:26:43.8214636Z 2025-05-07T20:26:43.8214639Z 2025-05-07T20:26:43.8214643Z 2025-05-07T20:26:43.8214646Z 2025-05-07T20:26:43.8214650Z 2025-05-07T20:26:43.8214654Z 2025-05-07T20:26:43.8214657Z 2025-05-07T20:26:43.8214926Z  2025-05-07T20:26:43.8215108Z 2025-05-07T20:26:43.8215116Z 2025-05-07T20:26:43.8215120Z 2025-05-07T20:26:43.8215123Z 2025-05-07T20:26:43.8215127Z 2025-05-07T20:26:43.8215131Z 2025-05-07T20:26:43.8215134Z 2025-05-07T20:26:43.8215147Z 2025-05-07T20:26:43.8215157Z 2025-05-07T20:26:43.8215161Z 2025-05-07T20:26:43.8215164Z 2025-05-07T20:26:43.8215168Z 2025-05-07T20:26:43.8215177Z 2025-05-07T20:26:43.8215415Z  2025-05-07T20:26:43.8215598Z 2025-05-07T20:26:43.8215615Z 2025-05-07T20:26:43.8215618Z 2025-05-07T20:26:43.8215622Z 2025-05-07T20:26:43.8215625Z 2025-05-07T20:26:43.8215629Z 2025-05-07T20:26:43.8215632Z 2025-05-07T20:26:43.8215636Z 2025-05-07T20:26:43.8215639Z 2025-05-07T20:26:43.8215643Z 2025-05-07T20:26:43.8215646Z 2025-05-07T20:26:43.8215650Z 2025-05-07T20:26:43.8215653Z 2025-05-07T20:26:43.8215657Z 2025-05-07T20:26:43.8216020Z  2025-05-07T20:26:43.8216306Z 2025-05-07T20:26:43.8216311Z 2025-05-07T20:26:43.8216316Z 2025-05-07T20:26:43.8216321Z 2025-05-07T20:26:43.8216333Z 2025-05-07T20:26:43.8216338Z 2025-05-07T20:26:43.8216343Z 2025-05-07T20:26:43.8216348Z 2025-05-07T20:26:43.8216353Z 2025-05-07T20:26:43.8216358Z 2025-05-07T20:26:43.8216488Z 2025-05-07T20:26:43.8216494Z 2025-05-07T20:26:43.8216500Z 2025-05-07T20:26:43.8216505Z 2025-05-07T20:26:43.8216516Z 2025-05-07T20:26:43.8216752Z  2025-05-07T20:26:43.8217022Z 2025-05-07T20:26:43.8217027Z 2025-05-07T20:26:43.8217032Z 2025-05-07T20:26:43.8217038Z 2025-05-07T20:26:43.8217043Z 2025-05-07T20:26:43.8217049Z 2025-05-07T20:26:43.8217054Z 2025-05-07T20:26:43.8217059Z 2025-05-07T20:26:43.8217064Z 2025-05-07T20:26:43.8217078Z 2025-05-07T20:26:43.8217084Z 2025-05-07T20:26:43.8217089Z 2025-05-07T20:26:43.8217094Z 2025-05-07T20:26:43.8217099Z 2025-05-07T20:26:43.8217104Z 2025-05-07T20:26:43.8217109Z 2025-05-07T20:26:43.8217325Z  2025-05-07T20:26:43.8217637Z 2025-05-07T20:26:43.8217642Z 2025-05-07T20:26:43.8217647Z 2025-05-07T20:26:43.8217652Z 2025-05-07T20:26:43.8217658Z 2025-05-07T20:26:43.8217664Z 2025-05-07T20:26:43.8217671Z 2025-05-07T20:26:43.8217687Z 2025-05-07T20:26:43.8217703Z 2025-05-07T20:26:43.8217709Z 2025-05-07T20:26:43.8217715Z 2025-05-07T20:26:43.8217721Z 2025-05-07T20:26:43.8217877Z 2025-05-07T20:26:43.8217882Z 2025-05-07T20:26:43.8217887Z 2025-05-07T20:26:43.8217892Z 2025-05-07T20:26:43.8217897Z 2025-05-07T20:26:43.8218123Z  2025-05-07T20:26:43.8218405Z 2025-05-07T20:26:43.8218410Z 2025-05-07T20:26:43.8218415Z 2025-05-07T20:26:43.8218419Z 2025-05-07T20:26:43.8218425Z 2025-05-07T20:26:43.8218436Z 2025-05-07T20:26:43.8218441Z 2025-05-07T20:26:43.8218446Z 2025-05-07T20:26:43.8218451Z 2025-05-07T20:26:43.8218456Z 2025-05-07T20:26:43.8218468Z 2025-05-07T20:26:43.8218473Z 2025-05-07T20:26:43.8218478Z 2025-05-07T20:26:43.8218483Z 2025-05-07T20:26:43.8218488Z 2025-05-07T20:26:43.8218492Z 2025-05-07T20:26:43.8218497Z 2025-05-07T20:26:43.8218502Z 2025-05-07T20:26:43.8219157Z  2025-05-07T20:26:43.8219468Z 2025-05-07T20:26:43.8219483Z 2025-05-07T20:26:43.8219638Z  2025-05-07T20:26:43.8219782Z 2025-05-07T20:26:43.8219787Z 2025-05-07T20:26:43.8219941Z  2025-05-07T20:26:43.8220098Z 2025-05-07T20:26:43.8220103Z 2025-05-07T20:26:43.8220108Z 2025-05-07T20:26:43.8220635Z  2025-05-07T20:26:43.8220782Z 2025-05-07T20:26:43.8220786Z 2025-05-07T20:26:43.8220790Z 2025-05-07T20:26:43.8220793Z 2025-05-07T20:26:43.8221152Z  2025-05-07T20:26:43.8221305Z 2025-05-07T20:26:43.8221309Z 2025-05-07T20:26:43.8221316Z 2025-05-07T20:26:43.8221319Z 2025-05-07T20:26:43.8221323Z 2025-05-07T20:26:43.8221731Z  2025-05-07T20:26:43.8221890Z 2025-05-07T20:26:43.8221893Z 2025-05-07T20:26:43.8221901Z 2025-05-07T20:26:43.8221904Z 2025-05-07T20:26:43.8221908Z 2025-05-07T20:26:43.8221911Z 2025-05-07T20:26:43.8222307Z  2025-05-07T20:26:43.8222451Z 2025-05-07T20:26:43.8222454Z 2025-05-07T20:26:43.8222458Z 2025-05-07T20:26:43.8222464Z 2025-05-07T20:26:43.8222468Z 2025-05-07T20:26:43.8222485Z 2025-05-07T20:26:43.8222489Z 2025-05-07T20:26:43.8222923Z  2025-05-07T20:26:43.8223119Z 2025-05-07T20:26:43.8223132Z 2025-05-07T20:26:43.8223137Z 2025-05-07T20:26:43.8223149Z 2025-05-07T20:26:43.8223154Z 2025-05-07T20:26:43.8223159Z 2025-05-07T20:26:43.8223163Z 2025-05-07T20:26:43.8223169Z 2025-05-07T20:26:43.8223470Z  2025-05-07T20:26:43.8223633Z 2025-05-07T20:26:43.8223636Z 2025-05-07T20:26:43.8223640Z 2025-05-07T20:26:43.8223644Z 2025-05-07T20:26:43.8223647Z 2025-05-07T20:26:43.8223654Z 2025-05-07T20:26:43.8223657Z 2025-05-07T20:26:43.8223661Z 2025-05-07T20:26:43.8223664Z 2025-05-07T20:26:43.8224050Z  2025-05-07T20:26:43.8224226Z 2025-05-07T20:26:43.8224232Z 2025-05-07T20:26:43.8224243Z 2025-05-07T20:26:43.8224249Z 2025-05-07T20:26:43.8224254Z 2025-05-07T20:26:43.8224259Z 2025-05-07T20:26:43.8224264Z 2025-05-07T20:26:43.8224270Z 2025-05-07T20:26:43.8224275Z 2025-05-07T20:26:43.8224280Z 2025-05-07T20:26:43.8224938Z  2025-05-07T20:26:43.8225205Z 2025-05-07T20:26:43.8225212Z 2025-05-07T20:26:43.8225217Z 2025-05-07T20:26:43.8225229Z 2025-05-07T20:26:43.8225234Z 2025-05-07T20:26:43.8225239Z 2025-05-07T20:26:43.8225244Z 2025-05-07T20:26:43.8225250Z 2025-05-07T20:26:43.8225254Z 2025-05-07T20:26:43.8225259Z 2025-05-07T20:26:43.8225264Z 2025-05-07T20:26:43.8225469Z  2025-05-07T20:26:43.8225714Z 2025-05-07T20:26:43.8225719Z 2025-05-07T20:26:43.8225724Z 2025-05-07T20:26:43.8225729Z 2025-05-07T20:26:43.8225735Z 2025-05-07T20:26:43.8225740Z 2025-05-07T20:26:43.8225745Z 2025-05-07T20:26:43.8225750Z 2025-05-07T20:26:43.8225755Z 2025-05-07T20:26:43.8225760Z 2025-05-07T20:26:43.8225765Z 2025-05-07T20:26:43.8225770Z 2025-05-07T20:26:43.8225970Z  2025-05-07T20:26:43.8226227Z 2025-05-07T20:26:43.8226233Z 2025-05-07T20:26:43.8226238Z 2025-05-07T20:26:43.8226243Z 2025-05-07T20:26:43.8226248Z 2025-05-07T20:26:43.8226260Z 2025-05-07T20:26:43.8226266Z 2025-05-07T20:26:43.8226271Z 2025-05-07T20:26:43.8226276Z 2025-05-07T20:26:43.8226281Z 2025-05-07T20:26:43.8226382Z 2025-05-07T20:26:43.8226387Z 2025-05-07T20:26:43.8226393Z 2025-05-07T20:26:43.8226616Z  2025-05-07T20:26:43.8226876Z 2025-05-07T20:26:43.8226882Z 2025-05-07T20:26:43.8226887Z 2025-05-07T20:26:43.8226892Z 2025-05-07T20:26:43.8226897Z 2025-05-07T20:26:43.8226902Z 2025-05-07T20:26:43.8226915Z 2025-05-07T20:26:43.8226920Z 2025-05-07T20:26:43.8226925Z 2025-05-07T20:26:43.8226930Z 2025-05-07T20:26:43.8226935Z 2025-05-07T20:26:43.8226940Z 2025-05-07T20:26:43.8226945Z 2025-05-07T20:26:43.8226950Z 2025-05-07T20:26:43.8227158Z  2025-05-07T20:26:43.8227430Z 2025-05-07T20:26:43.8227434Z 2025-05-07T20:26:43.8227440Z 2025-05-07T20:26:43.8227444Z 2025-05-07T20:26:43.8227450Z 2025-05-07T20:26:43.8227455Z 2025-05-07T20:26:43.8227460Z 2025-05-07T20:26:43.8227472Z 2025-05-07T20:26:43.8227478Z 2025-05-07T20:26:43.8227483Z 2025-05-07T20:26:43.8227488Z 2025-05-07T20:26:43.8227493Z 2025-05-07T20:26:43.8227505Z 2025-05-07T20:26:43.8227510Z 2025-05-07T20:26:43.8227515Z 2025-05-07T20:26:43.8227774Z  2025-05-07T20:26:43.8228051Z 2025-05-07T20:26:43.8228056Z 2025-05-07T20:26:43.8228061Z 2025-05-07T20:26:43.8228066Z 2025-05-07T20:26:43.8228071Z 2025-05-07T20:26:43.8228077Z 2025-05-07T20:26:43.8228082Z 2025-05-07T20:26:43.8228087Z 2025-05-07T20:26:43.8228092Z 2025-05-07T20:26:43.8228097Z 2025-05-07T20:26:43.8228102Z 2025-05-07T20:26:43.8228114Z 2025-05-07T20:26:43.8228120Z 2025-05-07T20:26:43.8228124Z 2025-05-07T20:26:43.8228129Z 2025-05-07T20:26:43.8228134Z 2025-05-07T20:26:43.8228357Z  2025-05-07T20:26:43.8228642Z 2025-05-07T20:26:43.8228648Z 2025-05-07T20:26:43.8228653Z 2025-05-07T20:26:43.8228658Z 2025-05-07T20:26:43.8228663Z 2025-05-07T20:26:43.8228674Z 2025-05-07T20:26:43.8228679Z 2025-05-07T20:26:43.8228685Z 2025-05-07T20:26:43.8228689Z 2025-05-07T20:26:43.8228695Z 2025-05-07T20:26:43.8228705Z 2025-05-07T20:26:43.8228710Z 2025-05-07T20:26:43.8228715Z 2025-05-07T20:26:43.8228720Z 2025-05-07T20:26:43.8228725Z 2025-05-07T20:26:43.8228730Z 2025-05-07T20:26:43.8228735Z 2025-05-07T20:26:43.8228967Z  2025-05-07T20:26:43.8229251Z 2025-05-07T20:26:43.8229256Z 2025-05-07T20:26:43.8229261Z 2025-05-07T20:26:43.8229266Z 2025-05-07T20:26:43.8229271Z 2025-05-07T20:26:43.8229276Z 2025-05-07T20:26:43.8229281Z 2025-05-07T20:26:43.8229286Z 2025-05-07T20:26:43.8229291Z 2025-05-07T20:26:43.8229296Z 2025-05-07T20:26:43.8229301Z 2025-05-07T20:26:43.8229306Z 2025-05-07T20:26:43.8229319Z 2025-05-07T20:26:43.8229324Z 2025-05-07T20:26:43.8229329Z 2025-05-07T20:26:43.8229334Z 2025-05-07T20:26:43.8229339Z 2025-05-07T20:26:43.8229344Z 2025-05-07T20:26:43.8229936Z  2025-05-07T20:26:43.8230251Z 2025-05-07T20:26:43.8230257Z 2025-05-07T20:26:43.8230422Z  2025-05-07T20:26:43.8230566Z 2025-05-07T20:26:43.8230579Z 2025-05-07T20:26:43.8230721Z  2025-05-07T20:26:43.8230874Z 2025-05-07T20:26:43.8230880Z 2025-05-07T20:26:43.8230885Z 2025-05-07T20:26:43.8231036Z  2025-05-07T20:26:43.8231195Z 2025-05-07T20:26:43.8231201Z 2025-05-07T20:26:43.8231206Z 2025-05-07T20:26:43.8231211Z 2025-05-07T20:26:43.8231373Z  2025-05-07T20:26:43.8231541Z 2025-05-07T20:26:43.8231546Z 2025-05-07T20:26:43.8231551Z 2025-05-07T20:26:43.8231566Z 2025-05-07T20:26:43.8231575Z 2025-05-07T20:26:43.8232012Z  2025-05-07T20:26:43.8232201Z 2025-05-07T20:26:43.8232207Z 2025-05-07T20:26:43.8232212Z 2025-05-07T20:26:43.8232217Z 2025-05-07T20:26:43.8232222Z 2025-05-07T20:26:43.8232230Z 2025-05-07T20:26:43.8232426Z  2025-05-07T20:26:43.8232609Z 2025-05-07T20:26:43.8232614Z 2025-05-07T20:26:43.8232625Z 2025-05-07T20:26:43.8232640Z 2025-05-07T20:26:43.8232645Z 2025-05-07T20:26:43.8232650Z 2025-05-07T20:26:43.8232655Z 2025-05-07T20:26:43.8233029Z  2025-05-07T20:26:43.8233359Z 2025-05-07T20:26:43.8233363Z 2025-05-07T20:26:43.8233367Z 2025-05-07T20:26:43.8233370Z 2025-05-07T20:26:43.8233374Z 2025-05-07T20:26:43.8233377Z 2025-05-07T20:26:43.8233381Z 2025-05-07T20:26:43.8233385Z 2025-05-07T20:26:43.8233520Z  2025-05-07T20:26:43.8233668Z 2025-05-07T20:26:43.8233672Z 2025-05-07T20:26:43.8233679Z 2025-05-07T20:26:43.8233682Z 2025-05-07T20:26:43.8233686Z 2025-05-07T20:26:43.8233690Z 2025-05-07T20:26:43.8233693Z 2025-05-07T20:26:43.8233697Z 2025-05-07T20:26:43.8233700Z 2025-05-07T20:26:43.8234134Z  2025-05-07T20:26:43.8234360Z 2025-05-07T20:26:43.8234365Z 2025-05-07T20:26:43.8234370Z 2025-05-07T20:26:43.8234382Z 2025-05-07T20:26:43.8234396Z 2025-05-07T20:26:43.8234401Z 2025-05-07T20:26:43.8234406Z 2025-05-07T20:26:43.8234411Z 2025-05-07T20:26:43.8234427Z 2025-05-07T20:26:43.8234432Z 2025-05-07T20:26:43.8234616Z  2025-05-07T20:26:43.8234845Z 2025-05-07T20:26:43.8234860Z 2025-05-07T20:26:43.8234865Z 2025-05-07T20:26:43.8234870Z 2025-05-07T20:26:43.8234875Z 2025-05-07T20:26:43.8234880Z 2025-05-07T20:26:43.8234886Z 2025-05-07T20:26:43.8234891Z 2025-05-07T20:26:43.8234896Z 2025-05-07T20:26:43.8234900Z 2025-05-07T20:26:43.8234905Z 2025-05-07T20:26:43.8235100Z  2025-05-07T20:26:43.8235341Z 2025-05-07T20:26:43.8235346Z 2025-05-07T20:26:43.8235352Z 2025-05-07T20:26:43.8235357Z 2025-05-07T20:26:43.8235362Z 2025-05-07T20:26:43.8235367Z 2025-05-07T20:26:43.8235372Z 2025-05-07T20:26:43.8235377Z 2025-05-07T20:26:43.8235382Z 2025-05-07T20:26:43.8235387Z 2025-05-07T20:26:43.8235477Z 2025-05-07T20:26:43.8235483Z 2025-05-07T20:26:43.8235730Z  2025-05-07T20:26:43.8235981Z 2025-05-07T20:26:43.8235986Z 2025-05-07T20:26:43.8235991Z 2025-05-07T20:26:43.8236004Z 2025-05-07T20:26:43.8236009Z 2025-05-07T20:26:43.8236023Z 2025-05-07T20:26:43.8236028Z 2025-05-07T20:26:43.8236039Z 2025-05-07T20:26:43.8236045Z 2025-05-07T20:26:43.8236050Z 2025-05-07T20:26:43.8236065Z 2025-05-07T20:26:43.8236071Z 2025-05-07T20:26:43.8236076Z 2025-05-07T20:26:43.8236273Z  2025-05-07T20:26:43.8236527Z 2025-05-07T20:26:43.8236532Z 2025-05-07T20:26:43.8236557Z 2025-05-07T20:26:43.8236563Z 2025-05-07T20:26:43.8236568Z 2025-05-07T20:26:43.8236573Z 2025-05-07T20:26:43.8236578Z 2025-05-07T20:26:43.8236583Z 2025-05-07T20:26:43.8236588Z 2025-05-07T20:26:43.8236593Z 2025-05-07T20:26:43.8236598Z 2025-05-07T20:26:43.8236604Z 2025-05-07T20:26:43.8236609Z 2025-05-07T20:26:43.8236614Z 2025-05-07T20:26:43.8236815Z  2025-05-07T20:26:43.8237085Z 2025-05-07T20:26:43.8237090Z 2025-05-07T20:26:43.8237095Z 2025-05-07T20:26:43.8237101Z 2025-05-07T20:26:43.8237106Z 2025-05-07T20:26:43.8237230Z 2025-05-07T20:26:43.8237237Z 2025-05-07T20:26:43.8237242Z 2025-05-07T20:26:43.8237247Z 2025-05-07T20:26:43.8237252Z 2025-05-07T20:26:43.8237264Z 2025-05-07T20:26:43.8237269Z 2025-05-07T20:26:43.8237274Z 2025-05-07T20:26:43.8237280Z 2025-05-07T20:26:43.8237296Z 2025-05-07T20:26:43.8237541Z  2025-05-07T20:26:43.8237835Z 2025-05-07T20:26:43.8237841Z 2025-05-07T20:26:43.8237845Z 2025-05-07T20:26:43.8237850Z 2025-05-07T20:26:43.8237855Z 2025-05-07T20:26:43.8237861Z 2025-05-07T20:26:43.8237866Z 2025-05-07T20:26:43.8237871Z 2025-05-07T20:26:43.8237876Z 2025-05-07T20:26:43.8237889Z 2025-05-07T20:26:43.8237894Z 2025-05-07T20:26:43.8237899Z 2025-05-07T20:26:43.8237905Z 2025-05-07T20:26:43.8237910Z 2025-05-07T20:26:43.8237915Z 2025-05-07T20:26:43.8237920Z 2025-05-07T20:26:43.8238142Z  2025-05-07T20:26:43.8238427Z 2025-05-07T20:26:43.8238432Z 2025-05-07T20:26:43.8238437Z 2025-05-07T20:26:43.8238449Z 2025-05-07T20:26:43.8238454Z 2025-05-07T20:26:43.8238460Z 2025-05-07T20:26:43.8238465Z 2025-05-07T20:26:43.8238470Z 2025-05-07T20:26:43.8238565Z 2025-05-07T20:26:43.8238570Z 2025-05-07T20:26:43.8238575Z 2025-05-07T20:26:43.8238580Z 2025-05-07T20:26:43.8238585Z 2025-05-07T20:26:43.8238590Z 2025-05-07T20:26:43.8238595Z 2025-05-07T20:26:43.8238600Z 2025-05-07T20:26:43.8238605Z 2025-05-07T20:26:43.8238840Z  2025-05-07T20:26:43.8239130Z 2025-05-07T20:26:43.8239136Z 2025-05-07T20:26:43.8239141Z 2025-05-07T20:26:43.8239146Z 2025-05-07T20:26:43.8239151Z 2025-05-07T20:26:43.8239156Z 2025-05-07T20:26:43.8239162Z 2025-05-07T20:26:43.8239167Z 2025-05-07T20:26:43.8239172Z 2025-05-07T20:26:43.8239185Z 2025-05-07T20:26:43.8239190Z 2025-05-07T20:26:43.8239195Z 2025-05-07T20:26:43.8239201Z 2025-05-07T20:26:43.8239206Z 2025-05-07T20:26:43.8239212Z 2025-05-07T20:26:43.8239217Z 2025-05-07T20:26:43.8239222Z 2025-05-07T20:26:43.8239227Z 2025-05-07T20:26:43.8239473Z  2025-05-07T20:26:43.8239769Z 2025-05-07T20:26:43.8239782Z 2025-05-07T20:26:43.8239927Z  2025-05-07T20:26:43.8240065Z 2025-05-07T20:26:43.8240070Z 2025-05-07T20:26:43.8240230Z  2025-05-07T20:26:43.8240380Z 2025-05-07T20:26:43.8240386Z 2025-05-07T20:26:43.8240391Z 2025-05-07T20:26:43.8240553Z  2025-05-07T20:26:43.8240703Z 2025-05-07T20:26:43.8240708Z 2025-05-07T20:26:43.8240713Z 2025-05-07T20:26:43.8240718Z 2025-05-07T20:26:43.8241073Z  2025-05-07T20:26:43.8241232Z 2025-05-07T20:26:43.8241237Z 2025-05-07T20:26:43.8241242Z 2025-05-07T20:26:43.8241247Z 2025-05-07T20:26:43.8241255Z 2025-05-07T20:26:43.8241520Z  2025-05-07T20:26:43.8241689Z 2025-05-07T20:26:43.8241694Z 2025-05-07T20:26:43.8241702Z 2025-05-07T20:26:43.8241707Z 2025-05-07T20:26:43.8241712Z 2025-05-07T20:26:43.8241717Z 2025-05-07T20:26:43.8242117Z  2025-05-07T20:26:43.8242314Z 2025-05-07T20:26:43.8242329Z 2025-05-07T20:26:43.8242335Z 2025-05-07T20:26:43.8242340Z 2025-05-07T20:26:43.8242345Z 2025-05-07T20:26:43.8242350Z 2025-05-07T20:26:43.8242369Z 2025-05-07T20:26:43.8242542Z  2025-05-07T20:26:43.8242736Z 2025-05-07T20:26:43.8242741Z 2025-05-07T20:26:43.8242746Z 2025-05-07T20:26:43.8242751Z 2025-05-07T20:26:43.8242756Z 2025-05-07T20:26:43.8242761Z 2025-05-07T20:26:43.8242766Z 2025-05-07T20:26:43.8242771Z 2025-05-07T20:26:43.8242989Z  2025-05-07T20:26:43.8243141Z 2025-05-07T20:26:43.8243145Z 2025-05-07T20:26:43.8243148Z 2025-05-07T20:26:43.8243152Z 2025-05-07T20:26:43.8243155Z 2025-05-07T20:26:43.8243159Z 2025-05-07T20:26:43.8243162Z 2025-05-07T20:26:43.8243175Z 2025-05-07T20:26:43.8243179Z 2025-05-07T20:26:43.8243433Z  2025-05-07T20:26:43.8243586Z 2025-05-07T20:26:43.8243590Z 2025-05-07T20:26:43.8243593Z 2025-05-07T20:26:43.8243597Z 2025-05-07T20:26:43.8243617Z 2025-05-07T20:26:43.8243785Z 2025-05-07T20:26:43.8243792Z 2025-05-07T20:26:43.8243797Z 2025-05-07T20:26:43.8243802Z 2025-05-07T20:26:43.8243807Z 2025-05-07T20:26:43.8244013Z  2025-05-07T20:26:43.8244228Z 2025-05-07T20:26:43.8244232Z 2025-05-07T20:26:43.8244235Z 2025-05-07T20:26:43.8244239Z 2025-05-07T20:26:43.8244243Z 2025-05-07T20:26:43.8244246Z 2025-05-07T20:26:43.8244250Z 2025-05-07T20:26:43.8244253Z 2025-05-07T20:26:43.8244257Z 2025-05-07T20:26:43.8244261Z 2025-05-07T20:26:43.8244271Z 2025-05-07T20:26:43.8244411Z  2025-05-07T20:26:43.8244591Z 2025-05-07T20:26:43.8244595Z 2025-05-07T20:26:43.8244599Z 2025-05-07T20:26:43.8244602Z 2025-05-07T20:26:43.8244606Z 2025-05-07T20:26:43.8244610Z 2025-05-07T20:26:43.8244613Z 2025-05-07T20:26:43.8244617Z 2025-05-07T20:26:43.8244621Z 2025-05-07T20:26:43.8244624Z 2025-05-07T20:26:43.8244628Z 2025-05-07T20:26:43.8244631Z 2025-05-07T20:26:43.8244780Z  2025-05-07T20:26:43.8244966Z 2025-05-07T20:26:43.8244970Z 2025-05-07T20:26:43.8244973Z 2025-05-07T20:26:43.8244977Z 2025-05-07T20:26:43.8244980Z 2025-05-07T20:26:43.8245073Z 2025-05-07T20:26:43.8245076Z 2025-05-07T20:26:43.8245080Z 2025-05-07T20:26:43.8245083Z 2025-05-07T20:26:43.8245087Z 2025-05-07T20:26:43.8245090Z 2025-05-07T20:26:43.8245094Z 2025-05-07T20:26:43.8245105Z 2025-05-07T20:26:43.8245251Z  2025-05-07T20:26:43.8245437Z 2025-05-07T20:26:43.8245440Z 2025-05-07T20:26:43.8245444Z 2025-05-07T20:26:43.8245447Z 2025-05-07T20:26:43.8245451Z 2025-05-07T20:26:43.8245460Z 2025-05-07T20:26:43.8245464Z 2025-05-07T20:26:43.8245467Z 2025-05-07T20:26:43.8245471Z 2025-05-07T20:26:43.8245474Z 2025-05-07T20:26:43.8245478Z 2025-05-07T20:26:43.8245481Z 2025-05-07T20:26:43.8245485Z 2025-05-07T20:26:43.8245488Z 2025-05-07T20:26:43.8245635Z  2025-05-07T20:26:43.8245834Z 2025-05-07T20:26:43.8245838Z 2025-05-07T20:26:43.8245841Z 2025-05-07T20:26:43.8245849Z 2025-05-07T20:26:43.8245853Z 2025-05-07T20:26:43.8245856Z 2025-05-07T20:26:43.8245860Z 2025-05-07T20:26:43.8245863Z 2025-05-07T20:26:43.8245872Z 2025-05-07T20:26:43.8245875Z 2025-05-07T20:26:43.8245879Z 2025-05-07T20:26:43.8245882Z 2025-05-07T20:26:43.8245886Z 2025-05-07T20:26:43.8245889Z 2025-05-07T20:26:43.8245893Z 2025-05-07T20:26:43.8246060Z  2025-05-07T20:26:43.8246265Z 2025-05-07T20:26:43.8246269Z 2025-05-07T20:26:43.8246272Z 2025-05-07T20:26:43.8246276Z 2025-05-07T20:26:43.8246279Z 2025-05-07T20:26:43.8246283Z 2025-05-07T20:26:43.8246287Z 2025-05-07T20:26:43.8246290Z 2025-05-07T20:26:43.8246293Z 2025-05-07T20:26:43.8246297Z 2025-05-07T20:26:43.8246300Z 2025-05-07T20:26:43.8246310Z 2025-05-07T20:26:43.8246313Z 2025-05-07T20:26:43.8246317Z 2025-05-07T20:26:43.8246320Z 2025-05-07T20:26:43.8246324Z 2025-05-07T20:26:43.8246489Z  2025-05-07T20:26:43.8246697Z 2025-05-07T20:26:43.8246706Z 2025-05-07T20:26:43.8246709Z 2025-05-07T20:26:43.8246713Z 2025-05-07T20:26:43.8246716Z 2025-05-07T20:26:43.8246720Z 2025-05-07T20:26:43.8246728Z 2025-05-07T20:26:43.8246732Z 2025-05-07T20:26:43.8246735Z 2025-05-07T20:26:43.8246739Z 2025-05-07T20:26:43.8246742Z 2025-05-07T20:26:43.8246746Z 2025-05-07T20:26:43.8246749Z 2025-05-07T20:26:43.8246753Z 2025-05-07T20:26:43.8246756Z 2025-05-07T20:26:43.8246760Z 2025-05-07T20:26:43.8246763Z 2025-05-07T20:26:43.8246932Z  2025-05-07T20:26:43.8247142Z 2025-05-07T20:26:43.8247145Z 2025-05-07T20:26:43.8247149Z 2025-05-07T20:26:43.8247152Z 2025-05-07T20:26:43.8247156Z 2025-05-07T20:26:43.8247159Z 2025-05-07T20:26:43.8247163Z 2025-05-07T20:26:43.8247166Z 2025-05-07T20:26:43.8247170Z 2025-05-07T20:26:43.8247173Z 2025-05-07T20:26:43.8247177Z 2025-05-07T20:26:43.8247188Z 2025-05-07T20:26:43.8247192Z 2025-05-07T20:26:43.8247195Z 2025-05-07T20:26:43.8247199Z 2025-05-07T20:26:43.8247280Z 2025-05-07T20:26:43.8247284Z 2025-05-07T20:26:43.8247288Z 2025-05-07T20:26:43.8247685Z  2025-05-07T20:26:43.8248005Z 2025-05-07T20:26:43.8248011Z 2025-05-07T20:26:43.8248161Z  2025-05-07T20:26:43.8248319Z 2025-05-07T20:26:43.8248324Z 2025-05-07T20:26:43.8248472Z  2025-05-07T20:26:43.8248618Z 2025-05-07T20:26:43.8248632Z 2025-05-07T20:26:43.8248637Z 2025-05-07T20:26:43.8248789Z  2025-05-07T20:26:43.8248940Z 2025-05-07T20:26:43.8248949Z 2025-05-07T20:26:43.8248954Z 2025-05-07T20:26:43.8248959Z 2025-05-07T20:26:43.8249223Z  2025-05-07T20:26:43.8249407Z 2025-05-07T20:26:43.8249413Z 2025-05-07T20:26:43.8249418Z 2025-05-07T20:26:43.8249423Z 2025-05-07T20:26:43.8249431Z 2025-05-07T20:26:43.8249630Z  2025-05-07T20:26:43.8249803Z 2025-05-07T20:26:43.8249813Z 2025-05-07T20:26:43.8249818Z 2025-05-07T20:26:43.8249824Z 2025-05-07T20:26:43.8249829Z 2025-05-07T20:26:43.8249834Z 2025-05-07T20:26:43.8250006Z  2025-05-07T20:26:43.8250188Z 2025-05-07T20:26:43.8250201Z 2025-05-07T20:26:43.8250206Z 2025-05-07T20:26:43.8250347Z 2025-05-07T20:26:43.8250353Z 2025-05-07T20:26:43.8250357Z 2025-05-07T20:26:43.8250363Z 2025-05-07T20:26:43.8250541Z  2025-05-07T20:26:43.8250737Z 2025-05-07T20:26:43.8250742Z 2025-05-07T20:26:43.8250748Z 2025-05-07T20:26:43.8250752Z 2025-05-07T20:26:43.8250758Z 2025-05-07T20:26:43.8250763Z 2025-05-07T20:26:43.8250768Z 2025-05-07T20:26:43.8250773Z 2025-05-07T20:26:43.8250944Z  2025-05-07T20:26:43.8251151Z 2025-05-07T20:26:43.8251156Z 2025-05-07T20:26:43.8251161Z 2025-05-07T20:26:43.8251166Z 2025-05-07T20:26:43.8251171Z 2025-05-07T20:26:43.8251176Z 2025-05-07T20:26:43.8251181Z 2025-05-07T20:26:43.8251186Z 2025-05-07T20:26:43.8251191Z 2025-05-07T20:26:43.8251359Z  2025-05-07T20:26:43.8251576Z 2025-05-07T20:26:43.8251581Z 2025-05-07T20:26:43.8251586Z 2025-05-07T20:26:43.8251607Z 2025-05-07T20:26:43.8251612Z 2025-05-07T20:26:43.8251617Z 2025-05-07T20:26:43.8251622Z 2025-05-07T20:26:43.8251627Z 2025-05-07T20:26:43.8251641Z 2025-05-07T20:26:43.8251646Z 2025-05-07T20:26:43.8251829Z  2025-05-07T20:26:43.8252050Z 2025-05-07T20:26:43.8252056Z 2025-05-07T20:26:43.8252061Z 2025-05-07T20:26:43.8252066Z 2025-05-07T20:26:43.8252071Z 2025-05-07T20:26:43.8252076Z 2025-05-07T20:26:43.8252081Z 2025-05-07T20:26:43.8252086Z 2025-05-07T20:26:43.8252091Z 2025-05-07T20:26:43.8252096Z 2025-05-07T20:26:43.8252101Z 2025-05-07T20:26:43.8252286Z  2025-05-07T20:26:43.8252528Z 2025-05-07T20:26:43.8252534Z 2025-05-07T20:26:43.8252539Z 2025-05-07T20:26:43.8252544Z 2025-05-07T20:26:43.8252549Z 2025-05-07T20:26:43.8252554Z 2025-05-07T20:26:43.8252559Z 2025-05-07T20:26:43.8252564Z 2025-05-07T20:26:43.8252579Z 2025-05-07T20:26:43.8252584Z 2025-05-07T20:26:43.8252589Z 2025-05-07T20:26:43.8252594Z 2025-05-07T20:26:43.8252807Z  2025-05-07T20:26:43.8253062Z 2025-05-07T20:26:43.8253067Z 2025-05-07T20:26:43.8253072Z 2025-05-07T20:26:43.8253084Z 2025-05-07T20:26:43.8253089Z 2025-05-07T20:26:43.8253094Z 2025-05-07T20:26:43.8253099Z 2025-05-07T20:26:43.8253104Z 2025-05-07T20:26:43.8253110Z 2025-05-07T20:26:43.8253114Z 2025-05-07T20:26:43.8253120Z 2025-05-07T20:26:43.8253125Z 2025-05-07T20:26:43.8253130Z 2025-05-07T20:26:43.8253324Z  2025-05-07T20:26:43.8253580Z 2025-05-07T20:26:43.8253586Z 2025-05-07T20:26:43.8253591Z 2025-05-07T20:26:43.8253596Z 2025-05-07T20:26:43.8253601Z 2025-05-07T20:26:43.8253606Z 2025-05-07T20:26:43.8253611Z 2025-05-07T20:26:43.8253616Z 2025-05-07T20:26:43.8253621Z 2025-05-07T20:26:43.8253626Z 2025-05-07T20:26:43.8253631Z 2025-05-07T20:26:43.8253636Z 2025-05-07T20:26:43.8253642Z 2025-05-07T20:26:43.8253647Z 2025-05-07T20:26:43.8253858Z  2025-05-07T20:26:43.8254222Z 2025-05-07T20:26:43.8254228Z 2025-05-07T20:26:43.8254233Z 2025-05-07T20:26:43.8254238Z 2025-05-07T20:26:43.8254243Z 2025-05-07T20:26:43.8254255Z 2025-05-07T20:26:43.8254260Z 2025-05-07T20:26:43.8254265Z 2025-05-07T20:26:43.8254270Z 2025-05-07T20:26:43.8254275Z 2025-05-07T20:26:43.8254280Z 2025-05-07T20:26:43.8254294Z 2025-05-07T20:26:43.8254300Z 2025-05-07T20:26:43.8254305Z 2025-05-07T20:26:43.8254310Z 2025-05-07T20:26:43.8254538Z  2025-05-07T20:26:43.8254813Z 2025-05-07T20:26:43.8254818Z 2025-05-07T20:26:43.8254823Z 2025-05-07T20:26:43.8254828Z 2025-05-07T20:26:43.8254833Z 2025-05-07T20:26:43.8254838Z 2025-05-07T20:26:43.8254843Z 2025-05-07T20:26:43.8254848Z 2025-05-07T20:26:43.8254854Z 2025-05-07T20:26:43.8254859Z 2025-05-07T20:26:43.8254864Z 2025-05-07T20:26:43.8254869Z 2025-05-07T20:26:43.8254874Z 2025-05-07T20:26:43.8254879Z 2025-05-07T20:26:43.8254884Z 2025-05-07T20:26:43.8254889Z 2025-05-07T20:26:43.8255111Z  2025-05-07T20:26:43.8255386Z 2025-05-07T20:26:43.8255391Z 2025-05-07T20:26:43.8255396Z 2025-05-07T20:26:43.8255488Z 2025-05-07T20:26:43.8255493Z 2025-05-07T20:26:43.8255498Z 2025-05-07T20:26:43.8255503Z 2025-05-07T20:26:43.8255507Z 2025-05-07T20:26:43.8255513Z 2025-05-07T20:26:43.8255517Z 2025-05-07T20:26:43.8255522Z 2025-05-07T20:26:43.8255527Z 2025-05-07T20:26:43.8255531Z 2025-05-07T20:26:43.8255536Z 2025-05-07T20:26:43.8255542Z 2025-05-07T20:26:43.8255558Z 2025-05-07T20:26:43.8255563Z 2025-05-07T20:26:43.8255781Z  2025-05-07T20:26:43.8256061Z 2025-05-07T20:26:43.8256066Z 2025-05-07T20:26:43.8256071Z 2025-05-07T20:26:43.8256076Z 2025-05-07T20:26:43.8256081Z 2025-05-07T20:26:43.8256094Z 2025-05-07T20:26:43.8256100Z 2025-05-07T20:26:43.8256105Z 2025-05-07T20:26:43.8256110Z 2025-05-07T20:26:43.8256115Z 2025-05-07T20:26:43.8256120Z 2025-05-07T20:26:43.8256125Z 2025-05-07T20:26:43.8256130Z 2025-05-07T20:26:43.8256141Z 2025-05-07T20:26:43.8256147Z 2025-05-07T20:26:43.8256151Z 2025-05-07T20:26:43.8256157Z 2025-05-07T20:26:43.8256162Z 2025-05-07T20:26:43.8256399Z  2025-05-07T20:26:43.8256695Z 2025-05-07T20:26:43.8256700Z 2025-05-07T20:26:43.8256839Z  2025-05-07T20:26:43.8256985Z 2025-05-07T20:26:43.8256990Z 2025-05-07T20:26:43.8257134Z  2025-05-07T20:26:43.8257279Z 2025-05-07T20:26:43.8257284Z 2025-05-07T20:26:43.8257289Z 2025-05-07T20:26:43.8257444Z  2025-05-07T20:26:43.8257590Z 2025-05-07T20:26:43.8257596Z 2025-05-07T20:26:43.8257601Z 2025-05-07T20:26:43.8257606Z 2025-05-07T20:26:43.8257771Z  2025-05-07T20:26:43.8257929Z 2025-05-07T20:26:43.8257935Z 2025-05-07T20:26:43.8257940Z 2025-05-07T20:26:43.8257945Z 2025-05-07T20:26:43.8257951Z 2025-05-07T20:26:43.8258114Z  2025-05-07T20:26:43.8258283Z 2025-05-07T20:26:43.8258289Z 2025-05-07T20:26:43.8258293Z 2025-05-07T20:26:43.8258299Z 2025-05-07T20:26:43.8258312Z 2025-05-07T20:26:43.8258317Z 2025-05-07T20:26:43.8258489Z  2025-05-07T20:26:43.8258665Z 2025-05-07T20:26:43.8258678Z 2025-05-07T20:26:43.8258684Z 2025-05-07T20:26:43.8258689Z 2025-05-07T20:26:43.8258695Z 2025-05-07T20:26:43.8258700Z 2025-05-07T20:26:43.8258706Z 2025-05-07T20:26:43.8258867Z  2025-05-07T20:26:43.8259064Z 2025-05-07T20:26:43.8259069Z 2025-05-07T20:26:43.8259074Z 2025-05-07T20:26:43.8259092Z 2025-05-07T20:26:43.8259097Z 2025-05-07T20:26:43.8259102Z 2025-05-07T20:26:43.8259107Z 2025-05-07T20:26:43.8259112Z 2025-05-07T20:26:43.8259289Z  2025-05-07T20:26:43.8259493Z 2025-05-07T20:26:43.8259498Z 2025-05-07T20:26:43.8259503Z 2025-05-07T20:26:43.8259508Z 2025-05-07T20:26:43.8259513Z 2025-05-07T20:26:43.8259518Z 2025-05-07T20:26:43.8259523Z 2025-05-07T20:26:43.8259528Z 2025-05-07T20:26:43.8259533Z 2025-05-07T20:26:43.8259712Z  2025-05-07T20:26:43.8259923Z 2025-05-07T20:26:43.8260064Z 2025-05-07T20:26:43.8260071Z 2025-05-07T20:26:43.8260076Z 2025-05-07T20:26:43.8260081Z 2025-05-07T20:26:43.8260087Z 2025-05-07T20:26:43.8260098Z 2025-05-07T20:26:43.8260103Z 2025-05-07T20:26:43.8260108Z 2025-05-07T20:26:43.8260112Z 2025-05-07T20:26:43.8260311Z  2025-05-07T20:26:43.8260533Z 2025-05-07T20:26:43.8260537Z 2025-05-07T20:26:43.8260542Z 2025-05-07T20:26:43.8260547Z 2025-05-07T20:26:43.8260551Z 2025-05-07T20:26:43.8260556Z 2025-05-07T20:26:43.8260562Z 2025-05-07T20:26:43.8260575Z 2025-05-07T20:26:43.8260581Z 2025-05-07T20:26:43.8260586Z 2025-05-07T20:26:43.8260591Z 2025-05-07T20:26:43.8260775Z  2025-05-07T20:26:43.8261009Z 2025-05-07T20:26:43.8261015Z 2025-05-07T20:26:43.8261020Z 2025-05-07T20:26:43.8261056Z 2025-05-07T20:26:43.8261061Z 2025-05-07T20:26:43.8261066Z 2025-05-07T20:26:43.8261071Z 2025-05-07T20:26:43.8261075Z 2025-05-07T20:26:43.8261080Z 2025-05-07T20:26:43.8261085Z 2025-05-07T20:26:43.8261098Z 2025-05-07T20:26:43.8261103Z 2025-05-07T20:26:43.8261293Z  2025-05-07T20:26:43.8261747Z 2025-05-07T20:26:43.8261751Z 2025-05-07T20:26:43.8261755Z 2025-05-07T20:26:43.8261758Z 2025-05-07T20:26:43.8261769Z 2025-05-07T20:26:43.8261773Z 2025-05-07T20:26:43.8261776Z 2025-05-07T20:26:43.8261780Z 2025-05-07T20:26:43.8261783Z 2025-05-07T20:26:43.8261787Z 2025-05-07T20:26:43.8261790Z 2025-05-07T20:26:43.8261794Z 2025-05-07T20:26:43.8261797Z 2025-05-07T20:26:43.8261944Z  2025-05-07T20:26:43.8262134Z 2025-05-07T20:26:43.8262138Z 2025-05-07T20:26:43.8262142Z 2025-05-07T20:26:43.8262145Z 2025-05-07T20:26:43.8262149Z 2025-05-07T20:26:43.8262152Z 2025-05-07T20:26:43.8262156Z 2025-05-07T20:26:43.8262160Z 2025-05-07T20:26:43.8262163Z 2025-05-07T20:26:43.8262167Z 2025-05-07T20:26:43.8262170Z 2025-05-07T20:26:43.8262174Z 2025-05-07T20:26:43.8262178Z 2025-05-07T20:26:43.8262181Z 2025-05-07T20:26:43.8262335Z  2025-05-07T20:26:43.8262617Z 2025-05-07T20:26:43.8262622Z 2025-05-07T20:26:43.8262626Z 2025-05-07T20:26:43.8262640Z 2025-05-07T20:26:43.8262645Z 2025-05-07T20:26:43.8262650Z 2025-05-07T20:26:43.8262655Z 2025-05-07T20:26:43.8262659Z 2025-05-07T20:26:43.8262664Z 2025-05-07T20:26:43.8262669Z 2025-05-07T20:26:43.8262673Z 2025-05-07T20:26:43.8262678Z 2025-05-07T20:26:43.8262683Z 2025-05-07T20:26:43.8262688Z 2025-05-07T20:26:43.8262693Z 2025-05-07T20:26:43.8262922Z  2025-05-07T20:26:43.8263196Z 2025-05-07T20:26:43.8263201Z 2025-05-07T20:26:43.8263206Z 2025-05-07T20:26:43.8263211Z 2025-05-07T20:26:43.8263216Z 2025-05-07T20:26:43.8263221Z 2025-05-07T20:26:43.8263226Z 2025-05-07T20:26:43.8263232Z 2025-05-07T20:26:43.8263247Z 2025-05-07T20:26:43.8263252Z 2025-05-07T20:26:43.8263257Z 2025-05-07T20:26:43.8263262Z 2025-05-07T20:26:43.8263267Z 2025-05-07T20:26:43.8263272Z 2025-05-07T20:26:43.8263277Z 2025-05-07T20:26:43.8263288Z 2025-05-07T20:26:43.8263501Z  2025-05-07T20:26:43.8263784Z 2025-05-07T20:26:43.8263796Z 2025-05-07T20:26:43.8263801Z 2025-05-07T20:26:43.8263806Z 2025-05-07T20:26:43.8263811Z 2025-05-07T20:26:43.8263816Z 2025-05-07T20:26:43.8263821Z 2025-05-07T20:26:43.8263826Z 2025-05-07T20:26:43.8263831Z 2025-05-07T20:26:43.8263836Z 2025-05-07T20:26:43.8263841Z 2025-05-07T20:26:43.8263846Z 2025-05-07T20:26:43.8263851Z 2025-05-07T20:26:43.8263856Z 2025-05-07T20:26:43.8263861Z 2025-05-07T20:26:43.8263866Z 2025-05-07T20:26:43.8263871Z 2025-05-07T20:26:43.8264095Z  2025-05-07T20:26:43.8264377Z 2025-05-07T20:26:43.8264382Z 2025-05-07T20:26:43.8264387Z 2025-05-07T20:26:43.8264392Z 2025-05-07T20:26:43.8264397Z 2025-05-07T20:26:43.8264402Z 2025-05-07T20:26:43.8264407Z 2025-05-07T20:26:43.8264412Z 2025-05-07T20:26:43.8264417Z 2025-05-07T20:26:43.8264421Z 2025-05-07T20:26:43.8264426Z 2025-05-07T20:26:43.8264545Z 2025-05-07T20:26:43.8264551Z 2025-05-07T20:26:43.8264567Z 2025-05-07T20:26:43.8264572Z 2025-05-07T20:26:43.8264601Z 2025-05-07T20:26:43.8264606Z 2025-05-07T20:26:43.8264611Z 2025-05-07T20:26:43.8264851Z  2025-05-07T20:26:43.8265572Z 2025-05-07T20:26:43.8265577Z 2025-05-07T20:26:43.8265737Z  2025-05-07T20:26:43.8266120Z 2025-05-07T20:26:43.8266123Z 2025-05-07T20:26:43.8266247Z  2025-05-07T20:26:43.8266355Z 2025-05-07T20:26:43.8266358Z 2025-05-07T20:26:43.8266362Z 2025-05-07T20:26:43.8266474Z  2025-05-07T20:26:43.8266584Z 2025-05-07T20:26:43.8266587Z 2025-05-07T20:26:43.8266591Z 2025-05-07T20:26:43.8266594Z 2025-05-07T20:26:43.8266703Z  2025-05-07T20:26:43.8266827Z 2025-05-07T20:26:43.8266830Z 2025-05-07T20:26:43.8266834Z 2025-05-07T20:26:43.8266837Z 2025-05-07T20:26:43.8266841Z 2025-05-07T20:26:43.8266952Z  2025-05-07T20:26:43.8267081Z 2025-05-07T20:26:43.8267093Z 2025-05-07T20:26:43.8267097Z 2025-05-07T20:26:43.8267100Z 2025-05-07T20:26:43.8267104Z 2025-05-07T20:26:43.8267107Z 2025-05-07T20:26:43.8267399Z  2025-05-07T20:26:43.8267550Z 2025-05-07T20:26:43.8267555Z 2025-05-07T20:26:43.8267559Z 2025-05-07T20:26:43.8267564Z 2025-05-07T20:26:43.8267568Z 2025-05-07T20:26:43.8267573Z 2025-05-07T20:26:43.8267577Z 2025-05-07T20:26:43.8267707Z  2025-05-07T20:26:43.8275811Z 2025-05-07T20:26:43.8275817Z 2025-05-07T20:26:43.8275822Z 2025-05-07T20:26:43.8275827Z 2025-05-07T20:26:43.8275832Z 2025-05-07T20:26:43.8275838Z 2025-05-07T20:26:43.8275852Z 2025-05-07T20:26:43.8275869Z 2025-05-07T20:26:43.8276128Z  done 2025-05-07T20:26:44.0315843Z Preparing transaction: \ | done 2025-05-07T20:26:48.0465607Z Verifying transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:26:48.6547313Z Executing transaction: \ | / - \ | done 2025-05-07T20:26:50.8241903Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ... 2025-05-07T20:26:50.8242340Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:50.8243034Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:50.8243594Z 2025-05-07T20:26:50.8256469Z 2025-05-07T20:26:50.8257176Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:50.8257886Z 2025-05-07T20:26:50.8270695Z 2025-05-07T20:26:50.8270897Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:50.8276127Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:50.8279830Z 2025-05-07T20:26:50.9917138Z 2025-05-07T20:26:50.9922639Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:50.9926401Z 2025-05-07T20:26:50.9945718Z 2025-05-07T20:26:50.9946105Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:51.0325466Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:52.9157268Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:52.9794818Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:52.9795339Z 2025-05-07T20:26:53.4025176Z 2025-05-07T20:26:53.4036027Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:53.4379106Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:53.4379623Z 2025-05-07T20:26:53.8759932Z 2025-05-07T20:26:53.8760341Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:53.8761263Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:53.8761988Z 2025-05-07T20:26:54.2998481Z 2025-05-07T20:26:56.3327275Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:58.3806424Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:27:00.4067954Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:27:00.4068750Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:27:02.4396553Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:27:04.3414899Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:27:04.3415281Z 2025-05-07T20:27:04.4029773Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:27:08.2437528Z /tmp/tmpc19u08lt: line 3: clang: command not found 2025-05-07T20:27:08.2437819Z 2025-05-07T20:27:08.2438218Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:27:08.3069615Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:27:08.3069930Z 2025-05-07T20:27:08.3090262Z total 36 2025-05-07T20:27:08.3090563Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:27:08.3090942Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:25 .. 2025-05-07T20:27:08.3091394Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:27:08.3091908Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:27:08.3092389Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:27:08.3092861Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:27:08.3093607Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:27:08.3094074Z -rw-r--r--. 2 ec2-user ec2-user 2932 Jan 24 22:22 ~cuda-nvcc_activate.sh 2025-05-07T20:27:08.3094365Z 2025-05-07T20:27:08.3094585Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:27:08.3095229Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:27:08.3095645Z 2025-05-07T20:27:08.3117142Z 2025-05-07T20:27:08.3117549Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:27:08.3117812Z 2025-05-07T20:27:10.2823624Z 2025-05-07T20:27:10.2824635Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:27:10.2825685Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:27:10.2826435Z 2025-05-07T20:27:10.7097656Z 2025-05-07T20:27:10.7098355Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:27:10.7098837Z 2025-05-07T20:27:12.6032668Z -allow-unsupported-compiler 2025-05-07T20:27:12.6032974Z 2025-05-07T20:27:12.6660547Z 2025-05-07T20:27:12.6660829Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:27:12.6661655Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:27:12.6661982Z 2025-05-07T20:27:14.6326948Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:27:14.6327566Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:27:14.6327907Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:27:14.6328231Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:27:14.6328561Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:27:14.6328824Z #define _STL_PAIR_H 1 2025-05-07T20:27:14.6329078Z #define __cpp_attributes 200809L 2025-05-07T20:27:14.6329402Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:27:14.6329742Z #define __DELETE_THROW throw() 2025-05-07T20:27:14.6330005Z #define _PTRDIFF_T_ 2025-05-07T20:27:14.6330271Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:27:14.6330556Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:27:14.6330829Z #define _IO_LEFT 02 2025-05-07T20:27:14.6331077Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:27:14.6331330Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:27:14.6331607Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:27:14.6332065Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:27:14.6332497Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:27:14.6332840Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:27:14.6333225Z #define _IOS_OUTPUT 2 2025-05-07T20:27:14.6333566Z #define __SM_100_RT_HPP__ 2025-05-07T20:27:14.6334026Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:27:14.6334529Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:27:14.6334955Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:27:14.6335317Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:27:14.6335701Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:27:14.6336777Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:27:14.6337864Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:27:14.6338277Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:27:14.6338677Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:27:14.6339089Z #define _T_WCHAR_ 2025-05-07T20:27:14.6339313Z #define stdout stdout 2025-05-07T20:27:14.6339645Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:27:14.6340025Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:27:14.6340276Z #define __flexarr [] 2025-05-07T20:27:14.6340515Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:27:14.6340837Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:27:14.6341177Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:27:14.6341434Z #define _MATH_H 1 2025-05-07T20:27:14.6342010Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:27:14.6342356Z #define __S64_TYPE long int 2025-05-07T20:27:14.6342609Z #define __stub_fchflags 2025-05-07T20:27:14.6342879Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:27:14.6343174Z #define __SQUAD_TYPE long int 2025-05-07T20:27:14.6343435Z #define __INTMAX_C(c) c ## L 2025-05-07T20:27:14.6343741Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:27:14.6344132Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:27:14.6344386Z #define NL_NMAX INT_MAX 2025-05-07T20:27:14.6344622Z #define _BITS_TIME_H 1 2025-05-07T20:27:14.6344899Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:27:14.6345223Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:27:14.6345528Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:27:14.6345880Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:27:14.6346279Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:27:14.6346645Z #define __CHAR_BIT__ 8 2025-05-07T20:27:14.6346909Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:14.6347383Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:27:14.6347673Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:27:14.6347942Z #define FP_NAN 0 2025-05-07T20:27:14.6348205Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:27:14.6348611Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:27:14.6348999Z #define __cudaCDP2GetErrorString 2025-05-07T20:27:14.6349287Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:27:14.6349547Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:27:14.6349809Z #define __SM_80_RT_H__ 2025-05-07T20:27:14.6350040Z #define _NEW 2025-05-07T20:27:14.6350264Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:27:14.6350545Z #define __UINT8_MAX__ 0xff 2025-05-07T20:27:14.6350914Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:27:14.6351323Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:27:14.6351570Z #define __USE_ANSI 1 2025-05-07T20:27:14.6351858Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:27:14.6352258Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:27:14.6352612Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:27:14.6352915Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:27:14.6353200Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:27:14.6353479Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:27:14.6353761Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:27:14.6354049Z #define PIPE_BUF 4096 2025-05-07T20:27:14.6354363Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:27:14.6354816Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:27:14.6355195Z #define ADJ_TICK 0x4000 2025-05-07T20:27:14.6355603Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:27:14.6355923Z #define MQ_PRIO_MAX 32768 2025-05-07T20:27:14.6356199Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:27:14.6356525Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:27:14.6356996Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:14.6357523Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:27:14.6357896Z #define _XOPEN_SOURCE 700 2025-05-07T20:27:14.6358156Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:27:14.6358430Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:27:14.6358720Z #define __cpp_static_assert 201411L 2025-05-07T20:27:14.6359008Z #define __GLIBCXX__ 20230528 2025-05-07T20:27:14.6359275Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:27:14.6359565Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:27:14.6359846Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:27:14.6360147Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:27:14.6360431Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:27:14.6360738Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:14.6361181Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:27:14.6361525Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:27:14.6361814Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:27:14.6362136Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:14.6362491Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:27:14.6362860Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:27:14.6363158Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:27:14.6363449Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:27:14.6363801Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:27:14.6364164Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:27:14.6364570Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:27:14.6364979Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:27:14.6365295Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:27:14.6366556Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:27:14.6366870Z #define __GCC_IEC_559 2 2025-05-07T20:27:14.6367190Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:27:14.6367571Z #define _IO_flockfile(_fp) 2025-05-07T20:27:14.6367979Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:27:14.6368249Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:27:14.6368514Z #define _IOFBF 0 2025-05-07T20:27:14.6368726Z #define __USE_BSD 1 2025-05-07T20:27:14.6368955Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:27:14.6369230Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:27:14.6369498Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:27:14.6369755Z #define _IO_NO_WRITES 8 2025-05-07T20:27:14.6370017Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:27:14.6370377Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:27:14.6370726Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:27:14.6371036Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:27:14.6371363Z #define __cpp_binary_literals 201304L 2025-05-07T20:27:14.6371659Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:27:14.6371932Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:27:14.6372206Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:27:14.6372521Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:27:14.6372907Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:27:14.6373277Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:27:14.6373581Z #define M_PI 3.14159265358979323846 2025-05-07T20:27:14.6373892Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:27:14.6374229Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:27:14.6374539Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:27:14.6374840Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:27:14.6375118Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:27:14.6375388Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:27:14.6375974Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:27:14.6376565Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:27:14.6376889Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:27:14.6377222Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:27:14.6377526Z #define __cudaCDP2GetErrorName 2025-05-07T20:27:14.6377802Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:27:14.6378065Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:27:14.6378365Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:27:14.6378696Z #define __cpp_variadic_templates 200704L 2025-05-07T20:27:14.6378995Z #define RAND_MAX 2147483647 2025-05-07T20:27:14.6379257Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:27:14.6379584Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:14.6379902Z #define __SM_90_RT_H__ 2025-05-07T20:27:14.6380143Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:27:14.6380405Z #define __COMPAR_FN_T 2025-05-07T20:27:14.6380654Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:27:14.6381040Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:27:14.6381520Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:27:14.6382038Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:27:14.6382376Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:27:14.6382738Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:27:14.6383040Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:27:14.6383378Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:27:14.6383685Z #define __cpp_variable_templates 201304L 2025-05-07T20:27:14.6384191Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:14.6384733Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:27:14.6385058Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:27:14.6385339Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:27:14.6385642Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:27:14.6385939Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:27:14.6386216Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:27:14.6386488Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:27:14.6386752Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:27:14.6387082Z #define __u_char_defined 2025-05-07T20:27:14.6387402Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:27:14.6387764Z #define STA_PPSERROR 0x0800 2025-05-07T20:27:14.6388019Z #define _GLIBCXX_STD_A std 2025-05-07T20:27:14.6388275Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:27:14.6388559Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:27:14.6388990Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:27:14.6389415Z #define FP_INFINITE 1 2025-05-07T20:27:14.6389782Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:14.6390193Z #define _IO_pid_t __pid_t 2025-05-07T20:27:14.6390451Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:27:14.6390720Z #define __LEAF , __leaf__ 2025-05-07T20:27:14.6390965Z #define PATH_MAX 4096 2025-05-07T20:27:14.6391224Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:27:14.6391564Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:27:14.6391895Z #define _LIMITS_H___ 2025-05-07T20:27:14.6392116Z #define __size_t 2025-05-07T20:27:14.6392350Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:27:14.6392889Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:27:14.6393449Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:27:14.6393760Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:27:14.6394095Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:27:14.6394356Z #define _WCHAR_T_DEFINED 2025-05-07T20:27:14.6394711Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:27:14.6395111Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:27:14.6395476Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:27:14.6395805Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:27:14.6396090Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:27:14.6396373Z #define __INT8_C(c) c 2025-05-07T20:27:14.6396638Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:27:14.6396940Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:27:14.6397204Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:27:14.6397461Z #define __SM_70_RT_HPP__ 2025-05-07T20:27:14.6397715Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:27:14.6397988Z #define __cpp_variadic_using 201611L 2025-05-07T20:27:14.6398308Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:14.6398636Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:27:14.6398909Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:27:14.6399178Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:27:14.6399447Z #define __cpp_capture_star_this 201603L 2025-05-07T20:27:14.6399762Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:27:14.6400065Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:27:14.6400503Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:27:14.6400885Z #define NFDBITS __NFDBITS 2025-05-07T20:27:14.6401146Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:27:14.6401437Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:27:14.6401762Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:27:14.6402083Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:27:14.6402338Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:27:14.6402630Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:27:14.6402937Z #define STA_UNSYNC 0x0040 2025-05-07T20:27:14.6403246Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:14.6403661Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:27:14.6404027Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:27:14.6404318Z #define __cpp_if_constexpr 201606L 2025-05-07T20:27:14.6404629Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:27:14.6404961Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:27:14.6405286Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:27:14.6405616Z #define __daddr_t_defined 2025-05-07T20:27:14.6405872Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:27:14.6406266Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:27:14.6406576Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:27:14.6407091Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:27:14.6407578Z #define _ACRTIMP 2025-05-07T20:27:14.6407801Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:27:14.6408074Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:27:14.6408371Z #define _IOS_BIN 128 2025-05-07T20:27:14.6408721Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:27:14.6409130Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:14.6409401Z #define UNDERFLOW 4 2025-05-07T20:27:14.6409625Z #define NAME_MAX 255 2025-05-07T20:27:14.6409858Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:27:14.6410873Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:27:14.6411572Z 2025-05-07T20:27:14.6411677Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:27:14.6411961Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:27:14.6412251Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:27:14.6412632Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:27:14.6413026Z #define __ptr_t void * 2025-05-07T20:27:14.6413262Z #define M_E 2.7182818284590452354 2025-05-07T20:27:14.6413543Z #define cudaSurfaceType1D 0x01 2025-05-07T20:27:14.6413814Z #define __USE_ISOCXX11 1 2025-05-07T20:27:14.6414076Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:27:14.6414394Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:27:14.6414692Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:27:14.6414966Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:27:14.6415263Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:27:14.6415582Z #define cudaSurfaceType2D 0x02 2025-05-07T20:27:14.6415849Z #define __linux 1 2025-05-07T20:27:14.6416092Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:27:14.6416363Z #define cudaDeviceMask 0xff 2025-05-07T20:27:14.6416636Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:27:14.6416934Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:27:14.6417210Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:27:14.6417505Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:27:14.6417820Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:27:14.6418129Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:27:14.6418420Z #define _BITS_TYPES_H 1 2025-05-07T20:27:14.6418711Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:27:14.6419058Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:27:14.6419357Z #define cudaSurfaceType3D 0x03 2025-05-07T20:27:14.6419636Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:27:14.6420015Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:27:14.6420303Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:27:14.6421085Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:27:14.6421905Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:27:14.6471827Z #define __unix 1 2025-05-07T20:27:14.6472123Z #define MATH_ERRNO 1 2025-05-07T20:27:14.6472371Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:27:14.6472653Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:27:14.6472913Z #define __SM_100_RT_H__ 2025-05-07T20:27:14.6473174Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:27:14.6473469Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:27:14.6473757Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:27:14.6474038Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:27:14.6474336Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:27:14.6474812Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:27:14.6475622Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:27:14.6475918Z #define CUDARTAPI_CDECL 2025-05-07T20:27:14.6476186Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:27:14.6476457Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:27:14.6476732Z #define __cpp_lib_void_t 201411 2025-05-07T20:27:14.6476989Z #define _POSIX_AIO_MAX 1 2025-05-07T20:27:14.6477220Z #define __SIZE_T 2025-05-07T20:27:14.6477464Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:27:14.6477780Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:27:14.6478070Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:27:14.6478328Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:27:14.6478585Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:27:14.6478848Z #define _ATFILE_SOURCE 1 2025-05-07T20:27:14.6479230Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:27:14.6479663Z #define __WAIT_STATUS void * 2025-05-07T20:27:14.6479926Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:27:14.6480194Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:27:14.6480463Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:27:14.6480751Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:27:14.6481029Z #define __WINT_MIN__ 0U 2025-05-07T20:27:14.6481596Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:27:14.6482240Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:27:14.6482542Z #define WUNTRACED 2 2025-05-07T20:27:14.6482773Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:27:14.6483045Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:27:14.6483336Z #define NZERO 20 2025-05-07T20:27:14.6483564Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:27:14.6483864Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:27:14.6484187Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:27:14.6484477Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:27:14.6484728Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:27:14.6485016Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:27:14.6485290Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:27:14.6485561Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:27:14.6485837Z #define EXIT_FAILURE 1 2025-05-07T20:27:14.6486078Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:27:14.6486332Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:27:14.6486598Z #define _SIZE_T_DEFINED_ 2025-05-07T20:27:14.6486848Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:27:14.6487127Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:27:14.6487460Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:27:14.6487819Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:27:14.6488114Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:27:14.6488363Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:27:14.6488635Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:27:14.6489080Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:27:14.6489383Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:27:14.6489677Z #define SEEK_DATA 3 2025-05-07T20:27:14.6489908Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:27:14.6490198Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:27:14.6490617Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:27:14.6491004Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:27:14.6491254Z #define __INT64_C(c) c ## L 2025-05-07T20:27:14.6491517Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:27:14.6491849Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:27:14.6492169Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:27:14.6492440Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:27:14.6492733Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:27:14.6493030Z #define STA_PPSWANDER 0x0400 2025-05-07T20:27:14.6493280Z #define __INT_WCHAR_T_H 2025-05-07T20:27:14.6493519Z #define WSTOPPED 2 2025-05-07T20:27:14.6493778Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:27:14.6494093Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:27:14.6494428Z #define FP_NORMAL 4 2025-05-07T20:27:14.6494670Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:27:14.6494949Z #define _BITS_TIMEX_H 1 2025-05-07T20:27:14.6495186Z #define _POSIX_LINK_MAX 8 2025-05-07T20:27:14.6495443Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:27:14.6495720Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:27:14.6495992Z #define cudaTextureType1D 0x01 2025-05-07T20:27:14.6496269Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:27:14.6496535Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:27:14.6496798Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:27:14.6497094Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:27:14.6497520Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:27:14.6497964Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:27:14.6498226Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:27:14.6498493Z #define _POSIX_SOURCE 1 2025-05-07T20:27:14.6498736Z #define cudaTextureType2D 0x02 2025-05-07T20:27:14.6498996Z #define _PTR_TRAITS_H 1 2025-05-07T20:27:14.6499268Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:27:14.6499576Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:27:14.6499842Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:27:14.6500163Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:27:14.6500493Z #define cudaTextureType3D 0x03 2025-05-07T20:27:14.6500756Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:27:14.6501014Z #define CLOCK_REALTIME 0 2025-05-07T20:27:14.6501260Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:27:14.6501527Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:27:14.6501831Z #define __cpp_aligned_new 201606L 2025-05-07T20:27:14.6502108Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:27:14.6502385Z #define cudaEventBlockingSync 0x01 2025-05-07T20:27:14.6502671Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:27:14.6502948Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:27:14.6503251Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:27:14.6503549Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:27:14.6503835Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:27:14.6504084Z #define __GLIBC__ 2 2025-05-07T20:27:14.6504307Z #define __END_DECLS } 2025-05-07T20:27:14.6504548Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:27:14.6504907Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:27:14.6505280Z #define __CONCAT(x,y) x ## y 2025-05-07T20:27:14.6505531Z #define WCONTINUED 8 2025-05-07T20:27:14.6505764Z #define __STDC_HOSTED__ 1 2025-05-07T20:27:14.6506014Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:27:14.6506284Z #define _ALLOCA_H 1 2025-05-07T20:27:14.6506520Z #define __host__ __location__(host) 2025-05-07T20:27:14.6506941Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:27:14.6507382Z #define __SLONG32_TYPE int 2025-05-07T20:27:14.6507805Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:27:14.6508090Z #define _SYS_SELECT_H 1 2025-05-07T20:27:14.6508338Z #define _IO_LINE_BUF 0x200 2025-05-07T20:27:14.6508593Z #define _IOS_NOCREATE 32 2025-05-07T20:27:14.6508838Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:27:14.6509113Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:27:14.6509411Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:27:14.6509703Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:27:14.6509987Z #define __global__ __location__(global) 2025-05-07T20:27:14.6510274Z #define __GNU_LIBRARY__ 6 2025-05-07T20:27:14.6510533Z #define __cpp_decltype_auto 201304L 2025-05-07T20:27:14.6510805Z #define __DBL_DIG__ 15 2025-05-07T20:27:14.6511031Z #define TIME_UTC 1 2025-05-07T20:27:14.6511249Z #define __FLT32_DIG__ 6 2025-05-07T20:27:14.6511563Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:27:14.6511956Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:27:14.6512280Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:27:14.6512581Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:27:14.6512884Z #define _G_BUFSIZ 8192 2025-05-07T20:27:14.6513273Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:27:14.6513640Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:27:14.6513983Z #define __cudaCDP2GetDevice 2025-05-07T20:27:14.6514265Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:27:14.6514551Z #define STA_CLOCKERR 0x1000 2025-05-07T20:27:14.6514794Z #define __GXX_WEAK__ 1 2025-05-07T20:27:14.6515050Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:14.6515353Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:27:14.6515656Z #define __SHRT_WIDTH__ 16 2025-05-07T20:27:14.6515951Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:27:14.6516291Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:27:14.6516564Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:27:14.6516847Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:27:14.6517148Z #define _G_config_h 1 2025-05-07T20:27:14.6517417Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:27:14.6517759Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:27:14.6518037Z #define _GCC_WCHAR_T 2025-05-07T20:27:14.6518268Z #define TMP_MAX 238328 2025-05-07T20:27:14.6518505Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:27:14.6518767Z #define __DEVICE_TYPES_H__ 2025-05-07T20:27:14.6519025Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:14.6519295Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:27:14.6519567Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:27:14.6519852Z #define _IO_SKIPWS 01 2025-05-07T20:27:14.6520246Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:27:14.6520702Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:27:14.6520965Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:27:14.6521289Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:27:14.6521658Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:27:14.6522024Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:27:14.6522379Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:27:14.6522638Z #define le32toh(x) (x) 2025-05-07T20:27:14.6522875Z #define _SIZE_T_DEFINED 2025-05-07T20:27:14.6523127Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:27:14.6523458Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:27:14.6523811Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:27:14.6524256Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:27:14.6524664Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:27:14.6524930Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:27:14.6525197Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:27:14.6525460Z #define _POSIX_NAME_MAX 14 2025-05-07T20:27:14.6525737Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:27:14.6526347Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:27:14.6526849Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:27:14.6527153Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:27:14.6527504Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:27:14.6527821Z #define _WCHAR_T_ 2025-05-07T20:27:14.6528046Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:27:14.6528407Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:27:14.6528796Z #define RTSIG_MAX 32 2025-05-07T20:27:14.6529017Z #define _STDDEF_H 2025-05-07T20:27:14.6529247Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:27:14.6529516Z #define _VA_LIST_DEFINED 2025-05-07T20:27:14.6529764Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:27:14.6530097Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:27:14.6530488Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:27:14.6530810Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:27:14.6531110Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:27:14.6531570Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:27:14.6532176Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:27:14.6532539Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:27:14.6532858Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:27:14.6533172Z #define __unix__ 1 2025-05-07T20:27:14.6533402Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:14.6533684Z #define __INT_WIDTH__ 32 2025-05-07T20:27:14.6533971Z #define __SIZEOF_LONG__ 8 2025-05-07T20:27:14.6534212Z #define _IONBF 2 2025-05-07T20:27:14.6534661Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:27:14.6535423Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:27:14.6535958Z #define __STDC_IEC_559__ 1 2025-05-07T20:27:14.6536213Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:27:14.6536479Z #define __UINT16_C(c) c 2025-05-07T20:27:14.6536722Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:27:14.6536997Z #define STA_DEL 0x0020 2025-05-07T20:27:14.6537242Z #define __CUDACC_VER_MINOR__ 8 2025-05-07T20:27:14.6537498Z #define __id_t_defined 2025-05-07T20:27:14.6537766Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:27:14.6538216Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:27:14.6538643Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:27:14.6538913Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:27:14.6539168Z #define __DECIMAL_DIG__ 21 2025-05-07T20:27:14.6539423Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:27:14.6539693Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:27:14.6539952Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:27:14.6540217Z #define SING 2 2025-05-07T20:27:14.6540435Z #define STA_FREQHOLD 0x0080 2025-05-07T20:27:14.6540702Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:14.6541005Z #define cudaStreamDefault 0x00 2025-05-07T20:27:14.6541356Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:27:14.6541727Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:27:14.6541995Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:27:14.6542266Z #define __gnu_linux__ 1 2025-05-07T20:27:14.6542503Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:27:14.6542753Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:27:14.6543042Z #define MAX_INPUT 255 2025-05-07T20:27:14.6543287Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:27:14.6543608Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:27:14.6543978Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:27:14.6544293Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:27:14.6544558Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:27:14.6544954Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:27:14.6545381Z #define _IO_SHOWPOS 02000 2025-05-07T20:27:14.6545817Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:27:14.6546184Z #define _Mfloat_ float 2025-05-07T20:27:14.6546455Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:27:14.6546767Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:27:14.6547049Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:27:14.6547371Z #define cudaMemPoolCreateUsageHwDecompress 0x2 2025-05-07T20:27:14.6547913Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:27:14.6548405Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:14.6548685Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:27:14.6549014Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:27:14.6549365Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:27:14.6549661Z #define __USE_ISOC11 1 2025-05-07T20:27:14.6549890Z #define _BSD_SIZE_T_ 2025-05-07T20:27:14.6550119Z #define ADJ_MICRO 0x1000 2025-05-07T20:27:14.6550377Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:27:14.6550644Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:27:14.6550946Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:27:14.6551345Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:27:14.6551652Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:27:14.6551978Z #define __THROW throw () 2025-05-07T20:27:14.6552230Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:27:14.6552521Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:14.6552872Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:27:14.6553222Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:27:14.6553497Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:27:14.6553755Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:27:14.6554047Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:27:14.6554331Z #define L_tmpnam 20 2025-05-07T20:27:14.6554554Z #define ___int_wchar_t_h 2025-05-07T20:27:14.6554905Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:27:14.6555283Z #define isascii(c) __isascii (c) 2025-05-07T20:27:14.6555603Z #define _T_PTRDIFF 2025-05-07T20:27:14.6555911Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:27:14.6556266Z #define toascii(c) __toascii (c) 2025-05-07T20:27:14.6556525Z #define __GNUC__ 11 2025-05-07T20:27:14.6556778Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:27:14.6557074Z #define __GXX_RTTI 1 2025-05-07T20:27:14.6557297Z #define __pie__ 2 2025-05-07T20:27:14.6557510Z #define __MMX__ 1 2025-05-07T20:27:14.6557728Z #define __cudaCDP2Malloc 2025-05-07T20:27:14.6557984Z #define __timespec_defined 1 2025-05-07T20:27:14.6558234Z #define L_ctermid 9 2025-05-07T20:27:14.6558462Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:14.6558767Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:27:14.6559153Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:27:14.6559521Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:27:14.6559794Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:27:14.6560083Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:27:14.6560386Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:27:14.6560699Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:27:14.6560963Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:27:14.6561397Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:27:14.6562137Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:14.6562737Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:27:14.6563044Z #define __USE_SVID 1 2025-05-07T20:27:14.6563291Z #define __constant__ __location__(constant) 2025-05-07T20:27:14.6563606Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:27:14.6563906Z #define __device__ __location__(device) 2025-05-07T20:27:14.6564231Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:27:14.6564640Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:27:14.6564910Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:27:14.6565195Z #define CUDART_DEVICE __device__ 2025-05-07T20:27:14.6565835Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:27:14.6566227Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:27:14.6566511Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:27:14.6566869Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:27:14.6567246Z #define __STDC_UTF_16__ 1 2025-05-07T20:27:14.6567494Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:27:14.6567853Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:27:14.6568276Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:27:14.6568590Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:27:14.6568861Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:27:14.6569123Z #define NGROUPS_MAX 65536 2025-05-07T20:27:14.6569376Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:27:14.6569645Z #define __USE_ISOC95 1 2025-05-07T20:27:14.6569868Z #define _TIME_H 1 2025-05-07T20:27:14.6570132Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:27:14.6570600Z #define __USE_ISOC99 1 2025-05-07T20:27:14.6570917Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:27:14.6571286Z #define HOST_NAME_MAX 64 2025-05-07T20:27:14.6571533Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:27:14.6571784Z #define _IOS_ATEND 4 2025-05-07T20:27:14.6572017Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:27:14.6572339Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:27:14.6572738Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:14.6573073Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:27:14.6573355Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:27:14.6573673Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:27:14.6573979Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:27:14.6574241Z #define _STDIO_H 1 2025-05-07T20:27:14.6574631Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:27:14.6575098Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:27:14.6575457Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:14.6575832Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:27:14.6576118Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:27:14.6576392Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:27:14.6576663Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:27:14.6576955Z #define __cpp_raw_strings 200710L 2025-05-07T20:27:14.6577251Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6577566Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:27:14.6577838Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:27:14.6578116Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:27:14.6578420Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:27:14.6578696Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:27:14.6578984Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:27:14.6579337Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:27:14.6579710Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:27:14.6579950Z #define __USE_XOPEN 1 2025-05-07T20:27:14.6580194Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:27:14.6580631Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:14.6581073Z #define __USE_XOPEN2K 1 2025-05-07T20:27:14.6581314Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:27:14.6581586Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:27:14.6581886Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:27:14.6582153Z #define __cpp_fold_expressions 201603L 2025-05-07T20:27:14.6582667Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:27:14.6583191Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:27:14.6583470Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:27:14.6583995Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:27:14.6584394Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:27:14.6584768Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:27:14.6585161Z #define __END_NAMESPACE_C99 2025-05-07T20:27:14.6585435Z #define __glibcxx_integral_traps true 2025-05-07T20:27:14.6585725Z #define _POSIX_PATH_MAX 256 2025-05-07T20:27:14.6585980Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:27:14.6586236Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:27:14.6586503Z #define _IOS_TRUNC 16 2025-05-07T20:27:14.6586732Z #define _ISOC11_SOURCE 1 2025-05-07T20:27:14.6586985Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:27:14.6587278Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:27:14.6587573Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:27:14.6587934Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:27:14.6588314Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:27:14.6588592Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:27:14.6588855Z #define _IO_UNITBUF 020000 2025-05-07T20:27:14.6589111Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:27:14.6589457Z #define __FD_SETSIZE 1024 2025-05-07T20:27:14.6597012Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:27:14.6597331Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:27:14.6597681Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:27:14.6598043Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:27:14.6598306Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:27:14.6598621Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:27:14.6598944Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:27:14.6599214Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:27:14.6599520Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:27:14.6599852Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:27:14.6600139Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:27:14.6600473Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:27:14.6600765Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:27:14.6601037Z #define __USE_POSIX199506 1 2025-05-07T20:27:14.6601282Z #define _FEATURES_H 1 2025-05-07T20:27:14.6601521Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:27:14.6601905Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:27:14.6602378Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:27:14.6602704Z #define __stub_getmsg 2025-05-07T20:27:14.6602930Z #define _IO_FIXED 010000 2025-05-07T20:27:14.6603198Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:27:14.6603509Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:27:14.6603785Z #define __stub_setlogin 2025-05-07T20:27:14.6604059Z #define __stub_fattach 2025-05-07T20:27:14.6604297Z #define __cplusplus 201703L 2025-05-07T20:27:14.6604558Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:27:14.6604839Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:27:14.6605092Z #define INFINITY (__builtin_inff()) 2025-05-07T20:27:14.6605364Z #define _IO_UNBUFFERED 2 2025-05-07T20:27:14.6605842Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:27:14.6606364Z #define _IO_INTERNAL 010 2025-05-07T20:27:14.6606609Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:27:14.6606934Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:14.6607283Z #define __dev_t_defined 2025-05-07T20:27:14.6607522Z #define __DEPRECATED 1 2025-05-07T20:27:14.6607750Z #define __S32_TYPE int 2025-05-07T20:27:14.6607998Z #define __cpp_rvalue_references 200610L 2025-05-07T20:27:14.6608290Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:27:14.6608542Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:27:14.6608795Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:27:14.6609389Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:27:14.6610009Z #define _G_HAVE_MREMAP 1 2025-05-07T20:27:14.6610460Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:14.6610804Z #define OVERFLOW 3 2025-05-07T20:27:14.6611050Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:27:14.6611357Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:27:14.6611642Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:14.6611972Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:27:14.6612295Z #define __SSE2_MATH__ 1 2025-05-07T20:27:14.6612535Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:27:14.6612839Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:14.6613133Z #define _IO_STDIO_H 2025-05-07T20:27:14.6613372Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:27:14.6613662Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:27:14.6614001Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:27:14.6614317Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6614625Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:27:14.6614887Z #define __amd64 1 2025-05-07T20:27:14.6615104Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:27:14.6615374Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:27:14.6615652Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:27:14.6616022Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:27:14.6616326Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:27:14.6616589Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:27:14.6616881Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:27:14.6617142Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:27:14.6617389Z #define __bounded 2025-05-07T20:27:14.6617606Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:27:14.6617871Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:27:14.6618156Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:27:14.6618429Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:27:14.6618691Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:27:14.6618962Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:14.6619273Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:27:14.6619686Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:14.6620082Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:27:14.6620347Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:27:14.6620685Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:27:14.6621026Z #define STA_PLL 0x0001 2025-05-07T20:27:14.6621268Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:27:14.6621535Z #define __GNUG__ 11 2025-05-07T20:27:14.6621770Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:27:14.6622027Z #define _T_WCHAR 2025-05-07T20:27:14.6622256Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:27:14.6622545Z #define __specialization_static 2025-05-07T20:27:14.6622841Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:27:14.6623146Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:27:14.6623403Z #define cudaArraySparse 0x40 2025-05-07T20:27:14.6623664Z #define STA_PPSFREQ 0x0002 2025-05-07T20:27:14.6623937Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:27:14.6624230Z #define _WCHAR_T 2025-05-07T20:27:14.6624454Z #define __cudaCDP2Free 2025-05-07T20:27:14.6625092Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:27:14.6625773Z #define __cpp_nsdmi 200809L 2025-05-07T20:27:14.6626179Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:27:14.6626614Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:27:14.6626883Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:27:14.6627146Z #define cudaArrayCubemap 0x04 2025-05-07T20:27:14.6627469Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:14.6627821Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:27:14.6628054Z #define __NO_CTYPE 1 2025-05-07T20:27:14.6628280Z #define __stub_bdflush 2025-05-07T20:27:14.6628629Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:27:14.6629129Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:27:14.6629429Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:27:14.6629695Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:27:14.6629966Z #define __cpp_initializer_lists 200806L 2025-05-07T20:27:14.6630269Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:27:14.6630565Z #define __U16_TYPE unsigned short int 2025-05-07T20:27:14.6630888Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:27:14.6631231Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:27:14.6631507Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:27:14.6631785Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:27:14.6632116Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:27:14.6632454Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:27:14.6632731Z #define _IO_STDIO 040000 2025-05-07T20:27:14.6633042Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:27:14.6633421Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:27:14.6633741Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:27:14.6634062Z #define _PTRDIFF_T 2025-05-07T20:27:14.6634282Z #define _MOVE_H 1 2025-05-07T20:27:14.6634587Z #define __cpp_hex_float 201603L 2025-05-07T20:27:14.6634838Z #define ADJ_TAI 0x0080 2025-05-07T20:27:14.6635062Z #define __ptrvalue 2025-05-07T20:27:14.6635277Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:27:14.6635577Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:27:14.6635853Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:27:14.6636148Z #define MATH_ERREXCEPT 2 2025-05-07T20:27:14.6636394Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:27:14.6636668Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:27:14.6637055Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:27:14.6637430Z #define __USE_GNU 1 2025-05-07T20:27:14.6637655Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:27:14.6637924Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:27:14.6638190Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:27:14.6638571Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:27:14.6638966Z #define WEXITED 4 2025-05-07T20:27:14.6639180Z #define _IO_NO_READS 4 2025-05-07T20:27:14.6639472Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:27:14.6639817Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:27:14.6640094Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:27:14.6640390Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:27:14.6640696Z #define __uid_t_defined 2025-05-07T20:27:14.6640945Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:27:14.6641231Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:27:14.6641496Z #define WNOHANG 1 2025-05-07T20:27:14.6641739Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:27:14.6642043Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:27:14.6642311Z #define cudaEventDefault 0x00 2025-05-07T20:27:14.6642606Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:27:14.6642924Z #define NL_SETMAX INT_MAX 2025-05-07T20:27:14.6643151Z #define __x86_64 1 2025-05-07T20:27:14.6643379Z #define __cudaCDP2LaunchDevice 2025-05-07T20:27:14.6643778Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:14.6644287Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:27:14.6644785Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:14.6645212Z #define __PTRDIFF_T 2025-05-07T20:27:14.6645532Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:27:14.6645900Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:27:14.6646170Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:14.6646456Z #define _Mlong_double_ long double 2025-05-07T20:27:14.6646732Z #define __cpp_lambdas 200907L 2025-05-07T20:27:14.6646979Z #define _IO_DEC 020 2025-05-07T20:27:14.6647203Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:27:14.6647551Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:27:14.6647840Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:27:14.6648118Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:27:14.6648374Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:27:14.6648669Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:27:14.6648987Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:27:14.6649252Z #define _ANSI_STDDEF_H 2025-05-07T20:27:14.6649510Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:27:14.6649818Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:27:14.6650177Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:27:14.6650551Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:27:14.6650828Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:27:14.6651117Z #define __cpp_template_auto 201606L 2025-05-07T20:27:14.6651466Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:27:14.6651832Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:27:14.6652105Z #define __key_t_defined 2025-05-07T20:27:14.6652349Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:27:14.6652711Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:27:14.6653258Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:27:14.6653620Z #define __GNUC_VA_LIST 2025-05-07T20:27:14.6653947Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:14.6654325Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:27:14.6654586Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:27:14.6654856Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:27:14.6655144Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:27:14.6655389Z #define __WCOREFLAG 0x80 2025-05-07T20:27:14.6655633Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:27:14.6655939Z #define cudaEventDisableTiming 0x02 2025-05-07T20:27:14.6656215Z #define __LP64__ 1 2025-05-07T20:27:14.6656456Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:27:14.6656776Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:27:14.6657055Z #define _IO_off64_t __off64_t 2025-05-07T20:27:14.6657309Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:14.6657574Z #define __time_t_defined 1 2025-05-07T20:27:14.6657821Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:27:14.6658160Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:27:14.6658522Z #define __USE_UNIX98 1 2025-05-07T20:27:14.6658768Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:27:14.6659029Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:27:14.6659296Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:27:14.6659592Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:27:14.6659898Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:27:14.6660154Z #define SEEK_CUR 1 2025-05-07T20:27:14.6660380Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:14.6660647Z #define _ASSERT_H 1 2025-05-07T20:27:14.6661211Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:27:14.6661832Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:27:14.6662110Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:27:14.6662354Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:27:14.6662623Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:27:14.6662892Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:27:14.6663256Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:14.6663659Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:27:14.6664362Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:27:14.6665006Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:27:14.6665293Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:27:14.6665956Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:27:14.6667778Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:27:14.6668055Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:27:14.6668333Z #define cudaArrayDefault 0x00 2025-05-07T20:27:14.6668615Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:27:14.6668900Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:27:14.6669174Z #define TLOSS 5 2025-05-07T20:27:14.6669387Z #define __ssize_t_defined 2025-05-07T20:27:14.6669630Z #define __CUDACC_VER_BUILD__ 61 2025-05-07T20:27:14.6669897Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:27:14.6670183Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:27:14.6670460Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:27:14.6670733Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:27:14.6671014Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:27:14.6671318Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:27:14.6671604Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:27:14.6671887Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:27:14.6672164Z #define __REGISTER_PREFIX__ 2025-05-07T20:27:14.6672422Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:27:14.6672748Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:27:14.6673235Z #define _IOS_NOREPLACE 64 2025-05-07T20:27:14.6673466Z #define __cdecl 2025-05-07T20:27:14.6673700Z #define cudaEventInterprocess 0x04 2025-05-07T20:27:14.6674021Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:27:14.6674342Z #define LOGIN_NAME_MAX 256 2025-05-07T20:27:14.6674586Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:27:14.6674849Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:27:14.6675141Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:27:14.6675457Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:27:14.6675761Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:27:14.6676086Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:27:14.6676481Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:14.6676906Z #define ADJ_NANO 0x2000 2025-05-07T20:27:14.6677209Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:27:14.6677559Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:27:14.6677848Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:27:14.6678107Z #define __FLT_DIG__ 6 2025-05-07T20:27:14.6678447Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:27:14.6678834Z #define __NO_INLINE__ 1 2025-05-07T20:27:14.6679132Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:14.6679475Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:27:14.6679725Z #define ADJ_STATUS 0x0010 2025-05-07T20:27:14.6679983Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:27:14.6680266Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:27:14.6680527Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:14.6680820Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:27:14.6681103Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:27:14.6681477Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:27:14.6681889Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:27:14.6682229Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:27:14.6682578Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:27:14.6682814Z #define MAX_CANON 255 2025-05-07T20:27:14.6683043Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:27:14.6683294Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:27:14.6683555Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:27:14.6683856Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:27:14.6684189Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:27:14.6684478Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:27:14.6684749Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:27:14.6685067Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:27:14.6685371Z #define __VERSION__ "11.4.0" 2025-05-07T20:27:14.6685626Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:27:14.6685914Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:27:14.6686305Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:27:14.6686582Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:27:14.6686891Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:27:14.6687190Z #define __UINT64_C(c) c ## UL 2025-05-07T20:27:14.6687441Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:27:14.6687688Z #define _SYS_TYPES_H 1 2025-05-07T20:27:14.6687918Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:27:14.6688168Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:27:14.6688414Z #define _SYS_CDEFS_H 1 2025-05-07T20:27:14.6688645Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:27:14.6688910Z #define __cpp_unicode_characters 201411L 2025-05-07T20:27:14.6689197Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:27:14.6689449Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:27:14.6689733Z #define __cudaCDP2StreamDestroy 2025-05-07T20:27:14.6689996Z #define FP_SUBNORMAL 3 2025-05-07T20:27:14.6690242Z #define cudaOccupancyDefault 0x00 2025-05-07T20:27:14.6690513Z #define _INITIALIZER_LIST 2025-05-07T20:27:14.6690763Z #define _STDC_PREDEF_H 1 2025-05-07T20:27:14.6691015Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:27:14.6691296Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:27:14.6691628Z #define _IO_file_flags _flags 2025-05-07T20:27:14.6691883Z #define __USE_XOPEN2K8 1 2025-05-07T20:27:14.6692126Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:27:14.6692396Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:27:14.6692669Z #define HUGE 3.40282347e+38F 2025-05-07T20:27:14.6692929Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:27:14.6693296Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:27:14.6693682Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:27:14.6693987Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:27:14.6694249Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:27:14.6694501Z #define _BSD_SOURCE 1 2025-05-07T20:27:14.6694737Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:27:14.6695563Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:27:14.6696393Z #define __catch(X) catch(X) 2025-05-07T20:27:14.6696648Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:27:14.6696934Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:27:14.6697202Z #define __TIMER_T_TYPE void * 2025-05-07T20:27:14.6697447Z #define __STRING(x) #x 2025-05-07T20:27:14.6697681Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:27:14.6697945Z #define _T_PTRDIFF_ 2025-05-07T20:27:14.6698188Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:27:14.6698487Z #define cudaEventWaitExternal 0x01 2025-05-07T20:27:14.6698752Z #define __unbounded 2025-05-07T20:27:14.6698987Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:14.6699267Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:27:14.6699537Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:14.6699824Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:27:14.6700099Z #define __cpp_lib_is_final 201402L 2025-05-07T20:27:14.6700388Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:27:14.6700710Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:27:14.6701009Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:27:14.6701287Z #define __managed__ __location__(managed) 2025-05-07T20:27:14.6701578Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:27:14.6701976Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:14.6702391Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:27:14.6702640Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:27:14.6703004Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:27:14.6703401Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:27:14.6703646Z #define _SYS_SIZE_T_H 2025-05-07T20:27:14.6703953Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:27:14.6704312Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:27:14.6704666Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:27:14.6704953Z #define _CRTIMP 2025-05-07T20:27:14.6705175Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:27:14.6705477Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:14.6705795Z #define STA_PPSJITTER 0x0200 2025-05-07T20:27:14.6706139Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:27:14.6706541Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:14.6706850Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:27:14.6707121Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:27:14.6707405Z #define __SIZE_T__ 2025-05-07T20:27:14.6707610Z #define __stub_gtty 2025-05-07T20:27:14.6707840Z #define __pid_t_defined 2025-05-07T20:27:14.6708093Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:27:14.6708383Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:14.6708689Z #define __glibcxx_function_requires(...) 2025-05-07T20:27:14.6708976Z #define __SM_80_RT_HPP__ 2025-05-07T20:27:14.6709218Z #define __need_clockid_t 2025-05-07T20:27:14.6709454Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:27:14.6709708Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:27:14.6710107Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:27:14.6710414Z #define _IO_HEX 0100 2025-05-07T20:27:14.6710670Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:27:14.6710997Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:27:14.6711095Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:27:14.6711195Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:27:14.6711415Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:14.6711531Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:27:14.6711640Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:27:14.6711742Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:27:14.6711847Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:27:14.6711952Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:27:14.6712038Z #define __stub_sstk 2025-05-07T20:27:14.6712130Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:27:14.6712286Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:27:14.6712373Z #define __wur 2025-05-07T20:27:14.6712489Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:27:14.6712581Z #define _G_HAVE_MMAP 1 2025-05-07T20:27:14.6712663Z #define _IO_OCT 040 2025-05-07T20:27:14.6712755Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:27:14.6712848Z #define NL_MSGMAX INT_MAX 2025-05-07T20:27:14.6712939Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:27:14.6713068Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:27:14.6713158Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:27:14.6713260Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:27:14.6713450Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:27:14.6713601Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:27:14.6713762Z #define _STL_ALGOBASE_H 1 2025-05-07T20:27:14.6713943Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:27:14.6718361Z #define __off64_t_defined 2025-05-07T20:27:14.6718484Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:27:14.6718587Z #define __FLT128_DIG__ 33 2025-05-07T20:27:14.6718697Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:27:14.6718794Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:27:14.6718883Z #define __INT32_C(c) c 2025-05-07T20:27:14.6718978Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:27:14.6719074Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:27:14.6719177Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:27:14.6719273Z #define __PDP_ENDIAN 3412 2025-05-07T20:27:14.6719360Z #define _ISOC95_SOURCE 1 2025-05-07T20:27:14.6719463Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:27:14.6719594Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:27:14.6719689Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:27:14.6719783Z #define __SM_90_RT_HPP__ 2025-05-07T20:27:14.6719880Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:27:14.6720083Z #define __have_pthread_attr_t 1 2025-05-07T20:27:14.6720186Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:27:14.6720412Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:27:14.6720528Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:27:14.6720636Z #define __cudaCDP2EventRecord 2025-05-07T20:27:14.6720732Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:27:14.6720816Z #define htole32(x) (x) 2025-05-07T20:27:14.6721068Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:27:14.6721185Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:27:14.6721286Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:27:14.6721443Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:27:14.6721580Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:27:14.6721701Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:27:14.6721838Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:27:14.6721936Z #define ADJ_OFFSET 0x0001 2025-05-07T20:27:14.6722040Z #define cudaArrayLayered 0x01 2025-05-07T20:27:14.6722203Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:27:14.6722391Z #define cudaEventRecordDefault 0x00 2025-05-07T20:27:14.6722491Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:27:14.6722589Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:27:14.6722669Z #define unix 1 2025-05-07T20:27:14.6722765Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:27:14.6722855Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:27:14.6722948Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:27:14.6723067Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:27:14.6723151Z #define __USE_POSIX 1 2025-05-07T20:27:14.6723243Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:27:14.6723375Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:27:14.6723465Z #define __THROWNL throw () 2025-05-07T20:27:14.6723567Z #define __cpp_rtti 199711L 2025-05-07T20:27:14.6723674Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:27:14.6723762Z #define __PMT(args) args 2025-05-07T20:27:14.6723876Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:14.6724027Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:27:14.6724136Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:27:14.6724228Z #define _SIZE_T_DECLARED 2025-05-07T20:27:14.6724325Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:27:14.6724416Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:27:14.6724803Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:27:14.6724901Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:27:14.6724995Z #define XATTR_LIST_MAX 65536 2025-05-07T20:27:14.6725088Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:27:14.6725227Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:27:14.6725313Z #define _WCHAR_T_H 2025-05-07T20:27:14.6725401Z #define __FLT64X_DIG__ 18 2025-05-07T20:27:14.6725494Z #define _IO_SHOWBASE 0200 2025-05-07T20:27:14.6725587Z #define _POSIX_QLIMIT 1 2025-05-07T20:27:14.6725685Z #define __INT8_TYPE__ signed char 2025-05-07T20:27:14.6725789Z #define __SURFACE_TYPES_H__ 2025-05-07T20:27:14.6725879Z #define __CUDA_ARCH__ 520 2025-05-07T20:27:14.6725984Z #define __cpp_digit_separators 201309L 2025-05-07T20:27:14.6726066Z #define __ELF__ 1 2025-05-07T20:27:14.6726170Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:27:14.6726267Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:27:14.6726356Z #define STA_INS 0x0010 2025-05-07T20:27:14.6726453Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:27:14.6726619Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:27:14.6726716Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:27:14.6726811Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:27:14.6726920Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:14.6727029Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:27:14.6727210Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:27:14.6727313Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:27:14.6727415Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:27:14.6727570Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:27:14.6727724Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:27:14.6727822Z #define _IO_funlockfile(_fp) 2025-05-07T20:27:14.6728140Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:14.6728270Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:27:14.6728365Z #define __DRIVER_TYPES_H__ 2025-05-07T20:27:14.6728451Z #define __FLT_RADIX__ 2 2025-05-07T20:27:14.6728555Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:27:14.6728715Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:27:14.6728809Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:27:14.6728907Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:27:14.6729013Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:27:14.6729109Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:27:14.6729205Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:27:14.6729386Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:27:14.6729472Z #define WORD_BIT 32 2025-05-07T20:27:14.6729556Z #define _IO_USER_BUF 1 2025-05-07T20:27:14.6729647Z #define __VECTOR_TYPES_H__ 2025-05-07T20:27:14.6729751Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:14.6729859Z #define cudaHostAllocPortable 0x01 2025-05-07T20:27:14.6729957Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:27:14.6730058Z #define __long_double_t long double 2025-05-07T20:27:14.6730152Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:27:14.6730242Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:27:14.6730636Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:27:14.6730719Z #define __k8 1 2025-05-07T20:27:14.6730919Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:27:14.6731084Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:27:14.6731200Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:27:14.6731308Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:27:14.6731403Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:27:14.6731502Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:27:14.6731598Z #define __blksize_t_defined 2025-05-07T20:27:14.6731691Z #define _IO_SHOWPOINT 0400 2025-05-07T20:27:14.6731787Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:27:14.6731900Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:27:14.6731993Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:27:14.6732102Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:27:14.6732195Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:27:14.6732290Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:27:14.6732544Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:27:14.6732883Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:27:14.6732984Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:27:14.6733087Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:27:14.6733170Z #define SEEK_SET 0 2025-05-07T20:27:14.6733267Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:27:14.6733363Z #define __CUDA_API_VER_MINOR__ 8 2025-05-07T20:27:14.6733548Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:27:14.6733651Z #define __cudaCDP2GetLastError 2025-05-07T20:27:14.6733745Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:27:14.6733833Z #define _MATH_H_MATHDEF 1 2025-05-07T20:27:14.6734173Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:27:14.6734290Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:27:14.6734387Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:27:14.6734478Z #define __stub_sigreturn 2025-05-07T20:27:14.6734789Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:27:14.6734886Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:27:14.6734983Z #define __HOST_CONFIG_H__ 2025-05-07T20:27:14.6735082Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:27:14.6735168Z #define CLOCK_TAI 11 2025-05-07T20:27:14.6735271Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:27:14.6735474Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:27:14.6735565Z #define __restrict_arr 2025-05-07T20:27:14.6735674Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:27:14.6735811Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:27:14.6736326Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:14.6736511Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:27:14.6736601Z #define __USE_MISC 1 2025-05-07T20:27:14.6736707Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:27:14.6736883Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:27:14.6736973Z #define _GCC_LIMITS_H_ 2025-05-07T20:27:14.6737059Z #define __LDBL_DIG__ 18 2025-05-07T20:27:14.6737154Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:27:14.6737259Z #define __malloc_and_calloc_defined 2025-05-07T20:27:14.6737350Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:27:14.6737452Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:27:14.6737538Z #define __x86_64__ 1 2025-05-07T20:27:14.6737618Z #define _SIZE_T_ 2025-05-07T20:27:14.6738493Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:27:14.6738593Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:27:14.6738688Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:27:14.6738810Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:27:14.6738923Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:27:14.6739016Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:27:14.6739124Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:27:14.6739243Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:27:14.6739380Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:27:14.6739476Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:27:14.6739927Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:14.6740055Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:27:14.6740201Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:27:14.6740300Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:27:14.6740396Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:27:14.6740486Z #define STA_FLL 0x0008 2025-05-07T20:27:14.6740624Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:27:14.6740722Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:27:14.6740841Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6740950Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:27:14.6741036Z #define __stub_revoke 2025-05-07T20:27:14.6741128Z #define __timer_t_defined 1 2025-05-07T20:27:14.6741261Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:27:14.6741351Z #define INT_MAX __INT_MAX__ 2025-05-07T20:27:14.6741453Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:27:14.6741561Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:27:14.6741656Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:27:14.6741756Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:27:14.6741968Z #define cudaArrayTextureGather 0x08 2025-05-07T20:27:14.6742068Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:27:14.6742212Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:27:14.6742312Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:27:14.6742400Z #define _IO_off_t __off_t 2025-05-07T20:27:14.6742490Z #define __FLT64_DIG__ 15 2025-05-07T20:27:14.6742704Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:27:14.6742799Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:27:14.6742929Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:14.6743047Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:27:14.6743143Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:27:14.6743247Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:27:14.6743330Z #define NULL __null 2025-05-07T20:27:14.6743458Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:27:14.6743563Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:27:14.6743666Z #define __U64_TYPE unsigned long int 2025-05-07T20:27:14.6743763Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:27:14.6743860Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:27:14.6744048Z #define FP_ZERO 2 2025-05-07T20:27:14.6744167Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:27:14.6744315Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:27:14.6744421Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6744508Z #define __WCHAR_T__ 2025-05-07T20:27:14.6744600Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:27:14.6744792Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:27:14.6744942Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:27:14.6745037Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:27:14.6745156Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:27:14.6745269Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:14.6745394Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:27:14.6745524Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:27:14.6745617Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:27:14.6745707Z #define _SIGSET_H_types 1 2025-05-07T20:27:14.6745824Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:27:14.6745926Z #define __cpp_unicode_literals 200710L 2025-05-07T20:27:14.6746070Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:27:14.6746173Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:27:14.6746289Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:27:14.6746419Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:27:14.6746524Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:27:14.6746648Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:27:14.6746761Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1 2025-05-07T20:27:14.6746930Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:27:14.6747025Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:27:14.6747137Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:27:14.6747236Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:27:14.6747324Z #define STA_MODE 0x4000 2025-05-07T20:27:14.6747439Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:27:14.6747539Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:27:14.6747654Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:27:14.6747756Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:27:14.6747852Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:27:14.6747960Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:27:14.6748055Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:27:14.6748164Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:27:14.6748255Z #define __SIZE_WIDTH__ 64 2025-05-07T20:27:14.6748370Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:14.6748451Z #define __SEG_FS 1 2025-05-07T20:27:14.6748540Z #define _IO_size_t size_t 2025-05-07T20:27:14.6748637Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:27:14.6748812Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:27:14.6748903Z #define __stub_lchmod 2025-05-07T20:27:14.6748994Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:27:14.6749109Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6749206Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:27:14.6749287Z #define __SEG_GS 1 2025-05-07T20:27:14.6749468Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:27:14.6749555Z #define _IOS_APPEND 8 2025-05-07T20:27:14.6749648Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:27:14.6749742Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:27:14.6749838Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:27:14.6749933Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:27:14.6750037Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:27:14.6750126Z #define htole16(x) (x) 2025-05-07T20:27:14.6750231Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:14.6750329Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:27:14.6750422Z #define __INT16_TYPE__ short int 2025-05-07T20:27:14.6750530Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:27:14.6750635Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:27:14.6750826Z #define __cpp_structured_bindings 201606L 2025-05-07T20:27:14.6750952Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:27:14.6751039Z #define __SIZEOF_INT__ 4 2025-05-07T20:27:14.6751128Z #define __WCLONE 0x80000000 2025-05-07T20:27:14.6751222Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:27:14.6751306Z #define SEEK_HOLE 4 2025-05-07T20:27:14.6751393Z #define TIMER_ABSTIME 1 2025-05-07T20:27:14.6751489Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:27:14.6751580Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:27:14.6751752Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:14.6751865Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6751959Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:27:14.6752072Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:27:14.6752171Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:27:14.6752291Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:27:14.6752382Z #define _LINUX_LIMITS_H 2025-05-07T20:27:14.6752469Z #define linux 1 2025-05-07T20:27:14.6752561Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:27:14.6752673Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:27:14.6752769Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:27:14.6752861Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:27:14.6752970Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:27:14.6753116Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:27:14.6753214Z #define __cpp_lib_hypot 201603 2025-05-07T20:27:14.6753310Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:27:14.6753407Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:27:14.6753499Z #define MOD_NANO ADJ_NANO 2025-05-07T20:27:14.6753583Z #define htole64(x) (x) 2025-05-07T20:27:14.6753682Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:27:14.6753813Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:27:14.6753907Z #define _IO_UPPERCASE 01000 2025-05-07T20:27:14.6754387Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:27:14.6754480Z #define __USE_POSIX2 1 2025-05-07T20:27:14.6754577Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:27:14.6754666Z #define __WALL 0x40000000 2025-05-07T20:27:14.6754766Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:27:14.6754848Z #define _XLOCALE_H 1 2025-05-07T20:27:14.6754947Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:27:14.6755045Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:27:14.6755140Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:27:14.6755247Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:27:14.6755333Z #define __EXCEPTIONS 1 2025-05-07T20:27:14.6755490Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:27:14.6755681Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:27:14.6755768Z #define __WORDSIZE 64 2025-05-07T20:27:14.6755969Z #define CLOCK_MONOTONIC 1 2025-05-07T20:27:14.6756062Z #define _STL_RELOPS_H 1 2025-05-07T20:27:14.6756157Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:27:14.6756259Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:27:14.6756359Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:27:14.6756450Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:27:14.6756550Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:27:14.6756842Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:27:14.6757069Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:14.6757188Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:27:14.6757285Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:27:14.6757385Z #define __cpp_range_based_for 201603L 2025-05-07T20:27:14.6757496Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:27:14.6757596Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:27:14.6757706Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:27:14.6757888Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:27:14.6757985Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:27:14.6758160Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:27:14.6758265Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:27:14.6758434Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:14.6758549Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:27:14.6758631Z #define _STRING_H 1 2025-05-07T20:27:14.6758731Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:27:14.6758823Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:27:14.6758920Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:27:14.6759051Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:27:14.6759148Z #define __code_model_small__ 1 2025-05-07T20:27:14.6759235Z #define _PSTL_CONFIG_H 2025-05-07T20:27:14.6759340Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:27:14.6759452Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:27:14.6759557Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:27:14.6759660Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:27:14.6759998Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:14.6760091Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:27:14.6760181Z #define le64toh(x) (x) 2025-05-07T20:27:14.6760272Z #define FILENAME_MAX 4096 2025-05-07T20:27:14.6760418Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:27:14.6760537Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:27:14.6760619Z #define L_cuserid 9 2025-05-07T20:27:14.6760711Z #define __ino_t_defined 2025-05-07T20:27:14.6760791Z #define __k8__ 1 2025-05-07T20:27:14.6760889Z #define __INTPTR_TYPE__ long int 2025-05-07T20:27:14.6760999Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:27:14.6761087Z #define __int8_t_defined 2025-05-07T20:27:14.6761180Z #define __WCHAR_TYPE__ int 2025-05-07T20:27:14.6761289Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:27:14.6761402Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:27:14.6761498Z #define __SLONGWORD_TYPE long int 2025-05-07T20:27:14.6761624Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:27:14.6761772Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:27:14.6761857Z #define __HAVE_COLUMN 2025-05-07T20:27:14.6761945Z #define __stub_fdetach 2025-05-07T20:27:14.6762344Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:27:14.6762429Z #define __pic__ 2 2025-05-07T20:27:14.6762544Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:14.6762640Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:27:14.6762737Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:27:14.6762839Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:27:14.6762925Z #define __stub_chflags 2025-05-07T20:27:14.6763014Z #define CLOCK_BOOTTIME 7 2025-05-07T20:27:14.6763179Z #define __need_IOV_MAX 2025-05-07T20:27:14.6763286Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:27:14.6763395Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:27:14.6763491Z #define __cpp_decltype 200707L 2025-05-07T20:27:14.6763593Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:27:14.6763689Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:27:14.6763794Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:27:14.6763882Z #define TTY_NAME_MAX 32 2025-05-07T20:27:14.6764043Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:27:14.6764162Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6764329Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:27:14.6764439Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:27:14.6764531Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:27:14.6764627Z #define STA_PPSTIME 0x0004 2025-05-07T20:27:14.6764711Z #define __import__ 2025-05-07T20:27:14.6764806Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:27:14.6764943Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:27:14.6765026Z #define __export__ 2025-05-07T20:27:14.6765228Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:27:14.6765326Z #define cudaMemAttachHost 0x02 2025-05-07T20:27:14.6766422Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:14.6766524Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:27:14.6766614Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:27:14.6766716Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:27:14.6766806Z #define _WCHAR_T_DECLARED 2025-05-07T20:27:14.6766924Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:27:14.6767041Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:27:14.6767144Z #define __cpp_inline_variables 201606L 2025-05-07T20:27:14.6767239Z #define WNOWAIT 0x01000000 2025-05-07T20:27:14.6767327Z #define PLOSS 6 2025-05-07T20:27:14.6767421Z #define M_LN10 2.30258509299404568402 2025-05-07T20:27:14.6767685Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:27:14.6767777Z #define EXIT_SUCCESS 0 2025-05-07T20:27:14.6767881Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:27:14.6767983Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:27:14.6768084Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:27:14.6768173Z #define __thread__ __thread 2025-05-07T20:27:14.6768274Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:27:14.6768366Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:27:14.6768471Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:27:14.6768696Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:14.6768809Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:27:14.6768902Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:27:14.6768987Z #define __linux__ 1 2025-05-07T20:27:14.6769082Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:27:14.6769214Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:27:14.6769309Z #define __S16_TYPE short int 2025-05-07T20:27:14.6769647Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:27:14.6769765Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:27:14.6769950Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:27:14.6770049Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:27:14.6770153Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:27:14.6770236Z #define _T_SIZE_ 2025-05-07T20:27:14.6770334Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:14.6770453Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:27:14.6770546Z #define _PSTL_VERSION 12000 2025-05-07T20:27:14.6770669Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:27:14.6770764Z #define __WNOTHREAD 0x20000000 2025-05-07T20:27:14.6770860Z #define _G_va_list __gnuc_va_list 2025-05-07T20:27:14.6770991Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:27:14.6771255Z #define _IOS_INPUT 1 2025-05-07T20:27:14.6771350Z #define __USE_LARGEFILE64 1 2025-05-07T20:27:14.6771458Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:27:14.6771555Z #define __INT64_TYPE__ long int 2025-05-07T20:27:14.6771649Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:27:14.6771751Z #define __shared__ __location__(shared) 2025-05-07T20:27:14.6771842Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:27:14.6771995Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:27:14.6772087Z #define __gid_t_defined 2025-05-07T20:27:14.6772196Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:27:14.6772296Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:27:14.6772491Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:27:14.6772588Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:27:14.6772683Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:27:14.6772769Z #define ___int_size_t_h 2025-05-07T20:27:14.6772880Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:14.6773005Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:27:14.6773157Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:27:14.6773376Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:27:14.6773474Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:27:14.6773570Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:27:14.6773666Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:27:14.6773810Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6773930Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:27:14.6774064Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:27:14.6774155Z #define __clock_t_defined 1 2025-05-07T20:27:14.6774253Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:27:14.6774363Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:27:14.6774453Z #define __GLIBC_MINOR__ 17 2025-05-07T20:27:14.6774544Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:27:14.6774644Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:27:14.6774757Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:27:14.6774846Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:27:14.6775024Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:14.6775105Z #define __SSE__ 1 2025-05-07T20:27:14.6775202Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:27:14.6775296Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:27:14.6775381Z #define _CTYPE_H 1 2025-05-07T20:27:14.6775479Z #define __sigset_t_defined 2025-05-07T20:27:14.6775574Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:27:14.6775667Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:27:14.6775756Z #define MOD_TAI ADJ_TAI 2025-05-07T20:27:14.6775851Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:27:14.6775944Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:27:14.6776033Z #define __SM_70_RT_H__ 2025-05-07T20:27:14.6776125Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:27:14.6776230Z #define cudaEventWaitDefault 0x00 2025-05-07T20:27:14.6776337Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:27:14.6776494Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:14.6776592Z #define _POSIX_MAX_CANON 255 2025-05-07T20:27:14.6776704Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:27:14.6776797Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:27:14.6776891Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:27:14.6776973Z #define __amd64__ 1 2025-05-07T20:27:14.6777061Z #define __WINT_WIDTH__ 32 2025-05-07T20:27:14.6777168Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:27:14.6777433Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:14.6777532Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:27:14.6777617Z #define EOF (-1) 2025-05-07T20:27:14.6777712Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:27:14.6777809Z #define __USE_POSIX199309 1 2025-05-07T20:27:14.6777904Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:27:14.6777997Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:27:14.6778176Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:27:14.6778276Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:27:14.6778386Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:27:14.6778487Z #define ____mbstate_t_defined 1 2025-05-07T20:27:14.6778572Z #define STA_NANO 0x2000 2025-05-07T20:27:14.6778666Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:27:14.6778763Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:27:14.6778849Z #define _IO_LINKED 0x80 2025-05-07T20:27:14.6778946Z #define __cpp_lib_launder 201606 2025-05-07T20:27:14.6779040Z #define __SIZEOF_INT128__ 16 2025-05-07T20:27:14.6779144Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:27:14.6779240Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:27:14.6779332Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:27:14.6779470Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:27:14.6779580Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:14.6779681Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:14.6779781Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:27:14.6779877Z #define __W_CONTINUED 0xffff 2025-05-07T20:27:14.6779966Z #define __ATOMIC_RELAXED 0 2025-05-07T20:27:14.6780172Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:27:14.6780295Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:14.6780496Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:27:14.6780679Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:27:14.6780768Z #define __stub_stty 2025-05-07T20:27:14.6780929Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:27:14.6781019Z #define le16toh(x) (x) 2025-05-07T20:27:14.6781125Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:27:14.6781295Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:27:14.6781379Z #define _SIZET_ 2025-05-07T20:27:14.6781469Z #define XATTR_NAME_MAX 255 2025-05-07T20:27:14.6781553Z #define _SVID_SOURCE 1 2025-05-07T20:27:14.6781643Z #define _LP64 1 2025-05-07T20:27:14.6781734Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:27:14.6781963Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:27:14.6782082Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:27:14.6782166Z #define __UINT8_C(c) c 2025-05-07T20:27:14.6782264Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:27:14.6782358Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:27:14.6782468Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:27:14.6782565Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:27:14.6782659Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:27:14.6782757Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:27:14.6782845Z #define CUDARTAPI 2025-05-07T20:27:14.6782928Z #define IOV_MAX 1024 2025-05-07T20:27:14.6783070Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:27:14.6783175Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:27:14.6783268Z #define P_tmpdir "/tmp" 2025-05-07T20:27:14.6783375Z #define cudaMemAttachSingle 0x04 2025-05-07T20:27:14.6783462Z #define __wchar_t__ 2025-05-07T20:27:14.6783563Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:27:14.6783654Z #define SEEK_END 2 2025-05-07T20:27:14.6783746Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:27:14.6783939Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:27:14.6784063Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:27:14.6784203Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:27:14.6784293Z #define ____FILE_defined 1 2025-05-07T20:27:14.6784410Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:27:14.6784505Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:27:14.6784592Z #define _ISOC99_SOURCE 1 2025-05-07T20:27:14.6784690Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:27:14.6784931Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:14.6785062Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:27:14.6785145Z #define _IO_RIGHT 04 2025-05-07T20:27:14.6785320Z #define __END_NAMESPACE_STD 2025-05-07T20:27:14.6785506Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:14.6785603Z #define _GLIBCXX_STD_C std 2025-05-07T20:27:14.6785720Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:27:14.6785816Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:27:14.6785917Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:27:14.6785998Z #define _STDDEF_H_ 2025-05-07T20:27:14.6786170Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:14.6786267Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:27:14.6786385Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:27:14.6786577Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:27:14.6786687Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:14.6786827Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:27:14.6786952Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:27:14.6787051Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:27:14.6787163Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:27:14.6787336Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:27:14.6787447Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:27:14.6787549Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:27:14.6787641Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:27:14.6787734Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:27:14.6787905Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:27:14.6787996Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:27:14.6788173Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:27:14.6788271Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:27:14.6788364Z #define __STDCPP_THREADS__ 1 2025-05-07T20:27:14.6788508Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:27:14.6788603Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:27:14.6788701Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:27:14.6788803Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:27:14.6788918Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:27:14.6789015Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:27:14.6789118Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:27:14.6789279Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:27:14.6789450Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:27:14.6789547Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:27:14.6789665Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:27:14.6789777Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:27:14.6789877Z #define __location__(a) __annotate__(a) 2025-05-07T20:27:14.6790098Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:27:14.6790200Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:27:14.6790311Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:27:14.6790410Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:27:14.6790501Z #define __STDC_UTF_32__ 1 2025-05-07T20:27:14.6790594Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:27:14.6790697Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:27:14.6790792Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:27:14.6790873Z #define __FXSR__ 1 2025-05-07T20:27:14.6790957Z #define _SIZE_T 2025-05-07T20:27:14.6791060Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:27:14.6791169Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:27:14.6791335Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:14.6791483Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:27:14.6791575Z #define _IO_ssize_t __ssize_t 2025-05-07T20:27:14.6791675Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:27:14.6791853Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:14.6792051Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:27:14.6792225Z #define _GXX_NULLPTR_T 2025-05-07T20:27:14.6792349Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:27:14.6792439Z #define FOPEN_MAX 16 2025-05-07T20:27:14.6792530Z #define __BIG_ENDIAN 4321 2025-05-07T20:27:14.6792644Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:14.6792745Z #define __suseconds_t_defined 2025-05-07T20:27:14.6792833Z #define __off_t_defined 2025-05-07T20:27:14.6792918Z #define stderr stderr 2025-05-07T20:27:14.6793015Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:27:14.6793124Z #define __glibcxx_requires_string(_String) 2025-05-07T20:27:14.6793219Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:27:14.6793312Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:27:14.6793712Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:27:14.6793809Z #define __mode_t_defined 2025-05-07T20:27:14.6793913Z #define _GCC_SIZE_T 2025-05-07T20:27:14.6794026Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:14.6794144Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:27:14.6794248Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:27:14.6794444Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:27:14.6794537Z #define __UINT32_C(c) c ## U 2025-05-07T20:27:14.6794638Z #define __cpp_alias_templates 200704L 2025-05-07T20:27:14.6794741Z #define cudaHostAllocMapped 0x02 2025-05-07T20:27:14.6794848Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:27:14.6794937Z #define _STL_ITERATOR_H 1 2025-05-07T20:27:14.6795020Z #define __size_t__ 2025-05-07T20:27:14.6795148Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:27:14.6795242Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:27:14.6795351Z #define cudaEventRecordExternal 0x01 2025-05-07T20:27:14.6795565Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:27:14.6795658Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:27:14.6795825Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:27:14.6795913Z #define _ENDIAN_H 1 2025-05-07T20:27:14.6796016Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:27:14.6796117Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:27:14.6796217Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:27:14.6796299Z #define __try try 2025-05-07T20:27:14.6796394Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:27:14.6796486Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:27:14.6796575Z #define __INT8_MAX__ 0x7f 2025-05-07T20:27:14.6796827Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:27:14.6796915Z #define __LONG_WIDTH__ 64 2025-05-07T20:27:14.6796999Z #define __PIC__ 2 2025-05-07T20:27:14.6797107Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:27:14.6797222Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:27:14.6797353Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:27:14.6797449Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:27:14.6797546Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:27:14.6797732Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:14.6797832Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:27:14.6797938Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:27:14.6798026Z #define _IO_uid_t __uid_t 2025-05-07T20:27:14.6798123Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:27:14.6798250Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:27:14.6798341Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:27:14.6798483Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:14.6798586Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:27:14.6798704Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:27:14.6798789Z #define LONG_BIT 64 2025-05-07T20:27:14.6798896Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:27:14.6798994Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:27:14.6799123Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:27:14.6799301Z #define __fsfilcnt_t_defined 2025-05-07T20:27:14.6799394Z #define __blkcnt_t_defined 2025-05-07T20:27:14.6799662Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:14.6799755Z #define __USE_LARGEFILE 1 2025-05-07T20:27:14.6799853Z #define __cpp_constexpr 201603L 2025-05-07T20:27:14.6799948Z #define CUDART_VERSION 12080 2025-05-07T20:27:14.6800036Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:27:14.6800136Z #define cudaDeviceMapHost 0x08 2025-05-07T20:27:14.6800226Z #define _GLIBCXX_CMATH 1 2025-05-07T20:27:14.6800418Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:27:14.6800508Z #define __lldiv_t_defined 1 2025-05-07T20:27:14.6800591Z #define __SSE2__ 1 2025-05-07T20:27:14.6800672Z #define _IOLBF 1 2025-05-07T20:27:14.6800774Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:27:14.6800867Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:27:14.6800969Z #define __cpp_deduction_guides 201703L 2025-05-07T20:27:14.6801070Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:27:14.6801178Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:27:14.6801268Z #define __INT32_TYPE__ int 2025-05-07T20:27:14.6801445Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:27:14.6801550Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:27:14.6801649Z #define __cpp_exceptions 199711L 2025-05-07T20:27:14.6801745Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:27:14.6801852Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:27:14.6801942Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:27:14.6802061Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:27:14.6802218Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:27:14.6802315Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:27:14.6802407Z #define __SWORD_TYPE long int 2025-05-07T20:27:14.6802499Z #define __INTMAX_TYPE__ long int 2025-05-07T20:27:14.6802595Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:27:14.6802687Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:27:14.6802782Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:27:14.6803061Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:14.6803159Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:27:14.6803301Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:27:14.6803382Z #define _T_SIZE 2025-05-07T20:27:14.6803486Z #define cudaHostAllocDefault 0x00 2025-05-07T20:27:14.6803611Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:27:14.6803757Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:27:14.6803858Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:27:14.6803967Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:27:14.6804084Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:27:14.6804182Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:14.6804275Z #define __ATOMIC_CONSUME 1 2025-05-07T20:27:14.6804447Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:27:14.6804534Z #define __GNUC_MINOR__ 4 2025-05-07T20:27:14.6804642Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:27:14.6804734Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:27:14.6804849Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:14.6804938Z #define __PIE__ 2 2025-05-07T20:27:14.6805039Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:27:14.6805140Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:27:14.6805326Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:27:14.6805541Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:14.6805635Z #define __nlink_t_defined 2025-05-07T20:27:14.6805760Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:27:14.6805868Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:27:14.6805956Z #define _XOPEN_LIM_H 1 2025-05-07T20:27:14.6806209Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:14.6806409Z #define __cpp_template_template_args 201611L 2025-05-07T20:27:14.6806514Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:27:14.6806614Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:27:14.6806714Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:27:14.6806802Z #define __FILE_defined 1 2025-05-07T20:27:14.6806976Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:27:14.6807074Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:27:14.6807167Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:27:14.6807272Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:27:14.6807387Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:27:14.6807498Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:27:14.6807598Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:27:14.6807683Z #define __INT16_C(c) c 2025-05-07T20:27:14.6807822Z #define __U32_TYPE unsigned int 2025-05-07T20:27:14.6807952Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:27:14.6808106Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:27:14.6811908Z #define __STDC__ 1 2025-05-07T20:27:14.6812029Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:27:14.6812135Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:27:14.6812343Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:27:14.6812499Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:27:14.6812590Z #define __FLT32X_DIG__ 15 2025-05-07T20:27:14.6812695Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:27:14.6812793Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:27:14.6812908Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:27:14.6813022Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:27:14.6813121Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:27:14.6813224Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:27:14.6813313Z #define stdin stdin 2025-05-07T20:27:14.6813405Z #define __ino64_t_defined 2025-05-07T20:27:14.6813494Z #define STA_CLK 0x8000 2025-05-07T20:27:14.6813588Z #define __clockid_t_defined 1 2025-05-07T20:27:14.6813745Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:27:14.6813935Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:27:14.6814066Z #define __cudaCDP2MemsetAsync 2025-05-07T20:27:14.6814173Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:27:14.6814276Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:27:14.6814382Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:27:14.6814580Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:27:14.6814673Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:27:14.6815192Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:27:14.6815276Z #define DOMAIN 1 2025-05-07T20:27:14.6815368Z #define M_LN2 0.69314718055994530942 2025-05-07T20:27:14.6815455Z #define __NVCC__ 1 2025-05-07T20:27:14.6815564Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:27:14.6815680Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:14.6815780Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:27:14.6815890Z #define __throw_exception_again throw 2025-05-07T20:27:14.6815987Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:27:14.6816077Z #define __EXCEPTION_H 1 2025-05-07T20:27:14.6816174Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:27:14.6816277Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:27:14.6816575Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:14.6816685Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:27:14.6816786Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:27:14.6816882Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:27:14.6816986Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:27:14.6817081Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:27:14.6817220Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:27:14.6817434Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:14.6817545Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:27:14.6817644Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:27:14.6817749Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:27:14.6817844Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:27:14.6817944Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:27:14.6818081Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:27:14.6818173Z #define __useconds_t_defined 2025-05-07T20:27:14.6818272Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:27:14.6818450Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:27:14.6818593Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:27:14.6818681Z #define __SSE_MATH__ 1 2025-05-07T20:27:14.6818770Z #define _IO_wint_t wint_t 2025-05-07T20:27:14.6818862Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:27:14.6818955Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:27:14.6819053Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:27:14.6819165Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:27:14.6819264Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:27:14.6819440Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:27:14.6819525Z #define __USE_ATFILE 1 2025-05-07T20:27:14.6819626Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:27:14.6819720Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:27:14.6819809Z #define _GCC_PTRDIFF_T 2025-05-07T20:27:14.6820030Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:14.6820126Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:27:14.6820228Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:27:14.6820328Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:27:14.6820435Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:27:14.6820520Z #define _STDLIB_H 1 2025-05-07T20:27:14.6820655Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:27:14.6820749Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:27:14.6820854Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:27:14.6820981Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:14.6821094Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:14.6821190Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:27:14.6821369Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:27:14.6821524Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:27:14.6821626Z #define __glibcxx_requires_nonempty() 2025-05-07T20:27:14.6821740Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:27:14.6821835Z #define __ldiv_t_defined 1 2025-05-07T20:27:14.6822010Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:27:14.6822102Z #define ___int_ptrdiff_t_h 2025-05-07T20:27:14.6822271Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:14.6822371Z #define __cudaCDP2EventDestroy 2025-05-07T20:27:14.6822461Z #define __HOST_DEFINES_H__ 2025-05-07T20:27:14.6822571Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:27:14.6822671Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:14.6822770Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:27:14.6822857Z #define CUDART_CB 2025-05-07T20:27:14.6822957Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:27:14.6823080Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:27:14.6823165Z #define MB_LEN_MAX 16 2025-05-07T20:27:14.6823383Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:14.6823483Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:27:14.6823604Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:27:14.6823714Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:27:14.6823818Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:27:14.6823965Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:27:14.6824095Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:27:14.6824190Z #define _GNU_SOURCE 1 2025-05-07T20:27:14.6824375Z #define __stub_putmsg 2025-05-07T20:27:14.6824462Z #define __CUDACC__ 1 2025-05-07T20:27:14.6824552Z #define __N(msgid) (msgid) 2025-05-07T20:27:14.6824641Z #define __P(args) args 2025-05-07T20:27:14.6824891Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:27:14.6824992Z #define __cpp_init_captures 201304L 2025-05-07T20:27:14.6825095Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:27:14.6825189Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:27:14.6825287Z #define __cpp_lib_as_const 201510 2025-05-07T20:27:14.6825368Z #define __WCHAR_T 2025-05-07T20:27:14.6825461Z #define __ATOMIC_RELEASE 3 2025-05-07T20:27:14.6825553Z #define __fsblkcnt_t_defined 2025-05-07T20:27:14.6825669Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:27:14.6825771Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:27:14.6825777Z 2025-05-07T20:27:14.6957306Z 2025-05-07T20:27:14.6957832Z + conda run -n build_binary nvcc --version 2025-05-07T20:27:14.6957837Z 2025-05-07T20:27:16.5924670Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:27:16.5925074Z Copyright (c) 2005-2025 NVIDIA Corporation 2025-05-07T20:27:16.5925663Z Built on Wed_Jan_15_19:20:09_PST_2025 2025-05-07T20:27:16.5925973Z Cuda compilation tools, release 12.8, V12.8.61 2025-05-07T20:27:16.5926306Z Build cuda_12.8.r12.8/compiler.35404655_0 2025-05-07T20:27:16.5926519Z 2025-05-07T20:27:16.6536936Z 2025-05-07T20:27:16.6549536Z /usr/bin/nvidia-smi 2025-05-07T20:27:16.6554521Z + nvidia-smi 2025-05-07T20:27:16.6554725Z 2025-05-07T20:27:16.6733840Z Wed May 7 20:27:16 2025 2025-05-07T20:27:16.6734242Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:16.6734740Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:27:16.6735238Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:16.6735810Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:27:16.6736338Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:27:16.6736783Z | | | MIG M. | 2025-05-07T20:27:16.6737124Z |=========================================+========================+======================| 2025-05-07T20:27:16.6904065Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:27:16.6904510Z | 0% 26C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:27:16.6904902Z | | | N/A | 2025-05-07T20:27:16.6905293Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:16.6907360Z 2025-05-07T20:27:16.6907771Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:16.6908197Z | Processes: | 2025-05-07T20:27:16.6908634Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:27:16.6909050Z | ID ID Usage | 2025-05-07T20:27:16.6909390Z |=========================================================================================| 2025-05-07T20:27:16.6911459Z | No running processes found | 2025-05-07T20:27:16.6911988Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:16.9309447Z 2025-05-07T20:27:16.9314458Z [INSTALL] Successfully installed CUDA 12.8.0 2025-05-07T20:27:16.9372264Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:27:16.9372899Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:27:16.9386667Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:27:16.9387032Z env: 2025-05-07T20:27:16.9387268Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:27:16.9387569Z BUILD_ENV: build_binary 2025-05-07T20:27:16.9387824Z BUILD_TARGET: genai 2025-05-07T20:27:16.9388053Z BUILD_VARIANT: cuda 2025-05-07T20:27:16.9388282Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:27:16.9388539Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:27:16.9388841Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:27:16.9389179Z ##[endgroup] 2025-05-07T20:27:17.2773521Z ################################################################################ 2025-05-07T20:27:17.2773896Z # Install PyTorch (PIP) 2025-05-07T20:27:17.2774137Z # 2025-05-07T20:27:17.2788763Z # [2025-05-07T20:27:17.278Z] + install_pytorch_pip build_binary nightly cuda/12.8.0 2025-05-07T20:27:17.2789217Z ################################################################################ 2025-05-07T20:27:17.2789434Z 2025-05-07T20:27:17.2817234Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:27:18.2747806Z Channels: 2025-05-07T20:27:18.2748265Z - conda-forge 2025-05-07T20:27:18.2748714Z Platform: linux-64 2025-05-07T20:27:21.5295655Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:27:22.2489187Z Solving environment: \ | / done 2025-05-07T20:27:22.4692768Z 2025-05-07T20:27:22.4693113Z ## Package Plan ## 2025-05-07T20:27:22.4693275Z 2025-05-07T20:27:22.4693501Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:27:22.4693798Z 2025-05-07T20:27:22.4693895Z added / updated specs: 2025-05-07T20:27:22.4694139Z - numpy 2025-05-07T20:27:22.4694256Z 2025-05-07T20:27:22.4694291Z 2025-05-07T20:27:22.4694416Z The following packages will be downloaded: 2025-05-07T20:27:22.4694626Z 2025-05-07T20:27:22.4694749Z package | build 2025-05-07T20:27:22.4695065Z ---------------------------|----------------- 2025-05-07T20:27:22.4695460Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:27:22.4695917Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:27:22.4696360Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:27:22.4696809Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:27:22.4697264Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:27:22.4697731Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:27:22.4698171Z numpy-2.2.5 | py312h72c5963_0 8.1 MB conda-forge 2025-05-07T20:27:22.4698556Z ------------------------------------------------------------ 2025-05-07T20:27:22.4698896Z Total: 15.4 MB 2025-05-07T20:27:22.4699102Z 2025-05-07T20:27:22.4699232Z The following NEW packages will be INSTALLED: 2025-05-07T20:27:22.4699458Z 2025-05-07T20:27:22.4699675Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:27:22.4700171Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:27:22.4700667Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:27:22.4701155Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:27:22.4701667Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:27:22.4702199Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:27:22.4703088Z numpy conda-forge/linux-64::numpy-2.2.5-py312h72c5963_0 2025-05-07T20:27:22.4703365Z 2025-05-07T20:27:22.4703369Z 2025-05-07T20:27:22.4703373Z 2025-05-07T20:27:22.4703514Z Downloading and Extracting Packages: ...working... 2025-05-07T20:27:22.4703883Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:22.4704106Z 2025-05-07T20:27:22.4705122Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:22.4705361Z 2025-05-07T20:27:22.4705367Z 2025-05-07T20:27:22.4714774Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:27:22.4715029Z 2025-05-07T20:27:22.4715174Z 2025-05-07T20:27:22.4715184Z 2025-05-07T20:27:22.4726555Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:27:22.4726855Z 2025-05-07T20:27:22.4726861Z 2025-05-07T20:27:22.4726866Z 2025-05-07T20:27:22.4726883Z 2025-05-07T20:27:22.4751872Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:22.4752141Z 2025-05-07T20:27:22.4752155Z 2025-05-07T20:27:22.4752159Z 2025-05-07T20:27:22.4752170Z 2025-05-07T20:27:22.4752174Z 2025-05-07T20:27:22.4759146Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:22.4759561Z 2025-05-07T20:27:22.4759852Z 2025-05-07T20:27:22.4759857Z 2025-05-07T20:27:22.4759874Z 2025-05-07T20:27:22.4759880Z 2025-05-07T20:27:22.4759885Z 2025-05-07T20:27:22.5602776Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:22.5603065Z 2025-05-07T20:27:22.5603070Z 2025-05-07T20:27:22.5603075Z 2025-05-07T20:27:22.5603090Z 2025-05-07T20:27:22.6466258Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.6466518Z 2025-05-07T20:27:22.6466523Z 2025-05-07T20:27:22.6466903Z 2025-05-07T20:27:22.6467120Z 2025-05-07T20:27:22.6503253Z 2025-05-07T20:27:22.6991952Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:22.6992228Z 2025-05-07T20:27:22.6992232Z 2025-05-07T20:27:22.6992236Z 2025-05-07T20:27:22.6992248Z 2025-05-07T20:27:22.6998690Z 2025-05-07T20:27:22.7830569Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.7844641Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:22.7844890Z 2025-05-07T20:27:22.7844894Z 2025-05-07T20:27:22.7844898Z 2025-05-07T20:27:22.7844901Z 2025-05-07T20:27:22.7844905Z 2025-05-07T20:27:22.7846337Z 2025-05-07T20:27:22.7998742Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:22.7999014Z 2025-05-07T20:27:22.7999018Z 2025-05-07T20:27:22.7999021Z 2025-05-07T20:27:22.7999025Z 2025-05-07T20:27:22.7999028Z 2025-05-07T20:27:22.7999032Z 2025-05-07T20:27:22.8637977Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.8640216Z 2025-05-07T20:27:22.8832789Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:22.8973076Z numpy-2.2.5 | 8.1 MB | # | 11% 2025-05-07T20:27:22.8973323Z 2025-05-07T20:27:22.8973328Z 2025-05-07T20:27:22.8973331Z 2025-05-07T20:27:22.8973335Z 2025-05-07T20:27:22.8987934Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.8988189Z 2025-05-07T20:27:22.8988201Z 2025-05-07T20:27:22.8988205Z 2025-05-07T20:27:22.8989108Z 2025-05-07T20:27:22.9036939Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.9037192Z 2025-05-07T20:27:22.9037460Z 2025-05-07T20:27:22.9037783Z 2025-05-07T20:27:22.9072330Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:27:22.9072615Z 2025-05-07T20:27:22.9072619Z 2025-05-07T20:27:22.9072623Z 2025-05-07T20:27:22.9072627Z 2025-05-07T20:27:22.9074369Z 2025-05-07T20:27:22.9078141Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.9078409Z 2025-05-07T20:27:22.9078420Z 2025-05-07T20:27:22.9078423Z 2025-05-07T20:27:22.9214270Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:22.9214776Z 2025-05-07T20:27:22.9215449Z 2025-05-07T20:27:22.9306114Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:27:22.9306388Z 2025-05-07T20:27:22.9306392Z 2025-05-07T20:27:22.9306405Z 2025-05-07T20:27:22.9306409Z 2025-05-07T20:27:22.9306412Z 2025-05-07T20:27:22.9309399Z 2025-05-07T20:27:22.9642077Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.9642688Z 2025-05-07T20:27:22.9833994Z libopenblas-0.3.29 | 5.6 MB | ######3 | 64%  2025-05-07T20:27:22.9945583Z numpy-2.2.5 | 8.1 MB | ###### | 60% 2025-05-07T20:27:22.9945812Z 2025-05-07T20:27:22.9946336Z 2025-05-07T20:27:22.9956831Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:22.9957108Z 2025-05-07T20:27:22.9957112Z 2025-05-07T20:27:22.9957116Z 2025-05-07T20:27:23.0170465Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:23.0171453Z 2025-05-07T20:27:23.0516628Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:23.0751680Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:23.0751917Z 2025-05-07T20:27:23.0752242Z 2025-05-07T20:27:23.0756192Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:23.0756732Z 2025-05-07T20:27:23.0756737Z 2025-05-07T20:27:23.1875998Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:23.1876533Z 2025-05-07T20:27:23.4771834Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:23.4772219Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:23.4778469Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:23.4779174Z 2025-05-07T20:27:23.4779553Z 2025-05-07T20:27:23.4779875Z  2025-05-07T20:27:23.4780192Z 2025-05-07T20:27:23.4780198Z 2025-05-07T20:27:23.4780470Z  2025-05-07T20:27:23.4780785Z 2025-05-07T20:27:23.4780791Z 2025-05-07T20:27:23.4780796Z 2025-05-07T20:27:23.4781056Z  2025-05-07T20:27:23.4781373Z 2025-05-07T20:27:23.4781379Z 2025-05-07T20:27:23.4781384Z 2025-05-07T20:27:23.4781401Z 2025-05-07T20:27:23.4781652Z  2025-05-07T20:27:23.4781951Z 2025-05-07T20:27:23.4781954Z 2025-05-07T20:27:23.4781958Z 2025-05-07T20:27:23.4781961Z 2025-05-07T20:27:23.4781965Z 2025-05-07T20:27:23.4782171Z  2025-05-07T20:27:23.4782446Z 2025-05-07T20:27:23.4782452Z 2025-05-07T20:27:23.4782457Z 2025-05-07T20:27:23.4782462Z 2025-05-07T20:27:23.4782467Z 2025-05-07T20:27:23.4782472Z 2025-05-07T20:27:23.4782744Z  done 2025-05-07T20:27:23.5793605Z Preparing transaction: \ done 2025-05-07T20:27:23.6799086Z Verifying transaction: / done 2025-05-07T20:27:23.7807762Z Executing transaction: \ done 2025-05-07T20:27:23.9540862Z ################################################################################ 2025-05-07T20:27:23.9541301Z # Install Package From PyTorch PIP: torch 2025-05-07T20:27:23.9541601Z # 2025-05-07T20:27:23.9556497Z # [2025-05-07T20:27:23.955Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0 2025-05-07T20:27:23.9556972Z ################################################################################ 2025-05-07T20:27:23.9557195Z 2025-05-07T20:27:23.9576275Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:27:24.0484913Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:27:24.0485419Z ################################################################################ 2025-05-07T20:27:24.0485792Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:27:24.0486073Z # 2025-05-07T20:27:24.0503939Z # [2025-05-07T20:27:24.050Z] + __prepare_pip_arguments torch nightly cuda/12.8.0 2025-05-07T20:27:24.0504390Z ################################################################################ 2025-05-07T20:27:24.0504605Z 2025-05-07T20:27:24.0528082Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:27:24.0553303Z [INSTALL] Extracted package variant: cu128 2025-05-07T20:27:24.0569494Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:27:24.0570038Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:27:24.0578274Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:27:24.0586446Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ... 2025-05-07T20:27:24.0607377Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:29:02.0721830Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:29:02.0722452Z Collecting torch 2025-05-07T20:29:02.0723317Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:29:02.0724656Z Collecting filelock (from torch) 2025-05-07T20:29:02.0725329Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:29:02.0726341Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (4.13.2) 2025-05-07T20:29:02.0727401Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (78.1.1) 2025-05-07T20:29:02.0728065Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:29:02.0728559Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:29:02.0729398Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 61.4 MB/s eta 0:00:00 2025-05-07T20:29:02.0729752Z Collecting networkx (from torch) 2025-05-07T20:29:02.0730251Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:29:02.0730907Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 17.4 MB/s eta 0:00:00 2025-05-07T20:29:02.0731253Z Collecting jinja2 (from torch) 2025-05-07T20:29:02.0731731Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:29:02.0732249Z Collecting fsspec (from torch) 2025-05-07T20:29:02.0732731Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:29:02.0733303Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch) 2025-05-07T20:29:02.0741614Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:02.0742489Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch) 2025-05-07T20:29:02.0743323Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:02.0744151Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch) 2025-05-07T20:29:02.0744947Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:02.0745731Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch) 2025-05-07T20:29:02.0746421Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB) 2025-05-07T20:29:02.0747110Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch) 2025-05-07T20:29:02.0748043Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:02.0748748Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch) 2025-05-07T20:29:02.0749524Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:29:02.0750304Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch) 2025-05-07T20:29:02.0751006Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:29:02.0751720Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch) 2025-05-07T20:29:02.0752439Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:02.0753152Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch) 2025-05-07T20:29:02.0753959Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:02.0754760Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:29:02.0755478Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) 2025-05-07T20:29:02.0756388Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:29:02.0757150Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:29:02.0757914Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch) 2025-05-07T20:29:02.0758679Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:02.0759458Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch) 2025-05-07T20:29:02.0760307Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:29:02.0761127Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch) 2025-05-07T20:29:02.0761911Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:29:02.0762717Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:29:02.0763549Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:02.0764366Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:29:02.0764918Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:29:02.0765954Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 5.1 MB/s eta 0:00:00 2025-05-07T20:29:02.0766637Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:29:02.0767424Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB) 2025-05-07T20:29:02.0768467Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp312-cp312-manylinux_2_28_x86_64.whl (1047.0 MB) 2025-05-07T20:29:02.0769291Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 23.7 MB/s eta 0:00:00 2025-05-07T20:29:02.0769981Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB) 2025-05-07T20:29:02.0770811Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 48.6 MB/s eta 0:00:00 2025-05-07T20:29:02.0771574Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB) 2025-05-07T20:29:02.0772600Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 113.0 MB/s eta 0:00:00 2025-05-07T20:29:02.0773373Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB) 2025-05-07T20:29:02.0774251Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 179.5 MB/s eta 0:00:00 2025-05-07T20:29:02.0775012Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB) 2025-05-07T20:29:02.0775872Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 85.2 MB/s eta 0:00:00 2025-05-07T20:29:02.0776544Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB) 2025-05-07T20:29:02.0777311Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 43.5 MB/s eta 0:00:00 2025-05-07T20:29:02.0778071Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB) 2025-05-07T20:29:02.0778918Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 92.5 MB/s eta 0:00:00 2025-05-07T20:29:02.0779669Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB) 2025-05-07T20:29:02.0780641Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 79.3 MB/s eta 0:00:00 2025-05-07T20:29:02.0781313Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB) 2025-05-07T20:29:02.0782076Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 148.0 MB/s eta 0:00:00 2025-05-07T20:29:02.0782764Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB) 2025-05-07T20:29:02.0783555Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 124.5 MB/s eta 0:00:00 2025-05-07T20:29:02.0784311Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB) 2025-05-07T20:29:02.0785164Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 96.1 MB/s eta 0:00:00 2025-05-07T20:29:02.0785865Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:29:02.0786639Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 135.1 MB/s eta 0:00:00 2025-05-07T20:29:02.0787375Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:29:02.0788207Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 131.7 MB/s eta 0:00:00 2025-05-07T20:29:02.0789077Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB) 2025-05-07T20:29:02.0790094Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 95.9 MB/s eta 0:00:00 2025-05-07T20:29:02.0790966Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-05-07T20:29:02.0792115Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:29:02.0792978Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 149.2 MB/s eta 0:00:00 2025-05-07T20:29:02.0794711Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:29:02.0796384Z 2025-05-07T20:29:02.0798336Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128 2025-05-07T20:29:02.0800468Z 2025-05-07T20:29:04.3030757Z torch 2.8.0.dev20250507+cu128 2025-05-07T20:29:04.3033543Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128) 2025-05-07T20:29:07.7964541Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:29:11.3195026Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128 2025-05-07T20:29:11.3195470Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:29:14.7747804Z True 2025-05-07T20:29:14.7748053Z True 2025-05-07T20:29:14.7748187Z 2025-05-07T20:29:14.8394085Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:29:14.8433613Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:29:14.8434230Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:29:14.8448016Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:14.8448370Z env: 2025-05-07T20:29:14.8448608Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:14.8448912Z BUILD_ENV: build_binary 2025-05-07T20:29:14.8449166Z BUILD_TARGET: genai 2025-05-07T20:29:14.8449403Z BUILD_VARIANT: cuda 2025-05-07T20:29:14.8449653Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:14.8449913Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:14.8450224Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:14.8450567Z ##[endgroup] 2025-05-07T20:29:15.1821931Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:29:15.1824783Z ################################################################################ 2025-05-07T20:29:15.1825923Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:29:15.1826651Z # 2025-05-07T20:29:15.1840091Z # [2025-05-07T20:29:15.183Z] + collect_pytorch_env_info build_binary 2025-05-07T20:29:15.1840589Z ################################################################################ 2025-05-07T20:29:15.1840806Z 2025-05-07T20:29:15.1855521Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:15.2792443Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:15.2800825Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:29:15.2801670Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:29:15.2802169Z 2025-05-07T20:29:15.3695741Z 2025-05-07T20:29:15.3696434Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:29:15.3721071Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:29:21.2598401Z Collecting environment information... 2025-05-07T20:29:21.2598831Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T20:29:21.2599130Z Is debug build: False 2025-05-07T20:29:21.2599377Z CUDA used to build PyTorch: 12.8 2025-05-07T20:29:21.2599656Z ROCM used to build PyTorch: N/A 2025-05-07T20:29:21.2599831Z 2025-05-07T20:29:21.2599941Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:29:21.2600263Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:29:21.2600574Z Clang version: Could not collect 2025-05-07T20:29:21.2600851Z CMake version: Could not collect 2025-05-07T20:29:21.2601121Z Libc version: glibc-2.34 2025-05-07T20:29:21.2601275Z 2025-05-07T20:29:21.2601576Z Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:29:21.2602185Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:29:21.2602987Z Is CUDA available: True 2025-05-07T20:29:21.2603236Z CUDA runtime version: 12.8.61 2025-05-07T20:29:21.2603511Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:29:21.2603823Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:29:21.2604152Z Nvidia driver version: 570.133.07 2025-05-07T20:29:21.2604429Z cuDNN version: Could not collect 2025-05-07T20:29:21.2604700Z HIP runtime version: N/A 2025-05-07T20:29:21.2604955Z MIOpen runtime version: N/A 2025-05-07T20:29:21.2605210Z Is XNNPACK available: True 2025-05-07T20:29:21.2605380Z 2025-05-07T20:29:21.2605458Z CPU: 2025-05-07T20:29:21.2605676Z Architecture: x86_64 2025-05-07T20:29:21.2606012Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:29:21.2606414Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:29:21.2606806Z Byte Order: Little Endian 2025-05-07T20:29:21.2607117Z CPU(s): 16 2025-05-07T20:29:21.2607423Z On-line CPU(s) list: 0-15 2025-05-07T20:29:21.2607956Z Vendor ID: AuthenticAMD 2025-05-07T20:29:21.2608301Z Model name: AMD EPYC 7R32 2025-05-07T20:29:21.2608614Z CPU family: 23 2025-05-07T20:29:21.2608903Z Model: 49 2025-05-07T20:29:21.2609194Z Thread(s) per core: 2 2025-05-07T20:29:21.2609481Z Core(s) per socket: 8 2025-05-07T20:29:21.2609767Z Socket(s): 1 2025-05-07T20:29:21.2610052Z Stepping: 0 2025-05-07T20:29:21.2610345Z BogoMIPS: 5599.99 2025-05-07T20:29:21.2612425Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:29:21.2614513Z Hypervisor vendor: KVM 2025-05-07T20:29:21.2614828Z Virtualization type: full 2025-05-07T20:29:21.2615167Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:29:21.2615527Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:29:21.2615888Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:29:21.2616246Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:29:21.2616567Z NUMA node(s): 1 2025-05-07T20:29:21.2616857Z NUMA node0 CPU(s): 0-15 2025-05-07T20:29:21.2617191Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:29:21.2617572Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:29:21.2617922Z Vulnerability L1tf: Not affected 2025-05-07T20:29:21.2618271Z Vulnerability Mds: Not affected 2025-05-07T20:29:21.2618622Z Vulnerability Meltdown: Not affected 2025-05-07T20:29:21.2618973Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:29:21.2619336Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:29:21.2619873Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:29:21.2620446Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:29:21.2620977Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:29:21.2621657Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:29:21.2622515Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:29:21.2623272Z Vulnerability Srbds: Not affected 2025-05-07T20:29:21.2623635Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:29:21.2623869Z 2025-05-07T20:29:21.2623972Z Versions of relevant libraries: 2025-05-07T20:29:21.2624238Z [pip3] numpy==2.2.5 2025-05-07T20:29:21.2624478Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T20:29:21.2624788Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T20:29:21.2625101Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T20:29:21.2625410Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T20:29:21.2625727Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T20:29:21.2626021Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T20:29:21.2626309Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T20:29:21.2626614Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T20:29:21.2626918Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T20:29:21.2627342Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:29:21.2627637Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:29:21.2627926Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T20:29:21.2628224Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T20:29:21.2628507Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:29:21.2628812Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T20:29:21.2629184Z [conda] cuda-cudart 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:29:21.2629661Z [conda] cuda-cudart-dev 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:29:21.2630169Z [conda] cuda-cudart-dev_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:29:21.2630688Z [conda] cuda-cudart-static 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:29:21.2631218Z [conda] cuda-cudart-static_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:29:21.2631739Z [conda] cuda-cudart_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:29:21.2632229Z [conda] cuda-cupti 12.8.57 hbd13f7d_0 conda-forge 2025-05-07T20:29:21.2632697Z [conda] cuda-cupti-dev 12.8.57 h5888daf_0 conda-forge 2025-05-07T20:29:21.2633173Z [conda] cuda-libraries 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:29:21.2633671Z [conda] cuda-libraries-dev 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:29:21.2634151Z [conda] cuda-nvrtc 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:29:21.2634617Z [conda] cuda-nvrtc-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:29:21.2635069Z [conda] cuda-nvtx 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:29:21.2635629Z [conda] cuda-opencl 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:29:21.2636109Z [conda] cuda-opencl-dev 12.8.55 h5888daf_0 conda-forge 2025-05-07T20:29:21.2636589Z [conda] cuda-runtime 12.8.0 ha804496_0 conda-forge 2025-05-07T20:29:21.2637051Z [conda] libcublas 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:29:21.2637519Z [conda] libcublas-dev 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:29:21.2637983Z [conda] libcufft 11.3.3.41 hbd13f7d_0 conda-forge 2025-05-07T20:29:21.2638435Z [conda] libcufft-dev 11.3.3.41 h5888daf_0 conda-forge 2025-05-07T20:29:21.2638899Z [conda] libcurand 10.3.9.55 hbd13f7d_0 conda-forge 2025-05-07T20:29:21.2639368Z [conda] libcurand-dev 10.3.9.55 h5888daf_0 conda-forge 2025-05-07T20:29:21.2639842Z [conda] libcusolver 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:29:21.2640315Z [conda] libcusolver-dev 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:29:21.2640896Z [conda] libcusparse 12.5.7.53 hbd13f7d_0 conda-forge 2025-05-07T20:29:21.2641376Z [conda] libcusparse-dev 12.5.7.53 h5888daf_0 conda-forge 2025-05-07T20:29:21.2641855Z [conda] libnvjitlink 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:29:21.2642341Z [conda] libnvjitlink-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:29:21.2642804Z [conda] numpy 2.2.5 py312h72c5963_0 conda-forge 2025-05-07T20:29:21.2643271Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T20:29:21.2643762Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:29:21.2644262Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:29:21.2644766Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:29:21.2645249Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T20:29:21.2645821Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T20:29:21.2646301Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T20:29:21.2646789Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T20:29:21.2647276Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T20:29:21.2647778Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:29:21.2648270Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:29:21.2648749Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:29:21.2649235Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T20:29:21.2649733Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:29:21.2650208Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T20:29:21.2650524Z 2025-05-07T20:29:21.3400254Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:29:21.3400808Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:29:21.3412635Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:21.3412982Z env: 2025-05-07T20:29:21.3413215Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:21.3413515Z BUILD_ENV: build_binary 2025-05-07T20:29:21.3413766Z BUILD_TARGET: genai 2025-05-07T20:29:21.3413996Z BUILD_VARIANT: cuda 2025-05-07T20:29:21.3414236Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:21.3414491Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:21.3414792Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:21.3415129Z ##[endgroup] 2025-05-07T20:29:21.6803624Z ################################################################################ 2025-05-07T20:29:21.6804020Z # Prepare FBGEMM-GPU Build 2025-05-07T20:29:21.6804306Z # 2025-05-07T20:29:21.6818899Z # [2025-05-07T20:29:21.681Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:29:21.6819311Z ################################################################################ 2025-05-07T20:29:21.6819527Z 2025-05-07T20:29:21.6834147Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:21.7810171Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:21.7831320Z [BUILD] Running git submodules update ... 2025-05-07T20:29:21.7852283Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:29:21.8209361Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:29:21.8209990Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:29:21.8210443Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:29:21.8210843Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:29:21.8211247Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:29:21.8212029Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:29:21.8212444Z Synchronizing submodule url for '../external/json' 2025-05-07T20:29:21.8243646Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:29:21.8797307Z [BUILD] Installing other build dependencies ... 2025-05-07T20:29:21.8819115Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:29:24.2861390Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:29:24.3032401Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:29:24.4047417Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:29:24.4077260Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:29:24.6069937Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:29:24.6099450Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:29:24.7153253Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:29:24.7177964Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:29:25.0395017Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:29:25.0422906Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:29:25.0901529Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:29:25.0905255Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:29:25.1594806Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:29:25.1619549Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:29:25.2119814Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:29:25.2559785Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:29:25.2585652Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:29:25.3748109Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:29:25.3810375Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:29:25.5023156Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:29:25.5125141Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:29:25.5662088Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:29:25.6273834Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:29:25.6310350Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:29:25.7286193Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:29:25.7312872Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:29:25.8415616Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:29:25.8446737Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:29:25.9460814Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:29:25.9501087Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:29:26.0423926Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:29:26.0455044Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:29:26.1471275Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:29:26.1498520Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:29:26.2513362Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:29:26.2536359Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:29:26.3022318Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:29:26.3540133Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:29:26.3563445Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:29:26.4036138Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:29:26.4518598Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:29:26.4674713Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:29:26.5145474Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:29:26.5778028Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:29:26.5802719Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:29:26.6278627Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:29:26.6761236Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:29:26.7237662Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:29:27.2662031Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 51.4 MB/s eta 0:00:00 2025-05-07T20:29:27.2687815Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:29:27.3160544Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:29:27.3737778Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:29:27.4235928Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:29:27.4745646Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:29:27.5196128Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB) 2025-05-07T20:29:27.5817059Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 8.4 MB/s eta 0:00:00 2025-05-07T20:29:27.5845444Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:29:27.6344224Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:29:27.6821457Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:29:27.7308408Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:29:27.7851200Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:29:27.8332565Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:29:27.8912682Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:29:27.9448869Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:29:27.9929518Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:29:28.0341625Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:29:28.2000233Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:29:30.4778742Z 2025-05-07T20:29:30.4825513Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:29:30.6462134Z ################################################################################ 2025-05-07T20:29:30.6462579Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:29:30.6462934Z # 2025-05-07T20:29:30.6481478Z # [2025-05-07T20:29:30.647Z] + install_triton_pip build_binary 2025-05-07T20:29:30.6481856Z ################################################################################ 2025-05-07T20:29:30.6482071Z 2025-05-07T20:29:30.6482290Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:29:30.6482753Z ################################################################################ 2025-05-07T20:29:30.6483215Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:29:30.6483538Z # 2025-05-07T20:29:30.6499232Z # [2025-05-07T20:29:30.649Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:29:30.6499797Z ################################################################################ 2025-05-07T20:29:30.6500009Z 2025-05-07T20:29:30.6515534Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:30.7413072Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:30.7413547Z ################################################################################ 2025-05-07T20:29:30.7413890Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:29:30.7414167Z # 2025-05-07T20:29:30.7431295Z # [2025-05-07T20:29:30.742Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:29:30.7431862Z ################################################################################ 2025-05-07T20:29:30.7432083Z 2025-05-07T20:29:30.7478951Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:29:30.7496237Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:29:30.7496980Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:30.7505353Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:30.7515200Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:29:30.7537462Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:38.4429363Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:29:38.4430592Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:29:38.4431249Z 2025-05-07T20:29:38.4431471Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:38.4431880Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:38.4432670Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:29:38.4433876Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:29:38.4434945Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 56.1 MB/s eta 0:00:00 2025-05-07T20:29:38.4435329Z Installing collected packages: pytorch-triton 2025-05-07T20:29:38.4435785Z Attempting uninstall: pytorch-triton 2025-05-07T20:29:38.4436169Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:29:38.4436583Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:29:38.4437378Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:29:38.4437814Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:29:38.4438069Z 2025-05-07T20:29:40.6484993Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:29:40.6488427Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:42.7953129Z ################################################################################ 2025-05-07T20:29:42.7953729Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:42.7954252Z ################################################################################ 2025-05-07T20:29:42.7954558Z 2025-05-07T20:29:44.8317417Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:46.9948276Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:46.9952622Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:46.9984919Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:46.9985409Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:46.9998073Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:46.9998426Z env: 2025-05-07T20:29:46.9998653Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:46.9998953Z BUILD_ENV: build_binary 2025-05-07T20:29:46.9999207Z BUILD_TARGET: genai 2025-05-07T20:29:46.9999441Z BUILD_VARIANT: cuda 2025-05-07T20:29:46.9999723Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:46.9999983Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:47.0000284Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:47.0000621Z ##[endgroup] 2025-05-07T20:29:47.3355282Z ################################################################################ 2025-05-07T20:29:47.3355767Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:47.3356118Z # 2025-05-07T20:29:47.3371740Z # [2025-05-07T20:29:47.336Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:47.3372615Z ################################################################################ 2025-05-07T20:29:47.3372835Z 2025-05-07T20:29:47.3373193Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:47.3373884Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:47.3374226Z 2025-05-07T20:29:47.3536564Z c73a702bbc09a0f1f522be4fc10889dc19360f75 fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:47.3538084Z 2025-05-07T20:29:47.3538773Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:47.3539137Z 2025-05-07T20:29:47.3726716Z 3a160ecc54665559cce7e57cc15438640cf521df66903a79480f30a5b3cf6942 fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:47.3729189Z 2025-05-07T20:29:47.3729687Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:47.3730026Z 2025-05-07T20:29:47.4061663Z e7438d9eb3f38b23c683d9c8a7a66fd4 fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:47.4063338Z 2025-05-07T20:29:47.4073481Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:47.4094815Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:50.1846593Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:50.1847551Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:50.1848407Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:50.1849199Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:50.1849468Z 2025-05-07T20:29:57.1939515Z ################################################################################ 2025-05-07T20:29:57.1940427Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:57.1941482Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128 2025-05-07T20:29:57.1942515Z [CHECK] CUDA version reported by PyTorch is: 12.8 2025-05-07T20:29:57.1943140Z [CHECK] 2025-05-07T20:29:57.1943783Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:57.1944766Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:57.1945544Z ################################################################################ 2025-05-07T20:29:57.1945981Z 2025-05-07T20:29:57.1946212Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:30:01.2029747Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:05.1931363Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:30:09.1852816Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:30:09.1856190Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:30:21.1166966Z ################################################################################ 2025-05-07T20:30:21.1167403Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:30:21.1167762Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:30:21.1168118Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:30:21.1168473Z ################################################################################ 2025-05-07T20:30:21.1168690Z 2025-05-07T20:30:29.1024659Z ################################################################################ 2025-05-07T20:30:29.1025173Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:30:29.1026591Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:30:29.1028169Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:30:29.1028707Z ################################################################################ 2025-05-07T20:30:29.1028932Z 2025-05-07T20:30:29.1029089Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:30:33.0973832Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:30:37.0906529Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:30:41.1916830Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:45.2078227Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:45.2081671Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:49.1118335Z fbgemm.nccl_init 2025-05-07T20:30:49.1118520Z 2025-05-07T20:30:49.1736971Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:53.0670027Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:53.0670236Z 2025-05-07T20:30:53.1279533Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:57.0333660Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:57.0333863Z 2025-05-07T20:30:57.0953054Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:57.0953635Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:57.0989970Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:57.0990434Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:57.1003898Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:57.1004276Z env: 2025-05-07T20:30:57.1004595Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:57.1004916Z BUILD_ENV: build_binary 2025-05-07T20:30:57.1005178Z BUILD_TARGET: genai 2025-05-07T20:30:57.1005415Z BUILD_VARIANT: cuda 2025-05-07T20:30:57.1005652Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:30:57.1005919Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:57.1006229Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:57.1006561Z ##[endgroup] 2025-05-07T20:30:57.4369232Z ################################################################################ 2025-05-07T20:30:57.4369671Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:57.4369935Z # 2025-05-07T20:30:57.4384559Z # [2025-05-07T20:30:57.438Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:57.4385120Z ################################################################################ 2025-05-07T20:30:57.4385416Z 2025-05-07T20:31:05.4310524Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:31:05.4311100Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:31:05.4311496Z [TEST] Determined the test directories: 2025-05-07T20:31:05.4311811Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:31:05.4312108Z fbgemm_gpu/experimental/example/test 2025-05-07T20:31:05.4312408Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:31:05.4312592Z 2025-05-07T20:31:05.4318786Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:31:05.4325416Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:31:05.4325850Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:31:05.4326140Z 2025-05-07T20:31:05.8538982Z 2025-05-07T20:31:05.8539462Z [TEST] Installing PyTest ... 2025-05-07T20:31:05.8564647Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:31:06.9649559Z Channels: 2025-05-07T20:31:06.9649879Z - conda-forge 2025-05-07T20:31:06.9650192Z Platform: linux-64 2025-05-07T20:31:10.2443358Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:31:11.4010471Z Solving environment: \ | / done 2025-05-07T20:31:11.6315595Z 2025-05-07T20:31:11.6315872Z ## Package Plan ## 2025-05-07T20:31:11.6316105Z 2025-05-07T20:31:11.6316397Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:31:11.6316746Z 2025-05-07T20:31:11.6316851Z added / updated specs: 2025-05-07T20:31:11.6317094Z - expecttest 2025-05-07T20:31:11.6317313Z - pytest 2025-05-07T20:31:11.6317435Z 2025-05-07T20:31:11.6317439Z 2025-05-07T20:31:11.6317563Z The following packages will be downloaded: 2025-05-07T20:31:11.6317828Z 2025-05-07T20:31:11.6317991Z package | build 2025-05-07T20:31:11.6318447Z ---------------------------|----------------- 2025-05-07T20:31:11.6318886Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:31:11.6319532Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:31:11.6320167Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:31:11.6320740Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:31:11.6321176Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:31:11.6321593Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:31:11.6322006Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:31:11.6322698Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:31:11.6323094Z ------------------------------------------------------------ 2025-05-07T20:31:11.6323427Z Total: 428 KB 2025-05-07T20:31:11.6323787Z 2025-05-07T20:31:11.6323912Z The following NEW packages will be INSTALLED: 2025-05-07T20:31:11.6324128Z 2025-05-07T20:31:11.6324331Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:31:11.6324838Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:31:11.6325347Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:31:11.6325831Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:31:11.6326291Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:31:11.6326739Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:31:11.6327167Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:31:11.6327588Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:31:11.6327842Z 2025-05-07T20:31:11.6327846Z 2025-05-07T20:31:11.6327850Z 2025-05-07T20:31:11.6328003Z Downloading and Extracting Packages: ...working... 2025-05-07T20:31:11.6328368Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:31:11.6328591Z 2025-05-07T20:31:11.6328858Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:31:11.6329098Z 2025-05-07T20:31:11.6329107Z 2025-05-07T20:31:11.6342773Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:31:11.6343030Z 2025-05-07T20:31:11.6343034Z 2025-05-07T20:31:11.6343044Z 2025-05-07T20:31:11.6355933Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:31:11.6356194Z 2025-05-07T20:31:11.6356205Z 2025-05-07T20:31:11.6356209Z 2025-05-07T20:31:11.6357753Z 2025-05-07T20:31:11.6367779Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:31:11.6368073Z 2025-05-07T20:31:11.6368078Z 2025-05-07T20:31:11.6368088Z 2025-05-07T20:31:11.6368093Z 2025-05-07T20:31:11.6368103Z 2025-05-07T20:31:11.6369085Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:31:11.6369360Z 2025-05-07T20:31:11.6369365Z 2025-05-07T20:31:11.6369368Z 2025-05-07T20:31:11.6369386Z 2025-05-07T20:31:11.6369389Z 2025-05-07T20:31:11.6369393Z 2025-05-07T20:31:11.6372824Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:31:11.6373109Z 2025-05-07T20:31:11.6373120Z 2025-05-07T20:31:11.6373124Z 2025-05-07T20:31:11.6373128Z 2025-05-07T20:31:11.6373132Z 2025-05-07T20:31:11.6373135Z 2025-05-07T20:31:11.6376088Z 2025-05-07T20:31:11.7088906Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:31:11.7089219Z 2025-05-07T20:31:11.7089223Z 2025-05-07T20:31:11.7089894Z 2025-05-07T20:31:11.7870377Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:11.7870757Z 2025-05-07T20:31:11.7870763Z 2025-05-07T20:31:11.7870766Z 2025-05-07T20:31:11.7871127Z 2025-05-07T20:31:11.7883033Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:31:11.7883383Z 2025-05-07T20:31:11.7883387Z 2025-05-07T20:31:11.7883391Z 2025-05-07T20:31:11.7883394Z 2025-05-07T20:31:11.7883398Z 2025-05-07T20:31:11.7921512Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:31:11.7921788Z 2025-05-07T20:31:11.7921792Z 2025-05-07T20:31:11.7921795Z 2025-05-07T20:31:11.7927896Z 2025-05-07T20:31:11.7961822Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:31:11.7962196Z 2025-05-07T20:31:11.7962202Z 2025-05-07T20:31:11.7962207Z 2025-05-07T20:31:11.7962212Z 2025-05-07T20:31:11.7965261Z 2025-05-07T20:31:11.8719761Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:31:11.8720112Z 2025-05-07T20:31:11.8720116Z 2025-05-07T20:31:11.8720120Z 2025-05-07T20:31:11.8720418Z 2025-05-07T20:31:11.8720425Z 2025-05-07T20:31:11.8730668Z 2025-05-07T20:31:11.8740682Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:11.8741178Z 2025-05-07T20:31:11.8741182Z 2025-05-07T20:31:11.8741186Z 2025-05-07T20:31:11.8741189Z 2025-05-07T20:31:11.8741196Z 2025-05-07T20:31:11.8741200Z 2025-05-07T20:31:11.8742491Z 2025-05-07T20:31:11.8775465Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:11.8775754Z 2025-05-07T20:31:11.8775758Z 2025-05-07T20:31:11.8775761Z 2025-05-07T20:31:11.8775765Z 2025-05-07T20:31:11.8775769Z 2025-05-07T20:31:11.8775776Z 2025-05-07T20:31:11.8783786Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:11.8784180Z 2025-05-07T20:31:11.8784186Z 2025-05-07T20:31:11.8784191Z 2025-05-07T20:31:11.8784197Z 2025-05-07T20:31:11.8784211Z 2025-05-07T20:31:11.8784217Z 2025-05-07T20:31:11.8784222Z 2025-05-07T20:31:11.8786369Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:11.8786738Z 2025-05-07T20:31:11.8789049Z 2025-05-07T20:31:11.9061737Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:31:11.9062400Z 2025-05-07T20:31:11.9113372Z 2025-05-07T20:31:11.9621830Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:31:11.9622177Z 2025-05-07T20:31:11.9622183Z 2025-05-07T20:31:11.9622723Z 2025-05-07T20:31:11.9631859Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:11.9632203Z 2025-05-07T20:31:11.9632209Z 2025-05-07T20:31:11.9632746Z 2025-05-07T20:31:11.9703497Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:11.9703858Z 2025-05-07T20:31:11.9703865Z 2025-05-07T20:31:11.9703870Z 2025-05-07T20:31:11.9703885Z 2025-05-07T20:31:11.9703891Z 2025-05-07T20:31:11.9718488Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:31:11.9718838Z 2025-05-07T20:31:11.9718856Z 2025-05-07T20:31:11.9718862Z 2025-05-07T20:31:11.9718875Z 2025-05-07T20:31:11.9805173Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:31:11.9997470Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:31:11.9997812Z 2025-05-07T20:31:11.9997818Z 2025-05-07T20:31:11.9997823Z 2025-05-07T20:31:11.9997828Z 2025-05-07T20:31:11.9997833Z 2025-05-07T20:31:11.9997838Z 2025-05-07T20:31:12.0016406Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:12.0087066Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:31:12.0087413Z 2025-05-07T20:31:12.0087419Z 2025-05-07T20:31:12.0087427Z 2025-05-07T20:31:12.0087605Z 2025-05-07T20:31:12.0087612Z 2025-05-07T20:31:12.0087617Z 2025-05-07T20:31:12.0087626Z 2025-05-07T20:31:12.0161055Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:12.0161433Z 2025-05-07T20:31:12.0162497Z 2025-05-07T20:31:12.0165188Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:31:12.0165694Z 2025-05-07T20:31:12.0165700Z 2025-05-07T20:31:12.0250016Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:31:12.0250629Z 2025-05-07T20:31:12.0274241Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:31:12.0274579Z 2025-05-07T20:31:12.0435866Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:31:12.0436534Z 2025-05-07T20:31:12.0445094Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:31:12.0451082Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:31:12.0451553Z 2025-05-07T20:31:12.0451858Z 2025-05-07T20:31:12.0452122Z  2025-05-07T20:31:12.0452392Z 2025-05-07T20:31:12.0452397Z 2025-05-07T20:31:12.0452618Z  2025-05-07T20:31:12.0452907Z 2025-05-07T20:31:12.0453176Z 2025-05-07T20:31:12.0453183Z 2025-05-07T20:31:12.0453426Z  2025-05-07T20:31:12.0453716Z 2025-05-07T20:31:12.0453721Z 2025-05-07T20:31:12.0453892Z 2025-05-07T20:31:12.0453896Z 2025-05-07T20:31:12.0454081Z  2025-05-07T20:31:12.0454292Z 2025-05-07T20:31:12.0454295Z 2025-05-07T20:31:12.0454299Z 2025-05-07T20:31:12.0454302Z 2025-05-07T20:31:12.0454306Z 2025-05-07T20:31:12.0454479Z  2025-05-07T20:31:12.0454690Z 2025-05-07T20:31:12.0454693Z 2025-05-07T20:31:12.0454697Z 2025-05-07T20:31:12.0454700Z 2025-05-07T20:31:12.0454704Z 2025-05-07T20:31:12.0454707Z 2025-05-07T20:31:12.0454882Z  2025-05-07T20:31:12.0455091Z 2025-05-07T20:31:12.0455100Z 2025-05-07T20:31:12.0455111Z 2025-05-07T20:31:12.0455115Z 2025-05-07T20:31:12.0455126Z 2025-05-07T20:31:12.0455130Z 2025-05-07T20:31:12.0455133Z 2025-05-07T20:31:12.0455318Z  done 2025-05-07T20:31:12.1457461Z Preparing transaction: \ done 2025-05-07T20:31:12.2462931Z Verifying transaction: / done 2025-05-07T20:31:14.1490946Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:31:14.2763428Z [TEST] Checking imports ... 2025-05-07T20:31:18.2560658Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:31:18.2573366Z [TEST] Setting feature flags ... 2025-05-07T20:31:18.2573899Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:31:18.2574301Z 2025-05-07T20:31:18.6788845Z 2025-05-07T20:31:18.6789538Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:31:18.6790687Z ################################################################################ 2025-05-07T20:31:18.6791133Z # Run FBGEMM-GPU Tests: 2025-05-07T20:31:18.6791423Z # 2025-05-07T20:31:18.6810447Z # [2025-05-07T20:31:18.680Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:31:18.6810984Z ################################################################################ 2025-05-07T20:31:18.6811231Z 2025-05-07T20:31:18.6820421Z [TEST] Enumerating ALL test files ... 2025-05-07T20:31:18.6849075Z ./attention/gqa_test.py 2025-05-07T20:31:18.6849351Z ./coalesce/coalesce_test.py 2025-05-07T20:31:18.6849621Z ./comm/multi_gpu_car_test.py 2025-05-07T20:31:18.6849896Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:18.6850191Z ./kv_cache/kv_cache_test.py 2025-05-07T20:31:18.6850437Z ./moe/activation_test.py 2025-05-07T20:31:18.6850691Z ./moe/gather_scatter_test.py 2025-05-07T20:31:18.6850941Z ./moe/layers_test.py 2025-05-07T20:31:18.6851164Z ./moe/shuffling_test.py 2025-05-07T20:31:18.6851406Z ./quantize/quantize_test.py 2025-05-07T20:31:18.6851568Z 2025-05-07T20:31:18.6851700Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:31:18.6851908Z 2025-05-07T20:31:18.6869766Z ################################################################################ 2025-05-07T20:31:18.6885050Z # [2025-05-07T20:31:18.688Z] Run Python Test Suite: 2025-05-07T20:31:18.6885448Z # ./attention/gqa_test.py 2025-05-07T20:31:18.6885821Z ################################################################################ 2025-05-07T20:31:18.6909011Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:31:18.6909619Z 2025-05-07T20:31:21.2393048Z ============================= test session starts ============================== 2025-05-07T20:31:21.2394603Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:21.2395218Z cachedir: .pytest_cache 2025-05-07T20:31:21.2396194Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:21.2396930Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:21.2397481Z plugins: hypothesis-6.131.14 2025-05-07T20:31:22.9177081Z collecting ... collected 2 items 2025-05-07T20:31:22.9177364Z 2025-05-07T20:31:58.4717320Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:58.4720005Z self=, 2025-05-07T20:31:58.4720685Z int4_kv=False, 2025-05-07T20:31:58.4720986Z num_groups=1, 2025-05-07T20:31:58.4721238Z B=1, 2025-05-07T20:31:58.4721469Z MAX_T=4, 2025-05-07T20:31:58.4721708Z N_H_L=1, 2025-05-07T20:31:58.4726011Z ) 2025-05-07T20:31:58.4726323Z Trying example: test_gqa( 2025-05-07T20:31:58.4726716Z self=, 2025-05-07T20:31:58.4727104Z int4_kv=True, 2025-05-07T20:31:58.4727366Z num_groups=1, 2025-05-07T20:31:58.4727678Z B=1, 2025-05-07T20:31:58.4727905Z MAX_T=4, 2025-05-07T20:31:58.4728144Z N_H_L=1, 2025-05-07T20:31:58.4728380Z ) 2025-05-07T20:31:58.4728617Z Trying example: test_gqa( 2025-05-07T20:31:58.4728995Z self=, 2025-05-07T20:31:58.4729382Z int4_kv=True, 2025-05-07T20:31:58.4729639Z num_groups=4, 2025-05-07T20:31:58.4729887Z B=23, 2025-05-07T20:31:58.4730117Z MAX_T=33, 2025-05-07T20:31:58.4730358Z N_H_L=68, 2025-05-07T20:31:58.4730587Z ) 2025-05-07T20:31:58.4730834Z Trying example: test_gqa( 2025-05-07T20:31:58.4731188Z self=, 2025-05-07T20:31:58.4731561Z int4_kv=True, 2025-05-07T20:31:58.4731815Z num_groups=4, 2025-05-07T20:31:58.4732071Z B=77, 2025-05-07T20:31:58.4732292Z MAX_T=4, 2025-05-07T20:31:58.4732545Z N_H_L=1, 2025-05-07T20:31:58.4732779Z ) 2025-05-07T20:31:58.4733009Z Trying example: test_gqa( 2025-05-07T20:31:58.4733369Z self=, 2025-05-07T20:31:58.4733749Z int4_kv=True, 2025-05-07T20:31:58.4734002Z num_groups=4, 2025-05-07T20:31:58.4734253Z B=77, 2025-05-07T20:31:58.4734490Z MAX_T=52, 2025-05-07T20:31:58.4734730Z N_H_L=67, 2025-05-07T20:31:58.4734963Z ) 2025-05-07T20:31:58.4735198Z Trying example: test_gqa( 2025-05-07T20:31:58.4735544Z self=, 2025-05-07T20:31:58.4735932Z int4_kv=False, 2025-05-07T20:31:58.4736191Z num_groups=4, 2025-05-07T20:31:58.4736440Z B=57, 2025-05-07T20:31:58.4736670Z MAX_T=45, 2025-05-07T20:31:58.4736912Z N_H_L=120, 2025-05-07T20:31:58.4737145Z ) 2025-05-07T20:31:58.4737384Z Trying example: test_gqa( 2025-05-07T20:31:58.4737746Z self=, 2025-05-07T20:31:58.4738129Z int4_kv=True, 2025-05-07T20:31:58.4738384Z num_groups=4, 2025-05-07T20:31:58.4738629Z B=52, 2025-05-07T20:31:58.4738859Z MAX_T=42, 2025-05-07T20:31:58.4739099Z N_H_L=53, 2025-05-07T20:31:58.4739335Z ) 2025-05-07T20:31:58.4739572Z Trying example: test_gqa( 2025-05-07T20:31:58.4739921Z self=, 2025-05-07T20:31:58.4740296Z int4_kv=True, 2025-05-07T20:31:58.4740556Z num_groups=1, 2025-05-07T20:31:58.4740805Z B=77, 2025-05-07T20:31:58.4741028Z MAX_T=95, 2025-05-07T20:31:58.4741265Z N_H_L=53, 2025-05-07T20:31:58.4741564Z ) 2025-05-07T20:31:58.4742016Z Trying example: test_gqa( 2025-05-07T20:31:58.4742478Z self=, 2025-05-07T20:31:58.4742938Z int4_kv=True, 2025-05-07T20:31:58.4743325Z num_groups=4, 2025-05-07T20:31:58.4743679Z B=113, 2025-05-07T20:31:58.4743990Z MAX_T=48, 2025-05-07T20:31:58.4754518Z N_H_L=96, 2025-05-07T20:31:58.4754902Z ) 2025-05-07T20:31:58.4755178Z Trying example: test_gqa( 2025-05-07T20:31:58.4755570Z self=, 2025-05-07T20:31:58.4756068Z int4_kv=False, 2025-05-07T20:31:58.4756344Z num_groups=1, 2025-05-07T20:31:58.4756974Z B=51, 2025-05-07T20:31:58.4757222Z MAX_T=61, 2025-05-07T20:31:58.4757470Z N_H_L=69, 2025-05-07T20:31:58.4757708Z ) 2025-05-07T20:31:58.4757954Z Trying example: test_gqa( 2025-05-07T20:31:58.4758555Z self=, 2025-05-07T20:31:58.4758938Z int4_kv=False, 2025-05-07T20:31:58.4759202Z num_groups=4, 2025-05-07T20:31:58.4759461Z B=17, 2025-05-07T20:31:58.4759695Z MAX_T=113, 2025-05-07T20:31:58.4759940Z N_H_L=65, 2025-05-07T20:31:58.4760181Z ) 2025-05-07T20:31:58.4760423Z Trying example: test_gqa( 2025-05-07T20:31:58.4760771Z self=, 2025-05-07T20:31:58.4761159Z int4_kv=False, 2025-05-07T20:31:58.4761420Z num_groups=4, 2025-05-07T20:31:58.4761667Z B=17, 2025-05-07T20:31:58.4761899Z MAX_T=65, 2025-05-07T20:31:58.4762141Z N_H_L=65, 2025-05-07T20:31:58.4762376Z ) 2025-05-07T20:31:58.4762616Z Trying example: test_gqa( 2025-05-07T20:31:58.4762981Z self=, 2025-05-07T20:31:58.4763359Z int4_kv=False, 2025-05-07T20:31:58.4763620Z num_groups=4, 2025-05-07T20:31:58.4763878Z B=65, 2025-05-07T20:31:58.4764113Z MAX_T=65, 2025-05-07T20:31:58.4764357Z N_H_L=65, 2025-05-07T20:31:58.4764595Z ) 2025-05-07T20:31:58.4764852Z Trying example: test_gqa( 2025-05-07T20:31:58.4765229Z self=, 2025-05-07T20:31:58.4765927Z int4_kv=False, 2025-05-07T20:31:58.4766183Z num_groups=1, 2025-05-07T20:31:58.4766468Z B=6, 2025-05-07T20:31:58.4766722Z MAX_T=108, 2025-05-07T20:31:58.4766978Z N_H_L=14, 2025-05-07T20:31:58.4767232Z ) 2025-05-07T20:31:58.4767484Z Trying example: test_gqa( 2025-05-07T20:31:58.4767814Z self=, 2025-05-07T20:31:58.4768137Z int4_kv=False, 2025-05-07T20:31:58.4768355Z num_groups=1, 2025-05-07T20:31:58.4768563Z B=6, 2025-05-07T20:31:58.4768751Z MAX_T=14, 2025-05-07T20:31:58.4768963Z N_H_L=14, 2025-05-07T20:31:58.4769162Z ) 2025-05-07T20:31:58.4769351Z Trying example: test_gqa( 2025-05-07T20:31:58.4769648Z self=, 2025-05-07T20:31:58.4769969Z int4_kv=False, 2025-05-07T20:31:58.4770180Z num_groups=1, 2025-05-07T20:31:58.4770394Z B=6, 2025-05-07T20:31:58.4770588Z MAX_T=6, 2025-05-07T20:31:58.4770782Z N_H_L=14, 2025-05-07T20:31:58.4770979Z ) 2025-05-07T20:31:58.4771175Z Trying example: test_gqa( 2025-05-07T20:31:58.4771465Z self=, 2025-05-07T20:31:58.4771781Z int4_kv=False, 2025-05-07T20:31:58.4771997Z num_groups=1, 2025-05-07T20:31:58.4772201Z B=6, 2025-05-07T20:31:58.4772392Z MAX_T=6, 2025-05-07T20:31:58.4772589Z N_H_L=6, 2025-05-07T20:31:58.4772777Z ) 2025-05-07T20:31:58.4772979Z Trying example: test_gqa( 2025-05-07T20:31:58.4773270Z self=, 2025-05-07T20:31:58.4773579Z int4_kv=False, 2025-05-07T20:31:58.4773796Z num_groups=1, 2025-05-07T20:31:58.4774006Z B=70, 2025-05-07T20:31:58.4774190Z MAX_T=94, 2025-05-07T20:31:58.4774389Z N_H_L=78, 2025-05-07T20:31:58.4774584Z ) 2025-05-07T20:31:58.4774779Z Trying example: test_gqa( 2025-05-07T20:31:58.4775069Z self=, 2025-05-07T20:31:58.4775386Z int4_kv=False, 2025-05-07T20:31:58.4775592Z num_groups=1, 2025-05-07T20:31:58.4775803Z B=78, 2025-05-07T20:31:58.4775991Z MAX_T=94, 2025-05-07T20:31:58.4776190Z N_H_L=78, 2025-05-07T20:31:58.4776381Z ) 2025-05-07T20:31:58.4776578Z Trying example: test_gqa( 2025-05-07T20:31:58.4776866Z self=, 2025-05-07T20:31:58.4777179Z int4_kv=False, 2025-05-07T20:31:58.4777393Z num_groups=1, 2025-05-07T20:31:58.4777600Z B=94, 2025-05-07T20:31:58.4777782Z MAX_T=94, 2025-05-07T20:31:58.4777980Z N_H_L=78, 2025-05-07T20:31:58.4778173Z ) 2025-05-07T20:31:58.4778358Z Trying example: test_gqa( 2025-05-07T20:31:58.4778800Z self=, 2025-05-07T20:31:58.4779119Z int4_kv=False, 2025-05-07T20:31:58.4779325Z num_groups=1, 2025-05-07T20:31:58.4779532Z B=94, 2025-05-07T20:31:58.4779840Z MAX_T=94, 2025-05-07T20:31:58.4780032Z N_H_L=94, 2025-05-07T20:31:58.4780227Z ) 2025-05-07T20:31:58.4780423Z Trying example: test_gqa( 2025-05-07T20:31:58.4780710Z self=, 2025-05-07T20:31:58.4781028Z int4_kv=False, 2025-05-07T20:31:58.4781243Z num_groups=4, 2025-05-07T20:31:58.4781445Z B=41, 2025-05-07T20:31:58.4781637Z MAX_T=105, 2025-05-07T20:31:58.4781842Z N_H_L=126, 2025-05-07T20:31:58.4782035Z ) 2025-05-07T20:31:58.4782231Z Trying example: test_gqa( 2025-05-07T20:31:58.4782523Z self=, 2025-05-07T20:31:58.4782830Z int4_kv=False, 2025-05-07T20:31:58.4783041Z num_groups=4, 2025-05-07T20:31:58.4783255Z B=105, 2025-05-07T20:31:58.4783454Z MAX_T=105, 2025-05-07T20:31:58.4783655Z N_H_L=126, 2025-05-07T20:31:58.4783857Z ) 2025-05-07T20:31:58.4784055Z Trying example: test_gqa( 2025-05-07T20:31:58.4784338Z self=, 2025-05-07T20:31:58.4784661Z int4_kv=False, 2025-05-07T20:31:58.4784865Z num_groups=4, 2025-05-07T20:31:58.4785070Z B=105, 2025-05-07T20:31:58.4785262Z MAX_T=105, 2025-05-07T20:31:58.4785463Z N_H_L=105, 2025-05-07T20:31:58.4785652Z ) 2025-05-07T20:31:58.4785844Z Trying example: test_gqa( 2025-05-07T20:31:58.4786134Z self=, 2025-05-07T20:31:58.4786438Z int4_kv=True, 2025-05-07T20:31:58.4786643Z num_groups=1, 2025-05-07T20:31:58.4786848Z B=95, 2025-05-07T20:31:58.4787033Z MAX_T=114, 2025-05-07T20:31:58.4787231Z N_H_L=43, 2025-05-07T20:31:58.4787424Z ) 2025-05-07T20:31:58.4787611Z Trying example: test_gqa( 2025-05-07T20:31:58.4787899Z self=, 2025-05-07T20:31:58.4788215Z int4_kv=True, 2025-05-07T20:31:58.4788424Z num_groups=1, 2025-05-07T20:31:58.4788625Z B=43, 2025-05-07T20:31:58.4788813Z MAX_T=114, 2025-05-07T20:31:58.4789013Z N_H_L=43, 2025-05-07T20:31:58.4789204Z ) 2025-05-07T20:31:58.4789398Z Trying example: test_gqa( 2025-05-07T20:31:58.4789687Z self=, 2025-05-07T20:31:58.4789990Z int4_kv=True, 2025-05-07T20:31:58.4790200Z num_groups=1, 2025-05-07T20:31:58.4790407Z B=43, 2025-05-07T20:31:58.4790589Z MAX_T=43, 2025-05-07T20:31:58.4790787Z N_H_L=43, 2025-05-07T20:31:58.4790978Z ) 2025-05-07T20:31:58.4791164Z Trying example: test_gqa( 2025-05-07T20:31:58.4791450Z self=, 2025-05-07T20:31:58.4791760Z int4_kv=False, 2025-05-07T20:31:58.4791964Z num_groups=1, 2025-05-07T20:31:58.4792169Z B=21, 2025-05-07T20:31:58.4792356Z MAX_T=38, 2025-05-07T20:31:58.4792601Z N_H_L=42, 2025-05-07T20:31:58.4792792Z ) 2025-05-07T20:31:58.4792988Z Trying example: test_gqa( 2025-05-07T20:31:58.4793270Z self=, 2025-05-07T20:31:58.4793582Z int4_kv=False, 2025-05-07T20:31:58.4793794Z num_groups=1, 2025-05-07T20:31:58.4793996Z B=38, 2025-05-07T20:31:58.4794184Z MAX_T=38, 2025-05-07T20:31:58.4794387Z N_H_L=42, 2025-05-07T20:31:58.4794571Z ) 2025-05-07T20:31:58.4794766Z Trying example: test_gqa( 2025-05-07T20:31:58.4795059Z self=, 2025-05-07T20:31:58.4795401Z int4_kv=False, 2025-05-07T20:31:58.4795627Z num_groups=1, 2025-05-07T20:31:58.4795937Z B=38, 2025-05-07T20:31:58.4796125Z MAX_T=42, 2025-05-07T20:31:58.4796314Z N_H_L=42, 2025-05-07T20:31:58.4796505Z ) 2025-05-07T20:31:58.4796710Z Trying example: test_gqa( 2025-05-07T20:31:58.4796994Z self=, 2025-05-07T20:31:58.4797306Z int4_kv=False, 2025-05-07T20:31:58.4797520Z num_groups=1, 2025-05-07T20:31:58.4797719Z B=42, 2025-05-07T20:31:58.4798015Z MAX_T=42, 2025-05-07T20:31:58.4798215Z N_H_L=42, 2025-05-07T20:31:58.4798406Z ) 2025-05-07T20:31:58.4798603Z Trying example: test_gqa( 2025-05-07T20:31:58.4798899Z self=, 2025-05-07T20:31:58.4799283Z int4_kv=True, 2025-05-07T20:31:58.4799501Z num_groups=1, 2025-05-07T20:31:58.4799716Z B=74, 2025-05-07T20:31:58.4799903Z MAX_T=20, 2025-05-07T20:31:58.4800109Z N_H_L=15, 2025-05-07T20:31:58.4800310Z ) 2025-05-07T20:31:58.4800501Z Trying example: test_gqa( 2025-05-07T20:31:58.4800795Z self=, 2025-05-07T20:31:58.4801117Z int4_kv=True, 2025-05-07T20:31:58.4801321Z num_groups=1, 2025-05-07T20:31:58.4801534Z B=20, 2025-05-07T20:31:58.4801727Z MAX_T=20, 2025-05-07T20:31:58.4801921Z N_H_L=15, 2025-05-07T20:31:58.4802114Z ) 2025-05-07T20:31:58.4802309Z Trying example: test_gqa( 2025-05-07T20:31:58.4802598Z self=, 2025-05-07T20:31:58.4802917Z int4_kv=True, 2025-05-07T20:31:58.4803128Z num_groups=1, 2025-05-07T20:31:58.4803341Z B=20, 2025-05-07T20:31:58.4803525Z MAX_T=15, 2025-05-07T20:31:58.4803717Z N_H_L=15, 2025-05-07T20:31:58.4803919Z ) 2025-05-07T20:31:58.4804109Z Trying example: test_gqa( 2025-05-07T20:31:58.4804397Z self=, 2025-05-07T20:31:58.4804708Z int4_kv=True, 2025-05-07T20:31:58.4804913Z num_groups=1, 2025-05-07T20:31:58.4805119Z B=15, 2025-05-07T20:31:58.4805307Z MAX_T=20, 2025-05-07T20:31:58.4805500Z N_H_L=15, 2025-05-07T20:31:58.4805697Z ) 2025-05-07T20:31:58.4805894Z Trying example: test_gqa( 2025-05-07T20:31:58.4806179Z self=, 2025-05-07T20:31:58.4806493Z int4_kv=True, 2025-05-07T20:31:58.4806703Z num_groups=1, 2025-05-07T20:31:58.4806900Z B=15, 2025-05-07T20:31:58.4807091Z MAX_T=15, 2025-05-07T20:31:58.4807286Z N_H_L=15, 2025-05-07T20:31:58.4807475Z ) 2025-05-07T20:31:58.4807675Z Trying example: test_gqa( 2025-05-07T20:31:58.4807974Z self=, 2025-05-07T20:31:58.4808281Z int4_kv=False, 2025-05-07T20:31:58.4808507Z num_groups=4, 2025-05-07T20:31:58.4808723Z B=117, 2025-05-07T20:31:58.4808909Z MAX_T=104, 2025-05-07T20:31:58.4809113Z N_H_L=69, 2025-05-07T20:31:58.4809316Z ) 2025-05-07T20:31:58.4809507Z Trying example: test_gqa( 2025-05-07T20:31:58.4809801Z self=, 2025-05-07T20:31:58.4810120Z int4_kv=False, 2025-05-07T20:31:58.4810339Z num_groups=4, 2025-05-07T20:31:58.4810546Z B=117, 2025-05-07T20:31:58.4810750Z MAX_T=117, 2025-05-07T20:31:58.4810951Z N_H_L=69, 2025-05-07T20:31:58.4811145Z ) 2025-05-07T20:31:58.4811347Z Trying example: test_gqa( 2025-05-07T20:31:58.4811636Z self=, 2025-05-07T20:31:58.4811946Z int4_kv=False, 2025-05-07T20:31:58.4812163Z num_groups=4, 2025-05-07T20:31:58.4812377Z B=69, 2025-05-07T20:31:58.4812565Z MAX_T=117, 2025-05-07T20:31:58.4812774Z N_H_L=69, 2025-05-07T20:31:58.4812969Z ) 2025-05-07T20:31:58.4813164Z Trying example: test_gqa( 2025-05-07T20:31:58.4813458Z self=, 2025-05-07T20:31:58.4813768Z int4_kv=False, 2025-05-07T20:31:58.4813979Z num_groups=4, 2025-05-07T20:31:58.4814192Z B=117, 2025-05-07T20:31:58.4814387Z MAX_T=69, 2025-05-07T20:31:58.4814588Z N_H_L=69, 2025-05-07T20:31:58.4814795Z ) 2025-05-07T20:31:58.4814988Z PASSED 2025-05-07T20:31:58.4917442Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:58.4917782Z 2025-05-07T20:31:58.4917937Z =========================== short test summary info ============================ 2025-05-07T20:31:58.4918665Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:58.4919554Z ======================== 1 passed, 1 skipped in 37.76s ========================= 2025-05-07T20:31:59.1534537Z 2025-05-07T20:31:59.1535118Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:59.1556195Z [TEST] Python test time for ./attention/gqa_test.py: 41 seconds 2025-05-07T20:31:59.1556490Z 2025-05-07T20:31:59.1556495Z 2025-05-07T20:31:59.1556500Z 2025-05-07T20:31:59.1556503Z 2025-05-07T20:31:59.1578711Z ################################################################################ 2025-05-07T20:31:59.1594411Z # [2025-05-07T20:31:59.159Z] Run Python Test Suite: 2025-05-07T20:31:59.1594763Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:59.1595060Z ################################################################################ 2025-05-07T20:31:59.1619578Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:59.1620202Z 2025-05-07T20:32:01.3213136Z ============================= test session starts ============================== 2025-05-07T20:32:01.3213779Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:01.3214307Z cachedir: .pytest_cache 2025-05-07T20:32:01.3214887Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:01.3215615Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:01.3216030Z plugins: hypothesis-6.131.14 2025-05-07T20:32:03.0587836Z collecting ... collected 1 item 2025-05-07T20:32:03.8167520Z 2025-05-07T20:32:03.8167813Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:32:03.8168137Z 2025-05-07T20:32:03.8168415Z ============================== 1 passed in 2.62s =============================== 2025-05-07T20:32:04.4535417Z 2025-05-07T20:32:04.4536148Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:32:04.4553687Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:32:04.4554081Z 2025-05-07T20:32:04.4554087Z 2025-05-07T20:32:04.4554109Z 2025-05-07T20:32:04.4554114Z 2025-05-07T20:32:04.4576521Z ################################################################################ 2025-05-07T20:32:04.4593551Z # [2025-05-07T20:32:04.459Z] Run Python Test Suite: 2025-05-07T20:32:04.4593903Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:32:04.4594198Z ################################################################################ 2025-05-07T20:32:04.4619630Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:32:04.4620247Z 2025-05-07T20:32:06.6328558Z ============================= test session starts ============================== 2025-05-07T20:32:06.6329236Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:06.6329761Z cachedir: .pytest_cache 2025-05-07T20:32:06.6330338Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:06.6331088Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:06.6331490Z plugins: hypothesis-6.131.14 2025-05-07T20:32:08.3337235Z collecting ... collected 5 items 2025-05-07T20:32:08.3337656Z 2025-05-07T20:32:08.3350508Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:32:08.3360237Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:32:08.3369112Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:32:08.3377432Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:32:08.3397452Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:32:08.3397791Z 2025-05-07T20:32:08.3397945Z =========================== short test summary info ============================ 2025-05-07T20:32:08.3398617Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.3399710Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.3400633Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.3401555Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.3402479Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:08.3403121Z ============================== 5 skipped in 1.83s ============================== 2025-05-07T20:32:08.9158472Z 2025-05-07T20:32:08.9159116Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:32:08.9179049Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds 2025-05-07T20:32:08.9179342Z 2025-05-07T20:32:08.9179347Z 2025-05-07T20:32:08.9179350Z 2025-05-07T20:32:08.9179354Z 2025-05-07T20:32:08.9199582Z ################################################################################ 2025-05-07T20:32:08.9217263Z # [2025-05-07T20:32:08.921Z] Run Python Test Suite: 2025-05-07T20:32:08.9217622Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:08.9217945Z ################################################################################ 2025-05-07T20:32:08.9242926Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:08.9243589Z 2025-05-07T20:32:11.0750946Z ============================= test session starts ============================== 2025-05-07T20:32:11.0751831Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:11.0752353Z cachedir: .pytest_cache 2025-05-07T20:32:11.0752922Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:11.0753649Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:11.0754060Z plugins: hypothesis-6.131.14 2025-05-07T20:32:12.8720152Z collecting ... collected 2 items 2025-05-07T20:32:12.8720365Z 2025-05-07T20:32:12.8731820Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:32:12.8748989Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:32:12.8749421Z 2025-05-07T20:32:12.8749580Z =========================== short test summary info ============================ 2025-05-07T20:32:12.8750201Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:32:12.8751042Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:32:12.8751642Z ============================== 2 skipped in 1.92s ============================== 2025-05-07T20:32:13.4652067Z 2025-05-07T20:32:13.4652554Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:13.4672625Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:32:13.4672951Z 2025-05-07T20:32:13.4672955Z 2025-05-07T20:32:13.4672959Z 2025-05-07T20:32:13.4673029Z 2025-05-07T20:32:13.4695183Z ################################################################################ 2025-05-07T20:32:13.4710397Z # [2025-05-07T20:32:13.470Z] Run Python Test Suite: 2025-05-07T20:32:13.4710735Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:32:13.4711032Z ################################################################################ 2025-05-07T20:32:13.4735502Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:32:13.4736119Z 2025-05-07T20:32:15.6273008Z ============================= test session starts ============================== 2025-05-07T20:32:15.6273653Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:15.6274177Z cachedir: .pytest_cache 2025-05-07T20:32:15.6274746Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:15.6275498Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:15.6275967Z plugins: hypothesis-6.131.14 2025-05-07T20:32:17.3193626Z collecting ... collected 4 items 2025-05-07T20:32:17.3194033Z 2025-05-07T20:32:20.0693087Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:32:20.0776540Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:32:20.0872586Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:32:20.0962470Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:32:20.0962838Z 2025-05-07T20:32:20.0962988Z =========================== short test summary info ============================ 2025-05-07T20:32:20.0963696Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when H100 is not available or MI300 is not available 2025-05-07T20:32:20.0964634Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when xformers is not available 2025-05-07T20:32:20.0965249Z ============================== 4 skipped in 4.59s ============================== 2025-05-07T20:32:22.0299219Z 2025-05-07T20:32:22.0299711Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:32:22.0319393Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds 2025-05-07T20:32:22.0319689Z 2025-05-07T20:32:22.0319926Z 2025-05-07T20:32:22.0319929Z 2025-05-07T20:32:22.0319989Z 2025-05-07T20:32:22.0341579Z ################################################################################ 2025-05-07T20:32:22.0356695Z # [2025-05-07T20:32:22.035Z] Run Python Test Suite: 2025-05-07T20:32:22.0357067Z # ./moe/activation_test.py 2025-05-07T20:32:22.0357380Z ################################################################################ 2025-05-07T20:32:22.0381974Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:32:22.0382806Z 2025-05-07T20:32:24.1911920Z ============================= test session starts ============================== 2025-05-07T20:32:24.1912582Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:24.1922032Z cachedir: .pytest_cache 2025-05-07T20:32:24.1922679Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:24.1923417Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:24.1923835Z plugins: hypothesis-6.131.14 2025-05-07T20:32:25.8318015Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:25.9388843Z collecting ... collected 2 items 2025-05-07T20:32:25.9389043Z 2025-05-07T20:32:31.2858900Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:32:31.2860121Z self=, 2025-05-07T20:32:31.2861236Z T=1, 2025-05-07T20:32:31.2861634Z D=5120, 2025-05-07T20:32:31.2862414Z contiguous=True, 2025-05-07T20:32:31.2862763Z compiled=True, 2025-05-07T20:32:31.2863053Z ) 2025-05-07T20:32:31.2863334Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2863843Z self=, 2025-05-07T20:32:31.2864226Z T=4096, 2025-05-07T20:32:31.2864423Z D=5120, 2025-05-07T20:32:31.2864626Z contiguous=True, 2025-05-07T20:32:31.2864850Z compiled=True, 2025-05-07T20:32:31.2865057Z ) 2025-05-07T20:32:31.2865262Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2865929Z self=, 2025-05-07T20:32:31.2866321Z T=4096, 2025-05-07T20:32:31.2866519Z D=7168, 2025-05-07T20:32:31.2866718Z contiguous=False, 2025-05-07T20:32:31.2866966Z compiled=False, 2025-05-07T20:32:31.2867179Z ) 2025-05-07T20:32:31.2867375Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2867755Z self=, 2025-05-07T20:32:31.2868152Z T=4096, 2025-05-07T20:32:31.2868338Z D=5120, 2025-05-07T20:32:31.2868544Z contiguous=False, 2025-05-07T20:32:31.2868779Z compiled=True, 2025-05-07T20:32:31.2868989Z ) 2025-05-07T20:32:31.2869190Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2869562Z self=, 2025-05-07T20:32:31.2869944Z T=1, 2025-05-07T20:32:31.2870131Z D=7168, 2025-05-07T20:32:31.2870337Z contiguous=True, 2025-05-07T20:32:31.2870563Z compiled=True, 2025-05-07T20:32:31.2870773Z ) 2025-05-07T20:32:31.2870980Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2871356Z self=, 2025-05-07T20:32:31.2871734Z T=1, 2025-05-07T20:32:31.2871932Z D=7168, 2025-05-07T20:32:31.2872142Z contiguous=False, 2025-05-07T20:32:31.2872372Z compiled=True, 2025-05-07T20:32:31.2872586Z ) 2025-05-07T20:32:31.2872793Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2873173Z self=, 2025-05-07T20:32:31.2873566Z T=4096, 2025-05-07T20:32:31.2873769Z D=5120, 2025-05-07T20:32:31.2873971Z contiguous=False, 2025-05-07T20:32:31.2874214Z compiled=False, 2025-05-07T20:32:31.2874431Z ) 2025-05-07T20:32:31.2874633Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2875015Z self=, 2025-05-07T20:32:31.2875403Z T=1, 2025-05-07T20:32:31.2875597Z D=7168, 2025-05-07T20:32:31.2875891Z contiguous=True, 2025-05-07T20:32:31.2876127Z compiled=False, 2025-05-07T20:32:31.2876341Z ) 2025-05-07T20:32:31.2876539Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2876924Z self=, 2025-05-07T20:32:31.2877310Z T=2048, 2025-05-07T20:32:31.2877498Z D=5120, 2025-05-07T20:32:31.2877697Z contiguous=True, 2025-05-07T20:32:31.2877924Z compiled=True, 2025-05-07T20:32:31.2878132Z ) 2025-05-07T20:32:31.2878339Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2878714Z self=, 2025-05-07T20:32:31.2879089Z T=2048, 2025-05-07T20:32:31.2879283Z D=7168, 2025-05-07T20:32:31.2879486Z contiguous=True, 2025-05-07T20:32:31.2879707Z compiled=True, 2025-05-07T20:32:31.2879917Z ) 2025-05-07T20:32:31.2880119Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2880481Z self=, 2025-05-07T20:32:31.2880873Z T=2048, 2025-05-07T20:32:31.2881067Z D=7168, 2025-05-07T20:32:31.2881260Z contiguous=True, 2025-05-07T20:32:31.2881491Z compiled=False, 2025-05-07T20:32:31.2881702Z ) 2025-05-07T20:32:31.2882067Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2882439Z self=, 2025-05-07T20:32:31.2882823Z T=128, 2025-05-07T20:32:31.2883021Z D=5120, 2025-05-07T20:32:31.2883338Z contiguous=False, 2025-05-07T20:32:31.2883571Z compiled=True, 2025-05-07T20:32:31.2883782Z ) 2025-05-07T20:32:31.2883980Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2884358Z self=, 2025-05-07T20:32:31.2884750Z T=128, 2025-05-07T20:32:31.2884941Z D=5120, 2025-05-07T20:32:31.2885147Z contiguous=True, 2025-05-07T20:32:31.2885378Z compiled=True, 2025-05-07T20:32:31.2885583Z ) 2025-05-07T20:32:31.2885787Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2886166Z self=, 2025-05-07T20:32:31.2886546Z T=16384, 2025-05-07T20:32:31.2886749Z D=5120, 2025-05-07T20:32:31.2886988Z contiguous=False, 2025-05-07T20:32:31.2887232Z compiled=True, 2025-05-07T20:32:31.2887440Z ) 2025-05-07T20:32:31.2887641Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2888016Z self=, 2025-05-07T20:32:31.2888408Z T=16384, 2025-05-07T20:32:31.2888601Z D=5120, 2025-05-07T20:32:31.2888804Z contiguous=False, 2025-05-07T20:32:31.2889041Z compiled=False, 2025-05-07T20:32:31.2889255Z ) 2025-05-07T20:32:31.2889458Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2889825Z self=, 2025-05-07T20:32:31.2890202Z T=128, 2025-05-07T20:32:31.2890392Z D=7168, 2025-05-07T20:32:31.2890589Z contiguous=True, 2025-05-07T20:32:31.2890814Z compiled=False, 2025-05-07T20:32:31.2891024Z ) 2025-05-07T20:32:31.2891225Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2891604Z self=, 2025-05-07T20:32:31.2891981Z T=128, 2025-05-07T20:32:31.2892180Z D=7168, 2025-05-07T20:32:31.2892397Z contiguous=False, 2025-05-07T20:32:31.2892645Z compiled=False, 2025-05-07T20:32:31.2892854Z ) 2025-05-07T20:32:31.2893045Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2893419Z self=, 2025-05-07T20:32:31.2893799Z T=1, 2025-05-07T20:32:31.2893981Z D=5120, 2025-05-07T20:32:31.2894179Z contiguous=False, 2025-05-07T20:32:31.2894406Z compiled=False, 2025-05-07T20:32:31.2894607Z ) 2025-05-07T20:32:31.2894807Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2895176Z self=, 2025-05-07T20:32:31.2895546Z T=1, 2025-05-07T20:32:31.2895732Z D=7168, 2025-05-07T20:32:31.2895930Z contiguous=False, 2025-05-07T20:32:31.2896149Z compiled=False, 2025-05-07T20:32:31.2896356Z ) 2025-05-07T20:32:31.2896556Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2896923Z self=, 2025-05-07T20:32:31.2897304Z T=4096, 2025-05-07T20:32:31.2897494Z D=5120, 2025-05-07T20:32:31.2897694Z contiguous=True, 2025-05-07T20:32:31.2897920Z compiled=False, 2025-05-07T20:32:31.2898127Z ) 2025-05-07T20:32:31.2898330Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2898700Z self=, 2025-05-07T20:32:31.2899079Z T=128, 2025-05-07T20:32:31.2899272Z D=7168, 2025-05-07T20:32:31.2899464Z contiguous=True, 2025-05-07T20:32:31.2899689Z compiled=True, 2025-05-07T20:32:31.2899897Z ) 2025-05-07T20:32:31.2900094Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2900465Z self=, 2025-05-07T20:32:31.2900847Z T=1, 2025-05-07T20:32:31.2901027Z D=5120, 2025-05-07T20:32:31.2901238Z contiguous=False, 2025-05-07T20:32:31.2901473Z compiled=True, 2025-05-07T20:32:31.2901674Z ) 2025-05-07T20:32:31.2901975Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2902352Z self=, 2025-05-07T20:32:31.2902724Z T=4096, 2025-05-07T20:32:31.2903015Z D=7168, 2025-05-07T20:32:31.2903214Z contiguous=True, 2025-05-07T20:32:31.2903431Z compiled=False, 2025-05-07T20:32:31.2903641Z ) 2025-05-07T20:32:31.2903838Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2904208Z self=, 2025-05-07T20:32:31.2904586Z T=4096, 2025-05-07T20:32:31.2904775Z D=7168, 2025-05-07T20:32:31.2904971Z contiguous=False, 2025-05-07T20:32:31.2905188Z compiled=True, 2025-05-07T20:32:31.2905392Z ) 2025-05-07T20:32:31.2905590Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2905952Z self=, 2025-05-07T20:32:31.2906327Z T=128, 2025-05-07T20:32:31.2906515Z D=5120, 2025-05-07T20:32:31.2906703Z contiguous=True, 2025-05-07T20:32:31.2906933Z compiled=False, 2025-05-07T20:32:31.2907139Z ) 2025-05-07T20:32:31.2907332Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2907701Z self=, 2025-05-07T20:32:31.2908082Z T=128, 2025-05-07T20:32:31.2908265Z D=5120, 2025-05-07T20:32:31.2908461Z contiguous=False, 2025-05-07T20:32:31.2908685Z compiled=False, 2025-05-07T20:32:31.2908891Z ) 2025-05-07T20:32:31.2909090Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2909458Z self=, 2025-05-07T20:32:31.2909828Z T=1, 2025-05-07T20:32:31.2910014Z D=5120, 2025-05-07T20:32:31.2910211Z contiguous=True, 2025-05-07T20:32:31.2910435Z compiled=False, 2025-05-07T20:32:31.2910636Z ) 2025-05-07T20:32:31.2910837Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2911207Z self=, 2025-05-07T20:32:31.2911577Z T=2048, 2025-05-07T20:32:31.2911774Z D=7168, 2025-05-07T20:32:31.2911971Z contiguous=False, 2025-05-07T20:32:31.2912193Z compiled=True, 2025-05-07T20:32:31.2912399Z ) 2025-05-07T20:32:31.2912603Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2912971Z self=, 2025-05-07T20:32:31.2913350Z T=2048, 2025-05-07T20:32:31.2913540Z D=7168, 2025-05-07T20:32:31.2913735Z contiguous=False, 2025-05-07T20:32:31.2913962Z compiled=False, 2025-05-07T20:32:31.2914173Z ) 2025-05-07T20:32:31.2914372Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2914745Z self=, 2025-05-07T20:32:31.2915122Z T=16384, 2025-05-07T20:32:31.2915310Z D=7168, 2025-05-07T20:32:31.2915512Z contiguous=False, 2025-05-07T20:32:31.2915803Z compiled=True, 2025-05-07T20:32:31.2916003Z ) 2025-05-07T20:32:31.2916204Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2916575Z self=, 2025-05-07T20:32:31.2916952Z T=16384, 2025-05-07T20:32:31.2917140Z D=7168, 2025-05-07T20:32:31.2917335Z contiguous=True, 2025-05-07T20:32:31.2917567Z compiled=True, 2025-05-07T20:32:31.2917765Z ) 2025-05-07T20:32:31.2917964Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2918332Z self=, 2025-05-07T20:32:31.2918703Z T=4096, 2025-05-07T20:32:31.2918890Z D=7168, 2025-05-07T20:32:31.2919091Z contiguous=True, 2025-05-07T20:32:31.2919307Z compiled=True, 2025-05-07T20:32:31.2919511Z ) 2025-05-07T20:32:31.2919713Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2920075Z self=, 2025-05-07T20:32:31.2920451Z T=2048, 2025-05-07T20:32:31.2920639Z D=5120, 2025-05-07T20:32:31.2920837Z contiguous=False, 2025-05-07T20:32:31.2921064Z compiled=False, 2025-05-07T20:32:31.2921277Z ) 2025-05-07T20:32:31.2921572Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2921944Z self=, 2025-05-07T20:32:31.2922368Z T=2048, 2025-05-07T20:32:31.2922674Z D=5120, 2025-05-07T20:32:31.2922863Z contiguous=True, 2025-05-07T20:32:31.2923087Z compiled=False, 2025-05-07T20:32:31.2923295Z ) 2025-05-07T20:32:31.2923488Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2923866Z self=, 2025-05-07T20:32:31.2924244Z T=128, 2025-05-07T20:32:31.2924427Z D=7168, 2025-05-07T20:32:31.2924628Z contiguous=False, 2025-05-07T20:32:31.2924855Z compiled=True, 2025-05-07T20:32:31.2925052Z ) 2025-05-07T20:32:31.2925253Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2925626Z self=, 2025-05-07T20:32:31.2925999Z T=16384, 2025-05-07T20:32:31.2926200Z D=5120, 2025-05-07T20:32:31.2926404Z contiguous=True, 2025-05-07T20:32:31.2926622Z compiled=True, 2025-05-07T20:32:31.2926828Z ) 2025-05-07T20:32:31.2927033Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2927397Z self=, 2025-05-07T20:32:31.2927779Z T=2048, 2025-05-07T20:32:31.2927972Z D=5120, 2025-05-07T20:32:31.2928166Z contiguous=False, 2025-05-07T20:32:31.2928394Z compiled=True, 2025-05-07T20:32:31.2928597Z ) 2025-05-07T20:32:31.2928794Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2929165Z self=, 2025-05-07T20:32:31.2929542Z T=16384, 2025-05-07T20:32:31.2929743Z D=5120, 2025-05-07T20:32:31.2929936Z contiguous=True, 2025-05-07T20:32:31.2930160Z compiled=False, 2025-05-07T20:32:31.2930369Z ) 2025-05-07T20:32:31.2930560Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2930928Z self=, 2025-05-07T20:32:31.2931310Z T=16384, 2025-05-07T20:32:31.2931501Z D=7168, 2025-05-07T20:32:31.2931702Z contiguous=False, 2025-05-07T20:32:31.2931929Z compiled=False, 2025-05-07T20:32:31.2932130Z ) 2025-05-07T20:32:31.2932341Z Trying example: test_silu_mul( 2025-05-07T20:32:31.2932709Z self=, 2025-05-07T20:32:31.2933080Z T=16384, 2025-05-07T20:32:31.2933279Z D=7168, 2025-05-07T20:32:31.2933484Z contiguous=True, 2025-05-07T20:32:31.2933702Z compiled=False, 2025-05-07T20:32:31.2933910Z ) 2025-05-07T20:32:31.2934092Z PASSED 2025-05-07T20:32:31.3518374Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.3519676Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:31.3521043Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.3522502Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.3523483Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.3524784Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.3526498Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.3527491Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.3528884Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.3530260Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.3531327Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.3532614Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.3533861Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:31.3535077Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.3536283Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:31.3537108Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.3538133Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:31.3539148Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:31.3539940Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:31.3541144Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.3542427Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.3543542Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:31.3544581Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:31.3545770Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.3547133Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.3548199Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.3549196Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.3549937Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:31.3551033Z W0507 20:32:31.349000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.3678216Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.3679454Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:31.3680803Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.3682258Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.3683285Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.3684587Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.3685972Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.3686960Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.3688194Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.3689585Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.3690651Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.3691937Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.3693181Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:31.3694398Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.3695610Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:31.3696438Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.3697793Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:31.3698810Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:31.3699737Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:31.3700944Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.3702221Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.3703342Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:31.3704374Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:31.3705551Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.3706908Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.3707971Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.3708876Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.3709622Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:31.3710645Z W0507 20:32:31.366000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.4064707Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.4066114Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:31.4067449Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.4068903Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.4069882Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.4071206Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.4072587Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.4073907Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.4075139Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.4076720Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.4077789Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.4079069Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.4080322Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:31.4081533Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.4082754Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:31.4083585Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.4084619Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:31.4085642Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:31.4086435Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:31.4087647Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.4088929Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.4090042Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:31.4100467Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:31.4101685Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.4103064Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.4104129Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.4105050Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.4105795Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:31.4107399Z W0507 20:32:31.405000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.4109161Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:31.4110449Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:31.4111785Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:31.4113206Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:31.4114196Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.4115495Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:31.4117001Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.4117983Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.4119216Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:31.4120599Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.4121669Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.4122963Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:31.4124213Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:31.4125437Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:31.4126654Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:31.4127474Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:31.4128500Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:31.4129523Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:31.4130406Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:31.4131619Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:31.4132972Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:31.4134093Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:31.4135136Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:31.4136319Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:31.4137678Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:31.4138739Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.4139652Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.4140388Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:31.4141408Z W0507 20:32:31.409000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8313503Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8314418Z self=, 2025-05-07T20:32:31.8314867Z T=1, 2025-05-07T20:32:31.8315064Z D=5120, 2025-05-07T20:32:31.8315259Z scale_ub=None, 2025-05-07T20:32:31.8315481Z contiguous=True, 2025-05-07T20:32:31.8315796Z compiled=True, 2025-05-07T20:32:31.8316006Z ) 2025-05-07T20:32:31.8316328Z self = 2025-05-07T20:32:31.8316818Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.8317078Z 2025-05-07T20:32:31.8317161Z @given( 2025-05-07T20:32:31.8317403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8317720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8318022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8318374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8318707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8319002Z ) 2025-05-07T20:32:31.8319358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8319806Z def test_silu_mul_quant( 2025-05-07T20:32:31.8320057Z self, 2025-05-07T20:32:31.8320253Z T: int, 2025-05-07T20:32:31.8320464Z D: int, 2025-05-07T20:32:31.8320700Z scale_ub: Optional[float], 2025-05-07T20:32:31.8320976Z contiguous: bool, 2025-05-07T20:32:31.8321229Z compiled: bool, 2025-05-07T20:32:31.8321467Z ) -> None: 2025-05-07T20:32:31.8321689Z torch.manual_seed(2025) 2025-05-07T20:32:31.8321948Z 2025-05-07T20:32:31.8322238Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8322584Z 2025-05-07T20:32:31.8322791Z x_sign = torch.sign(x) 2025-05-07T20:32:31.8323422Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.8323751Z x = x_sign * x_clamp 2025-05-07T20:32:31.8323995Z x0 = x[:, :D] 2025-05-07T20:32:31.8324221Z x1 = x[:, D:] 2025-05-07T20:32:31.8324594Z 2025-05-07T20:32:31.8324780Z if contiguous: 2025-05-07T20:32:31.8325022Z x0 = x0.contiguous() 2025-05-07T20:32:31.8325289Z x1 = x1.contiguous() 2025-05-07T20:32:31.8325525Z 2025-05-07T20:32:31.8325724Z if scale_ub is not None: 2025-05-07T20:32:31.8326004Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.8326341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.8326659Z ) 2025-05-07T20:32:31.8326865Z else: 2025-05-07T20:32:31.8327074Z scale_ub_tensor = None 2025-05-07T20:32:31.8327332Z 2025-05-07T20:32:31.8327569Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8327887Z op = silu_mul_quant 2025-05-07T20:32:31.8328152Z if compiled: 2025-05-07T20:32:31.8328406Z op = torch.compile(op) 2025-05-07T20:32:31.8328707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8328989Z 2025-05-07T20:32:31.8329186Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.8329474Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.8329761Z 2025-05-07T20:32:31.8330002Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8330341Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.8330629Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.8330945Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.8331309Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.8331619Z 2025-05-07T20:32:31.8331828Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.8332023Z 2025-05-07T20:32:31.8332133Z moe/activation_test.py:126: 2025-05-07T20:32:31.8332448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8332823Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.8333161Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.8333960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.8334723Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.8335272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.8335956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.8336650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.8337377Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.8338118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.8338766Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.8339381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.8339903Z fn() 2025-05-07T20:32:31.8340409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.8340994Z self.fn.run( 2025-05-07T20:32:31.8341467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.8342001Z kernel = self.compile( 2025-05-07T20:32:31.8342548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.8343300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8343707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8343939Z 2025-05-07T20:32:31.8344150Z self = 2025-05-07T20:32:31.8345315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.8346808Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c04acc360>} 2025-05-07T20:32:31.8348155Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.8349186Z context = 2025-05-07T20:32:31.8349474Z 2025-05-07T20:32:31.8349641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.8350164Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8350643Z module_map=module_map) 2025-05-07T20:32:31.8351013Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8351367Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.8351638Z E ^ 2025-05-07T20:32:31.8352105Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8352557Z 2025-05-07T20:32:31.8352979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8353489Z 2025-05-07T20:32:31.8353594Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8354016Z self=, 2025-05-07T20:32:31.8354421Z T=2048, 2025-05-07T20:32:31.8354609Z D=5120, 2025-05-07T20:32:31.8354811Z scale_ub=1200.0, 2025-05-07T20:32:31.8355041Z contiguous=True, 2025-05-07T20:32:31.8355261Z compiled=False, 2025-05-07T20:32:31.8355469Z ) 2025-05-07T20:32:32.1222760Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.1224023Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:32.1225372Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.1226839Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.1227844Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.1229154Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.1230534Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1231827Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.1233067Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.1234603Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1235767Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.1237056Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.1238308Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:32.1239534Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.1240755Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:32.1241589Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.1242609Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:32.1243681Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:32.1244478Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:32.1245696Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.1246984Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.1248094Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:32.1249139Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:32.1250318Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.1251687Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.1252768Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1253701Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.1254446Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:32.1255553Z W0507 20:32:32.118000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.2022273Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.2023628Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:32.2024957Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.2026421Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.2027403Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.2028724Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.2030107Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.2031088Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.2032320Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.2033697Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.2034766Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.2036143Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.2037389Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:32.2038609Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.2039825Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:32.2040651Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.2041675Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:32.2042699Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:32.2043820Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:32.2045036Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.2046467Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.2047578Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:32.2048620Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:32.2049805Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.2051178Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.2052245Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.2053154Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.2053895Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:32.2054923Z W0507 20:32:32.199000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4328932Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.4330392Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:32.4331734Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.4333189Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.4334182Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.4335486Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.4336866Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4337856Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.4339450Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.4340833Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4342050Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.4343327Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.4344574Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:32.4345801Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.4347012Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:32.4347846Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.4348862Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:32.4349879Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:32.4350674Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:32.4351889Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.4353175Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.4354292Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:32.4355337Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:32.4356609Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.4357968Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.4359040Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4359953Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4360694Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:32.4361713Z W0507 20:32:32.429000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4430921Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.4431982Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:32.4433489Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.4434912Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.4435973Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.4437287Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.4438670Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4439658Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.4440887Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.4442269Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4443391Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.4444680Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.4445922Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:32.4447144Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.4448363Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:32.4449194Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.4450219Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:32.4451245Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:32.4452041Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:32.4453342Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.4454633Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.4455821Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:32.4456863Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:32.4458039Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.4459408Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.4460474Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4461387Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4462131Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:32.4463203Z W0507 20:32:32.440000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7858292Z self = 2025-05-07T20:32:32.7858925Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.7859324Z 2025-05-07T20:32:32.7859477Z @given( 2025-05-07T20:32:32.7859717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7860038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7860362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7860699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7861020Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7861311Z ) 2025-05-07T20:32:32.7861661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7862105Z def test_silu_mul_quant( 2025-05-07T20:32:32.7862354Z self, 2025-05-07T20:32:32.7862555Z T: int, 2025-05-07T20:32:32.7862749Z D: int, 2025-05-07T20:32:32.7862971Z scale_ub: Optional[float], 2025-05-07T20:32:32.7863249Z contiguous: bool, 2025-05-07T20:32:32.7863483Z compiled: bool, 2025-05-07T20:32:32.7863717Z ) -> None: 2025-05-07T20:32:32.7863945Z torch.manual_seed(2025) 2025-05-07T20:32:32.7864183Z 2025-05-07T20:32:32.7864463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7864818Z 2025-05-07T20:32:32.7865026Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7865312Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7865883Z x = x_sign * x_clamp 2025-05-07T20:32:32.7866127Z x0 = x[:, :D] 2025-05-07T20:32:32.7866343Z x1 = x[:, D:] 2025-05-07T20:32:32.7866556Z 2025-05-07T20:32:32.7866749Z if contiguous: 2025-05-07T20:32:32.7866978Z x0 = x0.contiguous() 2025-05-07T20:32:32.7867248Z x1 = x1.contiguous() 2025-05-07T20:32:32.7867497Z 2025-05-07T20:32:32.7867691Z if scale_ub is not None: 2025-05-07T20:32:32.7867975Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7868319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7868628Z ) 2025-05-07T20:32:32.7869161Z else: 2025-05-07T20:32:32.7869385Z scale_ub_tensor = None 2025-05-07T20:32:32.7869639Z 2025-05-07T20:32:32.7869884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7870405Z op = silu_mul_quant 2025-05-07T20:32:32.7870664Z if compiled: 2025-05-07T20:32:32.7870916Z op = torch.compile(op) 2025-05-07T20:32:32.7871216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7871496Z 2025-05-07T20:32:32.7871689Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7871859Z 2025-05-07T20:32:32.7871962Z moe/activation_test.py:117: 2025-05-07T20:32:32.7872265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7872599Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7872889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7873591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7874282Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7874812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7875499Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7876261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7876788Z kernel = self.compile( 2025-05-07T20:32:32.7877326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7877983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7878382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7878614Z 2025-05-07T20:32:32.7878824Z self = 2025-05-07T20:32:32.7879907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7881301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c054f1e40>} 2025-05-07T20:32:32.7882643Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7883722Z context = 2025-05-07T20:32:32.7884009Z 2025-05-07T20:32:32.7884176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7884703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7885195Z module_map=module_map) 2025-05-07T20:32:32.7885567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7885922Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7886198Z E ^ 2025-05-07T20:32:32.7886665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7887116Z 2025-05-07T20:32:32.7887539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7888047Z 2025-05-07T20:32:32.7888154Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7888576Z self=, 2025-05-07T20:32:32.7888995Z T=2048, 2025-05-07T20:32:32.7889187Z D=5120, 2025-05-07T20:32:32.7889394Z scale_ub=1200.0, 2025-05-07T20:32:32.7889717Z contiguous=True, 2025-05-07T20:32:32.7889954Z compiled=True, 2025-05-07T20:32:32.7890166Z ) 2025-05-07T20:32:32.7890489Z self = 2025-05-07T20:32:32.7891070Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.7891340Z 2025-05-07T20:32:32.7891422Z @given( 2025-05-07T20:32:32.7891664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7891981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7892286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7892668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7901409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7901749Z ) 2025-05-07T20:32:32.7902113Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7902567Z def test_silu_mul_quant( 2025-05-07T20:32:32.7902830Z self, 2025-05-07T20:32:32.7903035Z T: int, 2025-05-07T20:32:32.7903270Z D: int, 2025-05-07T20:32:32.7903490Z scale_ub: Optional[float], 2025-05-07T20:32:32.7903772Z contiguous: bool, 2025-05-07T20:32:32.7904031Z compiled: bool, 2025-05-07T20:32:32.7904257Z ) -> None: 2025-05-07T20:32:32.7904486Z torch.manual_seed(2025) 2025-05-07T20:32:32.7904743Z 2025-05-07T20:32:32.7905024Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7905386Z 2025-05-07T20:32:32.7905592Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7905891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7906203Z x = x_sign * x_clamp 2025-05-07T20:32:32.7906454Z x0 = x[:, :D] 2025-05-07T20:32:32.7906682Z x1 = x[:, D:] 2025-05-07T20:32:32.7906890Z 2025-05-07T20:32:32.7907091Z if contiguous: 2025-05-07T20:32:32.7907334Z x0 = x0.contiguous() 2025-05-07T20:32:32.7907602Z x1 = x1.contiguous() 2025-05-07T20:32:32.7907862Z 2025-05-07T20:32:32.7908060Z if scale_ub is not None: 2025-05-07T20:32:32.7908337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7908691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7909014Z ) 2025-05-07T20:32:32.7909209Z else: 2025-05-07T20:32:32.7909430Z scale_ub_tensor = None 2025-05-07T20:32:32.7909689Z 2025-05-07T20:32:32.7909924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7910254Z op = silu_mul_quant 2025-05-07T20:32:32.7910521Z if compiled: 2025-05-07T20:32:32.7910785Z op = torch.compile(op) 2025-05-07T20:32:32.7911087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7911378Z 2025-05-07T20:32:32.7911583Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7911873Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7912180Z 2025-05-07T20:32:32.7912429Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7912766Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7913073Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7913403Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7913765Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7914089Z 2025-05-07T20:32:32.7914305Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7914503Z 2025-05-07T20:32:32.7914618Z moe/activation_test.py:126: 2025-05-07T20:32:32.7914917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7915264Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7915599Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7916572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7917338Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7917893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7918663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7919355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7920084Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7920823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7921469Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7922074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7922612Z fn() 2025-05-07T20:32:32.7923131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7923716Z self.fn.run( 2025-05-07T20:32:32.7924201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7924746Z kernel = self.compile( 2025-05-07T20:32:32.7925286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7925951Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7926370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7926602Z 2025-05-07T20:32:32.7926820Z self = 2025-05-07T20:32:32.7927913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7929301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c0535ac00>} 2025-05-07T20:32:32.7930663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7931694Z context = 2025-05-07T20:32:32.7931988Z 2025-05-07T20:32:32.7932165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7932691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7933197Z module_map=module_map) 2025-05-07T20:32:32.7933603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7933969Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7934252Z E ^ 2025-05-07T20:32:32.7934729Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7935185Z 2025-05-07T20:32:32.7935610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7936123Z 2025-05-07T20:32:32.7936229Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7936647Z self=, 2025-05-07T20:32:32.7937059Z T=16384, 2025-05-07T20:32:32.7937254Z D=7168, 2025-05-07T20:32:32.7937460Z scale_ub=1200.0, 2025-05-07T20:32:32.7937690Z contiguous=False, 2025-05-07T20:32:32.7937927Z compiled=False, 2025-05-07T20:32:32.7938137Z ) 2025-05-07T20:32:32.9755599Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:32.9757827Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:32.9760748Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:32.9763295Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:32.9764272Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.9765836Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:32.9767240Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.9768232Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.9769468Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.9770854Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.9771926Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.9773220Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:32.9774473Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:32.9775709Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:32.9776914Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:32.9777752Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.9778785Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:32.9779808Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:32.9780600Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:32.9781968Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:32.9783258Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:32.9784489Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:32.9785538Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:32.9786717Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:32.9788091Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:32.9789162Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.9790091Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.9790840Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:32.9791866Z W0507 20:32:32.972000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.0339050Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:33.0340115Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:33.0341449Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:33.0342879Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:33.0343858Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.0345167Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:33.0346558Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.0347544Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.0348775Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:33.0350159Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.0351515Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.0352962Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:33.0354221Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:33.0355435Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:33.0356716Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:33.0357546Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.0358579Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:33.0359589Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:33.0360386Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:33.0361592Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:33.0362876Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:33.0363995Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:33.0365039Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:33.0366445Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:33.0367805Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:33.0368873Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.0369791Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.0370530Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:33.0371558Z W0507 20:32:33.030000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2244325Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:33.2245743Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:33.2247097Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:33.2248700Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:33.2249676Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.2250983Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:33.2252370Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2253361Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.2254599Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:33.2255969Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2257044Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.2258325Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:33.2259584Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:33.2260803Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:33.2262020Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:33.2262849Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.2263878Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:33.2264904Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:33.2265969Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:33.2267174Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:33.2268595Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:33.2269720Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:33.2270874Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:33.2272052Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:33.2273407Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:33.2274484Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2275402Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2276213Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:33.2277232Z W0507 20:32:33.221000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2337910Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:33.2339123Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:33.2340473Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:33.2341896Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:33.2342878Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.2344224Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:33.2345608Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2346592Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.2347820Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:33.2349198Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2350257Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.2351645Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:33.2352969Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:33.2354238Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:33.2355443Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:33.2356350Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:33.2357384Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:33.2358403Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:33.2359204Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:33.2360403Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:33.2361686Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:33.2362810Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:33.2363900Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:33.2365084Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:33.2366769Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:33.2367834Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2368753Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2369490Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:33.2370507Z W0507 20:32:33.231000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9737084Z self = 2025-05-07T20:32:33.9737687Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.9738101Z 2025-05-07T20:32:33.9738219Z @given( 2025-05-07T20:32:33.9738528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.9738945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.9739339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.9739765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.9740424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.9740718Z ) 2025-05-07T20:32:33.9741071Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.9741668Z def test_silu_mul_quant( 2025-05-07T20:32:33.9741909Z self, 2025-05-07T20:32:33.9742119Z T: int, 2025-05-07T20:32:33.9742318Z D: int, 2025-05-07T20:32:33.9742538Z scale_ub: Optional[float], 2025-05-07T20:32:33.9742819Z contiguous: bool, 2025-05-07T20:32:33.9743063Z compiled: bool, 2025-05-07T20:32:33.9743290Z ) -> None: 2025-05-07T20:32:33.9743513Z torch.manual_seed(2025) 2025-05-07T20:32:33.9743757Z 2025-05-07T20:32:33.9744029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.9744382Z 2025-05-07T20:32:33.9744587Z x_sign = torch.sign(x) 2025-05-07T20:32:33.9744877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.9745194Z x = x_sign * x_clamp 2025-05-07T20:32:33.9745439Z x0 = x[:, :D] 2025-05-07T20:32:33.9745657Z x1 = x[:, D:] 2025-05-07T20:32:33.9745868Z 2025-05-07T20:32:33.9746056Z if contiguous: 2025-05-07T20:32:33.9746294Z x0 = x0.contiguous() 2025-05-07T20:32:33.9746554Z x1 = x1.contiguous() 2025-05-07T20:32:33.9746803Z 2025-05-07T20:32:33.9747000Z if scale_ub is not None: 2025-05-07T20:32:33.9747270Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.9747607Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.9747919Z ) 2025-05-07T20:32:33.9748108Z else: 2025-05-07T20:32:33.9748319Z scale_ub_tensor = None 2025-05-07T20:32:33.9748577Z 2025-05-07T20:32:33.9748806Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9749127Z op = silu_mul_quant 2025-05-07T20:32:33.9749388Z if compiled: 2025-05-07T20:32:33.9749650Z op = torch.compile(op) 2025-05-07T20:32:33.9749943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9750224Z 2025-05-07T20:32:33.9750424Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.9750594Z 2025-05-07T20:32:33.9750696Z moe/activation_test.py:117: 2025-05-07T20:32:33.9750995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9751333Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.9751613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9752306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.9752996Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.9753538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.9754215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.9754884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.9755418Z kernel = self.compile( 2025-05-07T20:32:33.9756063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.9756722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.9757123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9757354Z 2025-05-07T20:32:33.9757572Z self = 2025-05-07T20:32:33.9758657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.9760123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05583ce0>} 2025-05-07T20:32:33.9761468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.9762568Z context = 2025-05-07T20:32:33.9762856Z 2025-05-07T20:32:33.9763030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.9763545Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.9764016Z module_map=module_map) 2025-05-07T20:32:33.9764380Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.9764737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.9764994Z E ^ 2025-05-07T20:32:33.9765754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9766205Z 2025-05-07T20:32:33.9766623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9767136Z 2025-05-07T20:32:33.9767245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9767652Z self=, 2025-05-07T20:32:33.9768059Z T=1, 2025-05-07T20:32:33.9768245Z D=7168, 2025-05-07T20:32:33.9768434Z scale_ub=None, 2025-05-07T20:32:33.9768649Z contiguous=True, 2025-05-07T20:32:33.9768870Z compiled=True, 2025-05-07T20:32:33.9769075Z ) 2025-05-07T20:32:33.9769393Z self = 2025-05-07T20:32:33.9769897Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:33.9770153Z 2025-05-07T20:32:33.9770239Z @given( 2025-05-07T20:32:33.9770469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.9770783Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.9771092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.9771432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.9771755Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.9772046Z ) 2025-05-07T20:32:33.9772393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.9772831Z def test_silu_mul_quant( 2025-05-07T20:32:33.9773079Z self, 2025-05-07T20:32:33.9773276Z T: int, 2025-05-07T20:32:33.9773468Z D: int, 2025-05-07T20:32:33.9773700Z scale_ub: Optional[float], 2025-05-07T20:32:33.9773984Z contiguous: bool, 2025-05-07T20:32:33.9774219Z compiled: bool, 2025-05-07T20:32:33.9774444Z ) -> None: 2025-05-07T20:32:33.9774664Z torch.manual_seed(2025) 2025-05-07T20:32:33.9774910Z 2025-05-07T20:32:33.9775192Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.9775538Z 2025-05-07T20:32:33.9775729Z x_sign = torch.sign(x) 2025-05-07T20:32:33.9776025Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.9776340Z x = x_sign * x_clamp 2025-05-07T20:32:33.9776588Z x0 = x[:, :D] 2025-05-07T20:32:33.9776806Z x1 = x[:, D:] 2025-05-07T20:32:33.9777019Z 2025-05-07T20:32:33.9777206Z if contiguous: 2025-05-07T20:32:33.9777437Z x0 = x0.contiguous() 2025-05-07T20:32:33.9777694Z x1 = x1.contiguous() 2025-05-07T20:32:33.9777941Z 2025-05-07T20:32:33.9778127Z if scale_ub is not None: 2025-05-07T20:32:33.9778404Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.9778737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.9779043Z ) 2025-05-07T20:32:33.9779240Z else: 2025-05-07T20:32:33.9779589Z scale_ub_tensor = None 2025-05-07T20:32:33.9779838Z 2025-05-07T20:32:33.9780071Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9780386Z op = silu_mul_quant 2025-05-07T20:32:33.9780746Z if compiled: 2025-05-07T20:32:33.9780992Z op = torch.compile(op) 2025-05-07T20:32:33.9781287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9781563Z 2025-05-07T20:32:33.9781756Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.9782044Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.9782338Z 2025-05-07T20:32:33.9782576Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9782918Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.9783223Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.9783533Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.9783907Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.9784222Z 2025-05-07T20:32:33.9784424Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.9784626Z 2025-05-07T20:32:33.9784728Z moe/activation_test.py:126: 2025-05-07T20:32:33.9785035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9785372Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.9785705Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.9786495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.9787248Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.9787794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.9788484Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.9789180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.9789906Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.9790637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.9791276Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.9791876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.9792387Z fn() 2025-05-07T20:32:33.9792892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.9793475Z self.fn.run( 2025-05-07T20:32:33.9802236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.9802789Z kernel = self.compile( 2025-05-07T20:32:33.9803349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.9804003Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.9804414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9804652Z 2025-05-07T20:32:33.9804859Z self = 2025-05-07T20:32:33.9805951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.9807321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c04fac720>} 2025-05-07T20:32:33.9808785Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.9809896Z context = 2025-05-07T20:32:33.9810186Z 2025-05-07T20:32:33.9810361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.9810886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.9811354Z module_map=module_map) 2025-05-07T20:32:33.9811727Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.9812089Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.9812352Z E ^ 2025-05-07T20:32:33.9812823Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9813274Z 2025-05-07T20:32:33.9813708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9814218Z 2025-05-07T20:32:33.9814330Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9814748Z self=, 2025-05-07T20:32:33.9815164Z T=4096, 2025-05-07T20:32:33.9815362Z D=5120, 2025-05-07T20:32:33.9815552Z scale_ub=None, 2025-05-07T20:32:33.9815772Z contiguous=False, 2025-05-07T20:32:33.9816004Z compiled=False, 2025-05-07T20:32:33.9816210Z ) 2025-05-07T20:32:34.2706531Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:34.2707632Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:34.2709015Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:34.2710480Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:34.2711469Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.2712783Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:34.2714181Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2715184Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.2716558Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:34.2717937Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2719017Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.2720650Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:34.2722078Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:34.2723309Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:34.2724526Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:34.2725364Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.2726406Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:34.2727442Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:34.2728245Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:34.2729467Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:34.2730757Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:34.2731888Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:34.2732952Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:34.2734141Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:34.2735512Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:34.2736588Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2737513Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.2738259Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:34.2739284Z W0507 20:32:34.267000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.4794440Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:34.4795546Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:34.4797382Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:34.4798924Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:34.4800077Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.4801396Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:34.4802785Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.4804000Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.4805259Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:34.4806641Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.4807718Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.4809010Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:34.4810268Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:34.4811499Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:34.4812706Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:34.4813544Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.4814583Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:34.4815805Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:34.4816781Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:34.4818278Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:34.4819863Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:34.4821242Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:34.4822620Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:34.4823802Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:34.4825294Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:34.4826365Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.4827282Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.4828032Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:34.4829053Z W0507 20:32:34.476000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.7744159Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:34.7745272Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:34.7746638Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:34.7748118Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:34.7749108Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.7750425Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:34.7751820Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.7752820Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.7754074Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:34.7755471Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.7756608Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.7757894Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:34.7759475Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:34.7760710Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:34.7762094Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:34.7762927Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.7763968Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:34.7765053Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:34.7766109Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:34.7767470Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:34.7768767Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:34.7769888Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:34.7770944Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:34.7772134Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:34.7773507Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:34.7774570Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.7775489Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.7776247Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:34.7777287Z W0507 20:32:34.771000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.7846959Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:34.7848192Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:34.7849524Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:34.7850950Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:34.7852156Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.7853582Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:34.7854970Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.7855955Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.7857195Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:34.7858574Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.7859655Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.7860934Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:34.7862179Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:34.7863407Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:34.7864620Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:34.7865732Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:34.7866761Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:34.7867775Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:34.7868588Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:34.7869798Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:34.7871086Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:34.7872201Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:34.7873244Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:34.7874549Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:34.7875992Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:34.7877172Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.7878081Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.7878826Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:34.7879851Z W0507 20:32:34.781000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1366330Z self = 2025-05-07T20:32:36.1367033Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.1367331Z 2025-05-07T20:32:36.1367415Z @given( 2025-05-07T20:32:36.1367656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1367979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1368291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1368620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1368950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1369245Z ) 2025-05-07T20:32:36.1369592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1370040Z def test_silu_mul_quant( 2025-05-07T20:32:36.1370286Z self, 2025-05-07T20:32:36.1370485Z T: int, 2025-05-07T20:32:36.1370692Z D: int, 2025-05-07T20:32:36.1370921Z scale_ub: Optional[float], 2025-05-07T20:32:36.1371196Z contiguous: bool, 2025-05-07T20:32:36.1371447Z compiled: bool, 2025-05-07T20:32:36.1371682Z ) -> None: 2025-05-07T20:32:36.1371906Z torch.manual_seed(2025) 2025-05-07T20:32:36.1372160Z 2025-05-07T20:32:36.1372438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1372781Z 2025-05-07T20:32:36.1372991Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1373289Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1373612Z x = x_sign * x_clamp 2025-05-07T20:32:36.1373858Z x0 = x[:, :D] 2025-05-07T20:32:36.1374083Z x1 = x[:, D:] 2025-05-07T20:32:36.1374302Z 2025-05-07T20:32:36.1374491Z if contiguous: 2025-05-07T20:32:36.1374727Z x0 = x0.contiguous() 2025-05-07T20:32:36.1374990Z x1 = x1.contiguous() 2025-05-07T20:32:36.1375231Z 2025-05-07T20:32:36.1375431Z if scale_ub is not None: 2025-05-07T20:32:36.1375706Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.1376045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.1376364Z ) 2025-05-07T20:32:36.1376563Z else: 2025-05-07T20:32:36.1376778Z scale_ub_tensor = None 2025-05-07T20:32:36.1377036Z 2025-05-07T20:32:36.1377275Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1377595Z op = silu_mul_quant 2025-05-07T20:32:36.1377849Z if compiled: 2025-05-07T20:32:36.1378104Z op = torch.compile(op) 2025-05-07T20:32:36.1378412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1378684Z 2025-05-07T20:32:36.1378887Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.1379051Z 2025-05-07T20:32:36.1379163Z moe/activation_test.py:117: 2025-05-07T20:32:36.1379458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1380140Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.1380434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1381126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.1381974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.1382515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.1383202Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.1383864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.1384400Z kernel = self.compile( 2025-05-07T20:32:36.1384946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.1385602Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1386007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1386245Z 2025-05-07T20:32:36.1386453Z self = 2025-05-07T20:32:36.1387547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.1388943Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c49a0>} 2025-05-07T20:32:36.1390286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.1391322Z context = 2025-05-07T20:32:36.1391620Z 2025-05-07T20:32:36.1391787Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.1392314Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1392785Z module_map=module_map) 2025-05-07T20:32:36.1393156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1393518Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.1393780Z E ^ 2025-05-07T20:32:36.1394250Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1394712Z 2025-05-07T20:32:36.1395127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.1395638Z 2025-05-07T20:32:36.1395850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1396270Z self=, 2025-05-07T20:32:36.1396677Z T=4096, 2025-05-07T20:32:36.1396873Z D=7168, 2025-05-07T20:32:36.1397065Z scale_ub=None, 2025-05-07T20:32:36.1397290Z contiguous=False, 2025-05-07T20:32:36.1397518Z compiled=False, 2025-05-07T20:32:36.1397725Z ) 2025-05-07T20:32:36.1398047Z self = 2025-05-07T20:32:36.1398548Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.1398821Z 2025-05-07T20:32:36.1398906Z @given( 2025-05-07T20:32:36.1399135Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1399451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1399766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1400098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1400429Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1400842Z ) 2025-05-07T20:32:36.1401190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1401634Z def test_silu_mul_quant( 2025-05-07T20:32:36.1401959Z self, 2025-05-07T20:32:36.1402158Z T: int, 2025-05-07T20:32:36.1402356Z D: int, 2025-05-07T20:32:36.1402578Z scale_ub: Optional[float], 2025-05-07T20:32:36.1402854Z contiguous: bool, 2025-05-07T20:32:36.1403091Z compiled: bool, 2025-05-07T20:32:36.1403314Z ) -> None: 2025-05-07T20:32:36.1403530Z torch.manual_seed(2025) 2025-05-07T20:32:36.1403770Z 2025-05-07T20:32:36.1404043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1404389Z 2025-05-07T20:32:36.1404580Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1404885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1405201Z x = x_sign * x_clamp 2025-05-07T20:32:36.1405441Z x0 = x[:, :D] 2025-05-07T20:32:36.1405668Z x1 = x[:, D:] 2025-05-07T20:32:36.1405878Z 2025-05-07T20:32:36.1406064Z if contiguous: 2025-05-07T20:32:36.1406300Z x0 = x0.contiguous() 2025-05-07T20:32:36.1406574Z x1 = x1.contiguous() 2025-05-07T20:32:36.1406817Z 2025-05-07T20:32:36.1407016Z if scale_ub is not None: 2025-05-07T20:32:36.1407296Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.1407635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.1407944Z ) 2025-05-07T20:32:36.1408142Z else: 2025-05-07T20:32:36.1408359Z scale_ub_tensor = None 2025-05-07T20:32:36.1408610Z 2025-05-07T20:32:36.1408848Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1409169Z op = silu_mul_quant 2025-05-07T20:32:36.1409422Z if compiled: 2025-05-07T20:32:36.1409678Z op = torch.compile(op) 2025-05-07T20:32:36.1409993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1410276Z 2025-05-07T20:32:36.1410470Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.1410642Z 2025-05-07T20:32:36.1410742Z moe/activation_test.py:117: 2025-05-07T20:32:36.1411044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1411377Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.1411664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1412355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.1413048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.1413582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.1414265Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.1414933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.1415461Z kernel = self.compile( 2025-05-07T20:32:36.1416004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.1416662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1417063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1417293Z 2025-05-07T20:32:36.1417499Z self = 2025-05-07T20:32:36.1418581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.1420050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c4e00>} 2025-05-07T20:32:36.1421402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.1422506Z context = 2025-05-07T20:32:36.1422803Z 2025-05-07T20:32:36.1422970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.1423496Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1423967Z module_map=module_map) 2025-05-07T20:32:36.1424328Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1424686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.1424971Z E ^ 2025-05-07T20:32:36.1425460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1425919Z 2025-05-07T20:32:36.1426338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.1426862Z 2025-05-07T20:32:36.1426966Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1427386Z self=, 2025-05-07T20:32:36.1427784Z T=128, 2025-05-07T20:32:36.1427977Z D=7168, 2025-05-07T20:32:36.1428174Z scale_ub=None, 2025-05-07T20:32:36.1428386Z contiguous=False, 2025-05-07T20:32:36.1428616Z compiled=True, 2025-05-07T20:32:36.1428822Z ) 2025-05-07T20:32:36.2006129Z self = 2025-05-07T20:32:36.2006795Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:36.2007068Z 2025-05-07T20:32:36.2007150Z @given( 2025-05-07T20:32:36.2007411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2007733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2008047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2008388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2008741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2009028Z ) 2025-05-07T20:32:36.2009390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2009846Z def test_silu_mul_quant( 2025-05-07T20:32:36.2010103Z self, 2025-05-07T20:32:36.2010311Z T: int, 2025-05-07T20:32:36.2010517Z D: int, 2025-05-07T20:32:36.2010744Z scale_ub: Optional[float], 2025-05-07T20:32:36.2011021Z contiguous: bool, 2025-05-07T20:32:36.2011265Z compiled: bool, 2025-05-07T20:32:36.2011505Z ) -> None: 2025-05-07T20:32:36.2011723Z torch.manual_seed(2025) 2025-05-07T20:32:36.2011971Z 2025-05-07T20:32:36.2012259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2012603Z 2025-05-07T20:32:36.2012810Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2013113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2013430Z x = x_sign * x_clamp 2025-05-07T20:32:36.2013678Z x0 = x[:, :D] 2025-05-07T20:32:36.2013907Z x1 = x[:, D:] 2025-05-07T20:32:36.2014119Z 2025-05-07T20:32:36.2014320Z if contiguous: 2025-05-07T20:32:36.2014563Z x0 = x0.contiguous() 2025-05-07T20:32:36.2014824Z x1 = x1.contiguous() 2025-05-07T20:32:36.2015078Z 2025-05-07T20:32:36.2015304Z if scale_ub is not None: 2025-05-07T20:32:36.2015582Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.2015932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.2016255Z ) 2025-05-07T20:32:36.2016451Z else: 2025-05-07T20:32:36.2016682Z scale_ub_tensor = None 2025-05-07T20:32:36.2017706Z 2025-05-07T20:32:36.2017957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2018276Z op = silu_mul_quant 2025-05-07T20:32:36.2018537Z if compiled: 2025-05-07T20:32:36.2018959Z op = torch.compile(op) 2025-05-07T20:32:36.2019258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2019541Z 2025-05-07T20:32:36.2019743Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.2020028Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.2020330Z 2025-05-07T20:32:36.2020580Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2020918Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.2021217Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.2021536Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.2031135Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.2031501Z 2025-05-07T20:32:36.2031734Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.2031943Z 2025-05-07T20:32:36.2032055Z moe/activation_test.py:126: 2025-05-07T20:32:36.2032370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2032733Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.2033068Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.2033874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.2034637Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.2035195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.2035992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.2036701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.2037436Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.2038179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.2038825Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.2039438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.2040032Z fn() 2025-05-07T20:32:36.2040554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.2041152Z self.fn.run( 2025-05-07T20:32:36.2041635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.2042180Z kernel = self.compile( 2025-05-07T20:32:36.2042728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.2043396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.2043813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2044048Z 2025-05-07T20:32:36.2044271Z self = 2025-05-07T20:32:36.2045362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.2046765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c5a80>} 2025-05-07T20:32:36.2048273Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.2049316Z context = 2025-05-07T20:32:36.2049692Z 2025-05-07T20:32:36.2049863Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.2050397Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.2050878Z module_map=module_map) 2025-05-07T20:32:36.2051254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.2051621Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.2051902Z E ^ 2025-05-07T20:32:36.2052382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.2052836Z 2025-05-07T20:32:36.2053261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.2053782Z 2025-05-07T20:32:36.2053890Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2054319Z self=, 2025-05-07T20:32:36.2054748Z T=128, 2025-05-07T20:32:36.2054977Z D=7168, 2025-05-07T20:32:36.2055204Z scale_ub=None, 2025-05-07T20:32:36.2055431Z contiguous=False, 2025-05-07T20:32:36.2055664Z compiled=False, 2025-05-07T20:32:36.2055886Z ) 2025-05-07T20:32:36.4027582Z self = 2025-05-07T20:32:36.4028392Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.4028676Z 2025-05-07T20:32:36.4028770Z @given( 2025-05-07T20:32:36.4029005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4029335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4029682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4030026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4030362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4030659Z ) 2025-05-07T20:32:36.4031033Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4031480Z def test_silu_mul_quant( 2025-05-07T20:32:36.4031733Z self, 2025-05-07T20:32:36.4031941Z T: int, 2025-05-07T20:32:36.4032151Z D: int, 2025-05-07T20:32:36.4032382Z scale_ub: Optional[float], 2025-05-07T20:32:36.4032664Z contiguous: bool, 2025-05-07T20:32:36.4032911Z compiled: bool, 2025-05-07T20:32:36.4033149Z ) -> None: 2025-05-07T20:32:36.4033376Z torch.manual_seed(2025) 2025-05-07T20:32:36.4033625Z 2025-05-07T20:32:36.4033914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.4034282Z 2025-05-07T20:32:36.4034487Z x_sign = torch.sign(x) 2025-05-07T20:32:36.4034801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.4035162Z x = x_sign * x_clamp 2025-05-07T20:32:36.4035429Z x0 = x[:, :D] 2025-05-07T20:32:36.4035654Z x1 = x[:, D:] 2025-05-07T20:32:36.4035955Z 2025-05-07T20:32:36.4036154Z if contiguous: 2025-05-07T20:32:36.4036389Z x0 = x0.contiguous() 2025-05-07T20:32:36.4036662Z x1 = x1.contiguous() 2025-05-07T20:32:36.4036913Z 2025-05-07T20:32:36.4037106Z if scale_ub is not None: 2025-05-07T20:32:36.4037390Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.4037737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.4038056Z ) 2025-05-07T20:32:36.4038259Z else: 2025-05-07T20:32:36.4038483Z scale_ub_tensor = None 2025-05-07T20:32:36.4038739Z 2025-05-07T20:32:36.4038978Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4039629Z op = silu_mul_quant 2025-05-07T20:32:36.4039901Z if compiled: 2025-05-07T20:32:36.4040175Z op = torch.compile(op) 2025-05-07T20:32:36.4040505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4040956Z 2025-05-07T20:32:36.4041157Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.4041345Z 2025-05-07T20:32:36.4041454Z moe/activation_test.py:117: 2025-05-07T20:32:36.4041792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4042167Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.4042477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4043312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.4044141Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.4044776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.4045649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.4046450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.4047083Z kernel = self.compile( 2025-05-07T20:32:36.4047722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.4048507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.4048968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4049236Z 2025-05-07T20:32:36.4049471Z self = 2025-05-07T20:32:36.4050799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.4052517Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea19c540>} 2025-05-07T20:32:36.4054192Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.4055440Z context = 2025-05-07T20:32:36.4055786Z 2025-05-07T20:32:36.4055972Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.4056586Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.4057139Z module_map=module_map) 2025-05-07T20:32:36.4057546Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.4057954Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.4058242Z E ^ 2025-05-07T20:32:36.4058781Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.4059340Z 2025-05-07T20:32:36.4059841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.4060471Z 2025-05-07T20:32:36.4060590Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4061064Z self=, 2025-05-07T20:32:36.4061531Z T=4096, 2025-05-07T20:32:36.4061737Z D=5120, 2025-05-07T20:32:36.4061943Z scale_ub=1200.0, 2025-05-07T20:32:36.4062186Z contiguous=True, 2025-05-07T20:32:36.4062425Z compiled=False, 2025-05-07T20:32:36.4062635Z ) 2025-05-07T20:32:36.4062959Z self = 2025-05-07T20:32:36.4063544Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.4063823Z 2025-05-07T20:32:36.4063910Z @given( 2025-05-07T20:32:36.4064141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4064547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4064863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4065195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4065761Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4066133Z ) 2025-05-07T20:32:36.4066481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4066924Z def test_silu_mul_quant( 2025-05-07T20:32:36.4067169Z self, 2025-05-07T20:32:36.4067364Z T: int, 2025-05-07T20:32:36.4067567Z D: int, 2025-05-07T20:32:36.4067789Z scale_ub: Optional[float], 2025-05-07T20:32:36.4068066Z contiguous: bool, 2025-05-07T20:32:36.4068312Z compiled: bool, 2025-05-07T20:32:36.4068540Z ) -> None: 2025-05-07T20:32:36.4068760Z torch.manual_seed(2025) 2025-05-07T20:32:36.4069001Z 2025-05-07T20:32:36.4069276Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.4069632Z 2025-05-07T20:32:36.4069825Z x_sign = torch.sign(x) 2025-05-07T20:32:36.4070121Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.4070441Z x = x_sign * x_clamp 2025-05-07T20:32:36.4070681Z x0 = x[:, :D] 2025-05-07T20:32:36.4070905Z x1 = x[:, D:] 2025-05-07T20:32:36.4071122Z 2025-05-07T20:32:36.4071309Z if contiguous: 2025-05-07T20:32:36.4071548Z x0 = x0.contiguous() 2025-05-07T20:32:36.4071814Z x1 = x1.contiguous() 2025-05-07T20:32:36.4072055Z 2025-05-07T20:32:36.4072257Z if scale_ub is not None: 2025-05-07T20:32:36.4072541Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.4072880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.4073198Z ) 2025-05-07T20:32:36.4073399Z else: 2025-05-07T20:32:36.4073618Z scale_ub_tensor = None 2025-05-07T20:32:36.4073874Z 2025-05-07T20:32:36.4074114Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4074441Z op = silu_mul_quant 2025-05-07T20:32:36.4074691Z if compiled: 2025-05-07T20:32:36.4074949Z op = torch.compile(op) 2025-05-07T20:32:36.4075250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4075527Z 2025-05-07T20:32:36.4075793Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.4075960Z 2025-05-07T20:32:36.4076071Z moe/activation_test.py:117: 2025-05-07T20:32:36.4076373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4076716Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.4077004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4077704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.4078390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.4078936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.4079620Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.4080285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.4080822Z kernel = self.compile( 2025-05-07T20:32:36.4081367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.4082022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.4082423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4082660Z 2025-05-07T20:32:36.4083014Z self = 2025-05-07T20:32:36.4084099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.4085614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea19e480>} 2025-05-07T20:32:36.4086953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.4087983Z context = 2025-05-07T20:32:36.4088277Z 2025-05-07T20:32:36.4088452Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.4088981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.4089446Z module_map=module_map) 2025-05-07T20:32:36.4089825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.4090185Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.4090452Z E ^ 2025-05-07T20:32:36.4090913Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.4091373Z 2025-05-07T20:32:36.4091788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.4092299Z 2025-05-07T20:32:36.4092410Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4092820Z self=, 2025-05-07T20:32:36.4093227Z T=1, 2025-05-07T20:32:36.4093425Z D=5120, 2025-05-07T20:32:36.4093623Z scale_ub=None, 2025-05-07T20:32:36.4093834Z contiguous=True, 2025-05-07T20:32:36.4094059Z compiled=True, 2025-05-07T20:32:36.4094268Z ) 2025-05-07T20:32:36.6439332Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.6440425Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:36.6441777Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.6443241Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.6444234Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.6445553Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.6446945Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.6447943Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.6449482Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.6450868Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.6452148Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.6453430Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.6454687Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:36.6455961Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.6457181Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:36.6458014Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.6459040Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:36.6460064Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:36.6460871Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:36.6462086Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.6463372Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.6464494Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:36.6465866Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:36.6467066Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.6468432Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.6469656Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.6470575Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.6471410Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:36.6472692Z W0507 20:32:36.640000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.7138663Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.7141269Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:36.7143967Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.7145871Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.7146869Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.7148186Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.7149577Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.7150577Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.7151816Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.7153208Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.7154294Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.7155628Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.7156961Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:36.7158196Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.7159414Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:36.7160257Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.7161280Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:36.7162310Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:36.7163114Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:36.7164474Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.7166118Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.7167240Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:36.7168289Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:36.7169486Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.7170854Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.7171922Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.7172841Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.7173586Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:36.7174621Z W0507 20:32:36.710000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.9172800Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.9173863Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:36.9175234Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.9177050Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.9178248Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.9179870Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.9181584Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9182787Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.9184306Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.9186369Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9187442Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.9188862Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.9190121Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:36.9191339Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.9192553Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:36.9193378Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.9194417Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:36.9195444Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:36.9196330Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:36.9197536Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.9198812Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.9199942Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:36.9200981Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:36.9202165Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.9203521Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.9204583Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9205556Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.9206297Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:36.9207316Z W0507 20:32:36.914000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.9277953Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:36.9279179Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:36.9280525Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:36.9282068Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:36.9283059Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.9284374Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:36.9285768Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9286769Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.9288007Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:36.9289389Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9290469Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.9291758Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:36.9293013Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:36.9294240Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:36.9295460Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:36.9296293Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:36.9297324Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:36.9298360Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:36.9299163Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:36.9300381Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:36.9301741Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:36.9302872Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:36.9304000Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:36.9305184Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:36.9306554Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:36.9307628Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9308548Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.9309304Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:36.9310337Z W0507 20:32:36.924000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1482816Z self = 2025-05-07T20:32:37.1483554Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.1483833Z 2025-05-07T20:32:37.1483919Z @given( 2025-05-07T20:32:37.1484167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1484521Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1484833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1485172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1485523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1485813Z ) 2025-05-07T20:32:37.1486177Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1486629Z def test_silu_mul_quant( 2025-05-07T20:32:37.1486872Z self, 2025-05-07T20:32:37.1487078Z T: int, 2025-05-07T20:32:37.1487287Z D: int, 2025-05-07T20:32:37.1487512Z scale_ub: Optional[float], 2025-05-07T20:32:37.1487819Z contiguous: bool, 2025-05-07T20:32:37.1488072Z compiled: bool, 2025-05-07T20:32:37.1488307Z ) -> None: 2025-05-07T20:32:37.1488528Z torch.manual_seed(2025) 2025-05-07T20:32:37.1488782Z 2025-05-07T20:32:37.1489068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1489430Z 2025-05-07T20:32:37.1489630Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1489934Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1490256Z x = x_sign * x_clamp 2025-05-07T20:32:37.1490511Z x0 = x[:, :D] 2025-05-07T20:32:37.1490740Z x1 = x[:, D:] 2025-05-07T20:32:37.1490961Z 2025-05-07T20:32:37.1491152Z if contiguous: 2025-05-07T20:32:37.1491390Z x0 = x0.contiguous() 2025-05-07T20:32:37.1499870Z x1 = x1.contiguous() 2025-05-07T20:32:37.1500148Z 2025-05-07T20:32:37.1500346Z if scale_ub is not None: 2025-05-07T20:32:37.1500623Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1500957Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1501299Z ) 2025-05-07T20:32:37.1501495Z else: 2025-05-07T20:32:37.1501710Z scale_ub_tensor = None 2025-05-07T20:32:37.1501976Z 2025-05-07T20:32:37.1502557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1502875Z op = silu_mul_quant 2025-05-07T20:32:37.1503118Z if compiled: 2025-05-07T20:32:37.1503362Z op = torch.compile(op) 2025-05-07T20:32:37.1503812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1504088Z 2025-05-07T20:32:37.1504295Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.1504585Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.1504887Z 2025-05-07T20:32:37.1505133Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1505478Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.1505782Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.1506110Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.1506479Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.1506794Z 2025-05-07T20:32:37.1507013Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.1507211Z 2025-05-07T20:32:37.1507325Z moe/activation_test.py:126: 2025-05-07T20:32:37.1507626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1507984Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.1508322Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.1509113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.1509874Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.1510426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1511118Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1511812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.1512552Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.1513300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.1513955Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.1514561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.1515090Z fn() 2025-05-07T20:32:37.1515608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.1516314Z self.fn.run( 2025-05-07T20:32:37.1516788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1517332Z kernel = self.compile( 2025-05-07T20:32:37.1517881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1518532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1518942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1519178Z 2025-05-07T20:32:37.1519394Z self = 2025-05-07T20:32:37.1520485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1521879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05058c20>} 2025-05-07T20:32:37.1523321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1524360Z context = 2025-05-07T20:32:37.1524651Z 2025-05-07T20:32:37.1524905Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1525427Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1525908Z module_map=module_map) 2025-05-07T20:32:37.1526277Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1526642Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.1526911Z E ^ 2025-05-07T20:32:37.1527382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1527835Z 2025-05-07T20:32:37.1528259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1528779Z 2025-05-07T20:32:37.1528891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1529313Z self=, 2025-05-07T20:32:37.1529734Z T=2048, 2025-05-07T20:32:37.1529932Z D=5120, 2025-05-07T20:32:37.1530132Z scale_ub=None, 2025-05-07T20:32:37.1530360Z contiguous=True, 2025-05-07T20:32:37.1530593Z compiled=True, 2025-05-07T20:32:37.1530802Z ) 2025-05-07T20:32:37.3786614Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:37.3787885Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:37.3789269Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:37.3790723Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:37.3791721Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.3793042Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:37.3794439Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.3795444Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.3796797Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.3798187Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.3799271Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.3800904Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:37.3802173Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:37.3803553Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:37.3804768Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:37.3805615Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.3806665Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:37.3807694Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:37.3808512Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:37.3809727Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:37.3811024Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:37.3812169Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:37.3813224Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:37.3814412Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:37.3815848Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:37.3816927Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.3817854Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.3818614Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:37.3819641Z W0507 20:32:37.375000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.4480335Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:37.4481650Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:37.4482997Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:37.4484816Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:37.4485856Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.4487347Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:37.4488739Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.4489746Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.4490989Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.4492385Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.4493459Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.4494754Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:37.4496012Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:37.4497243Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:37.4498473Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:37.4499311Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.4500346Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:37.4501382Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:37.4502188Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:37.4503405Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:37.4504698Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:37.4505828Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:37.4506965Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:37.4508164Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:37.4509605Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:37.4510688Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.4511613Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.4512371Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:37.4513401Z W0507 20:32:37.445000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.6514484Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:37.6516426Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:37.6517767Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:37.6519213Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:37.6520204Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.6521513Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:37.6522901Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.6523892Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.6525129Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.6526507Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.6527574Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.6528858Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:37.6530113Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:37.6531658Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:37.6533019Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:37.6533842Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.6534871Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:37.6535945Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:37.6536752Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:37.6537965Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:37.6539247Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:37.6540368Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:37.6541418Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:37.6542607Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:37.6543970Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:37.6545034Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.6545996Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.6546744Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:37.6547770Z W0507 20:32:37.648000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.6614366Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:37.6615612Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:37.6616992Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:37.6618412Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:37.6619577Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.6620884Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:37.6622366Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.6623355Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.6624580Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.6626005Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.6627079Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.6628352Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:37.6629595Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:37.6630822Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:37.6632032Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:37.6632865Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.6633882Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:37.6634906Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:37.6635841Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:37.6637064Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:37.6638343Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:37.6639459Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:37.6640500Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:37.6641676Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:37.6643149Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:37.6644278Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.6645186Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.6645924Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:37.6646942Z W0507 20:32:37.658000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.8706276Z self = 2025-05-07T20:32:37.8706907Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.8707182Z 2025-05-07T20:32:37.8707264Z @given( 2025-05-07T20:32:37.8707504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.8707835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.8708146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.8708487Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.8708825Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.8709111Z ) 2025-05-07T20:32:37.8709467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.8709913Z def test_silu_mul_quant( 2025-05-07T20:32:37.8710160Z self, 2025-05-07T20:32:37.8710355Z T: int, 2025-05-07T20:32:37.8710563Z D: int, 2025-05-07T20:32:37.8710787Z scale_ub: Optional[float], 2025-05-07T20:32:37.8711062Z contiguous: bool, 2025-05-07T20:32:37.8711311Z compiled: bool, 2025-05-07T20:32:37.8711547Z ) -> None: 2025-05-07T20:32:37.8711762Z torch.manual_seed(2025) 2025-05-07T20:32:37.8712007Z 2025-05-07T20:32:37.8712283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.8712628Z 2025-05-07T20:32:37.8712830Z x_sign = torch.sign(x) 2025-05-07T20:32:37.8713125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.8713431Z x = x_sign * x_clamp 2025-05-07T20:32:37.8713678Z x0 = x[:, :D] 2025-05-07T20:32:37.8713904Z x1 = x[:, D:] 2025-05-07T20:32:37.8714108Z 2025-05-07T20:32:37.8714304Z if contiguous: 2025-05-07T20:32:37.8714538Z x0 = x0.contiguous() 2025-05-07T20:32:37.8714793Z x1 = x1.contiguous() 2025-05-07T20:32:37.8715038Z 2025-05-07T20:32:37.8715236Z if scale_ub is not None: 2025-05-07T20:32:37.8715513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.8715952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.8716269Z ) 2025-05-07T20:32:37.8716464Z else: 2025-05-07T20:32:37.8716672Z scale_ub_tensor = None 2025-05-07T20:32:37.8716934Z 2025-05-07T20:32:37.8717171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8717483Z op = silu_mul_quant 2025-05-07T20:32:37.8717736Z if compiled: 2025-05-07T20:32:37.8717986Z op = torch.compile(op) 2025-05-07T20:32:37.8718283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8718565Z 2025-05-07T20:32:37.8718762Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.8719046Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.8719344Z 2025-05-07T20:32:37.8719589Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8719927Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.8720538Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.8720864Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.8721226Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.8721676Z 2025-05-07T20:32:37.8721882Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.8722077Z 2025-05-07T20:32:37.8722186Z moe/activation_test.py:126: 2025-05-07T20:32:37.8722480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8722824Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.8723155Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.8723946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.8724691Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.8725244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.8725926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.8726610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.8727340Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.8728073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.8728711Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.8729308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.8729825Z fn() 2025-05-07T20:32:37.8730335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.8730919Z self.fn.run( 2025-05-07T20:32:37.8731385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.8731921Z kernel = self.compile( 2025-05-07T20:32:37.8732461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.8733112Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.8733513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8733741Z 2025-05-07T20:32:37.8733954Z self = 2025-05-07T20:32:37.8735039Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.8736421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea1553a0>} 2025-05-07T20:32:37.8737765Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.8738795Z context = 2025-05-07T20:32:37.8739083Z 2025-05-07T20:32:37.8739256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.8739774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.8740248Z module_map=module_map) 2025-05-07T20:32:37.8740618Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.8740978Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.8741241Z E ^ 2025-05-07T20:32:37.8741804Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.8742256Z 2025-05-07T20:32:37.8742675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.8743262Z 2025-05-07T20:32:37.8743372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.8743784Z self=, 2025-05-07T20:32:37.8744189Z T=128, 2025-05-07T20:32:37.8744382Z D=5120, 2025-05-07T20:32:37.8744575Z scale_ub=None, 2025-05-07T20:32:37.8744800Z contiguous=True, 2025-05-07T20:32:37.8745028Z compiled=True, 2025-05-07T20:32:37.8745235Z ) 2025-05-07T20:32:38.1087121Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.1088237Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:38.1089586Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.1091028Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.1092010Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.1093328Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.1094704Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.1095696Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.1096983Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.1098359Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.1099427Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.1100708Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.1101958Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:38.1103173Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.1104373Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:38.1105527Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.1106550Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:38.1107775Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:38.1108567Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:38.1109772Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.1111061Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.1112167Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:38.1113215Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:38.1114392Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.1115842Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.1116909Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.1117825Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.1118582Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:38.1119612Z W0507 20:32:38.105000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.1785698Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.1788449Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:38.1791127Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.1793965Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.1795973Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.1797326Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.1799057Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.1800041Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.1801414Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.1802790Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.1803863Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.1805147Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.1806392Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:38.1807608Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.1808818Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:38.1809647Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.1810676Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:38.1811700Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:38.1812493Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:38.1813700Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.1814989Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.1816156Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:38.1817211Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:38.1818401Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.1819763Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.1820828Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.1821829Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.1822572Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:38.1823672Z W0507 20:32:38.175000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.3840906Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.3843001Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:38.3845663Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.3847250Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.3848246Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.3849549Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.3850925Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.3851915Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.3853133Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.3854513Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.3855577Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.3856862Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.3858107Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:38.3859326Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.3860549Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:38.3873230Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.3874672Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:38.3875811Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:38.3876778Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:38.3877998Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.3879306Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.3880424Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:38.3881481Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:38.3882677Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.3884056Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.3885138Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.3886080Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.3886846Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:38.3887870Z W0507 20:32:38.381000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.3943600Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.3944687Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:38.3946023Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.3947449Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.3948435Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.3949744Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.3951122Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.3952230Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.3953464Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.3954928Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.3956088Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.3957373Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.3958633Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:38.3959854Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.3961080Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:38.3961918Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.3962938Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:38.3963966Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:38.3964763Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:38.3966247Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.3967532Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.3968660Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:38.3969708Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:38.3970900Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.3972264Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.3973328Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.3974247Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.3974992Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:38.3976175Z W0507 20:32:38.391000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.6418212Z self = 2025-05-07T20:32:38.6419183Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:38.6419466Z 2025-05-07T20:32:38.6419553Z @given( 2025-05-07T20:32:38.6419798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.6420115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.6420436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.6420776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.6421109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.6421408Z ) 2025-05-07T20:32:38.6421771Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.6422226Z def test_silu_mul_quant( 2025-05-07T20:32:38.6422486Z self, 2025-05-07T20:32:38.6422701Z T: int, 2025-05-07T20:32:38.6422910Z D: int, 2025-05-07T20:32:38.6423147Z scale_ub: Optional[float], 2025-05-07T20:32:38.6423440Z contiguous: bool, 2025-05-07T20:32:38.6423693Z compiled: bool, 2025-05-07T20:32:38.6423923Z ) -> None: 2025-05-07T20:32:38.6424149Z torch.manual_seed(2025) 2025-05-07T20:32:38.6424400Z 2025-05-07T20:32:38.6424681Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.6425038Z 2025-05-07T20:32:38.6425242Z x_sign = torch.sign(x) 2025-05-07T20:32:38.6425543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.6425871Z x = x_sign * x_clamp 2025-05-07T20:32:38.6426129Z x0 = x[:, :D] 2025-05-07T20:32:38.6426352Z x1 = x[:, D:] 2025-05-07T20:32:38.6426571Z 2025-05-07T20:32:38.6426775Z if contiguous: 2025-05-07T20:32:38.6427020Z x0 = x0.contiguous() 2025-05-07T20:32:38.6427299Z x1 = x1.contiguous() 2025-05-07T20:32:38.6427551Z 2025-05-07T20:32:38.6427753Z if scale_ub is not None: 2025-05-07T20:32:38.6428039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.6428394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.6428710Z ) 2025-05-07T20:32:38.6428920Z else: 2025-05-07T20:32:38.6429148Z scale_ub_tensor = None 2025-05-07T20:32:38.6429418Z 2025-05-07T20:32:38.6429658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6429996Z op = silu_mul_quant 2025-05-07T20:32:38.6430263Z if compiled: 2025-05-07T20:32:38.6430524Z op = torch.compile(op) 2025-05-07T20:32:38.6430845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.6431144Z 2025-05-07T20:32:38.6431344Z y_fp8, y_scale = fn() 2025-05-07T20:32:38.6431646Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:38.6431955Z 2025-05-07T20:32:38.6432202Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6432558Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:38.6432871Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:38.6433194Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:38.6433570Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.6433899Z 2025-05-07T20:32:38.6434116Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:38.6434315Z 2025-05-07T20:32:38.6434421Z moe/activation_test.py:126: 2025-05-07T20:32:38.6434733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6435087Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:38.6435420Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.6436509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:38.6437279Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:38.6437833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.6438596Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.6439294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:38.6440028Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.6440768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:38.6441410Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:38.6442025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:38.6442553Z fn() 2025-05-07T20:32:38.6443064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:38.6443660Z self.fn.run( 2025-05-07T20:32:38.6444137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.6444687Z kernel = self.compile( 2025-05-07T20:32:38.6445226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.6445890Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.6446306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6446541Z 2025-05-07T20:32:38.6446754Z self = 2025-05-07T20:32:38.6447867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.6449259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be93a1ee0>} 2025-05-07T20:32:38.6450611Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.6451647Z context = 2025-05-07T20:32:38.6451940Z 2025-05-07T20:32:38.6452117Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.6452650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.6453129Z module_map=module_map) 2025-05-07T20:32:38.6453505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.6453873Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:38.6454151Z E ^ 2025-05-07T20:32:38.6454628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.6455082Z 2025-05-07T20:32:38.6455504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.6456016Z 2025-05-07T20:32:38.6456130Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.6456548Z self=, 2025-05-07T20:32:38.6456959Z T=4096, 2025-05-07T20:32:38.6457162Z D=5120, 2025-05-07T20:32:38.6457364Z scale_ub=None, 2025-05-07T20:32:38.6457590Z contiguous=True, 2025-05-07T20:32:38.6457820Z compiled=True, 2025-05-07T20:32:38.6458032Z ) 2025-05-07T20:32:38.8841195Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.8842436Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:38.8843786Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.8845229Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.8846216Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.8847529Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.8848922Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.8849911Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.8851142Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.8852527Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.8853617Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.8854907Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.8856171Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:38.8857402Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.8858620Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:38.8859456Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.8860488Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:38.8861513Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:38.8862318Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:38.8863615Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.8864910Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.8866375Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:38.8867429Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:38.8868621Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.8869987Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.8871066Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.8871994Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.8872740Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:38.8873765Z W0507 20:32:38.881000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.9539227Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:38.9541348Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:38.9544036Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:38.9546553Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:38.9547541Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.9548854Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:38.9550240Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.9551241Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.9552475Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:38.9554204Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.9555276Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.9556783Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:38.9558034Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:38.9559255Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:38.9560472Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:38.9561297Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:38.9562330Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:38.9563351Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:38.9564151Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:38.9565625Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:38.9566968Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:38.9568093Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:38.9569140Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:38.9570321Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:38.9571687Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:38.9572754Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.9573676Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.9574424Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:38.9575445Z W0507 20:32:38.950000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1597127Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:39.1598724Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:39.1600089Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:39.1601691Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:39.1602681Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1604136Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:39.1605697Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1606710Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1608017Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:39.1609728Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1611040Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1612629Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:39.1614176Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:39.1615429Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:39.1616642Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:39.1617475Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1618514Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:39.1619549Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:39.1620348Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:39.1621564Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:39.1622959Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:39.1624084Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:39.1625217Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:39.1626406Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:39.1627770Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:39.1628836Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1629753Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.1630509Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:39.1631534Z W0507 20:32:39.156000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1692360Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:39.1693652Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:39.1695001Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:39.1696430Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:39.1697414Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1698726Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:39.1700120Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1701115Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1702354Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:39.1703736Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1704814Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1706368Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:39.1707835Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:39.1709065Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:39.1710279Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:39.1711111Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1712139Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:39.1713161Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:39.1713966Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:39.1715184Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:39.1716567Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:39.1717697Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:39.1718746Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:39.1719939Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:39.1721300Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:39.1722361Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1723285Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.1724029Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:39.1725057Z W0507 20:32:39.166000 95431 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.4257343Z self = 2025-05-07T20:32:39.4258114Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.4258456Z 2025-05-07T20:32:39.4258543Z @given( 2025-05-07T20:32:39.4258782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.4259113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.4259429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.4259765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.4260457Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.4260748Z ) 2025-05-07T20:32:39.4261118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.4261748Z def test_silu_mul_quant( 2025-05-07T20:32:39.4262001Z self, 2025-05-07T20:32:39.4262195Z T: int, 2025-05-07T20:32:39.4262395Z D: int, 2025-05-07T20:32:39.4262622Z scale_ub: Optional[float], 2025-05-07T20:32:39.4262898Z contiguous: bool, 2025-05-07T20:32:39.4263144Z compiled: bool, 2025-05-07T20:32:39.4263375Z ) -> None: 2025-05-07T20:32:39.4263593Z torch.manual_seed(2025) 2025-05-07T20:32:39.4263849Z 2025-05-07T20:32:39.4264130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.4264478Z 2025-05-07T20:32:39.4264681Z x_sign = torch.sign(x) 2025-05-07T20:32:39.4264984Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.4265300Z x = x_sign * x_clamp 2025-05-07T20:32:39.4265921Z x0 = x[:, :D] 2025-05-07T20:32:39.4266144Z x1 = x[:, D:] 2025-05-07T20:32:39.4266352Z 2025-05-07T20:32:39.4266552Z if contiguous: 2025-05-07T20:32:39.4266784Z x0 = x0.contiguous() 2025-05-07T20:32:39.4267044Z x1 = x1.contiguous() 2025-05-07T20:32:39.4267286Z 2025-05-07T20:32:39.4267483Z if scale_ub is not None: 2025-05-07T20:32:39.4267762Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.4268099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.4268415Z ) 2025-05-07T20:32:39.4268608Z else: 2025-05-07T20:32:39.4268828Z scale_ub_tensor = None 2025-05-07T20:32:39.4269092Z 2025-05-07T20:32:39.4269333Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.4269646Z op = silu_mul_quant 2025-05-07T20:32:39.4269909Z if compiled: 2025-05-07T20:32:39.4270165Z op = torch.compile(op) 2025-05-07T20:32:39.4270458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.4270738Z 2025-05-07T20:32:39.4270937Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.4271225Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.4271520Z 2025-05-07T20:32:39.4271766Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.4272099Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.4272399Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.4272719Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.4273078Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.4273398Z 2025-05-07T20:32:39.4273604Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.4273799Z 2025-05-07T20:32:39.4273910Z moe/activation_test.py:126: 2025-05-07T20:32:39.4274213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.4274561Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.4274898Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.4275771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.4276534Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.4277088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.4277776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.4278467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.4279200Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.4280075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.4280723Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.4281326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.4281956Z fn() 2025-05-07T20:32:39.4282465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.4283043Z self.fn.run( 2025-05-07T20:32:39.4283515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.4284049Z kernel = self.compile( 2025-05-07T20:32:39.4284591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.4285239Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.4285647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.4285878Z 2025-05-07T20:32:39.4286093Z self = 2025-05-07T20:32:39.4287181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.4288569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be9722660>} 2025-05-07T20:32:39.4289911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.4290938Z context = 2025-05-07T20:32:39.4291232Z 2025-05-07T20:32:39.4291407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.4291933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.4292422Z module_map=module_map) 2025-05-07T20:32:39.4292791Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.4293156Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.4293425Z E ^ 2025-05-07T20:32:39.4293891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.4294350Z 2025-05-07T20:32:39.4294772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.4295282Z 2025-05-07T20:32:39.4295388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.4295814Z self=, 2025-05-07T20:32:39.4296238Z T=16384, 2025-05-07T20:32:39.4296439Z D=5120, 2025-05-07T20:32:39.4296646Z scale_ub=None, 2025-05-07T20:32:39.4296872Z contiguous=True, 2025-05-07T20:32:39.4304814Z compiled=True, 2025-05-07T20:32:39.4305047Z ) 2025-05-07T20:32:39.4590668Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:39.4591917Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:39.4593270Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:39.4594583Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:39.4595791Z W0507 20:32:39.457000 95431 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:39.5476572Z self = 2025-05-07T20:32:39.5477269Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.5477551Z 2025-05-07T20:32:39.5477634Z @given( 2025-05-07T20:32:39.5477879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.5478203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.5478514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.5478852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.5479195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.5479481Z ) 2025-05-07T20:32:39.5479854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.5480307Z def test_silu_mul_quant( 2025-05-07T20:32:39.5480555Z self, 2025-05-07T20:32:39.5480765Z T: int, 2025-05-07T20:32:39.5480980Z D: int, 2025-05-07T20:32:39.5481199Z scale_ub: Optional[float], 2025-05-07T20:32:39.5481480Z contiguous: bool, 2025-05-07T20:32:39.5481729Z compiled: bool, 2025-05-07T20:32:39.5481961Z ) -> None: 2025-05-07T20:32:39.5482185Z torch.manual_seed(2025) 2025-05-07T20:32:39.5482433Z 2025-05-07T20:32:39.5482705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.5483055Z 2025-05-07T20:32:39.5483258Z x_sign = torch.sign(x) 2025-05-07T20:32:39.5483548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.5483867Z x = x_sign * x_clamp 2025-05-07T20:32:39.5484119Z x0 = x[:, :D] 2025-05-07T20:32:39.5484346Z x1 = x[:, D:] 2025-05-07T20:32:39.5484558Z 2025-05-07T20:32:39.5484757Z if contiguous: 2025-05-07T20:32:39.5485000Z x0 = x0.contiguous() 2025-05-07T20:32:39.5485258Z x1 = x1.contiguous() 2025-05-07T20:32:39.5485516Z 2025-05-07T20:32:39.5485716Z if scale_ub is not None: 2025-05-07T20:32:39.5485989Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.5486338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.5486670Z ) 2025-05-07T20:32:39.5486863Z else: 2025-05-07T20:32:39.5487086Z scale_ub_tensor = None 2025-05-07T20:32:39.5487344Z 2025-05-07T20:32:39.5487576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.5487902Z op = silu_mul_quant 2025-05-07T20:32:39.5488167Z if compiled: 2025-05-07T20:32:39.5488413Z op = torch.compile(op) 2025-05-07T20:32:39.5488714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.5488994Z 2025-05-07T20:32:39.5489191Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.5489484Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.5489781Z 2025-05-07T20:32:39.5490022Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.5490362Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.5490660Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.5490982Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.5491337Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.5491655Z 2025-05-07T20:32:39.5491864Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.5492059Z 2025-05-07T20:32:39.5492164Z moe/activation_test.py:126: 2025-05-07T20:32:39.5492469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5492815Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.5493490Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.5494289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.5495181Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.5495732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.5496413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.5497109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.5497834Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.5498571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.5499209Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.5499823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.5500352Z fn() 2025-05-07T20:32:39.5500869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.5501455Z self.fn.run( 2025-05-07T20:32:39.5501928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.5502467Z kernel = self.compile( 2025-05-07T20:32:39.5503001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.5503658Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.5504062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.5504292Z 2025-05-07T20:32:39.5504512Z self = 2025-05-07T20:32:39.5505591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.5507038Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be91474c0>} 2025-05-07T20:32:39.5508385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.5509419Z context = 2025-05-07T20:32:39.5509715Z 2025-05-07T20:32:39.5509892Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.5510422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.5510906Z module_map=module_map) 2025-05-07T20:32:39.5511278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.5511642Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.5511919Z E ^ 2025-05-07T20:32:39.5512390Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.5512846Z 2025-05-07T20:32:39.5513266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.5513777Z 2025-05-07T20:32:39.5513884Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.5514304Z self=, 2025-05-07T20:32:39.5514717Z T=1, 2025-05-07T20:32:39.5514901Z D=5120, 2025-05-07T20:32:39.5515109Z scale_ub=1200.0, 2025-05-07T20:32:39.5515428Z contiguous=True, 2025-05-07T20:32:39.5515656Z compiled=True, 2025-05-07T20:32:39.5515957Z ) 2025-05-07T20:32:39.6918568Z self = 2025-05-07T20:32:39.6919696Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:39.6919972Z 2025-05-07T20:32:39.6920055Z @given( 2025-05-07T20:32:39.6920298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.6920623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.6920931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.6921275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.6921621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.6921923Z ) 2025-05-07T20:32:39.6922276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.6922731Z def test_silu_mul_quant( 2025-05-07T20:32:39.6922989Z self, 2025-05-07T20:32:39.6923190Z T: int, 2025-05-07T20:32:39.6923401Z D: int, 2025-05-07T20:32:39.6923631Z scale_ub: Optional[float], 2025-05-07T20:32:39.6923903Z contiguous: bool, 2025-05-07T20:32:39.6924162Z compiled: bool, 2025-05-07T20:32:39.6924398Z ) -> None: 2025-05-07T20:32:39.6924621Z torch.manual_seed(2025) 2025-05-07T20:32:39.6924862Z 2025-05-07T20:32:39.6925138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.6925492Z 2025-05-07T20:32:39.6925686Z x_sign = torch.sign(x) 2025-05-07T20:32:39.6925983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.6926308Z x = x_sign * x_clamp 2025-05-07T20:32:39.6926555Z x0 = x[:, :D] 2025-05-07T20:32:39.6926811Z x1 = x[:, D:] 2025-05-07T20:32:39.6927049Z 2025-05-07T20:32:39.6927245Z if contiguous: 2025-05-07T20:32:39.6927482Z x0 = x0.contiguous() 2025-05-07T20:32:39.6927748Z x1 = x1.contiguous() 2025-05-07T20:32:39.6927998Z 2025-05-07T20:32:39.6928187Z if scale_ub is not None: 2025-05-07T20:32:39.6928463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.6928811Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.6929122Z ) 2025-05-07T20:32:39.6929326Z else: 2025-05-07T20:32:39.6929541Z scale_ub_tensor = None 2025-05-07T20:32:39.6929792Z 2025-05-07T20:32:39.6930029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.6930352Z op = silu_mul_quant 2025-05-07T20:32:39.6930605Z if compiled: 2025-05-07T20:32:39.6930863Z op = torch.compile(op) 2025-05-07T20:32:39.6931162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.6931436Z 2025-05-07T20:32:39.6931635Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.6931805Z 2025-05-07T20:32:39.6931908Z moe/activation_test.py:117: 2025-05-07T20:32:39.6932214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.6932549Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.6932833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.6933398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.6933959Z return fn(*args, **kwargs) 2025-05-07T20:32:39.6934619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.6935309Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.6935848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.6936524Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.6937403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.6937946Z kernel = self.compile( 2025-05-07T20:32:39.6938485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.6939250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.6939662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.6939897Z 2025-05-07T20:32:39.6940112Z self = 2025-05-07T20:32:39.6941192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.6942590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8bde5c0>} 2025-05-07T20:32:39.6943941Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.6944982Z context = 2025-05-07T20:32:39.6945270Z 2025-05-07T20:32:39.6945448Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.6945970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.6946449Z module_map=module_map) 2025-05-07T20:32:39.6946817Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.6947175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.6947447Z E ^ 2025-05-07T20:32:39.6947922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.6948375Z 2025-05-07T20:32:39.6948792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.6949306Z 2025-05-07T20:32:39.6949411Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.6949828Z self=, 2025-05-07T20:32:39.6950234Z T=1, 2025-05-07T20:32:39.6950417Z D=5120, 2025-05-07T20:32:39.6950615Z scale_ub=None, 2025-05-07T20:32:39.6950843Z contiguous=False, 2025-05-07T20:32:39.6951070Z compiled=True, 2025-05-07T20:32:39.6951278Z ) 2025-05-07T20:32:39.7564536Z self = 2025-05-07T20:32:39.7565285Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.7565906Z 2025-05-07T20:32:39.7566028Z @given( 2025-05-07T20:32:39.7566356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7566862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7567245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7567573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7567975Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7568376Z ) 2025-05-07T20:32:39.7568843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7569423Z def test_silu_mul_quant( 2025-05-07T20:32:39.7569672Z self, 2025-05-07T20:32:39.7569867Z T: int, 2025-05-07T20:32:39.7570068Z D: int, 2025-05-07T20:32:39.7570290Z scale_ub: Optional[float], 2025-05-07T20:32:39.7570559Z contiguous: bool, 2025-05-07T20:32:39.7570809Z compiled: bool, 2025-05-07T20:32:39.7571042Z ) -> None: 2025-05-07T20:32:39.7571256Z torch.manual_seed(2025) 2025-05-07T20:32:39.7571505Z 2025-05-07T20:32:39.7572073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7572445Z 2025-05-07T20:32:39.7572641Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7572944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7573394Z x = x_sign * x_clamp 2025-05-07T20:32:39.7573645Z x0 = x[:, :D] 2025-05-07T20:32:39.7573863Z x1 = x[:, D:] 2025-05-07T20:32:39.7574078Z 2025-05-07T20:32:39.7574272Z if contiguous: 2025-05-07T20:32:39.7574505Z x0 = x0.contiguous() 2025-05-07T20:32:39.7574778Z x1 = x1.contiguous() 2025-05-07T20:32:39.7575029Z 2025-05-07T20:32:39.7575226Z if scale_ub is not None: 2025-05-07T20:32:39.7575513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7575851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7576163Z ) 2025-05-07T20:32:39.7576362Z else: 2025-05-07T20:32:39.7576584Z scale_ub_tensor = None 2025-05-07T20:32:39.7576837Z 2025-05-07T20:32:39.7577083Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7577402Z op = silu_mul_quant 2025-05-07T20:32:39.7577649Z if compiled: 2025-05-07T20:32:39.7577910Z op = torch.compile(op) 2025-05-07T20:32:39.7578209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7578485Z 2025-05-07T20:32:39.7578677Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.7578964Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.7579258Z 2025-05-07T20:32:39.7579495Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7579834Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.7580134Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.7580446Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.7580807Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.7581128Z 2025-05-07T20:32:39.7581332Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.7581533Z 2025-05-07T20:32:39.7581636Z moe/activation_test.py:126: 2025-05-07T20:32:39.7581940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7582289Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.7582617Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.7583407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.7584165Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.7584707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7585391Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7586085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.7586811Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.7587536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.7588180Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.7588783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.7589304Z fn() 2025-05-07T20:32:39.7589808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.7590393Z self.fn.run( 2025-05-07T20:32:39.7590862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7591389Z kernel = self.compile( 2025-05-07T20:32:39.7592013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7592673Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7593076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7593380Z 2025-05-07T20:32:39.7593587Z self = 2025-05-07T20:32:39.7594671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7596140Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8c51b20>} 2025-05-07T20:32:39.7597491Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7598511Z context = 2025-05-07T20:32:39.7598813Z 2025-05-07T20:32:39.7598981Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7599511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7599984Z module_map=module_map) 2025-05-07T20:32:39.7600346Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7600708Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.7600978Z E ^ 2025-05-07T20:32:39.7601439Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7601897Z 2025-05-07T20:32:39.7602312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7602826Z 2025-05-07T20:32:39.7602930Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7603347Z self=, 2025-05-07T20:32:39.7603754Z T=1, 2025-05-07T20:32:39.7603941Z D=5120, 2025-05-07T20:32:39.7604137Z scale_ub=None, 2025-05-07T20:32:39.7604348Z contiguous=True, 2025-05-07T20:32:39.7604574Z compiled=False, 2025-05-07T20:32:39.7604781Z ) 2025-05-07T20:32:39.9112354Z self = 2025-05-07T20:32:39.9113079Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:39.9113441Z 2025-05-07T20:32:39.9113556Z @given( 2025-05-07T20:32:39.9113896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.9114212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.9114527Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.9114882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.9115209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.9115509Z ) 2025-05-07T20:32:39.9115962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.9116418Z def test_silu_mul_quant( 2025-05-07T20:32:39.9116665Z self, 2025-05-07T20:32:39.9116895Z T: int, 2025-05-07T20:32:39.9117113Z D: int, 2025-05-07T20:32:39.9117333Z scale_ub: Optional[float], 2025-05-07T20:32:39.9117612Z contiguous: bool, 2025-05-07T20:32:39.9117859Z compiled: bool, 2025-05-07T20:32:39.9118088Z ) -> None: 2025-05-07T20:32:39.9118307Z torch.manual_seed(2025) 2025-05-07T20:32:39.9118553Z 2025-05-07T20:32:39.9118824Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.9119172Z 2025-05-07T20:32:39.9119370Z x_sign = torch.sign(x) 2025-05-07T20:32:39.9119998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.9120320Z x = x_sign * x_clamp 2025-05-07T20:32:39.9120568Z x0 = x[:, :D] 2025-05-07T20:32:39.9120786Z x1 = x[:, D:] 2025-05-07T20:32:39.9121158Z 2025-05-07T20:32:39.9121349Z if contiguous: 2025-05-07T20:32:39.9121581Z x0 = x0.contiguous() 2025-05-07T20:32:39.9121842Z x1 = x1.contiguous() 2025-05-07T20:32:39.9122087Z 2025-05-07T20:32:39.9122279Z if scale_ub is not None: 2025-05-07T20:32:39.9122556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.9122896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.9123210Z ) 2025-05-07T20:32:39.9123401Z else: 2025-05-07T20:32:39.9123620Z scale_ub_tensor = None 2025-05-07T20:32:39.9123878Z 2025-05-07T20:32:39.9124110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.9124435Z op = silu_mul_quant 2025-05-07T20:32:39.9124694Z if compiled: 2025-05-07T20:32:39.9124939Z op = torch.compile(op) 2025-05-07T20:32:39.9125240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9125527Z 2025-05-07T20:32:39.9125724Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.9125895Z 2025-05-07T20:32:39.9126001Z moe/activation_test.py:117: 2025-05-07T20:32:39.9126305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9126634Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.9126919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9127615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.9128305Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.9128841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.9129531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.9130198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.9130744Z kernel = self.compile( 2025-05-07T20:32:39.9131282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.9131946Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.9132350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9132584Z 2025-05-07T20:32:39.9132790Z self = 2025-05-07T20:32:39.9133880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.9135354Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8c52d40>} 2025-05-07T20:32:39.9136703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.9137727Z context = 2025-05-07T20:32:39.9138014Z 2025-05-07T20:32:39.9138189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.9138708Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.9139182Z module_map=module_map) 2025-05-07T20:32:39.9139546Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.9139900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.9140265Z E ^ 2025-05-07T20:32:39.9140734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.9141266Z 2025-05-07T20:32:39.9141687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.9142196Z 2025-05-07T20:32:39.9142302Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9142716Z self=, 2025-05-07T20:32:39.9143121Z T=128, 2025-05-07T20:32:39.9143308Z D=5120, 2025-05-07T20:32:39.9143506Z scale_ub=None, 2025-05-07T20:32:39.9143727Z contiguous=False, 2025-05-07T20:32:39.9143970Z compiled=True, 2025-05-07T20:32:39.9152551Z ) 2025-05-07T20:32:39.9152901Z self = 2025-05-07T20:32:39.9153416Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:39.9153709Z 2025-05-07T20:32:39.9153791Z @given( 2025-05-07T20:32:39.9154033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.9154347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.9154670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.9155005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.9155330Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.9155610Z ) 2025-05-07T20:32:39.9156045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.9156498Z def test_silu_mul_quant( 2025-05-07T20:32:39.9156757Z self, 2025-05-07T20:32:39.9156996Z T: int, 2025-05-07T20:32:39.9157201Z D: int, 2025-05-07T20:32:39.9157414Z scale_ub: Optional[float], 2025-05-07T20:32:39.9157692Z contiguous: bool, 2025-05-07T20:32:39.9157935Z compiled: bool, 2025-05-07T20:32:39.9158171Z ) -> None: 2025-05-07T20:32:39.9158393Z torch.manual_seed(2025) 2025-05-07T20:32:39.9158639Z 2025-05-07T20:32:39.9158923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.9159280Z 2025-05-07T20:32:39.9159480Z x_sign = torch.sign(x) 2025-05-07T20:32:39.9159779Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.9160102Z x = x_sign * x_clamp 2025-05-07T20:32:39.9160348Z x0 = x[:, :D] 2025-05-07T20:32:39.9160576Z x1 = x[:, D:] 2025-05-07T20:32:39.9160796Z 2025-05-07T20:32:39.9160996Z if contiguous: 2025-05-07T20:32:39.9161228Z x0 = x0.contiguous() 2025-05-07T20:32:39.9161497Z x1 = x1.contiguous() 2025-05-07T20:32:39.9161746Z 2025-05-07T20:32:39.9161941Z if scale_ub is not None: 2025-05-07T20:32:39.9162229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.9162570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.9162885Z ) 2025-05-07T20:32:39.9163090Z else: 2025-05-07T20:32:39.9163313Z scale_ub_tensor = None 2025-05-07T20:32:39.9163563Z 2025-05-07T20:32:39.9163802Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.9164127Z op = silu_mul_quant 2025-05-07T20:32:39.9164380Z if compiled: 2025-05-07T20:32:39.9164637Z op = torch.compile(op) 2025-05-07T20:32:39.9164950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9165238Z 2025-05-07T20:32:39.9165782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:39.9165955Z 2025-05-07T20:32:39.9166057Z moe/activation_test.py:117: 2025-05-07T20:32:39.9166362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9166697Z moe/activation_test.py:115: in fn 2025-05-07T20:32:39.9166990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.9167770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:39.9168337Z return fn(*args, **kwargs) 2025-05-07T20:32:39.9168997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:39.9169807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:39.9170346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.9171020Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.9171685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.9172218Z kernel = self.compile( 2025-05-07T20:32:39.9172755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.9173416Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.9173819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.9174049Z 2025-05-07T20:32:39.9174267Z self = 2025-05-07T20:32:39.9175353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.9176741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8c50f40>} 2025-05-07T20:32:39.9178087Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.9179121Z context = 2025-05-07T20:32:39.9179411Z 2025-05-07T20:32:39.9179587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.9180114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.9180589Z module_map=module_map) 2025-05-07T20:32:39.9180966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.9181321Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.9181589Z E ^ 2025-05-07T20:32:39.9182058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.9182510Z 2025-05-07T20:32:39.9182933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.9183443Z 2025-05-07T20:32:39.9183549Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.9183974Z self=, 2025-05-07T20:32:39.9184387Z T=128, 2025-05-07T20:32:39.9184578Z D=7168, 2025-05-07T20:32:39.9184781Z scale_ub=1200.0, 2025-05-07T20:32:39.9185018Z contiguous=False, 2025-05-07T20:32:39.9185254Z compiled=False, 2025-05-07T20:32:39.9185460Z ) 2025-05-07T20:32:40.0324467Z self = 2025-05-07T20:32:40.0325957Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.0326710Z 2025-05-07T20:32:40.0326874Z @given( 2025-05-07T20:32:40.0327235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.0327569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.0327885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.0328227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.0328936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.0329240Z ) 2025-05-07T20:32:40.0329594Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.0330053Z def test_silu_mul_quant( 2025-05-07T20:32:40.0330463Z self, 2025-05-07T20:32:40.0330664Z T: int, 2025-05-07T20:32:40.0330874Z D: int, 2025-05-07T20:32:40.0331105Z scale_ub: Optional[float], 2025-05-07T20:32:40.0331380Z contiguous: bool, 2025-05-07T20:32:40.0331631Z compiled: bool, 2025-05-07T20:32:40.0331866Z ) -> None: 2025-05-07T20:32:40.0332087Z torch.manual_seed(2025) 2025-05-07T20:32:40.0332339Z 2025-05-07T20:32:40.0332623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.0333005Z 2025-05-07T20:32:40.0333210Z x_sign = torch.sign(x) 2025-05-07T20:32:40.0333506Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.0333833Z x = x_sign * x_clamp 2025-05-07T20:32:40.0334091Z x0 = x[:, :D] 2025-05-07T20:32:40.0334312Z x1 = x[:, D:] 2025-05-07T20:32:40.0334537Z 2025-05-07T20:32:40.0334732Z if contiguous: 2025-05-07T20:32:40.0334970Z x0 = x0.contiguous() 2025-05-07T20:32:40.0335252Z x1 = x1.contiguous() 2025-05-07T20:32:40.0335501Z 2025-05-07T20:32:40.0335694Z if scale_ub is not None: 2025-05-07T20:32:40.0335978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.0336325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.0336644Z ) 2025-05-07T20:32:40.0336840Z else: 2025-05-07T20:32:40.0337059Z scale_ub_tensor = None 2025-05-07T20:32:40.0337324Z 2025-05-07T20:32:40.0337562Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.0337890Z op = silu_mul_quant 2025-05-07T20:32:40.0338151Z if compiled: 2025-05-07T20:32:40.0338401Z op = torch.compile(op) 2025-05-07T20:32:40.0338713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.0338997Z 2025-05-07T20:32:40.0339194Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.0339368Z 2025-05-07T20:32:40.0339472Z moe/activation_test.py:117: 2025-05-07T20:32:40.0339789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.0340128Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.0340421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.0341120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.0341811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.0342362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.0343054Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.0343729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.0344272Z kernel = self.compile( 2025-05-07T20:32:40.0344824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.0345487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.0345888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.0346126Z 2025-05-07T20:32:40.0346336Z self = 2025-05-07T20:32:40.0347474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.0348978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8660e00>} 2025-05-07T20:32:40.0350320Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.0351428Z context = 2025-05-07T20:32:40.0351722Z 2025-05-07T20:32:40.0351890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.0352416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.0352888Z module_map=module_map) 2025-05-07T20:32:40.0353256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.0353617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.0353884Z E ^ 2025-05-07T20:32:40.0354352Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.0354813Z 2025-05-07T20:32:40.0355227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.0355837Z 2025-05-07T20:32:40.0355952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.0356372Z self=, 2025-05-07T20:32:40.0356783Z T=128, 2025-05-07T20:32:40.0356984Z D=5120, 2025-05-07T20:32:40.0357191Z scale_ub=None, 2025-05-07T20:32:40.0357412Z contiguous=False, 2025-05-07T20:32:40.0357646Z compiled=False, 2025-05-07T20:32:40.0357863Z ) 2025-05-07T20:32:40.0358186Z self = 2025-05-07T20:32:40.0358687Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:40.0358959Z 2025-05-07T20:32:40.0359050Z @given( 2025-05-07T20:32:40.0359293Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.0359618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.0359935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.0360281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.0360616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.0360911Z ) 2025-05-07T20:32:40.0361268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.0361715Z def test_silu_mul_quant( 2025-05-07T20:32:40.0361969Z self, 2025-05-07T20:32:40.0362174Z T: int, 2025-05-07T20:32:40.0362376Z D: int, 2025-05-07T20:32:40.0362606Z scale_ub: Optional[float], 2025-05-07T20:32:40.0362885Z contiguous: bool, 2025-05-07T20:32:40.0363128Z compiled: bool, 2025-05-07T20:32:40.0363359Z ) -> None: 2025-05-07T20:32:40.0363583Z torch.manual_seed(2025) 2025-05-07T20:32:40.0363825Z 2025-05-07T20:32:40.0364118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.0364476Z 2025-05-07T20:32:40.0364676Z x_sign = torch.sign(x) 2025-05-07T20:32:40.0364983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.0365310Z x = x_sign * x_clamp 2025-05-07T20:32:40.0369588Z x0 = x[:, :D] 2025-05-07T20:32:40.0369816Z x1 = x[:, D:] 2025-05-07T20:32:40.0370035Z 2025-05-07T20:32:40.0370226Z if contiguous: 2025-05-07T20:32:40.0370464Z x0 = x0.contiguous() 2025-05-07T20:32:40.0370726Z x1 = x1.contiguous() 2025-05-07T20:32:40.0370970Z 2025-05-07T20:32:40.0371164Z if scale_ub is not None: 2025-05-07T20:32:40.0371446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.0371787Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.0372099Z ) 2025-05-07T20:32:40.0372297Z else: 2025-05-07T20:32:40.0372679Z scale_ub_tensor = None 2025-05-07T20:32:40.0372936Z 2025-05-07T20:32:40.0373175Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.0373497Z op = silu_mul_quant 2025-05-07T20:32:40.0373860Z if compiled: 2025-05-07T20:32:40.0374113Z op = torch.compile(op) 2025-05-07T20:32:40.0374413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.0374694Z 2025-05-07T20:32:40.0374890Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.0375062Z 2025-05-07T20:32:40.0375166Z moe/activation_test.py:117: 2025-05-07T20:32:40.0375466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.0375798Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.0376083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.0376777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.0377518Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.0378059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.0378741Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.0379413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.0379946Z kernel = self.compile( 2025-05-07T20:32:40.0380496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.0381159Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.0381568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.0381801Z 2025-05-07T20:32:40.0382009Z self = 2025-05-07T20:32:40.0383098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.0384480Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8662980>} 2025-05-07T20:32:40.0385829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.0386862Z context = 2025-05-07T20:32:40.0387166Z 2025-05-07T20:32:40.0387335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.0387866Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.0388354Z module_map=module_map) 2025-05-07T20:32:40.0388722Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.0389091Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.0389374Z E ^ 2025-05-07T20:32:40.0389840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.0390304Z 2025-05-07T20:32:40.0390721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.0391238Z 2025-05-07T20:32:40.0391347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.0391779Z self=, 2025-05-07T20:32:40.0392184Z T=128, 2025-05-07T20:32:40.0392387Z D=5120, 2025-05-07T20:32:40.0392592Z scale_ub=1200.0, 2025-05-07T20:32:40.0392819Z contiguous=True, 2025-05-07T20:32:40.0393051Z compiled=False, 2025-05-07T20:32:40.0393774Z ) 2025-05-07T20:32:40.4218367Z self = 2025-05-07T20:32:40.4219002Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:40.4219668Z 2025-05-07T20:32:40.4219752Z @given( 2025-05-07T20:32:40.4219999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.4220317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.4220637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.4220977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.4221301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.4221592Z ) 2025-05-07T20:32:40.4221951Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.4222398Z def test_silu_mul_quant( 2025-05-07T20:32:40.4222662Z self, 2025-05-07T20:32:40.4222874Z T: int, 2025-05-07T20:32:40.4223090Z D: int, 2025-05-07T20:32:40.4223318Z scale_ub: Optional[float], 2025-05-07T20:32:40.4223597Z contiguous: bool, 2025-05-07T20:32:40.4223847Z compiled: bool, 2025-05-07T20:32:40.4224079Z ) -> None: 2025-05-07T20:32:40.4224305Z torch.manual_seed(2025) 2025-05-07T20:32:40.4224560Z 2025-05-07T20:32:40.4224837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.4225193Z 2025-05-07T20:32:40.4225394Z x_sign = torch.sign(x) 2025-05-07T20:32:40.4225685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.4226009Z x = x_sign * x_clamp 2025-05-07T20:32:40.4226259Z x0 = x[:, :D] 2025-05-07T20:32:40.4226477Z x1 = x[:, D:] 2025-05-07T20:32:40.4226693Z 2025-05-07T20:32:40.4226890Z if contiguous: 2025-05-07T20:32:40.4227126Z x0 = x0.contiguous() 2025-05-07T20:32:40.4227392Z x1 = x1.contiguous() 2025-05-07T20:32:40.4227642Z 2025-05-07T20:32:40.4227843Z if scale_ub is not None: 2025-05-07T20:32:40.4228121Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.4228468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.4228794Z ) 2025-05-07T20:32:40.4228987Z else: 2025-05-07T20:32:40.4229204Z scale_ub_tensor = None 2025-05-07T20:32:40.4229461Z 2025-05-07T20:32:40.4229692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.4230011Z op = silu_mul_quant 2025-05-07T20:32:40.4230265Z if compiled: 2025-05-07T20:32:40.4230509Z op = torch.compile(op) 2025-05-07T20:32:40.4230806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4231085Z 2025-05-07T20:32:40.4231277Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.4231447Z 2025-05-07T20:32:40.4231550Z moe/activation_test.py:117: 2025-05-07T20:32:40.4231852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4232199Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.4232488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4233188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.4233885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.4234418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.4235105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.4235878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.4236412Z kernel = self.compile( 2025-05-07T20:32:40.4236954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.4237774Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.4238180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4238411Z 2025-05-07T20:32:40.4238618Z self = 2025-05-07T20:32:40.4239780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.4241271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be86602c0>} 2025-05-07T20:32:40.4242620Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.4243651Z context = 2025-05-07T20:32:40.4243940Z 2025-05-07T20:32:40.4244106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.4244639Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.4245114Z module_map=module_map) 2025-05-07T20:32:40.4245482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.4245837Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.4246101Z E ^ 2025-05-07T20:32:40.4246569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.4247018Z 2025-05-07T20:32:40.4247439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.4247948Z 2025-05-07T20:32:40.4248053Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.4248471Z self=, 2025-05-07T20:32:40.4248882Z T=1, 2025-05-07T20:32:40.4249063Z D=7168, 2025-05-07T20:32:40.4249271Z scale_ub=1200.0, 2025-05-07T20:32:40.4249499Z contiguous=True, 2025-05-07T20:32:40.4249719Z compiled=True, 2025-05-07T20:32:40.4249931Z ) 2025-05-07T20:32:40.4250258Z self = 2025-05-07T20:32:40.4250741Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.4251009Z 2025-05-07T20:32:40.4251089Z @given( 2025-05-07T20:32:40.4251332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.4251651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.4251957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.4252302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.4252645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.4252929Z ) 2025-05-07T20:32:40.4253281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.4253730Z def test_silu_mul_quant( 2025-05-07T20:32:40.4253978Z self, 2025-05-07T20:32:40.4254184Z T: int, 2025-05-07T20:32:40.4254391Z D: int, 2025-05-07T20:32:40.4254608Z scale_ub: Optional[float], 2025-05-07T20:32:40.4254889Z contiguous: bool, 2025-05-07T20:32:40.4255142Z compiled: bool, 2025-05-07T20:32:40.4255371Z ) -> None: 2025-05-07T20:32:40.4255589Z torch.manual_seed(2025) 2025-05-07T20:32:40.4255836Z 2025-05-07T20:32:40.4256118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.4256462Z 2025-05-07T20:32:40.4256657Z x_sign = torch.sign(x) 2025-05-07T20:32:40.4256953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.4257261Z x = x_sign * x_clamp 2025-05-07T20:32:40.4257597Z x0 = x[:, :D] 2025-05-07T20:32:40.4257823Z x1 = x[:, D:] 2025-05-07T20:32:40.4258029Z 2025-05-07T20:32:40.4258219Z if contiguous: 2025-05-07T20:32:40.4258455Z x0 = x0.contiguous() 2025-05-07T20:32:40.4258787Z x1 = x1.contiguous() 2025-05-07T20:32:40.4259033Z 2025-05-07T20:32:40.4259230Z if scale_ub is not None: 2025-05-07T20:32:40.4259498Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.4259840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.4260156Z ) 2025-05-07T20:32:40.4260347Z else: 2025-05-07T20:32:40.4260568Z scale_ub_tensor = None 2025-05-07T20:32:40.4260821Z 2025-05-07T20:32:40.4269250Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.4269687Z op = silu_mul_quant 2025-05-07T20:32:40.4269950Z if compiled: 2025-05-07T20:32:40.4270196Z op = torch.compile(op) 2025-05-07T20:32:40.4270502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4270787Z 2025-05-07T20:32:40.4270983Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.4271158Z 2025-05-07T20:32:40.4271262Z moe/activation_test.py:117: 2025-05-07T20:32:40.4271578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4271918Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.4272200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.4272973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.4273540Z return fn(*args, **kwargs) 2025-05-07T20:32:40.4274198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.4274891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.4275444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.4276224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.4276889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.4277491Z kernel = self.compile( 2025-05-07T20:32:40.4278040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.4278710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.4279113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.4279356Z 2025-05-07T20:32:40.4279567Z self = 2025-05-07T20:32:40.4280661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.4282163Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8662ca0>} 2025-05-07T20:32:40.4283587Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.4284615Z context = 2025-05-07T20:32:40.4284910Z 2025-05-07T20:32:40.4285078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.4285603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.4286077Z module_map=module_map) 2025-05-07T20:32:40.4286649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.4287015Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.4287274Z E ^ 2025-05-07T20:32:40.4287741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.4288326Z 2025-05-07T20:32:40.4288740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.4289248Z 2025-05-07T20:32:40.4289362Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.4289770Z self=, 2025-05-07T20:32:40.4290178Z T=1, 2025-05-07T20:32:40.4290374Z D=7168, 2025-05-07T20:32:40.4290572Z scale_ub=1200.0, 2025-05-07T20:32:40.4290800Z contiguous=False, 2025-05-07T20:32:40.4291035Z compiled=True, 2025-05-07T20:32:40.4291238Z ) 2025-05-07T20:32:40.5680392Z self = 2025-05-07T20:32:40.5680957Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.5681236Z 2025-05-07T20:32:40.5681324Z @given( 2025-05-07T20:32:40.5681567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.5681908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.5682227Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.5682562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.5682901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.5683202Z ) 2025-05-07T20:32:40.5683556Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.5684018Z def test_silu_mul_quant( 2025-05-07T20:32:40.5684276Z self, 2025-05-07T20:32:40.5684474Z T: int, 2025-05-07T20:32:40.5684683Z D: int, 2025-05-07T20:32:40.5684912Z scale_ub: Optional[float], 2025-05-07T20:32:40.5685185Z contiguous: bool, 2025-05-07T20:32:40.5685449Z compiled: bool, 2025-05-07T20:32:40.5685691Z ) -> None: 2025-05-07T20:32:40.5685912Z torch.manual_seed(2025) 2025-05-07T20:32:40.5686172Z 2025-05-07T20:32:40.5686460Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.5686814Z 2025-05-07T20:32:40.5687010Z x_sign = torch.sign(x) 2025-05-07T20:32:40.5687311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.5687633Z x = x_sign * x_clamp 2025-05-07T20:32:40.5687879Z x0 = x[:, :D] 2025-05-07T20:32:40.5688110Z x1 = x[:, D:] 2025-05-07T20:32:40.5688328Z 2025-05-07T20:32:40.5688516Z if contiguous: 2025-05-07T20:32:40.5688757Z x0 = x0.contiguous() 2025-05-07T20:32:40.5689027Z x1 = x1.contiguous() 2025-05-07T20:32:40.5689270Z 2025-05-07T20:32:40.5689480Z if scale_ub is not None: 2025-05-07T20:32:40.5689766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.5690111Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.5690430Z ) 2025-05-07T20:32:40.5690633Z else: 2025-05-07T20:32:40.5690846Z scale_ub_tensor = None 2025-05-07T20:32:40.5691112Z 2025-05-07T20:32:40.5691356Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.5691684Z op = silu_mul_quant 2025-05-07T20:32:40.5691938Z if compiled: 2025-05-07T20:32:40.5692197Z op = torch.compile(op) 2025-05-07T20:32:40.5692509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.5692787Z 2025-05-07T20:32:40.5692996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.5693164Z 2025-05-07T20:32:40.5693273Z moe/activation_test.py:117: 2025-05-07T20:32:40.5693576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.5693923Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.5694218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.5695076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.5695653Z return fn(*args, **kwargs) 2025-05-07T20:32:40.5696453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.5697164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.5697739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.5698431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.5699104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.5699649Z kernel = self.compile( 2025-05-07T20:32:40.5700189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.5700860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.5701271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.5701510Z 2025-05-07T20:32:40.5701720Z self = 2025-05-07T20:32:40.5702814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.5704214Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8bdd0e00>} 2025-05-07T20:32:40.5705572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.5706612Z context = 2025-05-07T20:32:40.5706901Z 2025-05-07T20:32:40.5707071Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.5707611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.5708091Z module_map=module_map) 2025-05-07T20:32:40.5708472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.5708831Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.5709100Z E ^ 2025-05-07T20:32:40.5709577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.5710029Z 2025-05-07T20:32:40.5710447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.5710965Z 2025-05-07T20:32:40.5711078Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.5711499Z self=, 2025-05-07T20:32:40.5711919Z T=1, 2025-05-07T20:32:40.5712112Z D=7168, 2025-05-07T20:32:40.5712318Z scale_ub=None, 2025-05-07T20:32:40.5712547Z contiguous=False, 2025-05-07T20:32:40.5712776Z compiled=True, 2025-05-07T20:32:40.5712999Z ) 2025-05-07T20:32:40.6610309Z self = 2025-05-07T20:32:40.6610857Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.6611123Z 2025-05-07T20:32:40.6611211Z @given( 2025-05-07T20:32:40.6611450Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6611784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6612096Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6612428Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6613090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6613385Z ) 2025-05-07T20:32:40.6613728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6614319Z def test_silu_mul_quant( 2025-05-07T20:32:40.6614570Z self, 2025-05-07T20:32:40.6614770Z T: int, 2025-05-07T20:32:40.6614962Z D: int, 2025-05-07T20:32:40.6615184Z scale_ub: Optional[float], 2025-05-07T20:32:40.6615455Z contiguous: bool, 2025-05-07T20:32:40.6615687Z compiled: bool, 2025-05-07T20:32:40.6615916Z ) -> None: 2025-05-07T20:32:40.6616135Z torch.manual_seed(2025) 2025-05-07T20:32:40.6616372Z 2025-05-07T20:32:40.6616651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6616998Z 2025-05-07T20:32:40.6617193Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6617482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6617802Z x = x_sign * x_clamp 2025-05-07T20:32:40.6618039Z x0 = x[:, :D] 2025-05-07T20:32:40.6618262Z x1 = x[:, D:] 2025-05-07T20:32:40.6618470Z 2025-05-07T20:32:40.6618650Z if contiguous: 2025-05-07T20:32:40.6618893Z x0 = x0.contiguous() 2025-05-07T20:32:40.6619161Z x1 = x1.contiguous() 2025-05-07T20:32:40.6619397Z 2025-05-07T20:32:40.6619594Z if scale_ub is not None: 2025-05-07T20:32:40.6619872Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6620216Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6620526Z ) 2025-05-07T20:32:40.6620732Z else: 2025-05-07T20:32:40.6620952Z scale_ub_tensor = None 2025-05-07T20:32:40.6621198Z 2025-05-07T20:32:40.6621433Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6621755Z op = silu_mul_quant 2025-05-07T20:32:40.6622001Z if compiled: 2025-05-07T20:32:40.6622257Z op = torch.compile(op) 2025-05-07T20:32:40.6622558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6622829Z 2025-05-07T20:32:40.6623027Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.6623321Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.6623607Z 2025-05-07T20:32:40.6623851Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6624189Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.6624484Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.6624792Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.6625153Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.6625468Z 2025-05-07T20:32:40.6625669Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.6625867Z 2025-05-07T20:32:40.6625969Z moe/activation_test.py:126: 2025-05-07T20:32:40.6626273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6626609Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.6626936Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.6627717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.6628469Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.6629009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6629688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6630374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.6631090Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.6631909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.6632546Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.6633149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.6633765Z fn() 2025-05-07T20:32:40.6634273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.6634995Z self.fn.run( 2025-05-07T20:32:40.6635470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6636163Z kernel = self.compile( 2025-05-07T20:32:40.6636704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6637353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6637752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6637991Z 2025-05-07T20:32:40.6638198Z self = 2025-05-07T20:32:40.6639278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6640666Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be858e8e0>} 2025-05-07T20:32:40.6642002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6643019Z context = 2025-05-07T20:32:40.6643314Z 2025-05-07T20:32:40.6643485Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6644007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6644483Z module_map=module_map) 2025-05-07T20:32:40.6644839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6645197Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.6645469Z E ^ 2025-05-07T20:32:40.6645928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6646387Z 2025-05-07T20:32:40.6646800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6647318Z 2025-05-07T20:32:40.6647423Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6647845Z self=, 2025-05-07T20:32:40.6648248Z T=1, 2025-05-07T20:32:40.6648441Z D=5120, 2025-05-07T20:32:40.6648646Z scale_ub=1200.0, 2025-05-07T20:32:40.6648867Z contiguous=False, 2025-05-07T20:32:40.6649102Z compiled=True, 2025-05-07T20:32:40.6649310Z ) 2025-05-07T20:32:40.8191080Z self = 2025-05-07T20:32:40.8192372Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.8192917Z 2025-05-07T20:32:40.8193078Z @given( 2025-05-07T20:32:40.8193539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8194167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8194787Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8195454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8196244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8196807Z ) 2025-05-07T20:32:40.8197798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8198249Z def test_silu_mul_quant( 2025-05-07T20:32:40.8198493Z self, 2025-05-07T20:32:40.8198685Z T: int, 2025-05-07T20:32:40.8199032Z D: int, 2025-05-07T20:32:40.8199255Z scale_ub: Optional[float], 2025-05-07T20:32:40.8199524Z contiguous: bool, 2025-05-07T20:32:40.8199769Z compiled: bool, 2025-05-07T20:32:40.8199995Z ) -> None: 2025-05-07T20:32:40.8200216Z torch.manual_seed(2025) 2025-05-07T20:32:40.8200461Z 2025-05-07T20:32:40.8200736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8201080Z 2025-05-07T20:32:40.8201277Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8201571Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8201890Z x = x_sign * x_clamp 2025-05-07T20:32:40.8202127Z x0 = x[:, :D] 2025-05-07T20:32:40.8202351Z x1 = x[:, D:] 2025-05-07T20:32:40.8202567Z 2025-05-07T20:32:40.8202752Z if contiguous: 2025-05-07T20:32:40.8202991Z x0 = x0.contiguous() 2025-05-07T20:32:40.8203253Z x1 = x1.contiguous() 2025-05-07T20:32:40.8203497Z 2025-05-07T20:32:40.8203694Z if scale_ub is not None: 2025-05-07T20:32:40.8203969Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8204303Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8204616Z ) 2025-05-07T20:32:40.8204814Z else: 2025-05-07T20:32:40.8205024Z scale_ub_tensor = None 2025-05-07T20:32:40.8205279Z 2025-05-07T20:32:40.8205516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8205829Z op = silu_mul_quant 2025-05-07T20:32:40.8206083Z if compiled: 2025-05-07T20:32:40.8206337Z op = torch.compile(op) 2025-05-07T20:32:40.8206641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8206913Z 2025-05-07T20:32:40.8207126Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8207290Z 2025-05-07T20:32:40.8207399Z moe/activation_test.py:117: 2025-05-07T20:32:40.8207693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8208036Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8208322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8208885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.8209451Z return fn(*args, **kwargs) 2025-05-07T20:32:40.8210115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8210806Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8211341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8212026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8212689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8213220Z kernel = self.compile( 2025-05-07T20:32:40.8213764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8214416Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8214815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8215044Z 2025-05-07T20:32:40.8215252Z self = 2025-05-07T20:32:40.8216333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8217805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be858c540>} 2025-05-07T20:32:40.8219160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8220647Z context = 2025-05-07T20:32:40.8220937Z 2025-05-07T20:32:40.8221104Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8221632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8222112Z module_map=module_map) 2025-05-07T20:32:40.8222476Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8222843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8223115Z E ^ 2025-05-07T20:32:40.8223585Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8224044Z 2025-05-07T20:32:40.8224469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8224987Z 2025-05-07T20:32:40.8225096Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8225519Z self=, 2025-05-07T20:32:40.8225925Z T=1, 2025-05-07T20:32:40.8226119Z D=5120, 2025-05-07T20:32:40.8226325Z scale_ub=1200.0, 2025-05-07T20:32:40.8226549Z contiguous=False, 2025-05-07T20:32:40.8226780Z compiled=False, 2025-05-07T20:32:40.8226992Z ) 2025-05-07T20:32:40.8227305Z self = 2025-05-07T20:32:40.8227795Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.8228074Z 2025-05-07T20:32:40.8228154Z @given( 2025-05-07T20:32:40.8228392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8228704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8229019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8229355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8229679Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8229969Z ) 2025-05-07T20:32:40.8230321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8230758Z def test_silu_mul_quant( 2025-05-07T20:32:40.8231007Z self, 2025-05-07T20:32:40.8231208Z T: int, 2025-05-07T20:32:40.8231410Z D: int, 2025-05-07T20:32:40.8231628Z scale_ub: Optional[float], 2025-05-07T20:32:40.8231907Z contiguous: bool, 2025-05-07T20:32:40.8232148Z compiled: bool, 2025-05-07T20:32:40.8232367Z ) -> None: 2025-05-07T20:32:40.8232586Z torch.manual_seed(2025) 2025-05-07T20:32:40.8232838Z 2025-05-07T20:32:40.8233106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8233458Z 2025-05-07T20:32:40.8233661Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8233949Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8234262Z x = x_sign * x_clamp 2025-05-07T20:32:40.8234506Z x0 = x[:, :D] 2025-05-07T20:32:40.8234726Z x1 = x[:, D:] 2025-05-07T20:32:40.8234942Z 2025-05-07T20:32:40.8235139Z if contiguous: 2025-05-07T20:32:40.8235370Z x0 = x0.contiguous() 2025-05-07T20:32:40.8235635Z x1 = x1.contiguous() 2025-05-07T20:32:40.8235948Z 2025-05-07T20:32:40.8236141Z if scale_ub is not None: 2025-05-07T20:32:40.8236421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8236760Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8237160Z ) 2025-05-07T20:32:40.8237353Z else: 2025-05-07T20:32:40.8237572Z scale_ub_tensor = None 2025-05-07T20:32:40.8237836Z 2025-05-07T20:32:40.8238066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8238465Z op = silu_mul_quant 2025-05-07T20:32:40.8238725Z if compiled: 2025-05-07T20:32:40.8238973Z op = torch.compile(op) 2025-05-07T20:32:40.8239283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8239565Z 2025-05-07T20:32:40.8239762Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8239932Z 2025-05-07T20:32:40.8240031Z moe/activation_test.py:117: 2025-05-07T20:32:40.8240327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8240667Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8240949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8241648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8242339Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8242869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8243559Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8244221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8244755Z kernel = self.compile( 2025-05-07T20:32:40.8245289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8245942Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8246344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8246575Z 2025-05-07T20:32:40.8246785Z self = 2025-05-07T20:32:40.8247865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8249243Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8b5e0c0>} 2025-05-07T20:32:40.8250612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8251646Z context = 2025-05-07T20:32:40.8251937Z 2025-05-07T20:32:40.8252109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8260751Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8261346Z module_map=module_map) 2025-05-07T20:32:40.8261721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8262089Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8262349Z E ^ 2025-05-07T20:32:40.8262818Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8263272Z 2025-05-07T20:32:40.8263702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8264216Z 2025-05-07T20:32:40.8264331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8264743Z self=, 2025-05-07T20:32:40.8265159Z T=16384, 2025-05-07T20:32:40.8265627Z D=5120, 2025-05-07T20:32:40.8265842Z scale_ub=1200.0, 2025-05-07T20:32:40.8266294Z contiguous=False, 2025-05-07T20:32:40.8266543Z compiled=True, 2025-05-07T20:32:40.8266751Z ) 2025-05-07T20:32:40.9139678Z self = 2025-05-07T20:32:40.9141408Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.9141988Z 2025-05-07T20:32:40.9142147Z @given( 2025-05-07T20:32:40.9142608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.9143230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.9143844Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.9144506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.9145158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.9145723Z ) 2025-05-07T20:32:40.9146410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.9147261Z def test_silu_mul_quant( 2025-05-07T20:32:40.9147509Z self, 2025-05-07T20:32:40.9147712Z T: int, 2025-05-07T20:32:40.9147921Z D: int, 2025-05-07T20:32:40.9148136Z scale_ub: Optional[float], 2025-05-07T20:32:40.9148427Z contiguous: bool, 2025-05-07T20:32:40.9148674Z compiled: bool, 2025-05-07T20:32:40.9148899Z ) -> None: 2025-05-07T20:32:40.9149125Z torch.manual_seed(2025) 2025-05-07T20:32:40.9149371Z 2025-05-07T20:32:40.9149640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.9149990Z 2025-05-07T20:32:40.9150188Z x_sign = torch.sign(x) 2025-05-07T20:32:40.9150477Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.9150798Z x = x_sign * x_clamp 2025-05-07T20:32:40.9151050Z x0 = x[:, :D] 2025-05-07T20:32:40.9151273Z x1 = x[:, D:] 2025-05-07T20:32:40.9151481Z 2025-05-07T20:32:40.9151679Z if contiguous: 2025-05-07T20:32:40.9151916Z x0 = x0.contiguous() 2025-05-07T20:32:40.9152175Z x1 = x1.contiguous() 2025-05-07T20:32:40.9152424Z 2025-05-07T20:32:40.9152617Z if scale_ub is not None: 2025-05-07T20:32:40.9152889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.9153233Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.9153550Z ) 2025-05-07T20:32:40.9153743Z else: 2025-05-07T20:32:40.9153960Z scale_ub_tensor = None 2025-05-07T20:32:40.9154218Z 2025-05-07T20:32:40.9154449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.9154767Z op = silu_mul_quant 2025-05-07T20:32:40.9155027Z if compiled: 2025-05-07T20:32:40.9155272Z op = torch.compile(op) 2025-05-07T20:32:40.9155573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9155971Z 2025-05-07T20:32:40.9156182Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.9156348Z 2025-05-07T20:32:40.9156451Z moe/activation_test.py:117: 2025-05-07T20:32:40.9156760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9157104Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.9157383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9157960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.9158536Z return fn(*args, **kwargs) 2025-05-07T20:32:40.9159201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.9159897Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9160445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9161133Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9161954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9162506Z kernel = self.compile( 2025-05-07T20:32:40.9163051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9163789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9164187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9164428Z 2025-05-07T20:32:40.9164648Z self = 2025-05-07T20:32:40.9166077Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9167488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be90eb560>} 2025-05-07T20:32:40.9168827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9169866Z context = 2025-05-07T20:32:40.9170165Z 2025-05-07T20:32:40.9170333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9170860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9171327Z module_map=module_map) 2025-05-07T20:32:40.9171698Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9172058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9172316Z E ^ 2025-05-07T20:32:40.9172791Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9173250Z 2025-05-07T20:32:40.9173663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9174177Z 2025-05-07T20:32:40.9174293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9174702Z self=, 2025-05-07T20:32:40.9175109Z T=2048, 2025-05-07T20:32:40.9175306Z D=7168, 2025-05-07T20:32:40.9175505Z scale_ub=1200.0, 2025-05-07T20:32:40.9175728Z contiguous=False, 2025-05-07T20:32:40.9175960Z compiled=True, 2025-05-07T20:32:40.9176168Z ) 2025-05-07T20:32:40.9176482Z self = 2025-05-07T20:32:40.9176979Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:40.9177253Z 2025-05-07T20:32:40.9177338Z @given( 2025-05-07T20:32:40.9177571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.9177889Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.9178202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.9178528Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.9178869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.9179160Z ) 2025-05-07T20:32:40.9179511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.9179949Z def test_silu_mul_quant( 2025-05-07T20:32:40.9180199Z self, 2025-05-07T20:32:40.9180405Z T: int, 2025-05-07T20:32:40.9180607Z D: int, 2025-05-07T20:32:40.9180837Z scale_ub: Optional[float], 2025-05-07T20:32:40.9181118Z contiguous: bool, 2025-05-07T20:32:40.9181359Z compiled: bool, 2025-05-07T20:32:40.9181587Z ) -> None: 2025-05-07T20:32:40.9181810Z torch.manual_seed(2025) 2025-05-07T20:32:40.9182049Z 2025-05-07T20:32:40.9182470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.9182824Z 2025-05-07T20:32:40.9183016Z x_sign = torch.sign(x) 2025-05-07T20:32:40.9183311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.9183738Z x = x_sign * x_clamp 2025-05-07T20:32:40.9183978Z x0 = x[:, :D] 2025-05-07T20:32:40.9184203Z x1 = x[:, D:] 2025-05-07T20:32:40.9184416Z 2025-05-07T20:32:40.9184603Z if contiguous: 2025-05-07T20:32:40.9184840Z x0 = x0.contiguous() 2025-05-07T20:32:40.9185108Z x1 = x1.contiguous() 2025-05-07T20:32:40.9185355Z 2025-05-07T20:32:40.9185549Z if scale_ub is not None: 2025-05-07T20:32:40.9185821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.9186157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.9186469Z ) 2025-05-07T20:32:40.9186664Z else: 2025-05-07T20:32:40.9186874Z scale_ub_tensor = None 2025-05-07T20:32:40.9187138Z 2025-05-07T20:32:40.9187373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.9187685Z op = silu_mul_quant 2025-05-07T20:32:40.9187946Z if compiled: 2025-05-07T20:32:40.9188203Z op = torch.compile(op) 2025-05-07T20:32:40.9188494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9188769Z 2025-05-07T20:32:40.9188967Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.9189133Z 2025-05-07T20:32:40.9189236Z moe/activation_test.py:117: 2025-05-07T20:32:40.9189532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9189864Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.9190147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.9190817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.9191380Z return fn(*args, **kwargs) 2025-05-07T20:32:40.9192099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.9192812Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9193352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9194031Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9194692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9195217Z kernel = self.compile( 2025-05-07T20:32:40.9195816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9196468Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9196860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9197103Z 2025-05-07T20:32:40.9197309Z self = 2025-05-07T20:32:40.9198386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9199758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8d6fd80>} 2025-05-07T20:32:40.9201096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9202118Z context = 2025-05-07T20:32:40.9202412Z 2025-05-07T20:32:40.9202683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9203212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9203681Z module_map=module_map) 2025-05-07T20:32:40.9204147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9204510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9204780Z E ^ 2025-05-07T20:32:40.9205240Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9205699Z 2025-05-07T20:32:40.9206114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9206634Z 2025-05-07T20:32:41.0371303Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0371947Z self=, 2025-05-07T20:32:41.0372516Z T=1, 2025-05-07T20:32:41.0372780Z D=5120, 2025-05-07T20:32:41.0372977Z scale_ub=None, 2025-05-07T20:32:41.0373209Z contiguous=False, 2025-05-07T20:32:41.0373448Z compiled=False, 2025-05-07T20:32:41.0373659Z ) 2025-05-07T20:32:41.0374005Z self = 2025-05-07T20:32:41.0374513Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.0374786Z 2025-05-07T20:32:41.0374867Z @given( 2025-05-07T20:32:41.0375111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0375436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0375745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0376085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0376424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0376720Z ) 2025-05-07T20:32:41.0377072Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0377567Z def test_silu_mul_quant( 2025-05-07T20:32:41.0377834Z self, 2025-05-07T20:32:41.0378035Z T: int, 2025-05-07T20:32:41.0378245Z D: int, 2025-05-07T20:32:41.0378476Z scale_ub: Optional[float], 2025-05-07T20:32:41.0378748Z contiguous: bool, 2025-05-07T20:32:41.0378998Z compiled: bool, 2025-05-07T20:32:41.0379234Z ) -> None: 2025-05-07T20:32:41.0379455Z torch.manual_seed(2025) 2025-05-07T20:32:41.0379708Z 2025-05-07T20:32:41.0379989Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0380335Z 2025-05-07T20:32:41.0380535Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0380833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0381142Z x = x_sign * x_clamp 2025-05-07T20:32:41.0381391Z x0 = x[:, :D] 2025-05-07T20:32:41.0381617Z x1 = x[:, D:] 2025-05-07T20:32:41.0381827Z 2025-05-07T20:32:41.0382018Z if contiguous: 2025-05-07T20:32:41.0382265Z x0 = x0.contiguous() 2025-05-07T20:32:41.0382534Z x1 = x1.contiguous() 2025-05-07T20:32:41.0382779Z 2025-05-07T20:32:41.0382981Z if scale_ub is not None: 2025-05-07T20:32:41.0383273Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0383611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0383931Z ) 2025-05-07T20:32:41.0384134Z else: 2025-05-07T20:32:41.0384351Z scale_ub_tensor = None 2025-05-07T20:32:41.0384619Z 2025-05-07T20:32:41.0384861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0385178Z op = silu_mul_quant 2025-05-07T20:32:41.0385446Z if compiled: 2025-05-07T20:32:41.0385707Z op = torch.compile(op) 2025-05-07T20:32:41.0386010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0386304Z 2025-05-07T20:32:41.0386510Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0387049Z 2025-05-07T20:32:41.0387168Z moe/activation_test.py:117: 2025-05-07T20:32:41.0387496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0388020Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0388315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0389015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0389717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0390270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0390966Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0391635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0392179Z kernel = self.compile( 2025-05-07T20:32:41.0392738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0393395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0393808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0394046Z 2025-05-07T20:32:41.0394255Z self = 2025-05-07T20:32:41.0395348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0396829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be976bd80>} 2025-05-07T20:32:41.0398233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0399266Z context = 2025-05-07T20:32:41.0399562Z 2025-05-07T20:32:41.0399737Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0400272Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0400746Z module_map=module_map) 2025-05-07T20:32:41.0401118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0401481Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0401746Z E ^ 2025-05-07T20:32:41.0402216Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0402668Z 2025-05-07T20:32:41.0403093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0403604Z 2025-05-07T20:32:41.0403717Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0404133Z self=, 2025-05-07T20:32:41.0404548Z T=4096, 2025-05-07T20:32:41.0404747Z D=7168, 2025-05-07T20:32:41.0404944Z scale_ub=1200.0, 2025-05-07T20:32:41.0405181Z contiguous=False, 2025-05-07T20:32:41.0405419Z compiled=False, 2025-05-07T20:32:41.0405628Z ) 2025-05-07T20:32:41.0405960Z self = 2025-05-07T20:32:41.0406470Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.0406749Z 2025-05-07T20:32:41.0406839Z @given( 2025-05-07T20:32:41.0407075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.0407402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.0407813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.0408150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.0408493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.0408873Z ) 2025-05-07T20:32:41.0409221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.0409679Z def test_silu_mul_quant( 2025-05-07T20:32:41.0409935Z self, 2025-05-07T20:32:41.0410138Z T: int, 2025-05-07T20:32:41.0410541Z D: int, 2025-05-07T20:32:41.0410770Z scale_ub: Optional[float], 2025-05-07T20:32:41.0411042Z contiguous: bool, 2025-05-07T20:32:41.0411297Z compiled: bool, 2025-05-07T20:32:41.0411537Z ) -> None: 2025-05-07T20:32:41.0411757Z torch.manual_seed(2025) 2025-05-07T20:32:41.0412005Z 2025-05-07T20:32:41.0412286Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.0412638Z 2025-05-07T20:32:41.0412842Z x_sign = torch.sign(x) 2025-05-07T20:32:41.0413140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.0413454Z x = x_sign * x_clamp 2025-05-07T20:32:41.0413697Z x0 = x[:, :D] 2025-05-07T20:32:41.0413939Z x1 = x[:, D:] 2025-05-07T20:32:41.0414152Z 2025-05-07T20:32:41.0414338Z if contiguous: 2025-05-07T20:32:41.0414582Z x0 = x0.contiguous() 2025-05-07T20:32:41.0414847Z x1 = x1.contiguous() 2025-05-07T20:32:41.0415088Z 2025-05-07T20:32:41.0415290Z if scale_ub is not None: 2025-05-07T20:32:41.0415570Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.0415907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.0416224Z ) 2025-05-07T20:32:41.0416426Z else: 2025-05-07T20:32:41.0416641Z scale_ub_tensor = None 2025-05-07T20:32:41.0416902Z 2025-05-07T20:32:41.0417142Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.0417473Z op = silu_mul_quant 2025-05-07T20:32:41.0417732Z if compiled: 2025-05-07T20:32:41.0417988Z op = torch.compile(op) 2025-05-07T20:32:41.0418291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0418578Z 2025-05-07T20:32:41.0418784Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.0418949Z 2025-05-07T20:32:41.0419058Z moe/activation_test.py:117: 2025-05-07T20:32:41.0419354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0419702Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.0419989Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.0420677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.0421375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.0421924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.0422614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.0423279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.0423821Z kernel = self.compile( 2025-05-07T20:32:41.0424367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.0425022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.0425427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.0425667Z 2025-05-07T20:32:41.0425880Z self = 2025-05-07T20:32:41.0427064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.0428448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be93a0e00>} 2025-05-07T20:32:41.0429901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.0430933Z context = 2025-05-07T20:32:41.0431231Z 2025-05-07T20:32:41.0431405Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.0431936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.0432410Z module_map=module_map) 2025-05-07T20:32:41.0432782Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.0433155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.0433419Z E ^ 2025-05-07T20:32:41.0433893Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.0434368Z 2025-05-07T20:32:41.0434788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.0435299Z 2025-05-07T20:32:41.0435415Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.0435879Z self=, 2025-05-07T20:32:41.0436291Z T=16384, 2025-05-07T20:32:41.0436500Z D=7168, 2025-05-07T20:32:41.0436698Z scale_ub=None, 2025-05-07T20:32:41.0436921Z contiguous=True, 2025-05-07T20:32:41.0437158Z compiled=True, 2025-05-07T20:32:41.0437371Z ) 2025-05-07T20:32:41.2242465Z self = 2025-05-07T20:32:41.2243268Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.2243650Z 2025-05-07T20:32:41.2243748Z @given( 2025-05-07T20:32:41.2243987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2244321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2244637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2244967Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2245302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2245596Z ) 2025-05-07T20:32:41.2245949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2246404Z def test_silu_mul_quant( 2025-05-07T20:32:41.2246656Z self, 2025-05-07T20:32:41.2246856Z T: int, 2025-05-07T20:32:41.2247064Z D: int, 2025-05-07T20:32:41.2247295Z scale_ub: Optional[float], 2025-05-07T20:32:41.2247567Z contiguous: bool, 2025-05-07T20:32:41.2247828Z compiled: bool, 2025-05-07T20:32:41.2248075Z ) -> None: 2025-05-07T20:32:41.2248297Z torch.manual_seed(2025) 2025-05-07T20:32:41.2248538Z 2025-05-07T20:32:41.2248821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2249178Z 2025-05-07T20:32:41.2257333Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2257675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2257992Z x = x_sign * x_clamp 2025-05-07T20:32:41.2258248Z x0 = x[:, :D] 2025-05-07T20:32:41.2258477Z x1 = x[:, D:] 2025-05-07T20:32:41.2258686Z 2025-05-07T20:32:41.2258884Z if contiguous: 2025-05-07T20:32:41.2259132Z x0 = x0.contiguous() 2025-05-07T20:32:41.2259406Z x1 = x1.contiguous() 2025-05-07T20:32:41.2259649Z 2025-05-07T20:32:41.2259843Z if scale_ub is not None: 2025-05-07T20:32:41.2260131Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2260799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2261109Z ) 2025-05-07T20:32:41.2261298Z else: 2025-05-07T20:32:41.2261520Z scale_ub_tensor = None 2025-05-07T20:32:41.2261930Z 2025-05-07T20:32:41.2262176Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2262502Z op = silu_mul_quant 2025-05-07T20:32:41.2262752Z if compiled: 2025-05-07T20:32:41.2263011Z op = torch.compile(op) 2025-05-07T20:32:41.2263317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2263594Z 2025-05-07T20:32:41.2263797Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2263963Z 2025-05-07T20:32:41.2264078Z moe/activation_test.py:117: 2025-05-07T20:32:41.2264384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2264716Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2265007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2265874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2266436Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2267100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2267790Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2268328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2269003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2269668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2270205Z kernel = self.compile( 2025-05-07T20:32:41.2270743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2271405Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2271808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2272041Z 2025-05-07T20:32:41.2272260Z self = 2025-05-07T20:32:41.2273341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2274732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea4d8fe0>} 2025-05-07T20:32:41.2276185Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2277219Z context = 2025-05-07T20:32:41.2277511Z 2025-05-07T20:32:41.2277683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2278205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2278690Z module_map=module_map) 2025-05-07T20:32:41.2279062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2279417Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2279692Z E ^ 2025-05-07T20:32:41.2280169Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2280619Z 2025-05-07T20:32:41.2281038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2281548Z 2025-05-07T20:32:41.2281786Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2282206Z self=, 2025-05-07T20:32:41.2282615Z T=4096, 2025-05-07T20:32:41.2282914Z D=5120, 2025-05-07T20:32:41.2283114Z scale_ub=None, 2025-05-07T20:32:41.2283343Z contiguous=False, 2025-05-07T20:32:41.2283570Z compiled=True, 2025-05-07T20:32:41.2283785Z ) 2025-05-07T20:32:41.2284118Z self = 2025-05-07T20:32:41.2284628Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.2284903Z 2025-05-07T20:32:41.2284986Z @given( 2025-05-07T20:32:41.2285228Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.2285553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.2285868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.2286208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.2286554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.2286844Z ) 2025-05-07T20:32:41.2287202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.2287660Z def test_silu_mul_quant( 2025-05-07T20:32:41.2287907Z self, 2025-05-07T20:32:41.2288120Z T: int, 2025-05-07T20:32:41.2288329Z D: int, 2025-05-07T20:32:41.2288562Z scale_ub: Optional[float], 2025-05-07T20:32:41.2288835Z contiguous: bool, 2025-05-07T20:32:41.2289085Z compiled: bool, 2025-05-07T20:32:41.2289319Z ) -> None: 2025-05-07T20:32:41.2289538Z torch.manual_seed(2025) 2025-05-07T20:32:41.2289793Z 2025-05-07T20:32:41.2290081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.2290426Z 2025-05-07T20:32:41.2290633Z x_sign = torch.sign(x) 2025-05-07T20:32:41.2290931Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.2291249Z x = x_sign * x_clamp 2025-05-07T20:32:41.2291501Z x0 = x[:, :D] 2025-05-07T20:32:41.2291729Z x1 = x[:, D:] 2025-05-07T20:32:41.2291938Z 2025-05-07T20:32:41.2292136Z if contiguous: 2025-05-07T20:32:41.2292392Z x0 = x0.contiguous() 2025-05-07T20:32:41.2292653Z x1 = x1.contiguous() 2025-05-07T20:32:41.2292903Z 2025-05-07T20:32:41.2293107Z if scale_ub is not None: 2025-05-07T20:32:41.2293382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.2293729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.2294046Z ) 2025-05-07T20:32:41.2294246Z else: 2025-05-07T20:32:41.2294458Z scale_ub_tensor = None 2025-05-07T20:32:41.2294721Z 2025-05-07T20:32:41.2294960Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.2295277Z op = silu_mul_quant 2025-05-07T20:32:41.2295534Z if compiled: 2025-05-07T20:32:41.2295797Z op = torch.compile(op) 2025-05-07T20:32:41.2296099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2296384Z 2025-05-07T20:32:41.2296587Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.2296757Z 2025-05-07T20:32:41.2296861Z moe/activation_test.py:117: 2025-05-07T20:32:41.2297163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2297518Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.2297850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.2298412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.2298978Z return fn(*args, **kwargs) 2025-05-07T20:32:41.2299645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2300560Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2301844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2302544Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2303299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2303835Z kernel = self.compile( 2025-05-07T20:32:41.2304384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2305047Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2305448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2305692Z 2025-05-07T20:32:41.2305903Z self = 2025-05-07T20:32:41.2306995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2308379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea4afb00>} 2025-05-07T20:32:41.2309733Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2310757Z context = 2025-05-07T20:32:41.2311054Z 2025-05-07T20:32:41.2311224Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2311756Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2312233Z module_map=module_map) 2025-05-07T20:32:41.2312608Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2312974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2313244Z E ^ 2025-05-07T20:32:41.2313715Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2314172Z 2025-05-07T20:32:41.2314597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2315109Z 2025-05-07T20:32:41.3795346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3796194Z self=, 2025-05-07T20:32:41.3796898Z T=4096, 2025-05-07T20:32:41.3797221Z D=5120, 2025-05-07T20:32:41.3797545Z scale_ub=1200.0, 2025-05-07T20:32:41.3797911Z contiguous=False, 2025-05-07T20:32:41.3798288Z compiled=False, 2025-05-07T20:32:41.3798627Z ) 2025-05-07T20:32:41.3799188Z self = 2025-05-07T20:32:41.3800049Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.3800540Z 2025-05-07T20:32:41.3800681Z @given( 2025-05-07T20:32:41.3801062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3801586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3802107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3802670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3803228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3803722Z ) 2025-05-07T20:32:41.3804322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3805134Z def test_silu_mul_quant( 2025-05-07T20:32:41.3805544Z self, 2025-05-07T20:32:41.3805860Z T: int, 2025-05-07T20:32:41.3806188Z D: int, 2025-05-07T20:32:41.3806555Z scale_ub: Optional[float], 2025-05-07T20:32:41.3807378Z contiguous: bool, 2025-05-07T20:32:41.3807787Z compiled: bool, 2025-05-07T20:32:41.3808144Z ) -> None: 2025-05-07T20:32:41.3808484Z torch.manual_seed(2025) 2025-05-07T20:32:41.3809096Z 2025-05-07T20:32:41.3809541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3810116Z 2025-05-07T20:32:41.3810439Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3810925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3811457Z x = x_sign * x_clamp 2025-05-07T20:32:41.3811851Z x0 = x[:, :D] 2025-05-07T20:32:41.3812206Z x1 = x[:, D:] 2025-05-07T20:32:41.3812551Z 2025-05-07T20:32:41.3812851Z if contiguous: 2025-05-07T20:32:41.3813240Z x0 = x0.contiguous() 2025-05-07T20:32:41.3813677Z x1 = x1.contiguous() 2025-05-07T20:32:41.3814075Z 2025-05-07T20:32:41.3814393Z if scale_ub is not None: 2025-05-07T20:32:41.3814871Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3815423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3815950Z ) 2025-05-07T20:32:41.3816272Z else: 2025-05-07T20:32:41.3816625Z scale_ub_tensor = None 2025-05-07T20:32:41.3817054Z 2025-05-07T20:32:41.3817447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3817976Z op = silu_mul_quant 2025-05-07T20:32:41.3818397Z if compiled: 2025-05-07T20:32:41.3818810Z op = torch.compile(op) 2025-05-07T20:32:41.3819301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3819766Z 2025-05-07T20:32:41.3820084Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.3820362Z 2025-05-07T20:32:41.3820536Z moe/activation_test.py:117: 2025-05-07T20:32:41.3821033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3821602Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.3822092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3823299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.3824523Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.3825455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3826659Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3827821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3828727Z kernel = self.compile( 2025-05-07T20:32:41.3829460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3830354Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3830905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3831233Z 2025-05-07T20:32:41.3831515Z self = 2025-05-07T20:32:41.3833003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3834998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea4ad3a0>} 2025-05-07T20:32:41.3837033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3838521Z context = 2025-05-07T20:32:41.3839113Z 2025-05-07T20:32:41.3839382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3840219Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3841047Z module_map=module_map) 2025-05-07T20:32:41.3841616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3842167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.3842576Z E ^ 2025-05-07T20:32:41.3843309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3844051Z 2025-05-07T20:32:41.3844662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3845467Z 2025-05-07T20:32:41.3845645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3846294Z self=, 2025-05-07T20:32:41.3846942Z T=4096, 2025-05-07T20:32:41.3847247Z D=5120, 2025-05-07T20:32:41.3847585Z scale_ub=1200.0, 2025-05-07T20:32:41.3847972Z contiguous=False, 2025-05-07T20:32:41.3848323Z compiled=True, 2025-05-07T20:32:41.3848644Z ) 2025-05-07T20:32:41.3849142Z self = 2025-05-07T20:32:41.3849973Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.3850424Z 2025-05-07T20:32:41.3850564Z @given( 2025-05-07T20:32:41.3850939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3851470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3851981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3852517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3853065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3853553Z ) 2025-05-07T20:32:41.3854092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3854792Z def test_silu_mul_quant( 2025-05-07T20:32:41.3855173Z self, 2025-05-07T20:32:41.3855471Z T: int, 2025-05-07T20:32:41.3855766Z D: int, 2025-05-07T20:32:41.3856097Z scale_ub: Optional[float], 2025-05-07T20:32:41.3856529Z contiguous: bool, 2025-05-07T20:32:41.3856907Z compiled: bool, 2025-05-07T20:32:41.3857269Z ) -> None: 2025-05-07T20:32:41.3857626Z torch.manual_seed(2025) 2025-05-07T20:32:41.3858030Z 2025-05-07T20:32:41.3858457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3858981Z 2025-05-07T20:32:41.3859278Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3859725Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3860176Z x = x_sign * x_clamp 2025-05-07T20:32:41.3860528Z x0 = x[:, :D] 2025-05-07T20:32:41.3860855Z x1 = x[:, D:] 2025-05-07T20:32:41.3861174Z 2025-05-07T20:32:41.3861460Z if contiguous: 2025-05-07T20:32:41.3861806Z x0 = x0.contiguous() 2025-05-07T20:32:41.3862194Z x1 = x1.contiguous() 2025-05-07T20:32:41.3862573Z 2025-05-07T20:32:41.3862866Z if scale_ub is not None: 2025-05-07T20:32:41.3863289Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3863798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3864258Z ) 2025-05-07T20:32:41.3864551Z else: 2025-05-07T20:32:41.3864865Z scale_ub_tensor = None 2025-05-07T20:32:41.3865244Z 2025-05-07T20:32:41.3865884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3866378Z op = silu_mul_quant 2025-05-07T20:32:41.3866753Z if compiled: 2025-05-07T20:32:41.3867135Z op = torch.compile(op) 2025-05-07T20:32:41.3867590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3868255Z 2025-05-07T20:32:41.3868571Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.3868834Z 2025-05-07T20:32:41.3868987Z moe/activation_test.py:117: 2025-05-07T20:32:41.3869449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3870126Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.3870558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3871425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.3872291Z return fn(*args, **kwargs) 2025-05-07T20:32:41.3873326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.3874403Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.3875223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3876398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3877461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3878323Z kernel = self.compile( 2025-05-07T20:32:41.3879194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3880329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3881009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3881405Z 2025-05-07T20:32:41.3881737Z self = 2025-05-07T20:32:41.3883591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3885994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea2f7880>} 2025-05-07T20:32:41.3888239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3890044Z context = 2025-05-07T20:32:41.3890541Z 2025-05-07T20:32:41.3890828Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3891725Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3892537Z module_map=module_map) 2025-05-07T20:32:41.3893155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3893742Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.3894193Z E ^ 2025-05-07T20:32:41.3894994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3895793Z 2025-05-07T20:32:41.3896527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3897430Z 2025-05-07T20:32:41.5044393Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5045195Z self=, 2025-05-07T20:32:41.5045898Z T=2048, 2025-05-07T20:32:41.5046218Z D=7168, 2025-05-07T20:32:41.5046529Z scale_ub=1200.0, 2025-05-07T20:32:41.5046896Z contiguous=False, 2025-05-07T20:32:41.5047261Z compiled=False, 2025-05-07T20:32:41.5047599Z ) 2025-05-07T20:32:41.5048141Z self = 2025-05-07T20:32:41.5049381Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.5049870Z 2025-05-07T20:32:41.5050012Z @given( 2025-05-07T20:32:41.5050387Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5051159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5051685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5052238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5052802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5053289Z ) 2025-05-07T20:32:41.5053881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5054650Z def test_silu_mul_quant( 2025-05-07T20:32:41.5055056Z self, 2025-05-07T20:32:41.5055350Z T: int, 2025-05-07T20:32:41.5055668Z D: int, 2025-05-07T20:32:41.5056017Z scale_ub: Optional[float], 2025-05-07T20:32:41.5056443Z contiguous: bool, 2025-05-07T20:32:41.5056829Z compiled: bool, 2025-05-07T20:32:41.5057205Z ) -> None: 2025-05-07T20:32:41.5057549Z torch.manual_seed(2025) 2025-05-07T20:32:41.5057937Z 2025-05-07T20:32:41.5058389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5058990Z 2025-05-07T20:32:41.5059300Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5059783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5060314Z x = x_sign * x_clamp 2025-05-07T20:32:41.5060711Z x0 = x[:, :D] 2025-05-07T20:32:41.5061073Z x1 = x[:, D:] 2025-05-07T20:32:41.5061423Z 2025-05-07T20:32:41.5061725Z if contiguous: 2025-05-07T20:32:41.5062110Z x0 = x0.contiguous() 2025-05-07T20:32:41.5062537Z x1 = x1.contiguous() 2025-05-07T20:32:41.5062938Z 2025-05-07T20:32:41.5063260Z if scale_ub is not None: 2025-05-07T20:32:41.5063713Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5064266Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5064797Z ) 2025-05-07T20:32:41.5065119Z else: 2025-05-07T20:32:41.5065834Z scale_ub_tensor = None 2025-05-07T20:32:41.5066262Z 2025-05-07T20:32:41.5066664Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5067203Z op = silu_mul_quant 2025-05-07T20:32:41.5067613Z if compiled: 2025-05-07T20:32:41.5068080Z op = torch.compile(op) 2025-05-07T20:32:41.5068579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5069037Z 2025-05-07T20:32:41.5069355Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5069631Z 2025-05-07T20:32:41.5069802Z moe/activation_test.py:117: 2025-05-07T20:32:41.5070293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5070864Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5071342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5072565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5073782Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5074719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5076023Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5077133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5077910Z kernel = self.compile( 2025-05-07T20:32:41.5078650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5079547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5080090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5080411Z 2025-05-07T20:32:41.5080900Z self = 2025-05-07T20:32:41.5082442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5084663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5beab019e0>} 2025-05-07T20:32:41.5086604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5088084Z context = 2025-05-07T20:32:41.5088509Z 2025-05-07T20:32:41.5088765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5089534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5090278Z module_map=module_map) 2025-05-07T20:32:41.5090853Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5091402Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5091803Z E ^ 2025-05-07T20:32:41.5103392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5104137Z 2025-05-07T20:32:41.5104808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5105645Z 2025-05-07T20:32:41.5105840Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5106494Z self=, 2025-05-07T20:32:41.5107136Z T=1, 2025-05-07T20:32:41.5107423Z D=7168, 2025-05-07T20:32:41.5107737Z scale_ub=None, 2025-05-07T20:32:41.5108077Z contiguous=True, 2025-05-07T20:32:41.5108418Z compiled=False, 2025-05-07T20:32:41.5108743Z ) 2025-05-07T20:32:41.5109252Z self = 2025-05-07T20:32:41.5110024Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:41.5110447Z 2025-05-07T20:32:41.5110571Z @given( 2025-05-07T20:32:41.5110934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5111420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5111910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5112434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5112959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5113409Z ) 2025-05-07T20:32:41.5113964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5114682Z def test_silu_mul_quant( 2025-05-07T20:32:41.5115059Z self, 2025-05-07T20:32:41.5115366Z T: int, 2025-05-07T20:32:41.5115686Z D: int, 2025-05-07T20:32:41.5116094Z scale_ub: Optional[float], 2025-05-07T20:32:41.5116533Z contiguous: bool, 2025-05-07T20:32:41.5116912Z compiled: bool, 2025-05-07T20:32:41.5117257Z ) -> None: 2025-05-07T20:32:41.5117602Z torch.manual_seed(2025) 2025-05-07T20:32:41.5117989Z 2025-05-07T20:32:41.5118403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5118958Z 2025-05-07T20:32:41.5119269Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5119716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5120207Z x = x_sign * x_clamp 2025-05-07T20:32:41.5120586Z x0 = x[:, :D] 2025-05-07T20:32:41.5120927Z x1 = x[:, D:] 2025-05-07T20:32:41.5121247Z 2025-05-07T20:32:41.5121540Z if contiguous: 2025-05-07T20:32:41.5122042Z x0 = x0.contiguous() 2025-05-07T20:32:41.5122449Z x1 = x1.contiguous() 2025-05-07T20:32:41.5122829Z 2025-05-07T20:32:41.5123132Z if scale_ub is not None: 2025-05-07T20:32:41.5123557Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5124170Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5124655Z ) 2025-05-07T20:32:41.5124950Z else: 2025-05-07T20:32:41.5125278Z scale_ub_tensor = None 2025-05-07T20:32:41.5125675Z 2025-05-07T20:32:41.5126028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5126524Z op = silu_mul_quant 2025-05-07T20:32:41.5126922Z if compiled: 2025-05-07T20:32:41.5127301Z op = torch.compile(op) 2025-05-07T20:32:41.5127767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5128201Z 2025-05-07T20:32:41.5128506Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5128765Z 2025-05-07T20:32:41.5128933Z moe/activation_test.py:117: 2025-05-07T20:32:41.5129404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5129932Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5130375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5131480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5132591Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5133451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5134531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5135593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5136449Z kernel = self.compile( 2025-05-07T20:32:41.5137310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5138365Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5139000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5139369Z 2025-05-07T20:32:41.5139694Z self = 2025-05-07T20:32:41.5141421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5143656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5beab5e020>} 2025-05-07T20:32:41.5145836Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5147484Z context = 2025-05-07T20:32:41.5148001Z 2025-05-07T20:32:41.5148269Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5149090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5149839Z module_map=module_map) 2025-05-07T20:32:41.5150403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5150946Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5151355Z E ^ 2025-05-07T20:32:41.5152091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5152815Z 2025-05-07T20:32:41.5153581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5154402Z 2025-05-07T20:32:41.5154564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5155220Z self=, 2025-05-07T20:32:41.5156557Z T=16384, 2025-05-07T20:32:41.5156856Z D=7168, 2025-05-07T20:32:41.5157163Z scale_ub=1200.0, 2025-05-07T20:32:41.5157517Z contiguous=False, 2025-05-07T20:32:41.5157864Z compiled=True, 2025-05-07T20:32:41.7555909Z ) 2025-05-07T20:32:41.7557009Z self = 2025-05-07T20:32:41.7558020Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.7558312Z 2025-05-07T20:32:41.7558407Z @given( 2025-05-07T20:32:41.7558644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7558972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7559292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7559657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7559996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7560295Z ) 2025-05-07T20:32:41.7560659Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7561122Z def test_silu_mul_quant( 2025-05-07T20:32:41.7561378Z self, 2025-05-07T20:32:41.7561591Z T: int, 2025-05-07T20:32:41.7561797Z D: int, 2025-05-07T20:32:41.7562030Z scale_ub: Optional[float], 2025-05-07T20:32:41.7562318Z contiguous: bool, 2025-05-07T20:32:41.7562567Z compiled: bool, 2025-05-07T20:32:41.7562807Z ) -> None: 2025-05-07T20:32:41.7563032Z torch.manual_seed(2025) 2025-05-07T20:32:41.7563281Z 2025-05-07T20:32:41.7563567Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7563921Z 2025-05-07T20:32:41.7564132Z x_sign = torch.sign(x) 2025-05-07T20:32:41.7564442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.7564759Z x = x_sign * x_clamp 2025-05-07T20:32:41.7565015Z x0 = x[:, :D] 2025-05-07T20:32:41.7565244Z x1 = x[:, D:] 2025-05-07T20:32:41.7565725Z 2025-05-07T20:32:41.7565927Z if contiguous: 2025-05-07T20:32:41.7566169Z x0 = x0.contiguous() 2025-05-07T20:32:41.7566444Z x1 = x1.contiguous() 2025-05-07T20:32:41.7566696Z 2025-05-07T20:32:41.7566901Z if scale_ub is not None: 2025-05-07T20:32:41.7567185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.7567521Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.7567837Z ) 2025-05-07T20:32:41.7568037Z else: 2025-05-07T20:32:41.7568250Z scale_ub_tensor = None 2025-05-07T20:32:41.7568508Z 2025-05-07T20:32:41.7568745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.7569061Z op = silu_mul_quant 2025-05-07T20:32:41.7569324Z if compiled: 2025-05-07T20:32:41.7569579Z op = torch.compile(op) 2025-05-07T20:32:41.7569876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7570164Z 2025-05-07T20:32:41.7570366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.7570537Z 2025-05-07T20:32:41.7570646Z moe/activation_test.py:117: 2025-05-07T20:32:41.7570944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7571288Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.7571574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7572135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.7572706Z return fn(*args, **kwargs) 2025-05-07T20:32:41.7573371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.7574399Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.7574947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.7575641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.7576455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.7576986Z kernel = self.compile( 2025-05-07T20:32:41.7577534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.7578196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7578609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7578844Z 2025-05-07T20:32:41.7579054Z self = 2025-05-07T20:32:41.7580153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.7581645Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05131580>} 2025-05-07T20:32:41.7582993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.7584028Z context = 2025-05-07T20:32:41.7584322Z 2025-05-07T20:32:41.7584493Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.7585025Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7585518Z module_map=module_map) 2025-05-07T20:32:41.7585890Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7586252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7586545Z E ^ 2025-05-07T20:32:41.7587018Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.7587471Z 2025-05-07T20:32:41.7587900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7588414Z 2025-05-07T20:32:41.7588521Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7588943Z self=, 2025-05-07T20:32:41.7589355Z T=1, 2025-05-07T20:32:41.7589551Z D=7168, 2025-05-07T20:32:41.7589750Z scale_ub=None, 2025-05-07T20:32:41.7589975Z contiguous=False, 2025-05-07T20:32:41.7590210Z compiled=False, 2025-05-07T20:32:41.7590424Z ) 2025-05-07T20:32:41.7590753Z self = 2025-05-07T20:32:41.7591256Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.7591529Z 2025-05-07T20:32:41.7591611Z @given( 2025-05-07T20:32:41.7591850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7592180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7592488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7592830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7593170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7593461Z ) 2025-05-07T20:32:41.7593811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7594259Z def test_silu_mul_quant( 2025-05-07T20:32:41.7594514Z self, 2025-05-07T20:32:41.7594713Z T: int, 2025-05-07T20:32:41.7595010Z D: int, 2025-05-07T20:32:41.7595243Z scale_ub: Optional[float], 2025-05-07T20:32:41.7595517Z contiguous: bool, 2025-05-07T20:32:41.7595870Z compiled: bool, 2025-05-07T20:32:41.7596219Z ) -> None: 2025-05-07T20:32:41.7596435Z torch.manual_seed(2025) 2025-05-07T20:32:41.7596685Z 2025-05-07T20:32:41.7596964Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7597307Z 2025-05-07T20:32:41.7597512Z x_sign = torch.sign(x) 2025-05-07T20:32:41.7597815Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.7598172Z x = x_sign * x_clamp 2025-05-07T20:32:41.7598409Z x0 = x[:, :D] 2025-05-07T20:32:41.7598630Z x1 = x[:, D:] 2025-05-07T20:32:41.7598845Z 2025-05-07T20:32:41.7599033Z if contiguous: 2025-05-07T20:32:41.7599276Z x0 = x0.contiguous() 2025-05-07T20:32:41.7599545Z x1 = x1.contiguous() 2025-05-07T20:32:41.7599789Z 2025-05-07T20:32:41.7599998Z if scale_ub is not None: 2025-05-07T20:32:41.7600281Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.7600617Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.7600948Z ) 2025-05-07T20:32:41.7601159Z else: 2025-05-07T20:32:41.7601373Z scale_ub_tensor = None 2025-05-07T20:32:41.7601638Z 2025-05-07T20:32:41.7601879Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.7602208Z op = silu_mul_quant 2025-05-07T20:32:41.7602465Z if compiled: 2025-05-07T20:32:41.7602725Z op = torch.compile(op) 2025-05-07T20:32:41.7603018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7603301Z 2025-05-07T20:32:41.7603508Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.7603672Z 2025-05-07T20:32:41.7603787Z moe/activation_test.py:117: 2025-05-07T20:32:41.7604083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7604432Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.7604719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7605401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.7606100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.7606647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.7607337Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.7608041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.7608594Z kernel = self.compile( 2025-05-07T20:32:41.7609141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.7609803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7610209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7610447Z 2025-05-07T20:32:41.7610657Z self = 2025-05-07T20:32:41.7611747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.7613121Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05132840>} 2025-05-07T20:32:41.7614466Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.7615634Z context = 2025-05-07T20:32:41.7615932Z 2025-05-07T20:32:41.7616101Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.7616704Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7617175Z module_map=module_map) 2025-05-07T20:32:41.7617547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7617956Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7618223Z E ^ 2025-05-07T20:32:41.7618692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.7619152Z 2025-05-07T20:32:41.7619567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7620079Z 2025-05-07T20:32:41.7620192Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7620610Z self=, 2025-05-07T20:32:41.7621024Z T=2048, 2025-05-07T20:32:41.7621224Z D=7168, 2025-05-07T20:32:41.7621425Z scale_ub=None, 2025-05-07T20:32:41.7621649Z contiguous=False, 2025-05-07T20:32:41.7621885Z compiled=True, 2025-05-07T20:32:41.7622090Z ) 2025-05-07T20:32:41.8496878Z self = 2025-05-07T20:32:41.8497639Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.8498238Z 2025-05-07T20:32:41.8498410Z @given( 2025-05-07T20:32:41.8498870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8499511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8500134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8500789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8501482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8502060Z ) 2025-05-07T20:32:41.8502765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8503651Z def test_silu_mul_quant( 2025-05-07T20:32:41.8504153Z self, 2025-05-07T20:32:41.8504547Z T: int, 2025-05-07T20:32:41.8504944Z D: int, 2025-05-07T20:32:41.8505384Z scale_ub: Optional[float], 2025-05-07T20:32:41.8505928Z contiguous: bool, 2025-05-07T20:32:41.8506406Z compiled: bool, 2025-05-07T20:32:41.8506865Z ) -> None: 2025-05-07T20:32:41.8507302Z torch.manual_seed(2025) 2025-05-07T20:32:41.8507764Z 2025-05-07T20:32:41.8508080Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8508477Z 2025-05-07T20:32:41.8508682Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8508977Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8509289Z x = x_sign * x_clamp 2025-05-07T20:32:41.8509542Z x0 = x[:, :D] 2025-05-07T20:32:41.8509773Z x1 = x[:, D:] 2025-05-07T20:32:41.8509982Z 2025-05-07T20:32:41.8510176Z if contiguous: 2025-05-07T20:32:41.8510421Z x0 = x0.contiguous() 2025-05-07T20:32:41.8510689Z x1 = x1.contiguous() 2025-05-07T20:32:41.8510938Z 2025-05-07T20:32:41.8511145Z if scale_ub is not None: 2025-05-07T20:32:41.8511417Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8511765Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8512087Z ) 2025-05-07T20:32:41.8512286Z else: 2025-05-07T20:32:41.8512509Z scale_ub_tensor = None 2025-05-07T20:32:41.8512768Z 2025-05-07T20:32:41.8513002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8513327Z op = silu_mul_quant 2025-05-07T20:32:41.8513583Z if compiled: 2025-05-07T20:32:41.8513837Z op = torch.compile(op) 2025-05-07T20:32:41.8514501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8514787Z 2025-05-07T20:32:41.8514996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8515161Z 2025-05-07T20:32:41.8515263Z moe/activation_test.py:117: 2025-05-07T20:32:41.8515791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8516135Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8516417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8516985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.8517552Z return fn(*args, **kwargs) 2025-05-07T20:32:41.8518214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8518899Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8519446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8520132Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8520795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8521338Z kernel = self.compile( 2025-05-07T20:32:41.8521883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8522545Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8522947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8523189Z 2025-05-07T20:32:41.8523401Z self = 2025-05-07T20:32:41.8524495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8525885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c0ea80ea0>} 2025-05-07T20:32:41.8527234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8528271Z context = 2025-05-07T20:32:41.8528569Z 2025-05-07T20:32:41.8528738Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8529275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8529757Z module_map=module_map) 2025-05-07T20:32:41.8530140Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8530509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8530784Z E ^ 2025-05-07T20:32:41.8531251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8531717Z 2025-05-07T20:32:41.8532132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8532650Z 2025-05-07T20:32:41.8532771Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8533204Z self=, 2025-05-07T20:32:41.8533615Z T=4096, 2025-05-07T20:32:41.8533819Z D=7168, 2025-05-07T20:32:41.8534025Z scale_ub=None, 2025-05-07T20:32:41.8534254Z contiguous=False, 2025-05-07T20:32:41.8534491Z compiled=True, 2025-05-07T20:32:41.8534711Z ) 2025-05-07T20:32:41.8535037Z self = 2025-05-07T20:32:41.8535638Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.8535915Z 2025-05-07T20:32:41.8536003Z @given( 2025-05-07T20:32:41.8536240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8536636Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8536953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8537298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8537632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8537931Z ) 2025-05-07T20:32:41.8538285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8538728Z def test_silu_mul_quant( 2025-05-07T20:32:41.8538984Z self, 2025-05-07T20:32:41.8539193Z T: int, 2025-05-07T20:32:41.8539395Z D: int, 2025-05-07T20:32:41.8539623Z scale_ub: Optional[float], 2025-05-07T20:32:41.8539906Z contiguous: bool, 2025-05-07T20:32:41.8540157Z compiled: bool, 2025-05-07T20:32:41.8540391Z ) -> None: 2025-05-07T20:32:41.8540630Z torch.manual_seed(2025) 2025-05-07T20:32:41.8540881Z 2025-05-07T20:32:41.8541173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8541536Z 2025-05-07T20:32:41.8541740Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8542046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8542374Z x = x_sign * x_clamp 2025-05-07T20:32:41.8542625Z x0 = x[:, :D] 2025-05-07T20:32:41.8542858Z x1 = x[:, D:] 2025-05-07T20:32:41.8543082Z 2025-05-07T20:32:41.8552015Z if contiguous: 2025-05-07T20:32:41.8552291Z x0 = x0.contiguous() 2025-05-07T20:32:41.8552558Z x1 = x1.contiguous() 2025-05-07T20:32:41.8552817Z 2025-05-07T20:32:41.8553026Z if scale_ub is not None: 2025-05-07T20:32:41.8553311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8553667Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8553981Z ) 2025-05-07T20:32:41.8554192Z else: 2025-05-07T20:32:41.8554407Z scale_ub_tensor = None 2025-05-07T20:32:41.8554681Z 2025-05-07T20:32:41.8554928Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8555253Z op = silu_mul_quant 2025-05-07T20:32:41.8555516Z if compiled: 2025-05-07T20:32:41.8555833Z op = torch.compile(op) 2025-05-07T20:32:41.8556135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8556419Z 2025-05-07T20:32:41.8556625Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8556797Z 2025-05-07T20:32:41.8556905Z moe/activation_test.py:117: 2025-05-07T20:32:41.8557217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8557567Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8557861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8558440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.8559029Z return fn(*args, **kwargs) 2025-05-07T20:32:41.8559704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8560399Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8560949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8561644Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8562329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8562876Z kernel = self.compile( 2025-05-07T20:32:41.8563433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8564229Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8564638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8564959Z 2025-05-07T20:32:41.8565177Z self = 2025-05-07T20:32:41.8566573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8567967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05e5b060>} 2025-05-07T20:32:41.8569332Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8570373Z context = 2025-05-07T20:32:41.8570677Z 2025-05-07T20:32:41.8570853Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8571394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8571877Z module_map=module_map) 2025-05-07T20:32:41.8572250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8572621Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8572894Z E ^ 2025-05-07T20:32:41.8573362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8573827Z 2025-05-07T20:32:41.8574244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8574762Z 2025-05-07T20:32:42.0154604Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0155271Z self=, 2025-05-07T20:32:42.0155936Z T=16384, 2025-05-07T20:32:42.0156235Z D=5120, 2025-05-07T20:32:42.0156529Z scale_ub=1200.0, 2025-05-07T20:32:42.0156777Z contiguous=False, 2025-05-07T20:32:42.0157008Z compiled=False, 2025-05-07T20:32:42.0157232Z ) 2025-05-07T20:32:42.0157568Z self = 2025-05-07T20:32:42.0158089Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.0158384Z 2025-05-07T20:32:42.0158469Z @given( 2025-05-07T20:32:42.0158712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0159036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0159362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0159747Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0160095Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0160395Z ) 2025-05-07T20:32:42.0160754Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0161216Z def test_silu_mul_quant( 2025-05-07T20:32:42.0161479Z self, 2025-05-07T20:32:42.0161681Z T: int, 2025-05-07T20:32:42.0161895Z D: int, 2025-05-07T20:32:42.0162127Z scale_ub: Optional[float], 2025-05-07T20:32:42.0162418Z contiguous: bool, 2025-05-07T20:32:42.0162671Z compiled: bool, 2025-05-07T20:32:42.0162912Z ) -> None: 2025-05-07T20:32:42.0163145Z torch.manual_seed(2025) 2025-05-07T20:32:42.0163394Z 2025-05-07T20:32:42.0163686Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0164048Z 2025-05-07T20:32:42.0164250Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0164553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0165217Z x = x_sign * x_clamp 2025-05-07T20:32:42.0165898Z x0 = x[:, :D] 2025-05-07T20:32:42.0166130Z x1 = x[:, D:] 2025-05-07T20:32:42.0166350Z 2025-05-07T20:32:42.0166704Z if contiguous: 2025-05-07T20:32:42.0166950Z x0 = x0.contiguous() 2025-05-07T20:32:42.0167223Z x1 = x1.contiguous() 2025-05-07T20:32:42.0167467Z 2025-05-07T20:32:42.0167671Z if scale_ub is not None: 2025-05-07T20:32:42.0167969Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.0168314Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.0168637Z ) 2025-05-07T20:32:42.0168847Z else: 2025-05-07T20:32:42.0169068Z scale_ub_tensor = None 2025-05-07T20:32:42.0169325Z 2025-05-07T20:32:42.0169568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.0169902Z op = silu_mul_quant 2025-05-07T20:32:42.0170159Z if compiled: 2025-05-07T20:32:42.0170428Z op = torch.compile(op) 2025-05-07T20:32:42.0170736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0171020Z 2025-05-07T20:32:42.0171233Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.0171409Z 2025-05-07T20:32:42.0171525Z moe/activation_test.py:117: 2025-05-07T20:32:42.0171828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0172180Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.0172474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0173182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.0173877Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.0174431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.0175133Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.0175803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.0176351Z kernel = self.compile( 2025-05-07T20:32:42.0176911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.0177578Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.0178013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0178279Z 2025-05-07T20:32:42.0178492Z self = 2025-05-07T20:32:42.0179589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.0180995Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05f09120>} 2025-05-07T20:32:42.0182344Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.0183393Z context = 2025-05-07T20:32:42.0183694Z 2025-05-07T20:32:42.0183868Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.0184402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.0184879Z module_map=module_map) 2025-05-07T20:32:42.0185256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.0185626Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.0185896Z E ^ 2025-05-07T20:32:42.0186499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.0186957Z 2025-05-07T20:32:42.0187379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.0187966Z 2025-05-07T20:32:42.0188074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0188499Z self=, 2025-05-07T20:32:42.0188917Z T=16384, 2025-05-07T20:32:42.0189123Z D=5120, 2025-05-07T20:32:42.0189320Z scale_ub=1200.0, 2025-05-07T20:32:42.0189551Z contiguous=True, 2025-05-07T20:32:42.0189785Z compiled=True, 2025-05-07T20:32:42.0189988Z ) 2025-05-07T20:32:42.0190317Z self = 2025-05-07T20:32:42.0190823Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.0191101Z 2025-05-07T20:32:42.0191192Z @given( 2025-05-07T20:32:42.0191432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0191754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0192069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0192407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0192744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0193040Z ) 2025-05-07T20:32:42.0193393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0193849Z def test_silu_mul_quant( 2025-05-07T20:32:42.0194102Z self, 2025-05-07T20:32:42.0194300Z T: int, 2025-05-07T20:32:42.0194505Z D: int, 2025-05-07T20:32:42.0194731Z scale_ub: Optional[float], 2025-05-07T20:32:42.0195005Z contiguous: bool, 2025-05-07T20:32:42.0195263Z compiled: bool, 2025-05-07T20:32:42.0195494Z ) -> None: 2025-05-07T20:32:42.0195783Z torch.manual_seed(2025) 2025-05-07T20:32:42.0196040Z 2025-05-07T20:32:42.0196322Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0196667Z 2025-05-07T20:32:42.0196874Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0197171Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0197484Z x = x_sign * x_clamp 2025-05-07T20:32:42.0197733Z x0 = x[:, :D] 2025-05-07T20:32:42.0197958Z x1 = x[:, D:] 2025-05-07T20:32:42.0198171Z 2025-05-07T20:32:42.0198372Z if contiguous: 2025-05-07T20:32:42.0198619Z x0 = x0.contiguous() 2025-05-07T20:32:42.0198895Z x1 = x1.contiguous() 2025-05-07T20:32:42.0199145Z 2025-05-07T20:32:42.0199350Z if scale_ub is not None: 2025-05-07T20:32:42.0199634Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.0199972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.0200294Z ) 2025-05-07T20:32:42.0200503Z else: 2025-05-07T20:32:42.0200720Z scale_ub_tensor = None 2025-05-07T20:32:42.0200982Z 2025-05-07T20:32:42.0201225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.0201550Z op = silu_mul_quant 2025-05-07T20:32:42.0201821Z if compiled: 2025-05-07T20:32:42.0202081Z op = torch.compile(op) 2025-05-07T20:32:42.0202385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0202676Z 2025-05-07T20:32:42.0202884Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.0203050Z 2025-05-07T20:32:42.0203159Z moe/activation_test.py:117: 2025-05-07T20:32:42.0203459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0203806Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.0204097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0205157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.0205736Z return fn(*args, **kwargs) 2025-05-07T20:32:42.0206405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.0207198Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.0207745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.0208439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.0209116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.0209653Z kernel = self.compile( 2025-05-07T20:32:42.0210197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.0210867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.0211279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0211514Z 2025-05-07T20:32:42.0211729Z self = 2025-05-07T20:32:42.0212828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.0214209Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05e960c0>} 2025-05-07T20:32:42.0215558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.0216594Z context = 2025-05-07T20:32:42.0216894Z 2025-05-07T20:32:42.0217064Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.0217599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.0218092Z module_map=module_map) 2025-05-07T20:32:42.0218643Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.0219011Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.0219282Z E ^ 2025-05-07T20:32:42.0219755Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.0220216Z 2025-05-07T20:32:42.0220633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.0221154Z 2025-05-07T20:32:42.1934963Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1935597Z self=, 2025-05-07T20:32:42.1936018Z T=16384, 2025-05-07T20:32:42.1936214Z D=5120, 2025-05-07T20:32:42.1936425Z scale_ub=None, 2025-05-07T20:32:42.1936659Z contiguous=False, 2025-05-07T20:32:42.1936893Z compiled=True, 2025-05-07T20:32:42.1937100Z ) 2025-05-07T20:32:42.1937427Z self = 2025-05-07T20:32:42.1938119Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.1938396Z 2025-05-07T20:32:42.1938477Z @given( 2025-05-07T20:32:42.1938714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1939033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1939335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1939669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1940001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1940288Z ) 2025-05-07T20:32:42.1940932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1941381Z def test_silu_mul_quant( 2025-05-07T20:32:42.1941631Z self, 2025-05-07T20:32:42.1941966Z T: int, 2025-05-07T20:32:42.1942165Z D: int, 2025-05-07T20:32:42.1942383Z scale_ub: Optional[float], 2025-05-07T20:32:42.1942675Z contiguous: bool, 2025-05-07T20:32:42.1942912Z compiled: bool, 2025-05-07T20:32:42.1943140Z ) -> None: 2025-05-07T20:32:42.1943360Z torch.manual_seed(2025) 2025-05-07T20:32:42.1943597Z 2025-05-07T20:32:42.1943872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1944221Z 2025-05-07T20:32:42.1944414Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1944708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1945023Z x = x_sign * x_clamp 2025-05-07T20:32:42.1945261Z x0 = x[:, :D] 2025-05-07T20:32:42.1945492Z x1 = x[:, D:] 2025-05-07T20:32:42.1945704Z 2025-05-07T20:32:42.1945891Z if contiguous: 2025-05-07T20:32:42.1946128Z x0 = x0.contiguous() 2025-05-07T20:32:42.1946391Z x1 = x1.contiguous() 2025-05-07T20:32:42.1946632Z 2025-05-07T20:32:42.1946836Z if scale_ub is not None: 2025-05-07T20:32:42.1947113Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1947450Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1947761Z ) 2025-05-07T20:32:42.1947962Z else: 2025-05-07T20:32:42.1948217Z scale_ub_tensor = None 2025-05-07T20:32:42.1948479Z 2025-05-07T20:32:42.1948719Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1949039Z op = silu_mul_quant 2025-05-07T20:32:42.1949286Z if compiled: 2025-05-07T20:32:42.1949538Z op = torch.compile(op) 2025-05-07T20:32:42.1949844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1950122Z 2025-05-07T20:32:42.1950319Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1950483Z 2025-05-07T20:32:42.1950589Z moe/activation_test.py:117: 2025-05-07T20:32:42.1950888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1951227Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1951513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1952077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.1952635Z return fn(*args, **kwargs) 2025-05-07T20:32:42.1953297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1953987Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1954525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1955217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1955990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1956531Z kernel = self.compile( 2025-05-07T20:32:42.1957069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1957727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1958163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1958413Z 2025-05-07T20:32:42.1958627Z self = 2025-05-07T20:32:42.1959795Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1961191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8ae1d00>} 2025-05-07T20:32:42.1962617Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1963646Z context = 2025-05-07T20:32:42.1963934Z 2025-05-07T20:32:42.1964108Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1964631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1965112Z module_map=module_map) 2025-05-07T20:32:42.1965737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1966109Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1966378Z E ^ 2025-05-07T20:32:42.1966843Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1967309Z 2025-05-07T20:32:42.1967727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1968237Z 2025-05-07T20:32:42.1968347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1968760Z self=, 2025-05-07T20:32:42.1969165Z T=2048, 2025-05-07T20:32:42.1969365Z D=5120, 2025-05-07T20:32:42.1969570Z scale_ub=None, 2025-05-07T20:32:42.1969784Z contiguous=False, 2025-05-07T20:32:42.1970013Z compiled=True, 2025-05-07T20:32:42.1970221Z ) 2025-05-07T20:32:42.4919928Z self = 2025-05-07T20:32:42.4920768Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.4921054Z 2025-05-07T20:32:42.4921135Z @given( 2025-05-07T20:32:42.4921369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4921694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4921998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4922333Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4922662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4922946Z ) 2025-05-07T20:32:42.4923300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4923744Z def test_silu_mul_quant( 2025-05-07T20:32:42.4923986Z self, 2025-05-07T20:32:42.4924183Z T: int, 2025-05-07T20:32:42.4924386Z D: int, 2025-05-07T20:32:42.4924609Z scale_ub: Optional[float], 2025-05-07T20:32:42.4924879Z contiguous: bool, 2025-05-07T20:32:42.4925126Z compiled: bool, 2025-05-07T20:32:42.4925364Z ) -> None: 2025-05-07T20:32:42.4925581Z torch.manual_seed(2025) 2025-05-07T20:32:42.4925834Z 2025-05-07T20:32:42.4926115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4926467Z 2025-05-07T20:32:42.4926673Z x_sign = torch.sign(x) 2025-05-07T20:32:42.4926972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.4927281Z x = x_sign * x_clamp 2025-05-07T20:32:42.4927530Z x0 = x[:, :D] 2025-05-07T20:32:42.4927753Z x1 = x[:, D:] 2025-05-07T20:32:42.4927965Z 2025-05-07T20:32:42.4928168Z if contiguous: 2025-05-07T20:32:42.4928405Z x0 = x0.contiguous() 2025-05-07T20:32:42.4928665Z x1 = x1.contiguous() 2025-05-07T20:32:42.4928914Z 2025-05-07T20:32:42.4929125Z if scale_ub is not None: 2025-05-07T20:32:42.4929401Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.4930095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.4930413Z ) 2025-05-07T20:32:42.4930622Z else: 2025-05-07T20:32:42.4930834Z scale_ub_tensor = None 2025-05-07T20:32:42.4931099Z 2025-05-07T20:32:42.4931491Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.4931803Z op = silu_mul_quant 2025-05-07T20:32:42.4932059Z if compiled: 2025-05-07T20:32:42.4932310Z op = torch.compile(op) 2025-05-07T20:32:42.4932609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4932888Z 2025-05-07T20:32:42.4933094Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.4933258Z 2025-05-07T20:32:42.4933363Z moe/activation_test.py:117: 2025-05-07T20:32:42.4933666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4934008Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.4934294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4934857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.4935419Z return fn(*args, **kwargs) 2025-05-07T20:32:42.4936077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.4936767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.4937311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.4938010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.4938678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.4939211Z kernel = self.compile( 2025-05-07T20:32:42.4939753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.4940422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.4940825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4941066Z 2025-05-07T20:32:42.4941282Z self = 2025-05-07T20:32:42.4942378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.4943769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8ae14e0>} 2025-05-07T20:32:42.4953475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.4954538Z context = 2025-05-07T20:32:42.4954835Z 2025-05-07T20:32:42.4955012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.4955546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.4956100Z module_map=module_map) 2025-05-07T20:32:42.4956469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.4956835Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.4957101Z E ^ 2025-05-07T20:32:42.4957568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.4958024Z 2025-05-07T20:32:42.4958445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.4958964Z 2025-05-07T20:32:42.4959203Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.4959624Z self=, 2025-05-07T20:32:42.4960039Z T=2048, 2025-05-07T20:32:42.4960228Z D=5120, 2025-05-07T20:32:42.4960512Z scale_ub=1200.0, 2025-05-07T20:32:42.4960743Z contiguous=False, 2025-05-07T20:32:42.4960965Z compiled=True, 2025-05-07T20:32:42.4961187Z ) 2025-05-07T20:32:42.4961518Z self = 2025-05-07T20:32:42.4962014Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.4962309Z 2025-05-07T20:32:42.4962392Z @given( 2025-05-07T20:32:42.4962632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4962943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4963260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4963602Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4963948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4964231Z ) 2025-05-07T20:32:42.4964583Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4965035Z def test_silu_mul_quant( 2025-05-07T20:32:42.4965289Z self, 2025-05-07T20:32:42.4965772Z T: int, 2025-05-07T20:32:42.4965979Z D: int, 2025-05-07T20:32:42.4966195Z scale_ub: Optional[float], 2025-05-07T20:32:42.4966480Z contiguous: bool, 2025-05-07T20:32:42.4966728Z compiled: bool, 2025-05-07T20:32:42.4966950Z ) -> None: 2025-05-07T20:32:42.4967173Z torch.manual_seed(2025) 2025-05-07T20:32:42.4967419Z 2025-05-07T20:32:42.4967689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4968034Z 2025-05-07T20:32:42.4968238Z x_sign = torch.sign(x) 2025-05-07T20:32:42.4968528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.4968843Z x = x_sign * x_clamp 2025-05-07T20:32:42.4969096Z x0 = x[:, :D] 2025-05-07T20:32:42.4969316Z x1 = x[:, D:] 2025-05-07T20:32:42.4969522Z 2025-05-07T20:32:42.4969715Z if contiguous: 2025-05-07T20:32:42.4969951Z x0 = x0.contiguous() 2025-05-07T20:32:42.4970209Z x1 = x1.contiguous() 2025-05-07T20:32:42.4970443Z 2025-05-07T20:32:42.4970636Z if scale_ub is not None: 2025-05-07T20:32:42.4970908Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.4971241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.4971562Z ) 2025-05-07T20:32:42.4971755Z else: 2025-05-07T20:32:42.4971975Z scale_ub_tensor = None 2025-05-07T20:32:42.4972231Z 2025-05-07T20:32:42.4972465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.4972786Z op = silu_mul_quant 2025-05-07T20:32:42.4973042Z if compiled: 2025-05-07T20:32:42.4973292Z op = torch.compile(op) 2025-05-07T20:32:42.4973600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4973880Z 2025-05-07T20:32:42.4974081Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.4974248Z 2025-05-07T20:32:42.4974356Z moe/activation_test.py:117: 2025-05-07T20:32:42.4974661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4975009Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.4975289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4975855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.4976425Z return fn(*args, **kwargs) 2025-05-07T20:32:42.4977097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.4977778Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.4978478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.4979173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.4979830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.4980475Z kernel = self.compile( 2025-05-07T20:32:42.4981021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.4981679Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.4982077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4982313Z 2025-05-07T20:32:42.4982524Z self = 2025-05-07T20:32:42.4983617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.4985001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be97fdf80>} 2025-05-07T20:32:42.4986342Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.4987378Z context = 2025-05-07T20:32:42.4987678Z 2025-05-07T20:32:42.4987846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.4988375Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.4988842Z module_map=module_map) 2025-05-07T20:32:42.4989219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.4989583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.4989850Z E ^ 2025-05-07T20:32:42.4990312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.4990777Z 2025-05-07T20:32:42.4991191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.4991703Z 2025-05-07T20:32:42.6730918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6731510Z self=, 2025-05-07T20:32:42.6731937Z T=4096, 2025-05-07T20:32:42.6732139Z D=5120, 2025-05-07T20:32:42.6732338Z scale_ub=1200.0, 2025-05-07T20:32:42.6732576Z contiguous=True, 2025-05-07T20:32:42.6732814Z compiled=True, 2025-05-07T20:32:42.6733028Z ) 2025-05-07T20:32:42.6733384Z self = 2025-05-07T20:32:42.6733893Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.6734172Z 2025-05-07T20:32:42.6734263Z @given( 2025-05-07T20:32:42.6734518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6734849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6735180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6735524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6735869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6736172Z ) 2025-05-07T20:32:42.6736528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6736987Z def test_silu_mul_quant( 2025-05-07T20:32:42.6737245Z self, 2025-05-07T20:32:42.6737459Z T: int, 2025-05-07T20:32:42.6737666Z D: int, 2025-05-07T20:32:42.6737897Z scale_ub: Optional[float], 2025-05-07T20:32:42.6738462Z contiguous: bool, 2025-05-07T20:32:42.6738722Z compiled: bool, 2025-05-07T20:32:42.6738962Z ) -> None: 2025-05-07T20:32:42.6739185Z torch.manual_seed(2025) 2025-05-07T20:32:42.6739449Z 2025-05-07T20:32:42.6739868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6740220Z 2025-05-07T20:32:42.6740431Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6740743Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6741061Z x = x_sign * x_clamp 2025-05-07T20:32:42.6741320Z x0 = x[:, :D] 2025-05-07T20:32:42.6741549Z x1 = x[:, D:] 2025-05-07T20:32:42.6741759Z 2025-05-07T20:32:42.6741959Z if contiguous: 2025-05-07T20:32:42.6742204Z x0 = x0.contiguous() 2025-05-07T20:32:42.6742466Z x1 = x1.contiguous() 2025-05-07T20:32:42.6742717Z 2025-05-07T20:32:42.6742921Z if scale_ub is not None: 2025-05-07T20:32:42.6743201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6743550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6743871Z ) 2025-05-07T20:32:42.6744076Z else: 2025-05-07T20:32:42.6744290Z scale_ub_tensor = None 2025-05-07T20:32:42.6744560Z 2025-05-07T20:32:42.6744805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6745122Z op = silu_mul_quant 2025-05-07T20:32:42.6745384Z if compiled: 2025-05-07T20:32:42.6745639Z op = torch.compile(op) 2025-05-07T20:32:42.6745937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6746222Z 2025-05-07T20:32:42.6746427Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6746593Z 2025-05-07T20:32:42.6746697Z moe/activation_test.py:117: 2025-05-07T20:32:42.6747004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6747348Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6747639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6748201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6748774Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6749445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6750133Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6750676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6751369Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6752041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6752577Z kernel = self.compile( 2025-05-07T20:32:42.6753134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6753798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6754201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6754444Z 2025-05-07T20:32:42.6754654Z self = 2025-05-07T20:32:42.6755843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6757247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be97fe200>} 2025-05-07T20:32:42.6758694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6759725Z context = 2025-05-07T20:32:42.6760023Z 2025-05-07T20:32:42.6760328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6760861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6761348Z module_map=module_map) 2025-05-07T20:32:42.6761712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6762080Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6762356Z E ^ 2025-05-07T20:32:42.6762825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6763287Z 2025-05-07T20:32:42.6763704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6764233Z 2025-05-07T20:32:42.6764340Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6764761Z self=, 2025-05-07T20:32:42.6765177Z T=128, 2025-05-07T20:32:42.6765665Z D=5120, 2025-05-07T20:32:42.6765873Z scale_ub=1200.0, 2025-05-07T20:32:42.6766098Z contiguous=False, 2025-05-07T20:32:42.6766328Z compiled=True, 2025-05-07T20:32:42.6766539Z ) 2025-05-07T20:32:42.7782726Z self = 2025-05-07T20:32:42.7783530Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.7783874Z 2025-05-07T20:32:42.7783959Z @given( 2025-05-07T20:32:42.7784196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7784524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7784838Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7785201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7785535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7785826Z ) 2025-05-07T20:32:42.7786176Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7786639Z def test_silu_mul_quant( 2025-05-07T20:32:42.7786886Z self, 2025-05-07T20:32:42.7787084Z T: int, 2025-05-07T20:32:42.7787293Z D: int, 2025-05-07T20:32:42.7787522Z scale_ub: Optional[float], 2025-05-07T20:32:42.7787795Z contiguous: bool, 2025-05-07T20:32:42.7788052Z compiled: bool, 2025-05-07T20:32:42.7788287Z ) -> None: 2025-05-07T20:32:42.7788506Z torch.manual_seed(2025) 2025-05-07T20:32:42.7788760Z 2025-05-07T20:32:42.7789046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7789395Z 2025-05-07T20:32:42.7789606Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7789917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7790240Z x = x_sign * x_clamp 2025-05-07T20:32:42.7790486Z x0 = x[:, :D] 2025-05-07T20:32:42.7790712Z x1 = x[:, D:] 2025-05-07T20:32:42.7790931Z 2025-05-07T20:32:42.7791128Z if contiguous: 2025-05-07T20:32:42.7791375Z x0 = x0.contiguous() 2025-05-07T20:32:42.7791649Z x1 = x1.contiguous() 2025-05-07T20:32:42.7791894Z 2025-05-07T20:32:42.7792095Z if scale_ub is not None: 2025-05-07T20:32:42.7792376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7792712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7793041Z ) 2025-05-07T20:32:42.7793249Z else: 2025-05-07T20:32:42.7793463Z scale_ub_tensor = None 2025-05-07T20:32:42.7793732Z 2025-05-07T20:32:42.7793979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7794299Z op = silu_mul_quant 2025-05-07T20:32:42.7794560Z if compiled: 2025-05-07T20:32:42.7795133Z op = torch.compile(op) 2025-05-07T20:32:42.7795442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7795818Z 2025-05-07T20:32:42.7796022Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7796331Z 2025-05-07T20:32:42.7796441Z moe/activation_test.py:117: 2025-05-07T20:32:42.7796742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7797088Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7797382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7797941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7798560Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7799226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7799920Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7800463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7801160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7801841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7802373Z kernel = self.compile( 2025-05-07T20:32:42.7802922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7803587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7803998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7804233Z 2025-05-07T20:32:42.7804444Z self = 2025-05-07T20:32:42.7805538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7806939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea294c20>} 2025-05-07T20:32:42.7808293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7809331Z context = 2025-05-07T20:32:42.7809621Z 2025-05-07T20:32:42.7809791Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7810322Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7810810Z module_map=module_map) 2025-05-07T20:32:42.7811177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7811550Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7811821Z E ^ 2025-05-07T20:32:42.7812301Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7812754Z 2025-05-07T20:32:42.7813172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7813692Z 2025-05-07T20:32:42.7813800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7814224Z self=, 2025-05-07T20:32:42.7814640Z T=16384, 2025-05-07T20:32:42.7814841Z D=7168, 2025-05-07T20:32:42.7815048Z scale_ub=1200.0, 2025-05-07T20:32:42.7815284Z contiguous=True, 2025-05-07T20:32:42.7815510Z compiled=True, 2025-05-07T20:32:42.7815730Z ) 2025-05-07T20:32:42.7816200Z self = 2025-05-07T20:32:42.7816700Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7817061Z 2025-05-07T20:32:42.7817143Z @given( 2025-05-07T20:32:42.7817388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7817711Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7818034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7818375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7818714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7819007Z ) 2025-05-07T20:32:42.7819368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7819819Z def test_silu_mul_quant( 2025-05-07T20:32:42.7820066Z self, 2025-05-07T20:32:42.7820269Z T: int, 2025-05-07T20:32:42.7820478Z D: int, 2025-05-07T20:32:42.7820708Z scale_ub: Optional[float], 2025-05-07T20:32:42.7820989Z contiguous: bool, 2025-05-07T20:32:42.7821236Z compiled: bool, 2025-05-07T20:32:42.7821460Z ) -> None: 2025-05-07T20:32:42.7821693Z torch.manual_seed(2025) 2025-05-07T20:32:42.7821948Z 2025-05-07T20:32:42.7822226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7822575Z 2025-05-07T20:32:42.7822775Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7823076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7823403Z x = x_sign * x_clamp 2025-05-07T20:32:42.7823658Z x0 = x[:, :D] 2025-05-07T20:32:42.7823881Z x1 = x[:, D:] 2025-05-07T20:32:42.7824102Z 2025-05-07T20:32:42.7824299Z if contiguous: 2025-05-07T20:32:42.7824537Z x0 = x0.contiguous() 2025-05-07T20:32:42.7824806Z x1 = x1.contiguous() 2025-05-07T20:32:42.7825060Z 2025-05-07T20:32:42.7825257Z if scale_ub is not None: 2025-05-07T20:32:42.7825545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7825892Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7826210Z ) 2025-05-07T20:32:42.7826420Z else: 2025-05-07T20:32:42.7826645Z scale_ub_tensor = None 2025-05-07T20:32:42.7826908Z 2025-05-07T20:32:42.7827149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7827478Z op = silu_mul_quant 2025-05-07T20:32:42.7827739Z if compiled: 2025-05-07T20:32:42.7827992Z op = torch.compile(op) 2025-05-07T20:32:42.7828305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7828594Z 2025-05-07T20:32:42.7828792Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7828964Z 2025-05-07T20:32:42.7829068Z moe/activation_test.py:117: 2025-05-07T20:32:42.7829370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7829719Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7830009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7830580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7831163Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7831827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7832536Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7833085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7833776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7834450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7835001Z kernel = self.compile( 2025-05-07T20:32:42.7835645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7836391Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7836879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7837121Z 2025-05-07T20:32:42.7837335Z self = 2025-05-07T20:32:42.7838425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7839797Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea296a20>} 2025-05-07T20:32:42.7841159Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7842199Z context = 2025-05-07T20:32:42.7842495Z 2025-05-07T20:32:42.7842675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7843215Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7843690Z module_map=module_map) 2025-05-07T20:32:42.7844071Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7844442Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7844708Z E ^ 2025-05-07T20:32:42.7845191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7845646Z 2025-05-07T20:32:42.7846079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7846596Z 2025-05-07T20:32:42.9080801Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.9081447Z self=, 2025-05-07T20:32:42.9082065Z T=16384, 2025-05-07T20:32:42.9082275Z D=5120, 2025-05-07T20:32:42.9082475Z scale_ub=1200.0, 2025-05-07T20:32:42.9082697Z contiguous=True, 2025-05-07T20:32:42.9082926Z compiled=False, 2025-05-07T20:32:42.9083135Z ) 2025-05-07T20:32:42.9083455Z self = 2025-05-07T20:32:42.9083966Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:42.9084250Z 2025-05-07T20:32:42.9084345Z @given( 2025-05-07T20:32:42.9084606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.9084929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.9085250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.9085587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.9085922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.9086211Z ) 2025-05-07T20:32:42.9086566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.9087036Z def test_silu_mul_quant( 2025-05-07T20:32:42.9087277Z self, 2025-05-07T20:32:42.9087486Z T: int, 2025-05-07T20:32:42.9087694Z D: int, 2025-05-07T20:32:42.9087919Z scale_ub: Optional[float], 2025-05-07T20:32:42.9088206Z contiguous: bool, 2025-05-07T20:32:42.9088452Z compiled: bool, 2025-05-07T20:32:42.9088683Z ) -> None: 2025-05-07T20:32:42.9097039Z torch.manual_seed(2025) 2025-05-07T20:32:42.9097312Z 2025-05-07T20:32:42.9097594Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.9097955Z 2025-05-07T20:32:42.9098162Z x_sign = torch.sign(x) 2025-05-07T20:32:42.9098794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.9099126Z x = x_sign * x_clamp 2025-05-07T20:32:42.9099380Z x0 = x[:, :D] 2025-05-07T20:32:42.9099767Z x1 = x[:, D:] 2025-05-07T20:32:42.9099994Z 2025-05-07T20:32:42.9100182Z if contiguous: 2025-05-07T20:32:42.9100436Z x0 = x0.contiguous() 2025-05-07T20:32:42.9100704Z x1 = x1.contiguous() 2025-05-07T20:32:42.9100945Z 2025-05-07T20:32:42.9101147Z if scale_ub is not None: 2025-05-07T20:32:42.9101436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.9101785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.9102103Z ) 2025-05-07T20:32:42.9102308Z else: 2025-05-07T20:32:42.9102531Z scale_ub_tensor = None 2025-05-07T20:32:42.9102791Z 2025-05-07T20:32:42.9103042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.9103374Z op = silu_mul_quant 2025-05-07T20:32:42.9103627Z if compiled: 2025-05-07T20:32:42.9103886Z op = torch.compile(op) 2025-05-07T20:32:42.9104192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.9104476Z 2025-05-07T20:32:42.9104682Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.9104849Z 2025-05-07T20:32:42.9104964Z moe/activation_test.py:117: 2025-05-07T20:32:42.9105262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9105609Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.9105902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.9106606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.9107293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.9107834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.9108528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.9109198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.9109735Z kernel = self.compile( 2025-05-07T20:32:42.9110281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.9110944Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.9111343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9111586Z 2025-05-07T20:32:42.9111796Z self = 2025-05-07T20:32:42.9112893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.9114286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea19f9c0>} 2025-05-07T20:32:42.9115645Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.9116771Z context = 2025-05-07T20:32:42.9117069Z 2025-05-07T20:32:42.9117239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.9117765Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.9118244Z module_map=module_map) 2025-05-07T20:32:42.9118607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.9119059Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.9119328Z E ^ 2025-05-07T20:32:42.9119795Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.9120328Z 2025-05-07T20:32:42.9120746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.9121267Z 2025-05-07T20:32:42.9121375Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.9121804Z self=, 2025-05-07T20:32:42.9122220Z T=1, 2025-05-07T20:32:42.9122421Z D=7168, 2025-05-07T20:32:42.9122626Z scale_ub=1200.0, 2025-05-07T20:32:42.9122861Z contiguous=False, 2025-05-07T20:32:42.9123099Z compiled=False, 2025-05-07T20:32:42.9123319Z ) 2025-05-07T20:32:42.9123647Z self = 2025-05-07T20:32:42.9124163Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.9124441Z 2025-05-07T20:32:42.9124536Z @given( 2025-05-07T20:32:42.9124769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.9125103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.9125425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.9125769Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.9126105Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.9126414Z ) 2025-05-07T20:32:42.9126775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.9127230Z def test_silu_mul_quant( 2025-05-07T20:32:42.9127485Z self, 2025-05-07T20:32:42.9127690Z T: int, 2025-05-07T20:32:42.9127891Z D: int, 2025-05-07T20:32:42.9128128Z scale_ub: Optional[float], 2025-05-07T20:32:42.9128416Z contiguous: bool, 2025-05-07T20:32:42.9128672Z compiled: bool, 2025-05-07T20:32:42.9128907Z ) -> None: 2025-05-07T20:32:42.9129143Z torch.manual_seed(2025) 2025-05-07T20:32:42.9129391Z 2025-05-07T20:32:42.9129673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.9130042Z 2025-05-07T20:32:42.9130250Z x_sign = torch.sign(x) 2025-05-07T20:32:42.9130549Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.9130869Z x = x_sign * x_clamp 2025-05-07T20:32:42.9131125Z x0 = x[:, :D] 2025-05-07T20:32:42.9131352Z x1 = x[:, D:] 2025-05-07T20:32:42.9131570Z 2025-05-07T20:32:42.9131767Z if contiguous: 2025-05-07T20:32:42.9132002Z x0 = x0.contiguous() 2025-05-07T20:32:42.9132270Z x1 = x1.contiguous() 2025-05-07T20:32:42.9132513Z 2025-05-07T20:32:42.9132703Z if scale_ub is not None: 2025-05-07T20:32:42.9132989Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.9133340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.9133653Z ) 2025-05-07T20:32:42.9133853Z else: 2025-05-07T20:32:42.9134074Z scale_ub_tensor = None 2025-05-07T20:32:42.9134331Z 2025-05-07T20:32:42.9134572Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.9134893Z op = silu_mul_quant 2025-05-07T20:32:42.9135143Z if compiled: 2025-05-07T20:32:42.9135392Z op = torch.compile(op) 2025-05-07T20:32:42.9135704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.9135992Z 2025-05-07T20:32:42.9136191Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.9136364Z 2025-05-07T20:32:42.9136471Z moe/activation_test.py:117: 2025-05-07T20:32:42.9136780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9137124Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.9137422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.9138204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.9138954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.9139571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.9140265Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.9140946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.9141481Z kernel = self.compile( 2025-05-07T20:32:42.9142036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.9142701Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.9143114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.9143353Z 2025-05-07T20:32:42.9143568Z self = 2025-05-07T20:32:42.9144655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.9146044Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c5800>} 2025-05-07T20:32:42.9147396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.9148425Z context = 2025-05-07T20:32:42.9148774Z 2025-05-07T20:32:42.9148949Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.9149491Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.9149963Z module_map=module_map) 2025-05-07T20:32:42.9150346Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.9150710Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.9150972Z E ^ 2025-05-07T20:32:42.9151441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.9151907Z 2025-05-07T20:32:42.9152327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.9152837Z 2025-05-07T20:32:43.0903504Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.0904125Z self=, 2025-05-07T20:32:43.0904725Z T=4096, 2025-05-07T20:32:43.0905009Z D=7168, 2025-05-07T20:32:43.0905285Z scale_ub=1200.0, 2025-05-07T20:32:43.0905521Z contiguous=False, 2025-05-07T20:32:43.0905754Z compiled=True, 2025-05-07T20:32:43.0905975Z ) 2025-05-07T20:32:43.0906313Z self = 2025-05-07T20:32:43.0906809Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.0907091Z 2025-05-07T20:32:43.0907171Z @given( 2025-05-07T20:32:43.0907404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.0907725Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.0908031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.0908363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.0908695Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.0908979Z ) 2025-05-07T20:32:43.0909333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.0910127Z def test_silu_mul_quant( 2025-05-07T20:32:43.0910379Z self, 2025-05-07T20:32:43.0910581Z T: int, 2025-05-07T20:32:43.0910785Z D: int, 2025-05-07T20:32:43.0911150Z scale_ub: Optional[float], 2025-05-07T20:32:43.0911431Z contiguous: bool, 2025-05-07T20:32:43.0911674Z compiled: bool, 2025-05-07T20:32:43.0911899Z ) -> None: 2025-05-07T20:32:43.0912119Z torch.manual_seed(2025) 2025-05-07T20:32:43.0912371Z 2025-05-07T20:32:43.0912652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.0913014Z 2025-05-07T20:32:43.0913218Z x_sign = torch.sign(x) 2025-05-07T20:32:43.0913518Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.0913828Z x = x_sign * x_clamp 2025-05-07T20:32:43.0914085Z x0 = x[:, :D] 2025-05-07T20:32:43.0914317Z x1 = x[:, D:] 2025-05-07T20:32:43.0914523Z 2025-05-07T20:32:43.0914718Z if contiguous: 2025-05-07T20:32:43.0914960Z x0 = x0.contiguous() 2025-05-07T20:32:43.0915223Z x1 = x1.contiguous() 2025-05-07T20:32:43.0915473Z 2025-05-07T20:32:43.0915679Z if scale_ub is not None: 2025-05-07T20:32:43.0916070Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.0916410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.0916728Z ) 2025-05-07T20:32:43.0916921Z else: 2025-05-07T20:32:43.0917137Z scale_ub_tensor = None 2025-05-07T20:32:43.0917394Z 2025-05-07T20:32:43.0917624Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.0917944Z op = silu_mul_quant 2025-05-07T20:32:43.0918200Z if compiled: 2025-05-07T20:32:43.0918449Z op = torch.compile(op) 2025-05-07T20:32:43.0918742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.0919020Z 2025-05-07T20:32:43.0919218Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.0919386Z 2025-05-07T20:32:43.0919488Z moe/activation_test.py:117: 2025-05-07T20:32:43.0919790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.0920129Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.0920417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.0920982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.0921547Z return fn(*args, **kwargs) 2025-05-07T20:32:43.0922207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.0922891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.0923429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.0924112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.0924775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.0925309Z kernel = self.compile( 2025-05-07T20:32:43.0925850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.0926508Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.0926903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.0927141Z 2025-05-07T20:32:43.0927351Z self = 2025-05-07T20:32:43.0928437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.0929921Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c6d40>} 2025-05-07T20:32:43.0931265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.0932377Z context = 2025-05-07T20:32:43.0932671Z 2025-05-07T20:32:43.0932838Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.0933362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.0933833Z module_map=module_map) 2025-05-07T20:32:43.0934203Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.0934568Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.0934837Z E ^ 2025-05-07T20:32:43.0935303Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.0935765Z 2025-05-07T20:32:43.0936179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.0936694Z 2025-05-07T20:32:43.0936808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.0937226Z self=, 2025-05-07T20:32:43.0937628Z T=128, 2025-05-07T20:32:43.0937825Z D=7168, 2025-05-07T20:32:43.0938027Z scale_ub=1200.0, 2025-05-07T20:32:43.0938253Z contiguous=False, 2025-05-07T20:32:43.0938492Z compiled=True, 2025-05-07T20:32:43.0938706Z ) 2025-05-07T20:32:43.1861402Z self = 2025-05-07T20:32:43.1862921Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.1863663Z 2025-05-07T20:32:43.1863827Z @given( 2025-05-07T20:32:43.1864329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1864961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1866023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1866703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1867364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1867927Z ) 2025-05-07T20:32:43.1868563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1869062Z def test_silu_mul_quant( 2025-05-07T20:32:43.1869308Z self, 2025-05-07T20:32:43.1869501Z T: int, 2025-05-07T20:32:43.1869705Z D: int, 2025-05-07T20:32:43.1869929Z scale_ub: Optional[float], 2025-05-07T20:32:43.1870199Z contiguous: bool, 2025-05-07T20:32:43.1870448Z compiled: bool, 2025-05-07T20:32:43.1870678Z ) -> None: 2025-05-07T20:32:43.1870896Z torch.manual_seed(2025) 2025-05-07T20:32:43.1871153Z 2025-05-07T20:32:43.1871431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1871775Z 2025-05-07T20:32:43.1871977Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1872279Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1872589Z x = x_sign * x_clamp 2025-05-07T20:32:43.1872836Z x0 = x[:, :D] 2025-05-07T20:32:43.1873058Z x1 = x[:, D:] 2025-05-07T20:32:43.1873264Z 2025-05-07T20:32:43.1873458Z if contiguous: 2025-05-07T20:32:43.1873695Z x0 = x0.contiguous() 2025-05-07T20:32:43.1873958Z x1 = x1.contiguous() 2025-05-07T20:32:43.1874198Z 2025-05-07T20:32:43.1874397Z if scale_ub is not None: 2025-05-07T20:32:43.1874683Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1875019Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1875339Z ) 2025-05-07T20:32:43.1875544Z else: 2025-05-07T20:32:43.1876133Z scale_ub_tensor = None 2025-05-07T20:32:43.1876399Z 2025-05-07T20:32:43.1876646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1876964Z op = silu_mul_quant 2025-05-07T20:32:43.1877364Z if compiled: 2025-05-07T20:32:43.1877621Z op = torch.compile(op) 2025-05-07T20:32:43.1877922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1878208Z 2025-05-07T20:32:43.1878415Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1878581Z 2025-05-07T20:32:43.1878683Z moe/activation_test.py:117: 2025-05-07T20:32:43.1878990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1879333Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1879625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1880189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.1880768Z return fn(*args, **kwargs) 2025-05-07T20:32:43.1881433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1882126Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1882666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1883349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1884018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1884549Z kernel = self.compile( 2025-05-07T20:32:43.1885092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1885749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1886158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1886390Z 2025-05-07T20:32:43.1886598Z self = 2025-05-07T20:32:43.1887683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1889097Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c04ae2160>} 2025-05-07T20:32:43.1890449Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1891474Z context = 2025-05-07T20:32:43.1891771Z 2025-05-07T20:32:43.1891944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1892475Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1892951Z module_map=module_map) 2025-05-07T20:32:43.1893326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1893691Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1893954Z E ^ 2025-05-07T20:32:43.1894417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1894876Z 2025-05-07T20:32:43.1895293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1895807Z 2025-05-07T20:32:43.1895923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1896346Z self=, 2025-05-07T20:32:43.1896840Z T=2048, 2025-05-07T20:32:43.1897040Z D=7168, 2025-05-07T20:32:43.1897243Z scale_ub=None, 2025-05-07T20:32:43.1897461Z contiguous=True, 2025-05-07T20:32:43.1897799Z compiled=True, 2025-05-07T20:32:43.1898010Z ) 2025-05-07T20:32:43.1898329Z self = 2025-05-07T20:32:43.1898827Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1899097Z 2025-05-07T20:32:43.1899186Z @given( 2025-05-07T20:32:43.1899419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1899740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1900054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1900400Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1900733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1901023Z ) 2025-05-07T20:32:43.1901388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1901829Z def test_silu_mul_quant( 2025-05-07T20:32:43.1902081Z self, 2025-05-07T20:32:43.1902283Z T: int, 2025-05-07T20:32:43.1902490Z D: int, 2025-05-07T20:32:43.1902717Z scale_ub: Optional[float], 2025-05-07T20:32:43.1902999Z contiguous: bool, 2025-05-07T20:32:43.1903247Z compiled: bool, 2025-05-07T20:32:43.1903482Z ) -> None: 2025-05-07T20:32:43.1903708Z torch.manual_seed(2025) 2025-05-07T20:32:43.1903955Z 2025-05-07T20:32:43.1904233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1904595Z 2025-05-07T20:32:43.1904789Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1905092Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1905425Z x = x_sign * x_clamp 2025-05-07T20:32:43.1905677Z x0 = x[:, :D] 2025-05-07T20:32:43.1905902Z x1 = x[:, D:] 2025-05-07T20:32:43.1906126Z 2025-05-07T20:32:43.1906318Z if contiguous: 2025-05-07T20:32:43.1906556Z x0 = x0.contiguous() 2025-05-07T20:32:43.1906828Z x1 = x1.contiguous() 2025-05-07T20:32:43.1907086Z 2025-05-07T20:32:43.1907281Z if scale_ub is not None: 2025-05-07T20:32:43.1907564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1907916Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1908229Z ) 2025-05-07T20:32:43.1908439Z else: 2025-05-07T20:32:43.1908680Z scale_ub_tensor = None 2025-05-07T20:32:43.1908958Z 2025-05-07T20:32:43.1909200Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1909530Z op = silu_mul_quant 2025-05-07T20:32:43.1909781Z if compiled: 2025-05-07T20:32:43.1910045Z op = torch.compile(op) 2025-05-07T20:32:43.1910355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1910635Z 2025-05-07T20:32:43.1910837Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1911013Z 2025-05-07T20:32:43.1911115Z moe/activation_test.py:117: 2025-05-07T20:32:43.1911416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1911754Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1912047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1912610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.1913170Z return fn(*args, **kwargs) 2025-05-07T20:32:43.1913852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1914553Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1915108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1915954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1916630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1917171Z kernel = self.compile( 2025-05-07T20:32:43.1917784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1926378Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1926822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1927062Z 2025-05-07T20:32:43.1927280Z self = 2025-05-07T20:32:43.1928373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1929768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8bdd18a0>} 2025-05-07T20:32:43.1931128Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1932171Z context = 2025-05-07T20:32:43.1932463Z 2025-05-07T20:32:43.1932644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1933168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1933646Z module_map=module_map) 2025-05-07T20:32:43.1934014Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1934379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1934647Z E ^ 2025-05-07T20:32:43.1935122Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1935589Z 2025-05-07T20:32:43.1936010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1936522Z 2025-05-07T20:32:43.2538381Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2539588Z self=, 2025-05-07T20:32:43.2540802Z T=16384, 2025-05-07T20:32:43.2541351Z D=5120, 2025-05-07T20:32:43.2541891Z scale_ub=None, 2025-05-07T20:32:43.2542476Z contiguous=False, 2025-05-07T20:32:43.2543042Z compiled=False, 2025-05-07T20:32:43.2543465Z ) 2025-05-07T20:32:43.2544102Z self = 2025-05-07T20:32:43.2545112Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2545710Z 2025-05-07T20:32:43.2545872Z @given( 2025-05-07T20:32:43.2546339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2546965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2547601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2548260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2548762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2549091Z ) 2025-05-07T20:32:43.2549446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2549895Z def test_silu_mul_quant( 2025-05-07T20:32:43.2550155Z self, 2025-05-07T20:32:43.2550363Z T: int, 2025-05-07T20:32:43.2550562Z D: int, 2025-05-07T20:32:43.2550792Z scale_ub: Optional[float], 2025-05-07T20:32:43.2551073Z contiguous: bool, 2025-05-07T20:32:43.2551322Z compiled: bool, 2025-05-07T20:32:43.2551551Z ) -> None: 2025-05-07T20:32:43.2552085Z torch.manual_seed(2025) 2025-05-07T20:32:43.2552345Z 2025-05-07T20:32:43.2552620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2553102Z 2025-05-07T20:32:43.2553307Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2553600Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2555644Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2557649Z 2025-05-07T20:32:43.2557776Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2557997Z 2025-05-07T20:32:43.2558107Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2558528Z self=, 2025-05-07T20:32:43.2558948Z T=4096, 2025-05-07T20:32:43.2559152Z D=7168, 2025-05-07T20:32:43.2559361Z scale_ub=1200.0, 2025-05-07T20:32:43.2559587Z contiguous=True, 2025-05-07T20:32:43.2559822Z compiled=True, 2025-05-07T20:32:43.2560045Z ) 2025-05-07T20:32:43.2560369Z self = 2025-05-07T20:32:43.2560878Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2561150Z 2025-05-07T20:32:43.2561239Z @given( 2025-05-07T20:32:43.2561471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2561797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2562116Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2562459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2562788Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2563084Z ) 2025-05-07T20:32:43.2563447Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2563893Z def test_silu_mul_quant( 2025-05-07T20:32:43.2564146Z self, 2025-05-07T20:32:43.2564354Z T: int, 2025-05-07T20:32:43.2564558Z D: int, 2025-05-07T20:32:43.2564787Z scale_ub: Optional[float], 2025-05-07T20:32:43.2565065Z contiguous: bool, 2025-05-07T20:32:43.2565310Z compiled: bool, 2025-05-07T20:32:43.2565815Z ) -> None: 2025-05-07T20:32:43.2566038Z torch.manual_seed(2025) 2025-05-07T20:32:43.2566284Z 2025-05-07T20:32:43.2566569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2566925Z 2025-05-07T20:32:43.2567136Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2567435Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2569512Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2571396Z 2025-05-07T20:32:43.2571515Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2571730Z 2025-05-07T20:32:43.2571851Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2572263Z self=, 2025-05-07T20:32:43.2572801Z T=16384, 2025-05-07T20:32:43.2573014Z D=7168, 2025-05-07T20:32:43.2573217Z scale_ub=None, 2025-05-07T20:32:43.2573435Z contiguous=False, 2025-05-07T20:32:43.2573674Z compiled=False, 2025-05-07T20:32:43.2574003Z ) 2025-05-07T20:32:43.2574322Z self = 2025-05-07T20:32:43.2574835Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.2575116Z 2025-05-07T20:32:43.2575206Z @given( 2025-05-07T20:32:43.2575440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2575766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2576084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2576419Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2576758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2577062Z ) 2025-05-07T20:32:43.2577433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2577882Z def test_silu_mul_quant( 2025-05-07T20:32:43.2578137Z self, 2025-05-07T20:32:43.2578340Z T: int, 2025-05-07T20:32:43.2578547Z D: int, 2025-05-07T20:32:43.2578786Z scale_ub: Optional[float], 2025-05-07T20:32:43.2579069Z contiguous: bool, 2025-05-07T20:32:43.2579311Z compiled: bool, 2025-05-07T20:32:43.2579544Z ) -> None: 2025-05-07T20:32:43.2579789Z torch.manual_seed(2025) 2025-05-07T20:32:43.2580039Z 2025-05-07T20:32:43.2580315Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2582389Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2584275Z 2025-05-07T20:32:43.2584396Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.2584609Z 2025-05-07T20:32:43.2584720Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2585134Z self=, 2025-05-07T20:32:43.2585544Z T=2048, 2025-05-07T20:32:43.2585738Z D=7168, 2025-05-07T20:32:43.2585938Z scale_ub=1200.0, 2025-05-07T20:32:43.2586158Z contiguous=True, 2025-05-07T20:32:43.2586390Z compiled=True, 2025-05-07T20:32:43.2586600Z ) 2025-05-07T20:32:43.2586921Z self = 2025-05-07T20:32:43.2587417Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2587694Z 2025-05-07T20:32:43.2587789Z @given( 2025-05-07T20:32:43.2588022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2588347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2588669Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2588996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2589335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2589628Z ) 2025-05-07T20:32:43.2589975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2590427Z def test_silu_mul_quant( 2025-05-07T20:32:43.2590678Z self, 2025-05-07T20:32:43.2590880Z T: int, 2025-05-07T20:32:43.2591077Z D: int, 2025-05-07T20:32:43.2591300Z scale_ub: Optional[float], 2025-05-07T20:32:43.2591576Z contiguous: bool, 2025-05-07T20:32:43.2591816Z compiled: bool, 2025-05-07T20:32:43.2592044Z ) -> None: 2025-05-07T20:32:43.2592354Z torch.manual_seed(2025) 2025-05-07T20:32:43.2592597Z 2025-05-07T20:32:43.2592871Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2593216Z 2025-05-07T20:32:43.2593486Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2593778Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2595831Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.2597683Z 2025-05-07T20:32:43.2597817Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.2598030Z 2025-05-07T20:32:43.2598141Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2598556Z self=, 2025-05-07T20:32:43.2598973Z T=2048, 2025-05-07T20:32:43.2599169Z D=7168, 2025-05-07T20:32:43.2599359Z scale_ub=None, 2025-05-07T20:32:43.2599577Z contiguous=True, 2025-05-07T20:32:43.2599803Z compiled=False, 2025-05-07T20:32:43.2600006Z ) 2025-05-07T20:32:43.3727341Z self = 2025-05-07T20:32:43.3728110Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.3728484Z 2025-05-07T20:32:43.3728600Z @given( 2025-05-07T20:32:43.3728830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.3729150Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.3729464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.3729832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.3730161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.3730463Z ) 2025-05-07T20:32:43.3730819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.3731279Z def test_silu_mul_quant( 2025-05-07T20:32:43.3731533Z self, 2025-05-07T20:32:43.3731738Z T: int, 2025-05-07T20:32:43.3731937Z D: int, 2025-05-07T20:32:43.3732160Z scale_ub: Optional[float], 2025-05-07T20:32:43.3732437Z contiguous: bool, 2025-05-07T20:32:43.3732678Z compiled: bool, 2025-05-07T20:32:43.3732910Z ) -> None: 2025-05-07T20:32:43.3733136Z torch.manual_seed(2025) 2025-05-07T20:32:43.3733381Z 2025-05-07T20:32:43.3733659Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.3734012Z 2025-05-07T20:32:43.3734219Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.3736189Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.3738067Z 2025-05-07T20:32:43.3738185Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.3738404Z 2025-05-07T20:32:43.3738512Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.3738985Z self=, 2025-05-07T20:32:43.3739385Z T=1, 2025-05-07T20:32:43.3739577Z D=7168, 2025-05-07T20:32:43.3739773Z scale_ub=1200.0, 2025-05-07T20:32:43.3740300Z contiguous=True, 2025-05-07T20:32:43.3740523Z compiled=False, 2025-05-07T20:32:43.3740731Z ) 2025-05-07T20:32:43.3741054Z self = 2025-05-07T20:32:43.3741700Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.3741970Z 2025-05-07T20:32:43.3742050Z @given( 2025-05-07T20:32:43.3742287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.3742597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.3742907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.3743240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.3743565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.3743858Z ) 2025-05-07T20:32:43.3744209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.3744653Z def test_silu_mul_quant( 2025-05-07T20:32:43.3744897Z self, 2025-05-07T20:32:43.3745103Z T: int, 2025-05-07T20:32:43.3745307Z D: int, 2025-05-07T20:32:43.3745526Z scale_ub: Optional[float], 2025-05-07T20:32:43.3745800Z contiguous: bool, 2025-05-07T20:32:43.3746049Z compiled: bool, 2025-05-07T20:32:43.3746269Z ) -> None: 2025-05-07T20:32:43.3746487Z torch.manual_seed(2025) 2025-05-07T20:32:43.3746732Z 2025-05-07T20:32:43.3747001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.3747346Z 2025-05-07T20:32:43.3747546Z x_sign = torch.sign(x) 2025-05-07T20:32:43.3747837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.3748155Z x = x_sign * x_clamp 2025-05-07T20:32:43.3748405Z x0 = x[:, :D] 2025-05-07T20:32:43.3748622Z x1 = x[:, D:] 2025-05-07T20:32:43.3748838Z 2025-05-07T20:32:43.3749032Z if contiguous: 2025-05-07T20:32:43.3749273Z x0 = x0.contiguous() 2025-05-07T20:32:43.3749545Z x1 = x1.contiguous() 2025-05-07T20:32:43.3749792Z 2025-05-07T20:32:43.3750000Z if scale_ub is not None: 2025-05-07T20:32:43.3750277Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.3750623Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.3750944Z ) 2025-05-07T20:32:43.3751141Z else: 2025-05-07T20:32:43.3751360Z scale_ub_tensor = None 2025-05-07T20:32:43.3751618Z 2025-05-07T20:32:43.3751857Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.3752179Z op = silu_mul_quant 2025-05-07T20:32:43.3752433Z if compiled: 2025-05-07T20:32:43.3752679Z op = torch.compile(op) 2025-05-07T20:32:43.3752978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.3753256Z 2025-05-07T20:32:43.3753449Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.3753618Z 2025-05-07T20:32:43.3753720Z moe/activation_test.py:117: 2025-05-07T20:32:43.3754025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.3754362Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.3754648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.3755354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.3756150Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.3756685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.3757372Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.3758041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.3758582Z kernel = self.compile( 2025-05-07T20:32:43.3759207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.3759873Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.3760280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.3760593Z 2025-05-07T20:32:43.3760809Z self = 2025-05-07T20:32:43.3761890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.3763263Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b260f40>} 2025-05-07T20:32:43.3764614Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.3765926Z context = 2025-05-07T20:32:43.3766342Z 2025-05-07T20:32:43.3766519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.3767050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.3767528Z module_map=module_map) 2025-05-07T20:32:43.3767895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.3768252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.3768521Z E ^ 2025-05-07T20:32:43.3768995Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.3769449Z 2025-05-07T20:32:43.3769869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.3770395Z 2025-05-07T20:32:43.3770501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.3770923Z self=, 2025-05-07T20:32:43.3771340Z T=128, 2025-05-07T20:32:43.3771532Z D=5120, 2025-05-07T20:32:43.3771734Z scale_ub=None, 2025-05-07T20:32:43.3771952Z contiguous=True, 2025-05-07T20:32:43.3772179Z compiled=False, 2025-05-07T20:32:43.3772393Z ) 2025-05-07T20:32:43.4451491Z self = 2025-05-07T20:32:43.4452202Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.4452578Z 2025-05-07T20:32:43.4452685Z @given( 2025-05-07T20:32:43.4452990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.4453405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.4453799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.4454212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.4454549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.4454834Z ) 2025-05-07T20:32:43.4455192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.4455648Z def test_silu_mul_quant( 2025-05-07T20:32:43.4455891Z self, 2025-05-07T20:32:43.4456083Z T: int, 2025-05-07T20:32:43.4456285Z D: int, 2025-05-07T20:32:43.4456509Z scale_ub: Optional[float], 2025-05-07T20:32:43.4456781Z contiguous: bool, 2025-05-07T20:32:43.4457025Z compiled: bool, 2025-05-07T20:32:43.4457253Z ) -> None: 2025-05-07T20:32:43.4457468Z torch.manual_seed(2025) 2025-05-07T20:32:43.4457717Z 2025-05-07T20:32:43.4457991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.4458336Z 2025-05-07T20:32:43.4458532Z x_sign = torch.sign(x) 2025-05-07T20:32:43.4458852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.4459480Z x = x_sign * x_clamp 2025-05-07T20:32:43.4459731Z x0 = x[:, :D] 2025-05-07T20:32:43.4459943Z x1 = x[:, D:] 2025-05-07T20:32:43.4460160Z 2025-05-07T20:32:43.4460487Z if contiguous: 2025-05-07T20:32:43.4460720Z x0 = x0.contiguous() 2025-05-07T20:32:43.4460991Z x1 = x1.contiguous() 2025-05-07T20:32:43.4461238Z 2025-05-07T20:32:43.4461428Z if scale_ub is not None: 2025-05-07T20:32:43.4461710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.4462052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.4462357Z ) 2025-05-07T20:32:43.4462556Z else: 2025-05-07T20:32:43.4462770Z scale_ub_tensor = None 2025-05-07T20:32:43.4463021Z 2025-05-07T20:32:43.4463256Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.4463574Z op = silu_mul_quant 2025-05-07T20:32:43.4463826Z if compiled: 2025-05-07T20:32:43.4464092Z op = torch.compile(op) 2025-05-07T20:32:43.4464390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.4464668Z 2025-05-07T20:32:43.4464860Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.4465039Z 2025-05-07T20:32:43.4465141Z moe/activation_test.py:117: 2025-05-07T20:32:43.4465742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.4466081Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.4466365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.4467055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.4467737Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.4468278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.4468964Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.4469631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.4470163Z kernel = self.compile( 2025-05-07T20:32:43.4470710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.4471369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.4471768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.4472001Z 2025-05-07T20:32:43.4472210Z self = 2025-05-07T20:32:43.4473296Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.4474693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b262020>} 2025-05-07T20:32:43.4476101Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.4477130Z context = 2025-05-07T20:32:43.4477425Z 2025-05-07T20:32:43.4477593Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.4478126Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.4478610Z module_map=module_map) 2025-05-07T20:32:43.4478977Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.4479347Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.4479620Z E ^ 2025-05-07T20:32:43.4480214Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.4480673Z 2025-05-07T20:32:43.4481087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.4481714Z 2025-05-07T20:32:43.4481823Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.4482242Z self=, 2025-05-07T20:32:43.4482650Z T=128, 2025-05-07T20:32:43.4482853Z D=7168, 2025-05-07T20:32:43.4483066Z scale_ub=None, 2025-05-07T20:32:43.4483279Z contiguous=True, 2025-05-07T20:32:43.4483509Z compiled=False, 2025-05-07T20:32:43.4483723Z ) 2025-05-07T20:32:43.4484039Z self = 2025-05-07T20:32:43.4484529Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.4484807Z 2025-05-07T20:32:43.4484892Z @given( 2025-05-07T20:32:43.4485124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.4485434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.4485752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.4486088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.4486411Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.4486703Z ) 2025-05-07T20:32:43.4487055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.4487496Z def test_silu_mul_quant( 2025-05-07T20:32:43.4487745Z self, 2025-05-07T20:32:43.4487949Z T: int, 2025-05-07T20:32:43.4488151Z D: int, 2025-05-07T20:32:43.4488367Z scale_ub: Optional[float], 2025-05-07T20:32:43.4488645Z contiguous: bool, 2025-05-07T20:32:43.4488887Z compiled: bool, 2025-05-07T20:32:43.4489121Z ) -> None: 2025-05-07T20:32:43.4489352Z torch.manual_seed(2025) 2025-05-07T20:32:43.4489600Z 2025-05-07T20:32:43.4489875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.4490233Z 2025-05-07T20:32:43.4490449Z x_sign = torch.sign(x) 2025-05-07T20:32:43.4490742Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.4499582Z x = x_sign * x_clamp 2025-05-07T20:32:43.4499848Z x0 = x[:, :D] 2025-05-07T20:32:43.4500079Z x1 = x[:, D:] 2025-05-07T20:32:43.4500293Z 2025-05-07T20:32:43.4500492Z if contiguous: 2025-05-07T20:32:43.4500734Z x0 = x0.contiguous() 2025-05-07T20:32:43.4500996Z x1 = x1.contiguous() 2025-05-07T20:32:43.4501243Z 2025-05-07T20:32:43.4501444Z if scale_ub is not None: 2025-05-07T20:32:43.4501728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.4502078Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.4502391Z ) 2025-05-07T20:32:43.4502609Z else: 2025-05-07T20:32:43.4502833Z scale_ub_tensor = None 2025-05-07T20:32:43.4503079Z 2025-05-07T20:32:43.4503315Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.4503648Z op = silu_mul_quant 2025-05-07T20:32:43.4503911Z if compiled: 2025-05-07T20:32:43.4504171Z op = torch.compile(op) 2025-05-07T20:32:43.4504477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.4504759Z 2025-05-07T20:32:43.4504968Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.4505136Z 2025-05-07T20:32:43.4505250Z moe/activation_test.py:117: 2025-05-07T20:32:43.4505549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.4505899Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.4506192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.4507011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.4507708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.4508254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.4509021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.4509686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.4510228Z kernel = self.compile( 2025-05-07T20:32:43.4510776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.4511441Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.4511842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.4512083Z 2025-05-07T20:32:43.4512299Z self = 2025-05-07T20:32:43.4513390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.4514780Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b262f20>} 2025-05-07T20:32:43.4516222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.4517249Z context = 2025-05-07T20:32:43.4517549Z 2025-05-07T20:32:43.4517718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.4518252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.4518725Z module_map=module_map) 2025-05-07T20:32:43.4519095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.4519466Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.4519733Z E ^ 2025-05-07T20:32:43.4520198Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.4520662Z 2025-05-07T20:32:43.4521078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.4521588Z 2025-05-07T20:32:43.4521701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.4522121Z self=, 2025-05-07T20:32:43.4522525Z T=2048, 2025-05-07T20:32:43.4522729Z D=7168, 2025-05-07T20:32:43.4522932Z scale_ub=1200.0, 2025-05-07T20:32:43.4523162Z contiguous=True, 2025-05-07T20:32:43.4523397Z compiled=False, 2025-05-07T20:32:43.4523612Z ) 2025-05-07T20:32:43.5352512Z self = 2025-05-07T20:32:43.5353250Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.5353661Z 2025-05-07T20:32:43.5353785Z @given( 2025-05-07T20:32:43.5354105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5354536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5354862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5355198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5355527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5355894Z ) 2025-05-07T20:32:43.5356249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5356693Z def test_silu_mul_quant( 2025-05-07T20:32:43.5357169Z self, 2025-05-07T20:32:43.5357377Z T: int, 2025-05-07T20:32:43.5357578Z D: int, 2025-05-07T20:32:43.5357806Z scale_ub: Optional[float], 2025-05-07T20:32:43.5358123Z contiguous: bool, 2025-05-07T20:32:43.5358506Z compiled: bool, 2025-05-07T20:32:43.5358744Z ) -> None: 2025-05-07T20:32:43.5358992Z torch.manual_seed(2025) 2025-05-07T20:32:43.5359271Z 2025-05-07T20:32:43.5359549Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5361632Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.5363490Z 2025-05-07T20:32:43.5363620Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.5363840Z 2025-05-07T20:32:43.5363948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.5364371Z self=, 2025-05-07T20:32:43.5364788Z T=1, 2025-05-07T20:32:43.5364974Z D=5120, 2025-05-07T20:32:43.5365173Z scale_ub=1200.0, 2025-05-07T20:32:43.5365676Z contiguous=True, 2025-05-07T20:32:43.5365907Z compiled=False, 2025-05-07T20:32:43.5366120Z ) 2025-05-07T20:32:43.5366446Z self = 2025-05-07T20:32:43.5366928Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.5367198Z 2025-05-07T20:32:43.5367276Z @given( 2025-05-07T20:32:43.5367511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5367838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5368147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5368480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5368815Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5369100Z ) 2025-05-07T20:32:43.5369456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5369901Z def test_silu_mul_quant( 2025-05-07T20:32:43.5370146Z self, 2025-05-07T20:32:43.5370352Z T: int, 2025-05-07T20:32:43.5370559Z D: int, 2025-05-07T20:32:43.5370782Z scale_ub: Optional[float], 2025-05-07T20:32:43.5371065Z contiguous: bool, 2025-05-07T20:32:43.5371310Z compiled: bool, 2025-05-07T20:32:43.5371546Z ) -> None: 2025-05-07T20:32:43.5371762Z torch.manual_seed(2025) 2025-05-07T20:32:43.5372011Z 2025-05-07T20:32:43.5372299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5372641Z 2025-05-07T20:32:43.5372848Z x_sign = torch.sign(x) 2025-05-07T20:32:43.5373148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.5373460Z x = x_sign * x_clamp 2025-05-07T20:32:43.5373709Z x0 = x[:, :D] 2025-05-07T20:32:43.5373936Z x1 = x[:, D:] 2025-05-07T20:32:43.5374150Z 2025-05-07T20:32:43.5374351Z if contiguous: 2025-05-07T20:32:43.5374590Z x0 = x0.contiguous() 2025-05-07T20:32:43.5374850Z x1 = x1.contiguous() 2025-05-07T20:32:43.5375105Z 2025-05-07T20:32:43.5375317Z if scale_ub is not None: 2025-05-07T20:32:43.5375592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.5375932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.5376247Z ) 2025-05-07T20:32:43.5376452Z else: 2025-05-07T20:32:43.5376677Z scale_ub_tensor = None 2025-05-07T20:32:43.5376929Z 2025-05-07T20:32:43.5377298Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.5377622Z op = silu_mul_quant 2025-05-07T20:32:43.5377882Z if compiled: 2025-05-07T20:32:43.5378235Z op = torch.compile(op) 2025-05-07T20:32:43.5378540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5378822Z 2025-05-07T20:32:43.5379016Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.5379187Z 2025-05-07T20:32:43.5379286Z moe/activation_test.py:117: 2025-05-07T20:32:43.5379584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5379918Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.5380204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5380901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.5381589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.5382131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.5382812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.5383482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.5384015Z kernel = self.compile( 2025-05-07T20:32:43.5384563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.5385217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.5385617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5385847Z 2025-05-07T20:32:43.5386054Z self = 2025-05-07T20:32:43.5387140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.5388510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b1004a0>} 2025-05-07T20:32:43.5389911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.5390934Z context = 2025-05-07T20:32:43.5391222Z 2025-05-07T20:32:43.5391387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.5391915Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.5392395Z module_map=module_map) 2025-05-07T20:32:43.5392755Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.5393111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.5393380Z E ^ 2025-05-07T20:32:43.5393851Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.5394303Z 2025-05-07T20:32:43.5394713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.5395226Z 2025-05-07T20:32:43.5395331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.5395790Z self=, 2025-05-07T20:32:43.5396194Z T=2048, 2025-05-07T20:32:43.5396394Z D=5120, 2025-05-07T20:32:43.5396591Z scale_ub=None, 2025-05-07T20:32:43.5396803Z contiguous=True, 2025-05-07T20:32:43.5397037Z compiled=False, 2025-05-07T20:32:43.5397248Z ) 2025-05-07T20:32:43.5397656Z self = 2025-05-07T20:32:43.5398154Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.5398590Z 2025-05-07T20:32:43.5398672Z @given( 2025-05-07T20:32:43.5398911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5399220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5399537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5399870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5400204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5400499Z ) 2025-05-07T20:32:43.5400853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5401306Z def test_silu_mul_quant( 2025-05-07T20:32:43.5401549Z self, 2025-05-07T20:32:43.5401750Z T: int, 2025-05-07T20:32:43.5401958Z D: int, 2025-05-07T20:32:43.5402181Z scale_ub: Optional[float], 2025-05-07T20:32:43.5402458Z contiguous: bool, 2025-05-07T20:32:43.5402703Z compiled: bool, 2025-05-07T20:32:43.5402926Z ) -> None: 2025-05-07T20:32:43.5403157Z torch.manual_seed(2025) 2025-05-07T20:32:43.5403406Z 2025-05-07T20:32:43.5403675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5404021Z 2025-05-07T20:32:43.5404225Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.5406184Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.5408040Z 2025-05-07T20:32:43.5408165Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.5408376Z 2025-05-07T20:32:43.5408484Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.5408901Z self=, 2025-05-07T20:32:43.5409305Z T=16384, 2025-05-07T20:32:43.5409500Z D=5120, 2025-05-07T20:32:43.5409698Z scale_ub=None, 2025-05-07T20:32:43.5409914Z contiguous=True, 2025-05-07T20:32:43.5410134Z compiled=False, 2025-05-07T20:32:43.5410342Z ) 2025-05-07T20:32:43.6174082Z self = 2025-05-07T20:32:43.6174604Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.6174924Z 2025-05-07T20:32:43.6175034Z @given( 2025-05-07T20:32:43.6175364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6175833Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6176252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6176706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6177105Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6177392Z ) 2025-05-07T20:32:43.6177745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6178192Z def test_silu_mul_quant( 2025-05-07T20:32:43.6178434Z self, 2025-05-07T20:32:43.6178640Z T: int, 2025-05-07T20:32:43.6178841Z D: int, 2025-05-07T20:32:43.6179085Z scale_ub: Optional[float], 2025-05-07T20:32:43.6179387Z contiguous: bool, 2025-05-07T20:32:43.6179633Z compiled: bool, 2025-05-07T20:32:43.6179868Z ) -> None: 2025-05-07T20:32:43.6180083Z torch.manual_seed(2025) 2025-05-07T20:32:43.6180329Z 2025-05-07T20:32:43.6180610Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6182837Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.6184820Z 2025-05-07T20:32:43.6184941Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.6185160Z 2025-05-07T20:32:43.6185266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6185694Z self=, 2025-05-07T20:32:43.6186106Z T=4096, 2025-05-07T20:32:43.6186300Z D=5120, 2025-05-07T20:32:43.6186510Z scale_ub=None, 2025-05-07T20:32:43.6186729Z contiguous=True, 2025-05-07T20:32:43.6186954Z compiled=False, 2025-05-07T20:32:43.6187166Z ) 2025-05-07T20:32:43.6187489Z self = 2025-05-07T20:32:43.6187988Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.6188270Z 2025-05-07T20:32:43.6188352Z @given( 2025-05-07T20:32:43.6188589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6188903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6189222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6189562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6189903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6190196Z ) 2025-05-07T20:32:43.6190547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6190993Z def test_silu_mul_quant( 2025-05-07T20:32:43.6191239Z self, 2025-05-07T20:32:43.6191438Z T: int, 2025-05-07T20:32:43.6191640Z D: int, 2025-05-07T20:32:43.6191860Z scale_ub: Optional[float], 2025-05-07T20:32:43.6192141Z contiguous: bool, 2025-05-07T20:32:43.6192386Z compiled: bool, 2025-05-07T20:32:43.6192609Z ) -> None: 2025-05-07T20:32:43.6192834Z torch.manual_seed(2025) 2025-05-07T20:32:43.6193081Z 2025-05-07T20:32:43.6193355Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6195406Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.6197351Z 2025-05-07T20:32:43.6197470Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.6197692Z 2025-05-07T20:32:43.6197796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6198216Z self=, 2025-05-07T20:32:43.6198621Z T=2048, 2025-05-07T20:32:43.6198817Z D=5120, 2025-05-07T20:32:43.6199019Z scale_ub=None, 2025-05-07T20:32:43.6199232Z contiguous=False, 2025-05-07T20:32:43.6199460Z compiled=False, 2025-05-07T20:32:43.6199668Z ) 2025-05-07T20:32:43.6199991Z self = 2025-05-07T20:32:43.6200494Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.6200797Z 2025-05-07T20:32:43.6200878Z @given( 2025-05-07T20:32:43.6201203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6201524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6201829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6202236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6202572Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6202857Z ) 2025-05-07T20:32:43.6203218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6203673Z def test_silu_mul_quant( 2025-05-07T20:32:43.6203915Z self, 2025-05-07T20:32:43.6204118Z T: int, 2025-05-07T20:32:43.6204325Z D: int, 2025-05-07T20:32:43.6204547Z scale_ub: Optional[float], 2025-05-07T20:32:43.6204825Z contiguous: bool, 2025-05-07T20:32:43.6205077Z compiled: bool, 2025-05-07T20:32:43.6205307Z ) -> None: 2025-05-07T20:32:43.6205525Z torch.manual_seed(2025) 2025-05-07T20:32:43.6205784Z 2025-05-07T20:32:43.6206084Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6208137Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.6209996Z 2025-05-07T20:32:43.6210129Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.6210345Z 2025-05-07T20:32:43.6210456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6210885Z self=, 2025-05-07T20:32:43.6211299Z T=4096, 2025-05-07T20:32:43.6211499Z D=7168, 2025-05-07T20:32:43.6211700Z scale_ub=None, 2025-05-07T20:32:43.6211923Z contiguous=True, 2025-05-07T20:32:43.6212146Z compiled=True, 2025-05-07T20:32:43.6212366Z ) 2025-05-07T20:32:43.6212691Z self = 2025-05-07T20:32:43.6213178Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.6213454Z 2025-05-07T20:32:43.6213535Z @given( 2025-05-07T20:32:43.6213773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6214099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6214407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6214747Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6215082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6215373Z ) 2025-05-07T20:32:43.6215739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6216191Z def test_silu_mul_quant( 2025-05-07T20:32:43.6216434Z self, 2025-05-07T20:32:43.6216633Z T: int, 2025-05-07T20:32:43.6216833Z D: int, 2025-05-07T20:32:43.6217055Z scale_ub: Optional[float], 2025-05-07T20:32:43.6217336Z contiguous: bool, 2025-05-07T20:32:43.6217585Z compiled: bool, 2025-05-07T20:32:43.6217810Z ) -> None: 2025-05-07T20:32:43.6218035Z torch.manual_seed(2025) 2025-05-07T20:32:43.6218281Z 2025-05-07T20:32:43.6218558Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6220695Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.6222641Z 2025-05-07T20:32:43.6222764Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.6222988Z 2025-05-07T20:32:43.6223098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6223523Z self=, 2025-05-07T20:32:43.6223944Z T=2048, 2025-05-07T20:32:43.6224137Z D=5120, 2025-05-07T20:32:43.6224341Z scale_ub=1200.0, 2025-05-07T20:32:43.6224579Z contiguous=False, 2025-05-07T20:32:43.6224812Z compiled=False, 2025-05-07T20:32:43.6225031Z ) 2025-05-07T20:32:43.6225361Z self = 2025-05-07T20:32:43.6225859Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.6226149Z 2025-05-07T20:32:43.6226238Z @given( 2025-05-07T20:32:43.6226483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6226803Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6227130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6227486Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6227829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6228125Z ) 2025-05-07T20:32:43.6228492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6228958Z def test_silu_mul_quant( 2025-05-07T20:32:43.6229209Z self, 2025-05-07T20:32:43.6229425Z T: int, 2025-05-07T20:32:43.6229634Z D: int, 2025-05-07T20:32:43.6229857Z scale_ub: Optional[float], 2025-05-07T20:32:43.6230154Z contiguous: bool, 2025-05-07T20:32:43.6230417Z compiled: bool, 2025-05-07T20:32:43.6230646Z ) -> None: 2025-05-07T20:32:43.6230888Z torch.manual_seed(2025) 2025-05-07T20:32:43.6231149Z 2025-05-07T20:32:43.6231427Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6233489Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.6235361Z 2025-05-07T20:32:43.6235482Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.6235756Z 2025-05-07T20:32:43.6235866Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6236294Z self=, 2025-05-07T20:32:43.6236706Z T=4096, 2025-05-07T20:32:43.6236913Z D=7168, 2025-05-07T20:32:43.6237127Z scale_ub=1200.0, 2025-05-07T20:32:43.6237351Z contiguous=True, 2025-05-07T20:32:43.6237611Z compiled=False, 2025-05-07T20:32:43.6237830Z ) 2025-05-07T20:32:43.7294688Z self = 2025-05-07T20:32:43.7295244Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.7295533Z 2025-05-07T20:32:43.7295618Z @given( 2025-05-07T20:32:43.7295857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7296179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7296501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7305060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7305418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7305717Z ) 2025-05-07T20:32:43.7306251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7306711Z def test_silu_mul_quant( 2025-05-07T20:32:43.7306957Z self, 2025-05-07T20:32:43.7307276Z T: int, 2025-05-07T20:32:43.7307476Z D: int, 2025-05-07T20:32:43.7307701Z scale_ub: Optional[float], 2025-05-07T20:32:43.7307980Z contiguous: bool, 2025-05-07T20:32:43.7308222Z compiled: bool, 2025-05-07T20:32:43.7308460Z ) -> None: 2025-05-07T20:32:43.7308684Z torch.manual_seed(2025) 2025-05-07T20:32:43.7308952Z 2025-05-07T20:32:43.7309260Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7311334Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.7313210Z 2025-05-07T20:32:43.7313340Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.7313587Z 2025-05-07T20:32:43.7313700Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7314128Z self=, 2025-05-07T20:32:43.7314535Z T=16384, 2025-05-07T20:32:43.7314739Z D=7168, 2025-05-07T20:32:43.7314946Z scale_ub=None, 2025-05-07T20:32:43.7315165Z contiguous=False, 2025-05-07T20:32:43.7315401Z compiled=True, 2025-05-07T20:32:43.7315615Z ) 2025-05-07T20:32:43.7316005Z self = 2025-05-07T20:32:43.7316509Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.7316799Z 2025-05-07T20:32:43.7316886Z @given( 2025-05-07T20:32:43.7317132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7317452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7317772Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7318112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7318444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7318738Z ) 2025-05-07T20:32:43.7319092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7319552Z def test_silu_mul_quant( 2025-05-07T20:32:43.7319803Z self, 2025-05-07T20:32:43.7320006Z T: int, 2025-05-07T20:32:43.7320218Z D: int, 2025-05-07T20:32:43.7320442Z scale_ub: Optional[float], 2025-05-07T20:32:43.7320720Z contiguous: bool, 2025-05-07T20:32:43.7320972Z compiled: bool, 2025-05-07T20:32:43.7321203Z ) -> None: 2025-05-07T20:32:43.7321430Z torch.manual_seed(2025) 2025-05-07T20:32:43.7321681Z 2025-05-07T20:32:43.7321951Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7324017Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.7325886Z 2025-05-07T20:32:43.7326006Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.7326224Z 2025-05-07T20:32:43.7326416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7326839Z self=, 2025-05-07T20:32:43.7327240Z T=4096, 2025-05-07T20:32:43.7327440Z D=7168, 2025-05-07T20:32:43.7327714Z scale_ub=None, 2025-05-07T20:32:43.7327930Z contiguous=True, 2025-05-07T20:32:43.7328158Z compiled=False, 2025-05-07T20:32:43.7328374Z ) 2025-05-07T20:32:43.7328691Z self = 2025-05-07T20:32:43.7329192Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.7329462Z 2025-05-07T20:32:43.7329546Z @given( 2025-05-07T20:32:43.7329787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7330111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7330424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7330767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7331102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7331401Z ) 2025-05-07T20:32:43.7331760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7332211Z def test_silu_mul_quant( 2025-05-07T20:32:43.7332460Z self, 2025-05-07T20:32:43.7332660Z T: int, 2025-05-07T20:32:43.7332861Z D: int, 2025-05-07T20:32:43.7333087Z scale_ub: Optional[float], 2025-05-07T20:32:43.7333364Z contiguous: bool, 2025-05-07T20:32:43.7333611Z compiled: bool, 2025-05-07T20:32:43.7333847Z ) -> None: 2025-05-07T20:32:43.7334067Z torch.manual_seed(2025) 2025-05-07T20:32:43.7334315Z 2025-05-07T20:32:43.7334593Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7336655Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.7338530Z 2025-05-07T20:32:43.7338653Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.7338869Z 2025-05-07T20:32:43.7338986Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7339400Z self=, 2025-05-07T20:32:43.7339808Z T=16384, 2025-05-07T20:32:43.7340011Z D=7168, 2025-05-07T20:32:43.7340202Z scale_ub=None, 2025-05-07T20:32:43.7340423Z contiguous=True, 2025-05-07T20:32:43.7340652Z compiled=False, 2025-05-07T20:32:43.7340860Z ) 2025-05-07T20:32:43.7341181Z self = 2025-05-07T20:32:43.7341688Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:43.7341966Z 2025-05-07T20:32:43.7342051Z @given( 2025-05-07T20:32:43.7342283Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7342604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7342921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7343251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7343586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7343878Z ) 2025-05-07T20:32:43.7344232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7344676Z def test_silu_mul_quant( 2025-05-07T20:32:43.7344921Z self, 2025-05-07T20:32:43.7345121Z T: int, 2025-05-07T20:32:43.7345317Z D: int, 2025-05-07T20:32:43.7345540Z scale_ub: Optional[float], 2025-05-07T20:32:43.7345905Z contiguous: bool, 2025-05-07T20:32:43.7346151Z compiled: bool, 2025-05-07T20:32:43.7346386Z ) -> None: 2025-05-07T20:32:43.7346608Z torch.manual_seed(2025) 2025-05-07T20:32:43.7346852Z 2025-05-07T20:32:43.7347203Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7349318Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.7351191Z 2025-05-07T20:32:43.7351316Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.7351544Z 2025-05-07T20:32:43.7351659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7352081Z self=, 2025-05-07T20:32:43.7352497Z T=16384, 2025-05-07T20:32:43.7352707Z D=7168, 2025-05-07T20:32:43.7352904Z scale_ub=1200.0, 2025-05-07T20:32:43.7353135Z contiguous=True, 2025-05-07T20:32:43.7353371Z compiled=False, 2025-05-07T20:32:43.7353581Z ) 2025-05-07T20:32:43.7353910Z self = 2025-05-07T20:32:43.7354423Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.7354704Z 2025-05-07T20:32:43.7354795Z @given( 2025-05-07T20:32:43.7355031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7355362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7355675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7356083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7356428Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7356713Z ) 2025-05-07T20:32:43.7357063Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7357523Z def test_silu_mul_quant( 2025-05-07T20:32:43.7357769Z self, 2025-05-07T20:32:43.7357969Z T: int, 2025-05-07T20:32:43.7358172Z D: int, 2025-05-07T20:32:43.7358390Z scale_ub: Optional[float], 2025-05-07T20:32:43.7358664Z contiguous: bool, 2025-05-07T20:32:43.7358920Z compiled: bool, 2025-05-07T20:32:43.7359168Z ) -> None: 2025-05-07T20:32:43.7359413Z torch.manual_seed(2025) 2025-05-07T20:32:43.7359666Z 2025-05-07T20:32:43.7359935Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7361998Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.7363867Z 2025-05-07T20:32:43.7363991Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.7364213Z 2025-05-07T20:32:43.7364319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7364737Z self=, 2025-05-07T20:32:43.7365137Z T=128, 2025-05-07T20:32:43.7365335Z D=5120, 2025-05-07T20:32:43.7365816Z scale_ub=1200.0, 2025-05-07T20:32:43.7366040Z contiguous=False, 2025-05-07T20:32:43.7366270Z compiled=False, 2025-05-07T20:32:43.7366481Z ) 2025-05-07T20:32:43.8632236Z self = 2025-05-07T20:32:43.8632788Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.8633206Z 2025-05-07T20:32:43.8633285Z @given( 2025-05-07T20:32:43.8633517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8633831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8634133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8634462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8634792Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8635074Z ) 2025-05-07T20:32:43.8635422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8635937Z def test_silu_mul_quant( 2025-05-07T20:32:43.8636182Z self, 2025-05-07T20:32:43.8636375Z T: int, 2025-05-07T20:32:43.8636577Z D: int, 2025-05-07T20:32:43.8636802Z scale_ub: Optional[float], 2025-05-07T20:32:43.8637071Z contiguous: bool, 2025-05-07T20:32:43.8637313Z compiled: bool, 2025-05-07T20:32:43.8637544Z ) -> None: 2025-05-07T20:32:43.8637767Z torch.manual_seed(2025) 2025-05-07T20:32:43.8638013Z 2025-05-07T20:32:43.8638289Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8638627Z 2025-05-07T20:32:43.8638835Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8639132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8639435Z x = x_sign * x_clamp 2025-05-07T20:32:43.8639683Z x0 = x[:, :D] 2025-05-07T20:32:43.8639906Z x1 = x[:, D:] 2025-05-07T20:32:43.8640109Z 2025-05-07T20:32:43.8640302Z if contiguous: 2025-05-07T20:32:43.8640537Z x0 = x0.contiguous() 2025-05-07T20:32:43.8640796Z x1 = x1.contiguous() 2025-05-07T20:32:43.8641046Z 2025-05-07T20:32:43.8641250Z if scale_ub is not None: 2025-05-07T20:32:43.8641531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8641861Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8642183Z ) 2025-05-07T20:32:43.8642375Z else: 2025-05-07T20:32:43.8642591Z scale_ub_tensor = None 2025-05-07T20:32:43.8642846Z 2025-05-07T20:32:43.8643079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8643393Z op = silu_mul_quant 2025-05-07T20:32:43.8643650Z if compiled: 2025-05-07T20:32:43.8643900Z op = torch.compile(op) 2025-05-07T20:32:43.8644197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8644471Z 2025-05-07T20:32:43.8644667Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8644829Z 2025-05-07T20:32:43.8644930Z moe/activation_test.py:117: 2025-05-07T20:32:43.8645227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8645566Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8645849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8646536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8647228Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8647759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8648434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8649099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8649635Z kernel = self.compile( 2025-05-07T20:32:43.8650175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8650911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8651317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8651547Z 2025-05-07T20:32:43.8651758Z self = 2025-05-07T20:32:43.8652920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8654289Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b153060>} 2025-05-07T20:32:43.8655629Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8656662Z context = 2025-05-07T20:32:43.8656950Z 2025-05-07T20:32:43.8657119Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8657640Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8658115Z module_map=module_map) 2025-05-07T20:32:43.8658482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8658849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8659130Z E ^ 2025-05-07T20:32:43.8659617Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8660065Z 2025-05-07T20:32:43.8660485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8660993Z 2025-05-07T20:32:43.8661104Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.8661520Z self=, 2025-05-07T20:32:43.8661926Z T=2048, 2025-05-07T20:32:43.8662115Z D=7168, 2025-05-07T20:32:43.8662306Z scale_ub=None, 2025-05-07T20:32:43.8662530Z contiguous=False, 2025-05-07T20:32:43.8662757Z compiled=False, 2025-05-07T20:32:43.8662960Z ) 2025-05-07T20:32:43.8663276Z self = 2025-05-07T20:32:43.8663778Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.8664050Z 2025-05-07T20:32:43.8664130Z @given( 2025-05-07T20:32:43.8664361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8664677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8664986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8665313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8665820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8666109Z ) 2025-05-07T20:32:43.8666451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8666889Z def test_silu_mul_quant( 2025-05-07T20:32:43.8667136Z self, 2025-05-07T20:32:43.8667327Z T: int, 2025-05-07T20:32:43.8667529Z D: int, 2025-05-07T20:32:43.8667753Z scale_ub: Optional[float], 2025-05-07T20:32:43.8668024Z contiguous: bool, 2025-05-07T20:32:43.8668264Z compiled: bool, 2025-05-07T20:32:43.8668491Z ) -> None: 2025-05-07T20:32:43.8668705Z torch.manual_seed(2025) 2025-05-07T20:32:43.8668949Z 2025-05-07T20:32:43.8669266Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8671452Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.8673411Z 2025-05-07T20:32:43.8673533Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.8673742Z 2025-05-07T20:32:43.8673848Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.8674259Z self=, 2025-05-07T20:32:43.8674662Z T=128, 2025-05-07T20:32:43.8674846Z D=7168, 2025-05-07T20:32:43.8675038Z scale_ub=1200.0, 2025-05-07T20:32:43.8675261Z contiguous=True, 2025-05-07T20:32:43.8675482Z compiled=True, 2025-05-07T20:32:43.8675683Z ) 2025-05-07T20:32:43.8987867Z self = 2025-05-07T20:32:43.8988393Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.8988690Z 2025-05-07T20:32:43.8988798Z @given( 2025-05-07T20:32:43.8989045Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8989425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8989792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8990184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8990516Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8990806Z ) 2025-05-07T20:32:43.8991157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8991607Z def test_silu_mul_quant( 2025-05-07T20:32:43.8991859Z self, 2025-05-07T20:32:43.8992063Z T: int, 2025-05-07T20:32:43.8992259Z D: int, 2025-05-07T20:32:43.8992490Z scale_ub: Optional[float], 2025-05-07T20:32:43.8992771Z contiguous: bool, 2025-05-07T20:32:43.8993021Z compiled: bool, 2025-05-07T20:32:43.8993253Z ) -> None: 2025-05-07T20:32:43.8993473Z torch.manual_seed(2025) 2025-05-07T20:32:43.8993716Z 2025-05-07T20:32:43.8993993Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8994349Z 2025-05-07T20:32:43.8994545Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8994841Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8995160Z x = x_sign * x_clamp 2025-05-07T20:32:43.8995399Z x0 = x[:, :D] 2025-05-07T20:32:43.8995630Z x1 = x[:, D:] 2025-05-07T20:32:43.8995901Z 2025-05-07T20:32:43.8996089Z if contiguous: 2025-05-07T20:32:43.8996329Z x0 = x0.contiguous() 2025-05-07T20:32:43.8996595Z x1 = x1.contiguous() 2025-05-07T20:32:43.8996837Z 2025-05-07T20:32:43.8997032Z if scale_ub is not None: 2025-05-07T20:32:43.8997308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8997653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8997967Z ) 2025-05-07T20:32:43.8998162Z else: 2025-05-07T20:32:43.8998380Z scale_ub_tensor = None 2025-05-07T20:32:43.8998640Z 2025-05-07T20:32:43.8998884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8999207Z op = silu_mul_quant 2025-05-07T20:32:43.8999506Z if compiled: 2025-05-07T20:32:43.8999758Z op = torch.compile(op) 2025-05-07T20:32:43.9000059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.9000334Z 2025-05-07T20:32:43.9000531Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.9000721Z 2025-05-07T20:32:43.9000825Z moe/activation_test.py:117: 2025-05-07T20:32:43.9001130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.9001460Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.9001746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.9002480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.9003050Z return fn(*args, **kwargs) 2025-05-07T20:32:43.9003816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.9004505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.9005046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.9005725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.9006395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.9006934Z kernel = self.compile( 2025-05-07T20:32:43.9007475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.9008134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.9008537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.9008773Z 2025-05-07T20:32:43.9008987Z self = 2025-05-07T20:32:43.9010076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.9011452Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8afe0900>} 2025-05-07T20:32:43.9012805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.9013837Z context = 2025-05-07T20:32:43.9014126Z 2025-05-07T20:32:43.9014298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.9014832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.9015309Z module_map=module_map) 2025-05-07T20:32:43.9015678Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.9016043Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.9016349Z E ^ 2025-05-07T20:32:43.9016894Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.9017350Z 2025-05-07T20:32:43.9017772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.9018283Z 2025-05-07T20:32:43.9018395Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9018817Z self=, 2025-05-07T20:32:43.9019224Z T=128, 2025-05-07T20:32:43.9019425Z D=7168, 2025-05-07T20:32:43.9019618Z scale_ub=1200.0, 2025-05-07T20:32:43.9019844Z contiguous=True, 2025-05-07T20:32:43.9020071Z compiled=False, 2025-05-07T20:32:43.9020276Z ) 2025-05-07T20:32:43.9020599Z self = 2025-05-07T20:32:43.9021099Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.9021372Z 2025-05-07T20:32:43.9021454Z @given( 2025-05-07T20:32:43.9021689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9022005Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9022312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9022648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9023097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9023389Z ) 2025-05-07T20:32:43.9023735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9024263Z def test_silu_mul_quant( 2025-05-07T20:32:43.9024509Z self, 2025-05-07T20:32:43.9024705Z T: int, 2025-05-07T20:32:43.9024907Z D: int, 2025-05-07T20:32:43.9025130Z scale_ub: Optional[float], 2025-05-07T20:32:43.9025402Z contiguous: bool, 2025-05-07T20:32:43.9025643Z compiled: bool, 2025-05-07T20:32:43.9025868Z ) -> None: 2025-05-07T20:32:43.9026083Z torch.manual_seed(2025) 2025-05-07T20:32:43.9026328Z 2025-05-07T20:32:43.9026609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9026949Z 2025-05-07T20:32:43.9027154Z x_sign = torch.sign(x) 2025-05-07T20:32:43.9027452Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.9029474Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9031338Z 2025-05-07T20:32:43.9031465Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.9031682Z 2025-05-07T20:32:43.9031788Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9032203Z self=, 2025-05-07T20:32:43.9032612Z T=128, 2025-05-07T20:32:43.9032797Z D=5120, 2025-05-07T20:32:43.9032991Z scale_ub=1200.0, 2025-05-07T20:32:43.9033221Z contiguous=True, 2025-05-07T20:32:43.9040574Z compiled=True, 2025-05-07T20:32:43.9040824Z ) 2025-05-07T20:32:43.9041153Z self = 2025-05-07T20:32:43.9041658Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.9041930Z 2025-05-07T20:32:43.9042013Z @given( 2025-05-07T20:32:43.9042247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9042568Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9042877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9043207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9043539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9043827Z ) 2025-05-07T20:32:43.9044173Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9044616Z def test_silu_mul_quant( 2025-05-07T20:32:43.9044865Z self, 2025-05-07T20:32:43.9045062Z T: int, 2025-05-07T20:32:43.9045264Z D: int, 2025-05-07T20:32:43.9045486Z scale_ub: Optional[float], 2025-05-07T20:32:43.9045757Z contiguous: bool, 2025-05-07T20:32:43.9046003Z compiled: bool, 2025-05-07T20:32:43.9046232Z ) -> None: 2025-05-07T20:32:43.9046446Z torch.manual_seed(2025) 2025-05-07T20:32:43.9046694Z 2025-05-07T20:32:43.9046971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9047320Z 2025-05-07T20:32:43.9047512Z > x_sign = torch.sign(x) 2025-05-07T20:32:43.9049574Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9051511Z 2025-05-07T20:32:43.9051631Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:43.9051843Z 2025-05-07T20:32:43.9051955Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9052366Z self=, 2025-05-07T20:32:43.9052773Z T=128, 2025-05-07T20:32:43.9052968Z D=7168, 2025-05-07T20:32:43.9053157Z scale_ub=None, 2025-05-07T20:32:43.9053371Z contiguous=True, 2025-05-07T20:32:43.9053595Z compiled=True, 2025-05-07T20:32:43.9053800Z ) 2025-05-07T20:32:44.2296320Z self = 2025-05-07T20:32:44.2296838Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.2297113Z 2025-05-07T20:32:44.2297202Z @given( 2025-05-07T20:32:44.2297433Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2297737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2298046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2298378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2298699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2298987Z ) 2025-05-07T20:32:44.2299332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2299769Z def test_silu_mul_quant( 2025-05-07T20:32:44.2300015Z self, 2025-05-07T20:32:44.2300216Z T: int, 2025-05-07T20:32:44.2300408Z D: int, 2025-05-07T20:32:44.2300628Z scale_ub: Optional[float], 2025-05-07T20:32:44.2300901Z contiguous: bool, 2025-05-07T20:32:44.2301143Z compiled: bool, 2025-05-07T20:32:44.2301364Z ) -> None: 2025-05-07T20:32:44.2301580Z torch.manual_seed(2025) 2025-05-07T20:32:44.2301828Z 2025-05-07T20:32:44.2302099Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2304154Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.2306016Z 2025-05-07T20:32:44.2306133Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.2306342Z 2025-05-07T20:32:44.2390608Z FAILED 2025-05-07T20:32:44.2390776Z 2025-05-07T20:32:44.2390914Z =================================== FAILURES =================================== 2025-05-07T20:32:44.2391375Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:44.2392014Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:44.2392853Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:44.2393616Z | yield 2025-05-07T20:32:44.2394205Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:32:44.2394931Z | self._callTestMethod(testMethod) 2025-05-07T20:32:44.2395792Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:32:44.2396558Z | if method() is not None: 2025-05-07T20:32:44.2396905Z | ^^^^^^^^ 2025-05-07T20:32:44.2398017Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:44.2399022Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2399461Z | ^^^^^^^ 2025-05-07T20:32:44.2400243Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:44.2401227Z | raise the_error_hypothesis_found 2025-05-07T20:32:44.2401803Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:44.2402387Z +-+---------------- 1 ---------------- 2025-05-07T20:32:44.2402785Z | Traceback (most recent call last): 2025-05-07T20:32:44.2403742Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.2404806Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2405337Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2408048Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.2410776Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.2411379Z | self=, 2025-05-07T20:32:44.2411942Z | T=128, 2025-05-07T20:32:44.2412227Z | D=7168, 2025-05-07T20:32:44.2412518Z | scale_ub=1200.0, 2025-05-07T20:32:44.2412840Z | contiguous=True, 2025-05-07T20:32:44.2413179Z | compiled=False, 2025-05-07T20:32:44.2413407Z | ) 2025-05-07T20:32:44.2413580Z | 2025-05-07T20:32:44.2414103Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:44.2414714Z +---------------- 2 ---------------- 2025-05-07T20:32:44.2415006Z | Traceback (most recent call last): 2025-05-07T20:32:44.2415701Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.2416467Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2416840Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2418823Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.2420783Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.2421210Z | self=, 2025-05-07T20:32:44.2421614Z | T=128, 2025-05-07T20:32:44.2421814Z | D=7168, 2025-05-07T20:32:44.2422014Z | scale_ub=None, 2025-05-07T20:32:44.2422253Z | contiguous=True, 2025-05-07T20:32:44.2422491Z | compiled=True, 2025-05-07T20:32:44.2422707Z | ) 2025-05-07T20:32:44.2422884Z | 2025-05-07T20:32:44.2423503Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.2424106Z +---------------- 3 ---------------- 2025-05-07T20:32:44.2424390Z | Traceback (most recent call last): 2025-05-07T20:32:44.2425162Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.2425927Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2426293Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2428266Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.2431011Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.2431634Z | self=, 2025-05-07T20:32:44.2432067Z | T=128, 2025-05-07T20:32:44.2432315Z | D=5120, 2025-05-07T20:32:44.2432614Z | scale_ub=1200.0, 2025-05-07T20:32:44.2432957Z | contiguous=True, 2025-05-07T20:32:44.2433293Z | compiled=True, 2025-05-07T20:32:44.2433608Z | ) 2025-05-07T20:32:44.2433863Z | 2025-05-07T20:32:44.2434607Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.2435476Z +---------------- 4 ---------------- 2025-05-07T20:32:44.2435979Z | Traceback (most recent call last): 2025-05-07T20:32:44.2436972Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:44.2437931Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2438330Z | ^^^^^^^^ 2025-05-07T20:32:44.2439231Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:44.2440177Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2440634Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2441721Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:44.2442808Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2443631Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:44.2444623Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2445247Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2446140Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:44.2447206Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2447834Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2448715Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:44.2449650Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2450287Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2451101Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:44.2451958Z | fn() 2025-05-07T20:32:44.2452757Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:44.2453598Z | self.fn.run( 2025-05-07T20:32:44.2454314Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:44.2454893Z | kernel = self.compile( 2025-05-07T20:32:44.2455149Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:44.2455733Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:44.2456433Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2456812Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2457556Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.2458582Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2459064Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.2459442Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2459788Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2460049Z | ^ 2025-05-07T20:32:44.2460509Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2461073Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.2461474Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:44.2461987Z | self=, 2025-05-07T20:32:44.2462418Z | T=1, # or any other generated value 2025-05-07T20:32:44.2462730Z | D=5120, # or any other generated value 2025-05-07T20:32:44.2463064Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:44.2463429Z | contiguous=True, # or any other generated value 2025-05-07T20:32:44.2463789Z | compiled=True, # or any other generated value 2025-05-07T20:32:44.2464089Z | ) 2025-05-07T20:32:44.2464275Z | 2025-05-07T20:32:44.2464797Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.2465757Z +------------------------------------ 2025-05-07T20:32:44.2466123Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:44.2466497Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2466904Z self=, 2025-05-07T20:32:44.2467302Z T=1, 2025-05-07T20:32:44.2467493Z D=5120, 2025-05-07T20:32:44.2467681Z scale_ub=None, 2025-05-07T20:32:44.2467893Z contiguous=True, 2025-05-07T20:32:44.2468111Z compiled=True, 2025-05-07T20:32:44.2468309Z ) 2025-05-07T20:32:44.2468624Z self = 2025-05-07T20:32:44.2469108Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.2469414Z 2025-05-07T20:32:44.2469498Z @given( 2025-05-07T20:32:44.2469723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2470033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2470339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2470666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2471152Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2471550Z ) 2025-05-07T20:32:44.2472030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2472948Z def test_silu_mul_quant( 2025-05-07T20:32:44.2473297Z self, 2025-05-07T20:32:44.2473568Z T: int, 2025-05-07T20:32:44.2473860Z D: int, 2025-05-07T20:32:44.2474176Z scale_ub: Optional[float], 2025-05-07T20:32:44.2474540Z contiguous: bool, 2025-05-07T20:32:44.2474778Z compiled: bool, 2025-05-07T20:32:44.2475002Z ) -> None: 2025-05-07T20:32:44.2475219Z torch.manual_seed(2025) 2025-05-07T20:32:44.2475457Z 2025-05-07T20:32:44.2475874Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2476358Z 2025-05-07T20:32:44.2476621Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2477020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2477465Z x = x_sign * x_clamp 2025-05-07T20:32:44.2477791Z x0 = x[:, :D] 2025-05-07T20:32:44.2478090Z x1 = x[:, D:] 2025-05-07T20:32:44.2478371Z 2025-05-07T20:32:44.2478621Z if contiguous: 2025-05-07T20:32:44.2478953Z x0 = x0.contiguous() 2025-05-07T20:32:44.2479307Z x1 = x1.contiguous() 2025-05-07T20:32:44.2479648Z 2025-05-07T20:32:44.2479921Z if scale_ub is not None: 2025-05-07T20:32:44.2480308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2480771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2481206Z ) 2025-05-07T20:32:44.2481484Z else: 2025-05-07T20:32:44.2481781Z scale_ub_tensor = None 2025-05-07T20:32:44.2482135Z 2025-05-07T20:32:44.2482462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2482895Z op = silu_mul_quant 2025-05-07T20:32:44.2483256Z if compiled: 2025-05-07T20:32:44.2483606Z op = torch.compile(op) 2025-05-07T20:32:44.2484021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2484401Z 2025-05-07T20:32:44.2484676Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2485084Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2485469Z 2025-05-07T20:32:44.2485799Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2486266Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2486671Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2487110Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2487605Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2488037Z 2025-05-07T20:32:44.2488330Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2488605Z 2025-05-07T20:32:44.2488748Z moe/activation_test.py:126: 2025-05-07T20:32:44.2489167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2489619Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2490077Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2491172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2492214Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2492965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2493907Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2494863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2495849Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2496994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2497866Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2498697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2499554Z fn() 2025-05-07T20:32:44.2500260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2501060Z self.fn.run( 2025-05-07T20:32:44.2501708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2502447Z kernel = self.compile( 2025-05-07T20:32:44.2503188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2504078Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2504634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2504964Z 2025-05-07T20:32:44.2505245Z self = 2025-05-07T20:32:44.2506716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2508625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c04acc360>} 2025-05-07T20:32:44.2510491Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2511887Z context = 2025-05-07T20:32:44.2512271Z 2025-05-07T20:32:44.2512505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2513226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2513875Z module_map=module_map) 2025-05-07T20:32:44.2514378Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2514878Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2515251Z E ^ 2025-05-07T20:32:44.2515991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2516611Z 2025-05-07T20:32:44.2517174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2517868Z 2025-05-07T20:32:44.2518018Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2518551Z self=, 2025-05-07T20:32:44.2519123Z T=2048, 2025-05-07T20:32:44.2519424Z D=5120, 2025-05-07T20:32:44.2519686Z scale_ub=1200.0, 2025-05-07T20:32:44.2519988Z contiguous=True, 2025-05-07T20:32:44.2520282Z compiled=False, 2025-05-07T20:32:44.2520585Z ) 2025-05-07T20:32:44.2521019Z self = 2025-05-07T20:32:44.2521695Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.2522067Z 2025-05-07T20:32:44.2522184Z @given( 2025-05-07T20:32:44.2522500Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2522937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2523375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2523833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2524295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2524695Z ) 2025-05-07T20:32:44.2525295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2525909Z def test_silu_mul_quant( 2025-05-07T20:32:44.2526254Z self, 2025-05-07T20:32:44.2526550Z T: int, 2025-05-07T20:32:44.2526934Z D: int, 2025-05-07T20:32:44.2527251Z scale_ub: Optional[float], 2025-05-07T20:32:44.2527623Z contiguous: bool, 2025-05-07T20:32:44.2527965Z compiled: bool, 2025-05-07T20:32:44.2528290Z ) -> None: 2025-05-07T20:32:44.2528608Z torch.manual_seed(2025) 2025-05-07T20:32:44.2528952Z 2025-05-07T20:32:44.2529385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2529863Z 2025-05-07T20:32:44.2530135Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2530548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2530987Z x = x_sign * x_clamp 2025-05-07T20:32:44.2552266Z x0 = x[:, :D] 2025-05-07T20:32:44.2552592Z x1 = x[:, D:] 2025-05-07T20:32:44.2552897Z 2025-05-07T20:32:44.2553189Z if contiguous: 2025-05-07T20:32:44.2553477Z x0 = x0.contiguous() 2025-05-07T20:32:44.2553804Z x1 = x1.contiguous() 2025-05-07T20:32:44.2554092Z 2025-05-07T20:32:44.2554335Z if scale_ub is not None: 2025-05-07T20:32:44.2554667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2555075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2555452Z ) 2025-05-07T20:32:44.2555838Z else: 2025-05-07T20:32:44.2556147Z scale_ub_tensor = None 2025-05-07T20:32:44.2556515Z 2025-05-07T20:32:44.2556856Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2557291Z op = silu_mul_quant 2025-05-07T20:32:44.2557636Z if compiled: 2025-05-07T20:32:44.2557986Z op = torch.compile(op) 2025-05-07T20:32:44.2558392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2558727Z 2025-05-07T20:32:44.2559000Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2559223Z 2025-05-07T20:32:44.2559365Z moe/activation_test.py:117: 2025-05-07T20:32:44.2559779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2560212Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2560619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2561578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2562549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2563293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2564206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2565764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2566471Z kernel = self.compile( 2025-05-07T20:32:44.2567161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2567978Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2568529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2568850Z 2025-05-07T20:32:44.2569149Z self = 2025-05-07T20:32:44.2570617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2572482Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c054f1e40>} 2025-05-07T20:32:44.2574606Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2576124Z context = 2025-05-07T20:32:44.2578928Z 2025-05-07T20:32:44.2579168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2579906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2580544Z module_map=module_map) 2025-05-07T20:32:44.2581043Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2581528Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2581884Z E ^ 2025-05-07T20:32:44.2582523Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2583149Z 2025-05-07T20:32:44.2583725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2584414Z 2025-05-07T20:32:44.2584566Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2585114Z self=, 2025-05-07T20:32:44.2585639Z T=2048, 2025-05-07T20:32:44.2585900Z D=5120, 2025-05-07T20:32:44.2586163Z scale_ub=1200.0, 2025-05-07T20:32:44.2586473Z contiguous=True, 2025-05-07T20:32:44.2586779Z compiled=True, 2025-05-07T20:32:44.2587039Z ) 2025-05-07T20:32:44.2587468Z self = 2025-05-07T20:32:44.2588118Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.2588468Z 2025-05-07T20:32:44.2588582Z @given( 2025-05-07T20:32:44.2588889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2589326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2589749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2590206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2590663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2591072Z ) 2025-05-07T20:32:44.2591547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2592156Z def test_silu_mul_quant( 2025-05-07T20:32:44.2592496Z self, 2025-05-07T20:32:44.2592775Z T: int, 2025-05-07T20:32:44.2593050Z D: int, 2025-05-07T20:32:44.2593359Z scale_ub: Optional[float], 2025-05-07T20:32:44.2593738Z contiguous: bool, 2025-05-07T20:32:44.2594068Z compiled: bool, 2025-05-07T20:32:44.2594384Z ) -> None: 2025-05-07T20:32:44.2594691Z torch.manual_seed(2025) 2025-05-07T20:32:44.2595027Z 2025-05-07T20:32:44.2595413Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2596020Z 2025-05-07T20:32:44.2596298Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2596713Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2597142Z x = x_sign * x_clamp 2025-05-07T20:32:44.2597480Z x0 = x[:, :D] 2025-05-07T20:32:44.2597791Z x1 = x[:, D:] 2025-05-07T20:32:44.2598091Z 2025-05-07T20:32:44.2598351Z if contiguous: 2025-05-07T20:32:44.2598689Z x0 = x0.contiguous() 2025-05-07T20:32:44.2599061Z x1 = x1.contiguous() 2025-05-07T20:32:44.2599445Z 2025-05-07T20:32:44.2599713Z if scale_ub is not None: 2025-05-07T20:32:44.2600100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2600570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2600998Z ) 2025-05-07T20:32:44.2601280Z else: 2025-05-07T20:32:44.2601583Z scale_ub_tensor = None 2025-05-07T20:32:44.2601934Z 2025-05-07T20:32:44.2602431Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2602885Z op = silu_mul_quant 2025-05-07T20:32:44.2603231Z if compiled: 2025-05-07T20:32:44.2603584Z op = torch.compile(op) 2025-05-07T20:32:44.2604096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2604464Z 2025-05-07T20:32:44.2604737Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2605126Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2605531Z 2025-05-07T20:32:44.2605865Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2606327Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2606735Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2607171Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2607655Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2608085Z 2025-05-07T20:32:44.2608359Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2608641Z 2025-05-07T20:32:44.2608778Z moe/activation_test.py:126: 2025-05-07T20:32:44.2609186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2609639Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2610081Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2611127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2612144Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2612887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2613829Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2614759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2615746Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2616701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2617561Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2618372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2619052Z fn() 2025-05-07T20:32:44.2619722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2620492Z self.fn.run( 2025-05-07T20:32:44.2621114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2621778Z kernel = self.compile( 2025-05-07T20:32:44.2622458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2623336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2623865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2624166Z 2025-05-07T20:32:44.2624441Z self = 2025-05-07T20:32:44.2625799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2627609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c0535ac00>} 2025-05-07T20:32:44.2629462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2630872Z context = 2025-05-07T20:32:44.2631252Z 2025-05-07T20:32:44.2631477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2632226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2632805Z module_map=module_map) 2025-05-07T20:32:44.2633250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2633690Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2634016Z E ^ 2025-05-07T20:32:44.2634586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2635146Z 2025-05-07T20:32:44.2635666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2636500Z 2025-05-07T20:32:44.2636655Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2637225Z self=, 2025-05-07T20:32:44.2637803Z T=16384, 2025-05-07T20:32:44.2638082Z D=7168, 2025-05-07T20:32:44.2638356Z scale_ub=1200.0, 2025-05-07T20:32:44.2638663Z contiguous=False, 2025-05-07T20:32:44.2638980Z compiled=False, 2025-05-07T20:32:44.2639253Z ) 2025-05-07T20:32:44.2639697Z self = 2025-05-07T20:32:44.2640369Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.2640749Z 2025-05-07T20:32:44.2640857Z @given( 2025-05-07T20:32:44.2641174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2641605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2642023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2642472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2642920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2643317Z ) 2025-05-07T20:32:44.2643800Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2644424Z def test_silu_mul_quant( 2025-05-07T20:32:44.2644768Z self, 2025-05-07T20:32:44.2645041Z T: int, 2025-05-07T20:32:44.2645325Z D: int, 2025-05-07T20:32:44.2645635Z scale_ub: Optional[float], 2025-05-07T20:32:44.2646011Z contiguous: bool, 2025-05-07T20:32:44.2646349Z compiled: bool, 2025-05-07T20:32:44.2646669Z ) -> None: 2025-05-07T20:32:44.2646965Z torch.manual_seed(2025) 2025-05-07T20:32:44.2647309Z 2025-05-07T20:32:44.2647689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2648170Z 2025-05-07T20:32:44.2648419Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2648804Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2649230Z x = x_sign * x_clamp 2025-05-07T20:32:44.2649555Z x0 = x[:, :D] 2025-05-07T20:32:44.2649853Z x1 = x[:, D:] 2025-05-07T20:32:44.2650139Z 2025-05-07T20:32:44.2650390Z if contiguous: 2025-05-07T20:32:44.2650710Z x0 = x0.contiguous() 2025-05-07T20:32:44.2651064Z x1 = x1.contiguous() 2025-05-07T20:32:44.2651385Z 2025-05-07T20:32:44.2651653Z if scale_ub is not None: 2025-05-07T20:32:44.2652027Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2652466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2652895Z ) 2025-05-07T20:32:44.2653171Z else: 2025-05-07T20:32:44.2653465Z scale_ub_tensor = None 2025-05-07T20:32:44.2653827Z 2025-05-07T20:32:44.2654147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2654578Z op = silu_mul_quant 2025-05-07T20:32:44.2654918Z if compiled: 2025-05-07T20:32:44.2655383Z op = torch.compile(op) 2025-05-07T20:32:44.2655779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2656137Z 2025-05-07T20:32:44.2656403Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2656757Z 2025-05-07T20:32:44.2656900Z moe/activation_test.py:117: 2025-05-07T20:32:44.2657297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2657749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2658131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2659040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2659962Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2660683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2661599Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2662480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2663203Z kernel = self.compile( 2025-05-07T20:32:44.2663944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2664827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2665643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2665973Z 2025-05-07T20:32:44.2666247Z self = 2025-05-07T20:32:44.2667704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2669565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05583ce0>} 2025-05-07T20:32:44.2671406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2672832Z context = 2025-05-07T20:32:44.2673229Z 2025-05-07T20:32:44.2673457Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2674180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2674801Z module_map=module_map) 2025-05-07T20:32:44.2675291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2675863Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2676221Z E ^ 2025-05-07T20:32:44.2676861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2677480Z 2025-05-07T20:32:44.2678033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2678720Z 2025-05-07T20:32:44.2678871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2679420Z self=, 2025-05-07T20:32:44.2679974Z T=1, 2025-05-07T20:32:44.2680234Z D=7168, 2025-05-07T20:32:44.2680502Z scale_ub=None, 2025-05-07T20:32:44.2680805Z contiguous=True, 2025-05-07T20:32:44.2681118Z compiled=True, 2025-05-07T20:32:44.2681404Z ) 2025-05-07T20:32:44.2681854Z self = 2025-05-07T20:32:44.2682520Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.2682871Z 2025-05-07T20:32:44.2683245Z @given( 2025-05-07T20:32:44.2683571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2684014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2684599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2685041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2685480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2685866Z ) 2025-05-07T20:32:44.2686325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2686922Z def test_silu_mul_quant( 2025-05-07T20:32:44.2687277Z self, 2025-05-07T20:32:44.2687570Z T: int, 2025-05-07T20:32:44.2687850Z D: int, 2025-05-07T20:32:44.2688171Z scale_ub: Optional[float], 2025-05-07T20:32:44.2688513Z contiguous: bool, 2025-05-07T20:32:44.2688828Z compiled: bool, 2025-05-07T20:32:44.2689140Z ) -> None: 2025-05-07T20:32:44.2689455Z torch.manual_seed(2025) 2025-05-07T20:32:44.2689804Z 2025-05-07T20:32:44.2690186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2690665Z 2025-05-07T20:32:44.2690943Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2691347Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2691784Z x = x_sign * x_clamp 2025-05-07T20:32:44.2692118Z x0 = x[:, :D] 2025-05-07T20:32:44.2692427Z x1 = x[:, D:] 2025-05-07T20:32:44.2692733Z 2025-05-07T20:32:44.2692993Z if contiguous: 2025-05-07T20:32:44.2693314Z x0 = x0.contiguous() 2025-05-07T20:32:44.2693677Z x1 = x1.contiguous() 2025-05-07T20:32:44.2694022Z 2025-05-07T20:32:44.2694293Z if scale_ub is not None: 2025-05-07T20:32:44.2694668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2695125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2695551Z ) 2025-05-07T20:32:44.2695834Z else: 2025-05-07T20:32:44.2696140Z scale_ub_tensor = None 2025-05-07T20:32:44.2696497Z 2025-05-07T20:32:44.2696825Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2697275Z op = silu_mul_quant 2025-05-07T20:32:44.2697623Z if compiled: 2025-05-07T20:32:44.2697976Z op = torch.compile(op) 2025-05-07T20:32:44.2698394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2698779Z 2025-05-07T20:32:44.2699061Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2699461Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2699854Z 2025-05-07T20:32:44.2700184Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2700644Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2701046Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2701468Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2701961Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2702394Z 2025-05-07T20:32:44.2702680Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2702963Z 2025-05-07T20:32:44.2703109Z moe/activation_test.py:126: 2025-05-07T20:32:44.2703532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2704018Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2704488Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2705580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2706599Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2707323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2708371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2709320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2710311Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2711430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2712317Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2713152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2713862Z fn() 2025-05-07T20:32:44.2714569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2715386Z self.fn.run( 2025-05-07T20:32:44.2716175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2716937Z kernel = self.compile( 2025-05-07T20:32:44.2717687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2718425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2718823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2719076Z 2025-05-07T20:32:44.2719318Z self = 2025-05-07T20:32:44.2720408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2721788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c04fac720>} 2025-05-07T20:32:44.2723134Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2724158Z context = 2025-05-07T20:32:44.2724454Z 2025-05-07T20:32:44.2724622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2725145Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2725609Z module_map=module_map) 2025-05-07T20:32:44.2725975Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2726336Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2726605Z E ^ 2025-05-07T20:32:44.2727064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2727525Z 2025-05-07T20:32:44.2727936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2728443Z 2025-05-07T20:32:44.2728559Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2728974Z self=, 2025-05-07T20:32:44.2729375Z T=4096, 2025-05-07T20:32:44.2729573Z D=5120, 2025-05-07T20:32:44.2729772Z scale_ub=None, 2025-05-07T20:32:44.2729986Z contiguous=False, 2025-05-07T20:32:44.2730215Z compiled=False, 2025-05-07T20:32:44.2730425Z ) 2025-05-07T20:32:44.2730743Z self = 2025-05-07T20:32:44.2731238Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.2731514Z 2025-05-07T20:32:44.2731604Z @given( 2025-05-07T20:32:44.2731835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2732266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2732584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2732922Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2733328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2733619Z ) 2025-05-07T20:32:44.2733968Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2734406Z def test_silu_mul_quant( 2025-05-07T20:32:44.2734656Z self, 2025-05-07T20:32:44.2734863Z T: int, 2025-05-07T20:32:44.2735059Z D: int, 2025-05-07T20:32:44.2735283Z scale_ub: Optional[float], 2025-05-07T20:32:44.2735559Z contiguous: bool, 2025-05-07T20:32:44.2735800Z compiled: bool, 2025-05-07T20:32:44.2736031Z ) -> None: 2025-05-07T20:32:44.2736251Z torch.manual_seed(2025) 2025-05-07T20:32:44.2736494Z 2025-05-07T20:32:44.2736778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2737133Z 2025-05-07T20:32:44.2737332Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2737634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2737968Z x = x_sign * x_clamp 2025-05-07T20:32:44.2738217Z x0 = x[:, :D] 2025-05-07T20:32:44.2738437Z x1 = x[:, D:] 2025-05-07T20:32:44.2738655Z 2025-05-07T20:32:44.2738851Z if contiguous: 2025-05-07T20:32:44.2739088Z x0 = x0.contiguous() 2025-05-07T20:32:44.2739356Z x1 = x1.contiguous() 2025-05-07T20:32:44.2739603Z 2025-05-07T20:32:44.2739799Z if scale_ub is not None: 2025-05-07T20:32:44.2740079Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2740422Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2740735Z ) 2025-05-07T20:32:44.2740936Z else: 2025-05-07T20:32:44.2741160Z scale_ub_tensor = None 2025-05-07T20:32:44.2741413Z 2025-05-07T20:32:44.2741657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2741981Z op = silu_mul_quant 2025-05-07T20:32:44.2742234Z if compiled: 2025-05-07T20:32:44.2742492Z op = torch.compile(op) 2025-05-07T20:32:44.2742801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2743083Z 2025-05-07T20:32:44.2743281Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2743454Z 2025-05-07T20:32:44.2743558Z moe/activation_test.py:117: 2025-05-07T20:32:44.2743861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2752550Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2752882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2753582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2754272Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2754828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2755517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2756361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2756893Z kernel = self.compile( 2025-05-07T20:32:44.2757441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2757617Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2757753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2757759Z 2025-05-07T20:32:44.2757965Z self = 2025-05-07T20:32:44.2758912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2759472Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c49a0>} 2025-05-07T20:32:44.2760291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2760492Z context = 2025-05-07T20:32:44.2760497Z 2025-05-07T20:32:44.2760664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2760936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2761050Z module_map=module_map) 2025-05-07T20:32:44.2761221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2761327Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2761404Z E ^ 2025-05-07T20:32:44.2761769Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2761779Z 2025-05-07T20:32:44.2762192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2762197Z 2025-05-07T20:32:44.2762308Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2762534Z self=, 2025-05-07T20:32:44.2762615Z T=4096, 2025-05-07T20:32:44.2762703Z D=7168, 2025-05-07T20:32:44.2762789Z scale_ub=None, 2025-05-07T20:32:44.2762879Z contiguous=False, 2025-05-07T20:32:44.2762979Z compiled=False, 2025-05-07T20:32:44.2763058Z ) 2025-05-07T20:32:44.2763282Z self = 2025-05-07T20:32:44.2763472Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.2763477Z 2025-05-07T20:32:44.2763566Z @given( 2025-05-07T20:32:44.2763696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2763799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2763917Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2764048Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2764168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2764249Z ) 2025-05-07T20:32:44.2764501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2764597Z def test_silu_mul_quant( 2025-05-07T20:32:44.2764688Z self, 2025-05-07T20:32:44.2764769Z T: int, 2025-05-07T20:32:44.2764850Z D: int, 2025-05-07T20:32:44.2764959Z scale_ub: Optional[float], 2025-05-07T20:32:44.2765057Z contiguous: bool, 2025-05-07T20:32:44.2765146Z compiled: bool, 2025-05-07T20:32:44.2765236Z ) -> None: 2025-05-07T20:32:44.2765334Z torch.manual_seed(2025) 2025-05-07T20:32:44.2765758Z 2025-05-07T20:32:44.2765981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2766059Z 2025-05-07T20:32:44.2766155Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2766305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2766433Z x = x_sign * x_clamp 2025-05-07T20:32:44.2766542Z x0 = x[:, :D] 2025-05-07T20:32:44.2766635Z x1 = x[:, D:] 2025-05-07T20:32:44.2766712Z 2025-05-07T20:32:44.2766806Z if contiguous: 2025-05-07T20:32:44.2766902Z x0 = x0.contiguous() 2025-05-07T20:32:44.2766996Z x1 = x1.contiguous() 2025-05-07T20:32:44.2767078Z 2025-05-07T20:32:44.2767172Z if scale_ub is not None: 2025-05-07T20:32:44.2767500Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2767648Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2767726Z ) 2025-05-07T20:32:44.2767924Z else: 2025-05-07T20:32:44.2768030Z scale_ub_tensor = None 2025-05-07T20:32:44.2768105Z 2025-05-07T20:32:44.2768236Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2768336Z op = silu_mul_quant 2025-05-07T20:32:44.2768423Z if compiled: 2025-05-07T20:32:44.2768533Z op = torch.compile(op) 2025-05-07T20:32:44.2768639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2768716Z 2025-05-07T20:32:44.2768818Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2768823Z 2025-05-07T20:32:44.2768922Z moe/activation_test.py:117: 2025-05-07T20:32:44.2769055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2769182Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2769301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2769825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2769942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2770300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2770532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2770872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2770969Z kernel = self.compile( 2025-05-07T20:32:44.2771359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2771537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2771678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2771683Z 2025-05-07T20:32:44.2771888Z self = 2025-05-07T20:32:44.2772672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2773181Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c4e00>} 2025-05-07T20:32:44.2773924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2774124Z context = 2025-05-07T20:32:44.2774135Z 2025-05-07T20:32:44.2774300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2774564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2774683Z module_map=module_map) 2025-05-07T20:32:44.2774847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2774955Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2775036Z E ^ 2025-05-07T20:32:44.2775391Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2775396Z 2025-05-07T20:32:44.2775813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2775818Z 2025-05-07T20:32:44.2775923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2776236Z self=, 2025-05-07T20:32:44.2776318Z T=128, 2025-05-07T20:32:44.2776397Z D=7168, 2025-05-07T20:32:44.2776488Z scale_ub=None, 2025-05-07T20:32:44.2776580Z contiguous=False, 2025-05-07T20:32:44.2776742Z compiled=True, 2025-05-07T20:32:44.2776824Z ) 2025-05-07T20:32:44.2777045Z self = 2025-05-07T20:32:44.2777217Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.2777222Z 2025-05-07T20:32:44.2777309Z @given( 2025-05-07T20:32:44.2777431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2777537Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2777666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2777785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2777906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2777984Z ) 2025-05-07T20:32:44.2778233Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2778338Z def test_silu_mul_quant( 2025-05-07T20:32:44.2778417Z self, 2025-05-07T20:32:44.2778505Z T: int, 2025-05-07T20:32:44.2778594Z D: int, 2025-05-07T20:32:44.2778698Z scale_ub: Optional[float], 2025-05-07T20:32:44.2778791Z contiguous: bool, 2025-05-07T20:32:44.2778890Z compiled: bool, 2025-05-07T20:32:44.2778972Z ) -> None: 2025-05-07T20:32:44.2779081Z torch.manual_seed(2025) 2025-05-07T20:32:44.2779175Z 2025-05-07T20:32:44.2779371Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2779455Z 2025-05-07T20:32:44.2779549Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2779676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2779773Z x = x_sign * x_clamp 2025-05-07T20:32:44.2779856Z x0 = x[:, :D] 2025-05-07T20:32:44.2779945Z x1 = x[:, D:] 2025-05-07T20:32:44.2780027Z 2025-05-07T20:32:44.2780113Z if contiguous: 2025-05-07T20:32:44.2780207Z x0 = x0.contiguous() 2025-05-07T20:32:44.2780307Z x1 = x1.contiguous() 2025-05-07T20:32:44.2780387Z 2025-05-07T20:32:44.2780482Z if scale_ub is not None: 2025-05-07T20:32:44.2780598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2780734Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2780821Z ) 2025-05-07T20:32:44.2780900Z else: 2025-05-07T20:32:44.2780997Z scale_ub_tensor = None 2025-05-07T20:32:44.2781083Z 2025-05-07T20:32:44.2781215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2781309Z op = silu_mul_quant 2025-05-07T20:32:44.2781404Z if compiled: 2025-05-07T20:32:44.2781505Z op = torch.compile(op) 2025-05-07T20:32:44.2781614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2781701Z 2025-05-07T20:32:44.2781796Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2781918Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2782001Z 2025-05-07T20:32:44.2782145Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2782259Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2782361Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2782486Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2782633Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2782710Z 2025-05-07T20:32:44.2782815Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2782819Z 2025-05-07T20:32:44.2782927Z moe/activation_test.py:126: 2025-05-07T20:32:44.2783059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2783175Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2783400Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2783963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2784867Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2785229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2785451Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2785828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2786085Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2786467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2786640Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2786982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2787072Z fn() 2025-05-07T20:32:44.2787476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2787562Z self.fn.run( 2025-05-07T20:32:44.2787907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2788003Z kernel = self.compile( 2025-05-07T20:32:44.2788395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2788569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2788700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2788704Z 2025-05-07T20:32:44.2788925Z self = 2025-05-07T20:32:44.2789755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2790271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c5a80>} 2025-05-07T20:32:44.2791013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2791206Z context = 2025-05-07T20:32:44.2791221Z 2025-05-07T20:32:44.2791387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2791656Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2791776Z module_map=module_map) 2025-05-07T20:32:44.2791938Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2792046Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2792132Z E ^ 2025-05-07T20:32:44.2792490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2792494Z 2025-05-07T20:32:44.2792911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2792915Z 2025-05-07T20:32:44.2793021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2793247Z self=, 2025-05-07T20:32:44.2793333Z T=128, 2025-05-07T20:32:44.2793414Z D=7168, 2025-05-07T20:32:44.2793625Z scale_ub=None, 2025-05-07T20:32:44.2793727Z contiguous=False, 2025-05-07T20:32:44.2793815Z compiled=False, 2025-05-07T20:32:44.2793894Z ) 2025-05-07T20:32:44.2794122Z self = 2025-05-07T20:32:44.2794373Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.2794378Z 2025-05-07T20:32:44.2794469Z @given( 2025-05-07T20:32:44.2794591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2794692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2794818Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2794937Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2795053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2795138Z ) 2025-05-07T20:32:44.2795387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2795490Z def test_silu_mul_quant( 2025-05-07T20:32:44.2795578Z self, 2025-05-07T20:32:44.2795659Z T: int, 2025-05-07T20:32:44.2795810Z D: int, 2025-05-07T20:32:44.2795913Z scale_ub: Optional[float], 2025-05-07T20:32:44.2796010Z contiguous: bool, 2025-05-07T20:32:44.2796107Z compiled: bool, 2025-05-07T20:32:44.2796188Z ) -> None: 2025-05-07T20:32:44.2796287Z torch.manual_seed(2025) 2025-05-07T20:32:44.2796373Z 2025-05-07T20:32:44.2796544Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2796623Z 2025-05-07T20:32:44.2796725Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2796851Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2796944Z x = x_sign * x_clamp 2025-05-07T20:32:44.2797035Z x0 = x[:, :D] 2025-05-07T20:32:44.2797117Z x1 = x[:, D:] 2025-05-07T20:32:44.2797202Z 2025-05-07T20:32:44.2797288Z if contiguous: 2025-05-07T20:32:44.2797383Z x0 = x0.contiguous() 2025-05-07T20:32:44.2797485Z x1 = x1.contiguous() 2025-05-07T20:32:44.2797562Z 2025-05-07T20:32:44.2797656Z if scale_ub is not None: 2025-05-07T20:32:44.2797772Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2797913Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2797993Z ) 2025-05-07T20:32:44.2798080Z else: 2025-05-07T20:32:44.2798178Z scale_ub_tensor = None 2025-05-07T20:32:44.2798254Z 2025-05-07T20:32:44.2798392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2798483Z op = silu_mul_quant 2025-05-07T20:32:44.2798580Z if compiled: 2025-05-07T20:32:44.2798683Z op = torch.compile(op) 2025-05-07T20:32:44.2798794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2798876Z 2025-05-07T20:32:44.2798978Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2798982Z 2025-05-07T20:32:44.2799085Z moe/activation_test.py:117: 2025-05-07T20:32:44.2799217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2799326Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2799434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2799933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2800038Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2800396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2800624Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2800961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2801060Z kernel = self.compile( 2025-05-07T20:32:44.2801534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2801712Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2801847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2801928Z 2025-05-07T20:32:44.2802134Z self = 2025-05-07T20:32:44.2802912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2803421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea19c540>} 2025-05-07T20:32:44.2804171Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2804369Z context = 2025-05-07T20:32:44.2804381Z 2025-05-07T20:32:44.2804547Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2804811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2804925Z module_map=module_map) 2025-05-07T20:32:44.2805088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2805195Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2805274Z E ^ 2025-05-07T20:32:44.2805629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2805634Z 2025-05-07T20:32:44.2806055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2806060Z 2025-05-07T20:32:44.2806166Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2806396Z self=, 2025-05-07T20:32:44.2806481Z T=4096, 2025-05-07T20:32:44.2806560Z D=5120, 2025-05-07T20:32:44.2806650Z scale_ub=1200.0, 2025-05-07T20:32:44.2806735Z contiguous=True, 2025-05-07T20:32:44.2806820Z compiled=False, 2025-05-07T20:32:44.2806901Z ) 2025-05-07T20:32:44.2807119Z self = 2025-05-07T20:32:44.2807297Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.2807301Z 2025-05-07T20:32:44.2807385Z @given( 2025-05-07T20:32:44.2807506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2807605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2807727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2807850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2807969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2808045Z ) 2025-05-07T20:32:44.2808292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2808398Z def test_silu_mul_quant( 2025-05-07T20:32:44.2808476Z self, 2025-05-07T20:32:44.2808554Z T: int, 2025-05-07T20:32:44.2808639Z D: int, 2025-05-07T20:32:44.2808739Z scale_ub: Optional[float], 2025-05-07T20:32:44.2808830Z contiguous: bool, 2025-05-07T20:32:44.2808923Z compiled: bool, 2025-05-07T20:32:44.2809002Z ) -> None: 2025-05-07T20:32:44.2809111Z torch.manual_seed(2025) 2025-05-07T20:32:44.2809203Z 2025-05-07T20:32:44.2809396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2809479Z 2025-05-07T20:32:44.2809571Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2809779Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2809877Z x = x_sign * x_clamp 2025-05-07T20:32:44.2809960Z x0 = x[:, :D] 2025-05-07T20:32:44.2810042Z x1 = x[:, D:] 2025-05-07T20:32:44.2810236Z 2025-05-07T20:32:44.2810325Z if contiguous: 2025-05-07T20:32:44.2810417Z x0 = x0.contiguous() 2025-05-07T20:32:44.2810513Z x1 = x1.contiguous() 2025-05-07T20:32:44.2810587Z 2025-05-07T20:32:44.2810685Z if scale_ub is not None: 2025-05-07T20:32:44.2810798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2810933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2811017Z ) 2025-05-07T20:32:44.2811098Z else: 2025-05-07T20:32:44.2811198Z scale_ub_tensor = None 2025-05-07T20:32:44.2811278Z 2025-05-07T20:32:44.2811409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2811502Z op = silu_mul_quant 2025-05-07T20:32:44.2811598Z if compiled: 2025-05-07T20:32:44.2811698Z op = torch.compile(op) 2025-05-07T20:32:44.2811805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2811890Z 2025-05-07T20:32:44.2811991Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2811996Z 2025-05-07T20:32:44.2812101Z moe/activation_test.py:117: 2025-05-07T20:32:44.2812232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2812334Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2812442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2812940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2813038Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2813403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2813630Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2813974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2814078Z kernel = self.compile( 2025-05-07T20:32:44.2814458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2814641Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2814772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2814776Z 2025-05-07T20:32:44.2814986Z self = 2025-05-07T20:32:44.2815769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2816277Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea19e480>} 2025-05-07T20:32:44.2817032Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2817225Z context = 2025-05-07T20:32:44.2817230Z 2025-05-07T20:32:44.2817402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2817667Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2817775Z module_map=module_map) 2025-05-07T20:32:44.2817950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2818054Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2818215Z E ^ 2025-05-07T20:32:44.2818578Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2818681Z 2025-05-07T20:32:44.2819093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2819098Z 2025-05-07T20:32:44.2819213Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2819439Z self=, 2025-05-07T20:32:44.2819518Z T=1, 2025-05-07T20:32:44.2819603Z D=5120, 2025-05-07T20:32:44.2819687Z scale_ub=None, 2025-05-07T20:32:44.2819775Z contiguous=True, 2025-05-07T20:32:44.2819866Z compiled=True, 2025-05-07T20:32:44.2819943Z ) 2025-05-07T20:32:44.2820168Z self = 2025-05-07T20:32:44.2820341Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.2820346Z 2025-05-07T20:32:44.2820425Z @given( 2025-05-07T20:32:44.2820550Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2820651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2820772Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2820897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2821011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2821087Z ) 2025-05-07T20:32:44.2821337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2821432Z def test_silu_mul_quant( 2025-05-07T20:32:44.2821518Z self, 2025-05-07T20:32:44.2821596Z T: int, 2025-05-07T20:32:44.2821675Z D: int, 2025-05-07T20:32:44.2821780Z scale_ub: Optional[float], 2025-05-07T20:32:44.2821876Z contiguous: bool, 2025-05-07T20:32:44.2821967Z compiled: bool, 2025-05-07T20:32:44.2822053Z ) -> None: 2025-05-07T20:32:44.2822154Z torch.manual_seed(2025) 2025-05-07T20:32:44.2822229Z 2025-05-07T20:32:44.2822408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2822490Z 2025-05-07T20:32:44.2822584Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2822715Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2822806Z x = x_sign * x_clamp 2025-05-07T20:32:44.2822894Z x0 = x[:, :D] 2025-05-07T20:32:44.2822975Z x1 = x[:, D:] 2025-05-07T20:32:44.2823049Z 2025-05-07T20:32:44.2823139Z if contiguous: 2025-05-07T20:32:44.2823233Z x0 = x0.contiguous() 2025-05-07T20:32:44.2823323Z x1 = x1.contiguous() 2025-05-07T20:32:44.2823404Z 2025-05-07T20:32:44.2823496Z if scale_ub is not None: 2025-05-07T20:32:44.2823607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2823746Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2823829Z ) 2025-05-07T20:32:44.2823908Z else: 2025-05-07T20:32:44.2824008Z scale_ub_tensor = None 2025-05-07T20:32:44.2824086Z 2025-05-07T20:32:44.2824215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2824318Z op = silu_mul_quant 2025-05-07T20:32:44.2824402Z if compiled: 2025-05-07T20:32:44.2824509Z op = torch.compile(op) 2025-05-07T20:32:44.2824616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2824690Z 2025-05-07T20:32:44.2824792Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2824915Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2824989Z 2025-05-07T20:32:44.2825133Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2825236Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2825342Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2825552Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2825693Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2825775Z 2025-05-07T20:32:44.2825877Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2825955Z 2025-05-07T20:32:44.2826056Z moe/activation_test.py:126: 2025-05-07T20:32:44.2826192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2826300Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2826436Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2827000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2827102Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2827466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2827692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2828061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2828331Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2828703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2828870Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2829216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2829295Z fn() 2025-05-07T20:32:44.2829698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2829783Z self.fn.run( 2025-05-07T20:32:44.2830123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2830226Z kernel = self.compile( 2025-05-07T20:32:44.2830605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2830792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2830924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2830928Z 2025-05-07T20:32:44.2831135Z self = 2025-05-07T20:32:44.2831921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2832429Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05058c20>} 2025-05-07T20:32:44.2833179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2833376Z context = 2025-05-07T20:32:44.2833381Z 2025-05-07T20:32:44.2833547Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2833817Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2833927Z module_map=module_map) 2025-05-07T20:32:44.2834097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2834201Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2834281Z E ^ 2025-05-07T20:32:44.2834726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2834731Z 2025-05-07T20:32:44.2835146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2835227Z 2025-05-07T20:32:44.2835339Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2835564Z self=, 2025-05-07T20:32:44.2835643Z T=2048, 2025-05-07T20:32:44.2835782Z D=5120, 2025-05-07T20:32:44.2835867Z scale_ub=None, 2025-05-07T20:32:44.2835956Z contiguous=True, 2025-05-07T20:32:44.2836048Z compiled=True, 2025-05-07T20:32:44.2836123Z ) 2025-05-07T20:32:44.2836344Z self = 2025-05-07T20:32:44.2836522Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.2836526Z 2025-05-07T20:32:44.2836605Z @given( 2025-05-07T20:32:44.2836732Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2836838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2836955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2837078Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2837198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2837273Z ) 2025-05-07T20:32:44.2837522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2837615Z def test_silu_mul_quant( 2025-05-07T20:32:44.2837693Z self, 2025-05-07T20:32:44.2837777Z T: int, 2025-05-07T20:32:44.2837858Z D: int, 2025-05-07T20:32:44.2837959Z scale_ub: Optional[float], 2025-05-07T20:32:44.2838055Z contiguous: bool, 2025-05-07T20:32:44.2838142Z compiled: bool, 2025-05-07T20:32:44.2838227Z ) -> None: 2025-05-07T20:32:44.2838323Z torch.manual_seed(2025) 2025-05-07T20:32:44.2838397Z 2025-05-07T20:32:44.2838574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2838653Z 2025-05-07T20:32:44.2838746Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2838877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2838975Z x = x_sign * x_clamp 2025-05-07T20:32:44.2839060Z x0 = x[:, :D] 2025-05-07T20:32:44.2839148Z x1 = x[:, D:] 2025-05-07T20:32:44.2839222Z 2025-05-07T20:32:44.2839307Z if contiguous: 2025-05-07T20:32:44.2839409Z x0 = x0.contiguous() 2025-05-07T20:32:44.2839500Z x1 = x1.contiguous() 2025-05-07T20:32:44.2839574Z 2025-05-07T20:32:44.2839675Z if scale_ub is not None: 2025-05-07T20:32:44.2839781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2839925Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2840003Z ) 2025-05-07T20:32:44.2840083Z else: 2025-05-07T20:32:44.2840183Z scale_ub_tensor = None 2025-05-07T20:32:44.2840262Z 2025-05-07T20:32:44.2840396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2840499Z op = silu_mul_quant 2025-05-07T20:32:44.2840586Z if compiled: 2025-05-07T20:32:44.2840691Z op = torch.compile(op) 2025-05-07T20:32:44.2840805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2840880Z 2025-05-07T20:32:44.2840973Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2841101Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2841175Z 2025-05-07T20:32:44.2841317Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2841420Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2841522Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2841655Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2841793Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2841872Z 2025-05-07T20:32:44.2842067Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2842072Z 2025-05-07T20:32:44.2842175Z moe/activation_test.py:126: 2025-05-07T20:32:44.2842311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2842494Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2842630Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2843195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2843298Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2843659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2843893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2844263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2844526Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2844898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2845070Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2845416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2845495Z fn() 2025-05-07T20:32:44.2845893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2845986Z self.fn.run( 2025-05-07T20:32:44.2846323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2846423Z kernel = self.compile( 2025-05-07T20:32:44.2846807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2846981Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2847118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2847127Z 2025-05-07T20:32:44.2847335Z self = 2025-05-07T20:32:44.2848114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2848617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea1553a0>} 2025-05-07T20:32:44.2849405Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2849612Z context = 2025-05-07T20:32:44.2849621Z 2025-05-07T20:32:44.2849787Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2850055Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2850164Z module_map=module_map) 2025-05-07T20:32:44.2850326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2850437Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2850515Z E ^ 2025-05-07T20:32:44.2850877Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2850881Z 2025-05-07T20:32:44.2851374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2851379Z 2025-05-07T20:32:44.2851486Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2851716Z self=, 2025-05-07T20:32:44.2851896Z T=128, 2025-05-07T20:32:44.2851975Z D=5120, 2025-05-07T20:32:44.2852065Z scale_ub=None, 2025-05-07T20:32:44.2852153Z contiguous=True, 2025-05-07T20:32:44.2852243Z compiled=True, 2025-05-07T20:32:44.2852319Z ) 2025-05-07T20:32:44.2852535Z self = 2025-05-07T20:32:44.2852709Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.2852713Z 2025-05-07T20:32:44.2852792Z @given( 2025-05-07T20:32:44.2852913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2853018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2853134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2853256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2853378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2853455Z ) 2025-05-07T20:32:44.2853710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2853805Z def test_silu_mul_quant( 2025-05-07T20:32:44.2853884Z self, 2025-05-07T20:32:44.2853971Z T: int, 2025-05-07T20:32:44.2854049Z D: int, 2025-05-07T20:32:44.2854148Z scale_ub: Optional[float], 2025-05-07T20:32:44.2854245Z contiguous: bool, 2025-05-07T20:32:44.2854333Z compiled: bool, 2025-05-07T20:32:44.2854412Z ) -> None: 2025-05-07T20:32:44.2854518Z torch.manual_seed(2025) 2025-05-07T20:32:44.2854595Z 2025-05-07T20:32:44.2854763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2854846Z 2025-05-07T20:32:44.2854939Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2855073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2855168Z x = x_sign * x_clamp 2025-05-07T20:32:44.2855255Z x0 = x[:, :D] 2025-05-07T20:32:44.2855342Z x1 = x[:, D:] 2025-05-07T20:32:44.2855422Z 2025-05-07T20:32:44.2855507Z if contiguous: 2025-05-07T20:32:44.2855605Z x0 = x0.contiguous() 2025-05-07T20:32:44.2855694Z x1 = x1.contiguous() 2025-05-07T20:32:44.2855769Z 2025-05-07T20:32:44.2855872Z if scale_ub is not None: 2025-05-07T20:32:44.2855979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2856115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2856199Z ) 2025-05-07T20:32:44.2856281Z else: 2025-05-07T20:32:44.2856379Z scale_ub_tensor = None 2025-05-07T20:32:44.2856460Z 2025-05-07T20:32:44.2856593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2856686Z op = silu_mul_quant 2025-05-07T20:32:44.2856784Z if compiled: 2025-05-07T20:32:44.2856885Z op = torch.compile(op) 2025-05-07T20:32:44.2857001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2857080Z 2025-05-07T20:32:44.2857172Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2857298Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2857374Z 2025-05-07T20:32:44.2857510Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2857616Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2857717Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2857838Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2857983Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2858061Z 2025-05-07T20:32:44.2858163Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2858173Z 2025-05-07T20:32:44.2858272Z moe/activation_test.py:126: 2025-05-07T20:32:44.2858489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2858604Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2858738Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2859412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2859529Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2859887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2860115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2860478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2860733Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2861116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2861285Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2861628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2861715Z fn() 2025-05-07T20:32:44.2862112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2862201Z self.fn.run( 2025-05-07T20:32:44.2862538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2862632Z kernel = self.compile( 2025-05-07T20:32:44.2863015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2863196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2863325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2863335Z 2025-05-07T20:32:44.2863544Z self = 2025-05-07T20:32:44.2864324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2864838Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be93a1ee0>} 2025-05-07T20:32:44.2865875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2866087Z context = 2025-05-07T20:32:44.2866092Z 2025-05-07T20:32:44.2866258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2866524Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2866646Z module_map=module_map) 2025-05-07T20:32:44.2866808Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2866919Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2866998Z E ^ 2025-05-07T20:32:44.2867355Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2867360Z 2025-05-07T20:32:44.2867777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2867782Z 2025-05-07T20:32:44.2867890Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2868292Z self=, 2025-05-07T20:32:44.2868374Z T=4096, 2025-05-07T20:32:44.2868453Z D=5120, 2025-05-07T20:32:44.2868546Z scale_ub=None, 2025-05-07T20:32:44.2868748Z contiguous=True, 2025-05-07T20:32:44.2868837Z compiled=True, 2025-05-07T20:32:44.2868918Z ) 2025-05-07T20:32:44.2869137Z self = 2025-05-07T20:32:44.2869311Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.2869316Z 2025-05-07T20:32:44.2869399Z @given( 2025-05-07T20:32:44.2869519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2869623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2869746Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2869864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2869989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2870071Z ) 2025-05-07T20:32:44.2870316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2870421Z def test_silu_mul_quant( 2025-05-07T20:32:44.2870504Z self, 2025-05-07T20:32:44.2870583Z T: int, 2025-05-07T20:32:44.2870674Z D: int, 2025-05-07T20:32:44.2870776Z scale_ub: Optional[float], 2025-05-07T20:32:44.2870866Z contiguous: bool, 2025-05-07T20:32:44.2870961Z compiled: bool, 2025-05-07T20:32:44.2871045Z ) -> None: 2025-05-07T20:32:44.2871142Z torch.manual_seed(2025) 2025-05-07T20:32:44.2871227Z 2025-05-07T20:32:44.2871396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2871478Z 2025-05-07T20:32:44.2871571Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2871696Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2871792Z x = x_sign * x_clamp 2025-05-07T20:32:44.2871879Z x0 = x[:, :D] 2025-05-07T20:32:44.2871962Z x1 = x[:, D:] 2025-05-07T20:32:44.2872044Z 2025-05-07T20:32:44.2872129Z if contiguous: 2025-05-07T20:32:44.2872221Z x0 = x0.contiguous() 2025-05-07T20:32:44.2872323Z x1 = x1.contiguous() 2025-05-07T20:32:44.2872398Z 2025-05-07T20:32:44.2872489Z if scale_ub is not None: 2025-05-07T20:32:44.2872604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2872740Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2872827Z ) 2025-05-07T20:32:44.2872905Z else: 2025-05-07T20:32:44.2873002Z scale_ub_tensor = None 2025-05-07T20:32:44.2873091Z 2025-05-07T20:32:44.2873220Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2873312Z op = silu_mul_quant 2025-05-07T20:32:44.2873406Z if compiled: 2025-05-07T20:32:44.2873507Z op = torch.compile(op) 2025-05-07T20:32:44.2873621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2873700Z 2025-05-07T20:32:44.2873792Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2873917Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2874001Z 2025-05-07T20:32:44.2874136Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2874243Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2874344Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2874465Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2874609Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2874684Z 2025-05-07T20:32:44.2874789Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2874794Z 2025-05-07T20:32:44.2874898Z moe/activation_test.py:126: 2025-05-07T20:32:44.2875028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2875135Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2875361Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2875976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2876161Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2876521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2876744Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2877118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2877374Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2877753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2877924Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2878265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2878355Z fn() 2025-05-07T20:32:44.2878755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2878840Z self.fn.run( 2025-05-07T20:32:44.2879187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2879282Z kernel = self.compile( 2025-05-07T20:32:44.2879665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2879852Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2879982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2879986Z 2025-05-07T20:32:44.2880202Z self = 2025-05-07T20:32:44.2885821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2886370Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be9722660>} 2025-05-07T20:32:44.2887121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2887323Z context = 2025-05-07T20:32:44.2887328Z 2025-05-07T20:32:44.2887503Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2887767Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2887885Z module_map=module_map) 2025-05-07T20:32:44.2888055Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2888166Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2888248Z E ^ 2025-05-07T20:32:44.2888604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2888608Z 2025-05-07T20:32:44.2889027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2889031Z 2025-05-07T20:32:44.2889138Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2889368Z self=, 2025-05-07T20:32:44.2889450Z T=16384, 2025-05-07T20:32:44.2889665Z D=5120, 2025-05-07T20:32:44.2889761Z scale_ub=None, 2025-05-07T20:32:44.2889849Z contiguous=True, 2025-05-07T20:32:44.2889934Z compiled=True, 2025-05-07T20:32:44.2890019Z ) 2025-05-07T20:32:44.2890318Z self = 2025-05-07T20:32:44.2890492Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.2890496Z 2025-05-07T20:32:44.2890582Z @given( 2025-05-07T20:32:44.2890705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2890816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2890934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2891055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2891177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2891254Z ) 2025-05-07T20:32:44.2891503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2891606Z def test_silu_mul_quant( 2025-05-07T20:32:44.2891686Z self, 2025-05-07T20:32:44.2891772Z T: int, 2025-05-07T20:32:44.2891854Z D: int, 2025-05-07T20:32:44.2891956Z scale_ub: Optional[float], 2025-05-07T20:32:44.2892064Z contiguous: bool, 2025-05-07T20:32:44.2892154Z compiled: bool, 2025-05-07T20:32:44.2892248Z ) -> None: 2025-05-07T20:32:44.2892345Z torch.manual_seed(2025) 2025-05-07T20:32:44.2892421Z 2025-05-07T20:32:44.2892599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2892676Z 2025-05-07T20:32:44.2892773Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2892907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2893003Z x = x_sign * x_clamp 2025-05-07T20:32:44.2893089Z x0 = x[:, :D] 2025-05-07T20:32:44.2893179Z x1 = x[:, D:] 2025-05-07T20:32:44.2893257Z 2025-05-07T20:32:44.2893346Z if contiguous: 2025-05-07T20:32:44.2893456Z x0 = x0.contiguous() 2025-05-07T20:32:44.2893549Z x1 = x1.contiguous() 2025-05-07T20:32:44.2893625Z 2025-05-07T20:32:44.2893729Z if scale_ub is not None: 2025-05-07T20:32:44.2893842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2893990Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2894071Z ) 2025-05-07T20:32:44.2894153Z else: 2025-05-07T20:32:44.2894255Z scale_ub_tensor = None 2025-05-07T20:32:44.2894336Z 2025-05-07T20:32:44.2894469Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2894570Z op = silu_mul_quant 2025-05-07T20:32:44.2894658Z if compiled: 2025-05-07T20:32:44.2894762Z op = torch.compile(op) 2025-05-07T20:32:44.2894881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2894959Z 2025-05-07T20:32:44.2895053Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2895189Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2895269Z 2025-05-07T20:32:44.2895413Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2895518Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2895624Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2895754Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2895896Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2895974Z 2025-05-07T20:32:44.2896086Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2896090Z 2025-05-07T20:32:44.2896191Z moe/activation_test.py:126: 2025-05-07T20:32:44.2896325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2896443Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2896578Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2897229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2897336Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2897789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2898019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2898386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2898649Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2899028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2899194Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2899549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2899630Z fn() 2025-05-07T20:32:44.2900030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2900126Z self.fn.run( 2025-05-07T20:32:44.2900464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2900566Z kernel = self.compile( 2025-05-07T20:32:44.2900946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2901121Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2901260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2901265Z 2025-05-07T20:32:44.2901472Z self = 2025-05-07T20:32:44.2902269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2902779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be91474c0>} 2025-05-07T20:32:44.2903524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2903720Z context = 2025-05-07T20:32:44.2903725Z 2025-05-07T20:32:44.2903891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2904163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2904278Z module_map=module_map) 2025-05-07T20:32:44.2904441Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2904552Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2904636Z E ^ 2025-05-07T20:32:44.2904996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2905000Z 2025-05-07T20:32:44.2905412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2905417Z 2025-05-07T20:32:44.2905523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2905752Z self=, 2025-05-07T20:32:44.2905830Z T=1, 2025-05-07T20:32:44.2905909Z D=5120, 2025-05-07T20:32:44.2906000Z scale_ub=1200.0, 2025-05-07T20:32:44.2906087Z contiguous=True, 2025-05-07T20:32:44.2906173Z compiled=True, 2025-05-07T20:32:44.2906341Z ) 2025-05-07T20:32:44.2906560Z self = 2025-05-07T20:32:44.2906733Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.2906812Z 2025-05-07T20:32:44.2906891Z @given( 2025-05-07T20:32:44.2907014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2907123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2907241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2907360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2907482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2907558Z ) 2025-05-07T20:32:44.2907812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2907910Z def test_silu_mul_quant( 2025-05-07T20:32:44.2907991Z self, 2025-05-07T20:32:44.2908077Z T: int, 2025-05-07T20:32:44.2908162Z D: int, 2025-05-07T20:32:44.2908263Z scale_ub: Optional[float], 2025-05-07T20:32:44.2908365Z contiguous: bool, 2025-05-07T20:32:44.2908457Z compiled: bool, 2025-05-07T20:32:44.2908547Z ) -> None: 2025-05-07T20:32:44.2908656Z torch.manual_seed(2025) 2025-05-07T20:32:44.2908734Z 2025-05-07T20:32:44.2908905Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2908991Z 2025-05-07T20:32:44.2909087Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2909219Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2909311Z x = x_sign * x_clamp 2025-05-07T20:32:44.2909394Z x0 = x[:, :D] 2025-05-07T20:32:44.2909488Z x1 = x[:, D:] 2025-05-07T20:32:44.2909563Z 2025-05-07T20:32:44.2909652Z if contiguous: 2025-05-07T20:32:44.2909753Z x0 = x0.contiguous() 2025-05-07T20:32:44.2909847Z x1 = x1.contiguous() 2025-05-07T20:32:44.2909924Z 2025-05-07T20:32:44.2910029Z if scale_ub is not None: 2025-05-07T20:32:44.2910137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2910274Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2910365Z ) 2025-05-07T20:32:44.2910444Z else: 2025-05-07T20:32:44.2910541Z scale_ub_tensor = None 2025-05-07T20:32:44.2910623Z 2025-05-07T20:32:44.2910754Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2910855Z op = silu_mul_quant 2025-05-07T20:32:44.2910942Z if compiled: 2025-05-07T20:32:44.2911043Z op = torch.compile(op) 2025-05-07T20:32:44.2911157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2911233Z 2025-05-07T20:32:44.2911326Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2911331Z 2025-05-07T20:32:44.2911437Z moe/activation_test.py:117: 2025-05-07T20:32:44.2911572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2911678Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2911786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2912153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.2912261Z return fn(*args, **kwargs) 2025-05-07T20:32:44.2912753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2912853Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2913214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2913440Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2913782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2913884Z kernel = self.compile( 2025-05-07T20:32:44.2914344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2914529Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2914734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2914739Z 2025-05-07T20:32:44.2914949Z self = 2025-05-07T20:32:44.2915798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2916302Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8bde5c0>} 2025-05-07T20:32:44.2917055Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2917251Z context = 2025-05-07T20:32:44.2917255Z 2025-05-07T20:32:44.2917427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2917691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2917801Z module_map=module_map) 2025-05-07T20:32:44.2917970Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2918071Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2918150Z E ^ 2025-05-07T20:32:44.2918519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2918523Z 2025-05-07T20:32:44.2918939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2918944Z 2025-05-07T20:32:44.2919056Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2919287Z self=, 2025-05-07T20:32:44.2919370Z T=1, 2025-05-07T20:32:44.2919455Z D=5120, 2025-05-07T20:32:44.2919543Z scale_ub=None, 2025-05-07T20:32:44.2919630Z contiguous=False, 2025-05-07T20:32:44.2919725Z compiled=True, 2025-05-07T20:32:44.2919801Z ) 2025-05-07T20:32:44.2920020Z self = 2025-05-07T20:32:44.2920191Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.2920195Z 2025-05-07T20:32:44.2920276Z @given( 2025-05-07T20:32:44.2920403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2920503Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2920625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2920750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2920865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2920946Z ) 2025-05-07T20:32:44.2921196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2921291Z def test_silu_mul_quant( 2025-05-07T20:32:44.2921374Z self, 2025-05-07T20:32:44.2921457Z T: int, 2025-05-07T20:32:44.2921535Z D: int, 2025-05-07T20:32:44.2921641Z scale_ub: Optional[float], 2025-05-07T20:32:44.2921735Z contiguous: bool, 2025-05-07T20:32:44.2921823Z compiled: bool, 2025-05-07T20:32:44.2921908Z ) -> None: 2025-05-07T20:32:44.2922004Z torch.manual_seed(2025) 2025-05-07T20:32:44.2922080Z 2025-05-07T20:32:44.2922252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2922329Z 2025-05-07T20:32:44.2922532Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2922667Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2922758Z x = x_sign * x_clamp 2025-05-07T20:32:44.2922917Z x0 = x[:, :D] 2025-05-07T20:32:44.2923006Z x1 = x[:, D:] 2025-05-07T20:32:44.2923082Z 2025-05-07T20:32:44.2923175Z if contiguous: 2025-05-07T20:32:44.2923271Z x0 = x0.contiguous() 2025-05-07T20:32:44.2923366Z x1 = x1.contiguous() 2025-05-07T20:32:44.2923448Z 2025-05-07T20:32:44.2923542Z if scale_ub is not None: 2025-05-07T20:32:44.2923649Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2923794Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2923870Z ) 2025-05-07T20:32:44.2923950Z else: 2025-05-07T20:32:44.2924051Z scale_ub_tensor = None 2025-05-07T20:32:44.2924125Z 2025-05-07T20:32:44.2924254Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2924358Z op = silu_mul_quant 2025-05-07T20:32:44.2924447Z if compiled: 2025-05-07T20:32:44.2924560Z op = torch.compile(op) 2025-05-07T20:32:44.2924674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2924750Z 2025-05-07T20:32:44.2924848Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.2924976Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.2925053Z 2025-05-07T20:32:44.2925196Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2925299Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.2925406Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.2925537Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.2925677Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2925761Z 2025-05-07T20:32:44.2925865Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.2925869Z 2025-05-07T20:32:44.2925971Z moe/activation_test.py:126: 2025-05-07T20:32:44.2926106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2926213Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.2926353Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.2926916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.2927020Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.2927385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2927607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2927977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.2928238Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.2928617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.2928788Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.2929129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.2929211Z fn() 2025-05-07T20:32:44.2929609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.2929696Z self.fn.run( 2025-05-07T20:32:44.2930034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2930128Z kernel = self.compile( 2025-05-07T20:32:44.2930510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2930768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2930900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2930984Z 2025-05-07T20:32:44.2931188Z self = 2025-05-07T20:32:44.2931969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2932476Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8c51b20>} 2025-05-07T20:32:44.2933222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2933419Z context = 2025-05-07T20:32:44.2933423Z 2025-05-07T20:32:44.2933589Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2933858Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2933973Z module_map=module_map) 2025-05-07T20:32:44.2934135Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2934246Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.2934328Z E ^ 2025-05-07T20:32:44.2934683Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2934687Z 2025-05-07T20:32:44.2935103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2935107Z 2025-05-07T20:32:44.2935224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2935449Z self=, 2025-05-07T20:32:44.2935536Z T=1, 2025-05-07T20:32:44.2935619Z D=5120, 2025-05-07T20:32:44.2935708Z scale_ub=None, 2025-05-07T20:32:44.2935796Z contiguous=True, 2025-05-07T20:32:44.2935882Z compiled=False, 2025-05-07T20:32:44.2935965Z ) 2025-05-07T20:32:44.2936183Z self = 2025-05-07T20:32:44.2936348Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.2936353Z 2025-05-07T20:32:44.2936436Z @given( 2025-05-07T20:32:44.2936556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2936658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2936783Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2936902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2937028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2937105Z ) 2025-05-07T20:32:44.2937351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2937458Z def test_silu_mul_quant( 2025-05-07T20:32:44.2937537Z self, 2025-05-07T20:32:44.2937617Z T: int, 2025-05-07T20:32:44.2937698Z D: int, 2025-05-07T20:32:44.2937799Z scale_ub: Optional[float], 2025-05-07T20:32:44.2937889Z contiguous: bool, 2025-05-07T20:32:44.2937981Z compiled: bool, 2025-05-07T20:32:44.2938061Z ) -> None: 2025-05-07T20:32:44.2938157Z torch.manual_seed(2025) 2025-05-07T20:32:44.2938236Z 2025-05-07T20:32:44.2938408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2938493Z 2025-05-07T20:32:44.2938585Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2938709Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2938888Z x = x_sign * x_clamp 2025-05-07T20:32:44.2938974Z x0 = x[:, :D] 2025-05-07T20:32:44.2939057Z x1 = x[:, D:] 2025-05-07T20:32:44.2939136Z 2025-05-07T20:32:44.2939222Z if contiguous: 2025-05-07T20:32:44.2939390Z x0 = x0.contiguous() 2025-05-07T20:32:44.2939484Z x1 = x1.contiguous() 2025-05-07T20:32:44.2939559Z 2025-05-07T20:32:44.2939650Z if scale_ub is not None: 2025-05-07T20:32:44.2939762Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2939896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2939978Z ) 2025-05-07T20:32:44.2940056Z else: 2025-05-07T20:32:44.2940152Z scale_ub_tensor = None 2025-05-07T20:32:44.2940232Z 2025-05-07T20:32:44.2940363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2940454Z op = silu_mul_quant 2025-05-07T20:32:44.2940545Z if compiled: 2025-05-07T20:32:44.2940653Z op = torch.compile(op) 2025-05-07T20:32:44.2940760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2940846Z 2025-05-07T20:32:44.2940937Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2940947Z 2025-05-07T20:32:44.2941046Z moe/activation_test.py:117: 2025-05-07T20:32:44.2941180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2941281Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2941387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2941883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2941984Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2942351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2942571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2942914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2943014Z kernel = self.compile( 2025-05-07T20:32:44.2943399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2943585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2943714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2943719Z 2025-05-07T20:32:44.2943927Z self = 2025-05-07T20:32:44.2944709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2945215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8c52d40>} 2025-05-07T20:32:44.2945964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2946158Z context = 2025-05-07T20:32:44.2946163Z 2025-05-07T20:32:44.2946333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2946594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2946701Z module_map=module_map) 2025-05-07T20:32:44.2946871Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2946972Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2947051Z E ^ 2025-05-07T20:32:44.2947525Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2947530Z 2025-05-07T20:32:44.2947942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2948045Z 2025-05-07T20:32:44.2948155Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2948379Z self=, 2025-05-07T20:32:44.2948458Z T=128, 2025-05-07T20:32:44.2948541Z D=5120, 2025-05-07T20:32:44.2948625Z scale_ub=None, 2025-05-07T20:32:44.2948713Z contiguous=False, 2025-05-07T20:32:44.2948802Z compiled=True, 2025-05-07T20:32:44.2948877Z ) 2025-05-07T20:32:44.2949096Z self = 2025-05-07T20:32:44.2949270Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.2949274Z 2025-05-07T20:32:44.2949353Z @given( 2025-05-07T20:32:44.2949482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2949583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2949702Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2949830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2949943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2950022Z ) 2025-05-07T20:32:44.2950269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2950364Z def test_silu_mul_quant( 2025-05-07T20:32:44.2950446Z self, 2025-05-07T20:32:44.2950524Z T: int, 2025-05-07T20:32:44.2950602Z D: int, 2025-05-07T20:32:44.2950708Z scale_ub: Optional[float], 2025-05-07T20:32:44.2950798Z contiguous: bool, 2025-05-07T20:32:44.2950885Z compiled: bool, 2025-05-07T20:32:44.2950974Z ) -> None: 2025-05-07T20:32:44.2951069Z torch.manual_seed(2025) 2025-05-07T20:32:44.2951151Z 2025-05-07T20:32:44.2951327Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2951404Z 2025-05-07T20:32:44.2951497Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2951634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2951722Z x = x_sign * x_clamp 2025-05-07T20:32:44.2951804Z x0 = x[:, :D] 2025-05-07T20:32:44.2951898Z x1 = x[:, D:] 2025-05-07T20:32:44.2951971Z 2025-05-07T20:32:44.2952060Z if contiguous: 2025-05-07T20:32:44.2952153Z x0 = x0.contiguous() 2025-05-07T20:32:44.2952243Z x1 = x1.contiguous() 2025-05-07T20:32:44.2952326Z 2025-05-07T20:32:44.2952417Z if scale_ub is not None: 2025-05-07T20:32:44.2952523Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2952663Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2952745Z ) 2025-05-07T20:32:44.2952823Z else: 2025-05-07T20:32:44.2952927Z scale_ub_tensor = None 2025-05-07T20:32:44.2953001Z 2025-05-07T20:32:44.2953130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2953226Z op = silu_mul_quant 2025-05-07T20:32:44.2953315Z if compiled: 2025-05-07T20:32:44.2953419Z op = torch.compile(op) 2025-05-07T20:32:44.2953526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2953601Z 2025-05-07T20:32:44.2953696Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2953700Z 2025-05-07T20:32:44.2953798Z moe/activation_test.py:117: 2025-05-07T20:32:44.2953928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2954035Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2954136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2954503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.2954688Z return fn(*args, **kwargs) 2025-05-07T20:32:44.2955183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2955358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2955761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2955987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2956331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2956426Z kernel = self.compile( 2025-05-07T20:32:44.2956810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2956984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2957119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2957124Z 2025-05-07T20:32:44.2957333Z self = 2025-05-07T20:32:44.2958107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2958619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8c50f40>} 2025-05-07T20:32:44.2959361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2959555Z context = 2025-05-07T20:32:44.2959559Z 2025-05-07T20:32:44.2959733Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2959997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2960115Z module_map=module_map) 2025-05-07T20:32:44.2960277Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2960377Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2960467Z E ^ 2025-05-07T20:32:44.2960822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2960826Z 2025-05-07T20:32:44.2961241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2961245Z 2025-05-07T20:32:44.2961350Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2961581Z self=, 2025-05-07T20:32:44.2961666Z T=128, 2025-05-07T20:32:44.2961744Z D=7168, 2025-05-07T20:32:44.2961830Z scale_ub=1200.0, 2025-05-07T20:32:44.2961926Z contiguous=False, 2025-05-07T20:32:44.2962017Z compiled=False, 2025-05-07T20:32:44.2962092Z ) 2025-05-07T20:32:44.2962314Z self = 2025-05-07T20:32:44.2962487Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.2962492Z 2025-05-07T20:32:44.2962573Z @given( 2025-05-07T20:32:44.2962692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2962790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2962910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2963030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2963147Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2963225Z ) 2025-05-07T20:32:44.2963553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2963649Z def test_silu_mul_quant( 2025-05-07T20:32:44.2963733Z self, 2025-05-07T20:32:44.2963811Z T: int, 2025-05-07T20:32:44.2963967Z D: int, 2025-05-07T20:32:44.2964067Z scale_ub: Optional[float], 2025-05-07T20:32:44.2964157Z contiguous: bool, 2025-05-07T20:32:44.2964248Z compiled: bool, 2025-05-07T20:32:44.2964328Z ) -> None: 2025-05-07T20:32:44.2964424Z torch.manual_seed(2025) 2025-05-07T20:32:44.2964501Z 2025-05-07T20:32:44.2964672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2964748Z 2025-05-07T20:32:44.2964843Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2964968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2965057Z x = x_sign * x_clamp 2025-05-07T20:32:44.2965145Z x0 = x[:, :D] 2025-05-07T20:32:44.2965226Z x1 = x[:, D:] 2025-05-07T20:32:44.2965307Z 2025-05-07T20:32:44.2965631Z if contiguous: 2025-05-07T20:32:44.2965774Z x0 = x0.contiguous() 2025-05-07T20:32:44.2965888Z x1 = x1.contiguous() 2025-05-07T20:32:44.2965969Z 2025-05-07T20:32:44.2966062Z if scale_ub is not None: 2025-05-07T20:32:44.2966173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2966308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2966386Z ) 2025-05-07T20:32:44.2966470Z else: 2025-05-07T20:32:44.2966568Z scale_ub_tensor = None 2025-05-07T20:32:44.2966643Z 2025-05-07T20:32:44.2966782Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2966878Z op = silu_mul_quant 2025-05-07T20:32:44.2966967Z if compiled: 2025-05-07T20:32:44.2967072Z op = torch.compile(op) 2025-05-07T20:32:44.2967179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2967257Z 2025-05-07T20:32:44.2967355Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2967360Z 2025-05-07T20:32:44.2967459Z moe/activation_test.py:117: 2025-05-07T20:32:44.2967596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2967702Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2967807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2968303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2968401Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2968763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2968987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2969324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2969425Z kernel = self.compile( 2025-05-07T20:32:44.2969804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2969983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2970116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2970121Z 2025-05-07T20:32:44.2970324Z self = 2025-05-07T20:32:44.2971108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2971609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8660e00>} 2025-05-07T20:32:44.2972511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2972817Z context = 2025-05-07T20:32:44.2972821Z 2025-05-07T20:32:44.2972986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2973254Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2973363Z module_map=module_map) 2025-05-07T20:32:44.2973525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2973635Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2973715Z E ^ 2025-05-07T20:32:44.2974074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2974078Z 2025-05-07T20:32:44.2974493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2974497Z 2025-05-07T20:32:44.2974606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2974837Z self=, 2025-05-07T20:32:44.2974917Z T=128, 2025-05-07T20:32:44.2974999Z D=5120, 2025-05-07T20:32:44.2975084Z scale_ub=None, 2025-05-07T20:32:44.2975172Z contiguous=False, 2025-05-07T20:32:44.2975262Z compiled=False, 2025-05-07T20:32:44.2975336Z ) 2025-05-07T20:32:44.2975553Z self = 2025-05-07T20:32:44.2975731Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.2975735Z 2025-05-07T20:32:44.2975814Z @given( 2025-05-07T20:32:44.2975934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2976042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2976157Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2976280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2976400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2976476Z ) 2025-05-07T20:32:44.2976728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2976821Z def test_silu_mul_quant( 2025-05-07T20:32:44.2976899Z self, 2025-05-07T20:32:44.2976982Z T: int, 2025-05-07T20:32:44.2977059Z D: int, 2025-05-07T20:32:44.2977160Z scale_ub: Optional[float], 2025-05-07T20:32:44.2977254Z contiguous: bool, 2025-05-07T20:32:44.2977343Z compiled: bool, 2025-05-07T20:32:44.2977424Z ) -> None: 2025-05-07T20:32:44.2977523Z torch.manual_seed(2025) 2025-05-07T20:32:44.2977598Z 2025-05-07T20:32:44.2977770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2977851Z 2025-05-07T20:32:44.2977944Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2978075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2978165Z x = x_sign * x_clamp 2025-05-07T20:32:44.2978254Z x0 = x[:, :D] 2025-05-07T20:32:44.2978340Z x1 = x[:, D:] 2025-05-07T20:32:44.2978414Z 2025-05-07T20:32:44.2978501Z if contiguous: 2025-05-07T20:32:44.2978599Z x0 = x0.contiguous() 2025-05-07T20:32:44.2978690Z x1 = x1.contiguous() 2025-05-07T20:32:44.2978765Z 2025-05-07T20:32:44.2978861Z if scale_ub is not None: 2025-05-07T20:32:44.2978967Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2979101Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2979184Z ) 2025-05-07T20:32:44.2979261Z else: 2025-05-07T20:32:44.2979362Z scale_ub_tensor = None 2025-05-07T20:32:44.2979437Z 2025-05-07T20:32:44.2979652Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2979754Z op = silu_mul_quant 2025-05-07T20:32:44.2979841Z if compiled: 2025-05-07T20:32:44.2979942Z op = torch.compile(op) 2025-05-07T20:32:44.2980153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2980228Z 2025-05-07T20:32:44.2980322Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2980326Z 2025-05-07T20:32:44.2980433Z moe/activation_test.py:117: 2025-05-07T20:32:44.2980562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2980671Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2980772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2981267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2981368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2981730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2981955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2982307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2982401Z kernel = self.compile( 2025-05-07T20:32:44.2982785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2982958Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2983087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2983091Z 2025-05-07T20:32:44.2983299Z self = 2025-05-07T20:32:44.2984080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2984583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8662980>} 2025-05-07T20:32:44.2985329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2985519Z context = 2025-05-07T20:32:44.2985528Z 2025-05-07T20:32:44.2985696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2985957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2986071Z module_map=module_map) 2025-05-07T20:32:44.2986236Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2986337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2986423Z E ^ 2025-05-07T20:32:44.2986778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2986786Z 2025-05-07T20:32:44.2987196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2987205Z 2025-05-07T20:32:44.2987309Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2987533Z self=, 2025-05-07T20:32:44.2987618Z T=128, 2025-05-07T20:32:44.2987697Z D=5120, 2025-05-07T20:32:44.2987780Z scale_ub=1200.0, 2025-05-07T20:32:44.2987870Z contiguous=True, 2025-05-07T20:32:44.2987955Z compiled=False, 2025-05-07T20:32:44.2988030Z ) 2025-05-07T20:32:44.2988335Z self = 2025-05-07T20:32:44.2988509Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.2988514Z 2025-05-07T20:32:44.2988597Z @given( 2025-05-07T20:32:44.2988793Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2988895Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2989012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2989129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2989244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2989323Z ) 2025-05-07T20:32:44.2989565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2989662Z def test_silu_mul_quant( 2025-05-07T20:32:44.2989744Z self, 2025-05-07T20:32:44.2989821Z T: int, 2025-05-07T20:32:44.2989898Z D: int, 2025-05-07T20:32:44.2990001Z scale_ub: Optional[float], 2025-05-07T20:32:44.2990096Z contiguous: bool, 2025-05-07T20:32:44.2990189Z compiled: bool, 2025-05-07T20:32:44.2990267Z ) -> None: 2025-05-07T20:32:44.2990363Z torch.manual_seed(2025) 2025-05-07T20:32:44.2990448Z 2025-05-07T20:32:44.2990616Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2990692Z 2025-05-07T20:32:44.2990788Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2990914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2991006Z x = x_sign * x_clamp 2025-05-07T20:32:44.2991092Z x0 = x[:, :D] 2025-05-07T20:32:44.2991173Z x1 = x[:, D:] 2025-05-07T20:32:44.2991246Z 2025-05-07T20:32:44.2991334Z if contiguous: 2025-05-07T20:32:44.2991426Z x0 = x0.contiguous() 2025-05-07T20:32:44.2991518Z x1 = x1.contiguous() 2025-05-07T20:32:44.2991593Z 2025-05-07T20:32:44.2991684Z if scale_ub is not None: 2025-05-07T20:32:44.2991796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2991931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2992007Z ) 2025-05-07T20:32:44.2992088Z else: 2025-05-07T20:32:44.2992189Z scale_ub_tensor = None 2025-05-07T20:32:44.2992263Z 2025-05-07T20:32:44.2992397Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2992490Z op = silu_mul_quant 2025-05-07T20:32:44.2992576Z if compiled: 2025-05-07T20:32:44.2992681Z op = torch.compile(op) 2025-05-07T20:32:44.2992788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2992866Z 2025-05-07T20:32:44.2992959Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2992964Z 2025-05-07T20:32:44.2993063Z moe/activation_test.py:117: 2025-05-07T20:32:44.2993195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2993297Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2993402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2993901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2994005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2994361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2994586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2994925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2995023Z kernel = self.compile( 2025-05-07T20:32:44.2995408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2995583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2995900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2995906Z 2025-05-07T20:32:44.2996116Z self = 2025-05-07T20:32:44.2996966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2997465Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be86602c0>} 2025-05-07T20:32:44.2998213Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2998404Z context = 2025-05-07T20:32:44.2998413Z 2025-05-07T20:32:44.2998579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2998845Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2998957Z module_map=module_map) 2025-05-07T20:32:44.2999119Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2999224Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2999305Z E ^ 2025-05-07T20:32:44.2999666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2999671Z 2025-05-07T20:32:44.3000080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3000085Z 2025-05-07T20:32:44.3000194Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3000427Z self=, 2025-05-07T20:32:44.3000507Z T=1, 2025-05-07T20:32:44.3000591Z D=7168, 2025-05-07T20:32:44.3000677Z scale_ub=1200.0, 2025-05-07T20:32:44.3000764Z contiguous=True, 2025-05-07T20:32:44.3000862Z compiled=True, 2025-05-07T20:32:44.3000937Z ) 2025-05-07T20:32:44.3001156Z self = 2025-05-07T20:32:44.3001326Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.3001331Z 2025-05-07T20:32:44.3001411Z @given( 2025-05-07T20:32:44.3001533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3001639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3001756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3001874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3001992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3002068Z ) 2025-05-07T20:32:44.3002322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3002418Z def test_silu_mul_quant( 2025-05-07T20:32:44.3002497Z self, 2025-05-07T20:32:44.3002581Z T: int, 2025-05-07T20:32:44.3002664Z D: int, 2025-05-07T20:32:44.3002766Z scale_ub: Optional[float], 2025-05-07T20:32:44.3002860Z contiguous: bool, 2025-05-07T20:32:44.3002949Z compiled: bool, 2025-05-07T20:32:44.3003030Z ) -> None: 2025-05-07T20:32:44.3003130Z torch.manual_seed(2025) 2025-05-07T20:32:44.3003209Z 2025-05-07T20:32:44.3003377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3003459Z 2025-05-07T20:32:44.3003554Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3003684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3003775Z x = x_sign * x_clamp 2025-05-07T20:32:44.3003860Z x0 = x[:, :D] 2025-05-07T20:32:44.3003946Z x1 = x[:, D:] 2025-05-07T20:32:44.3004105Z 2025-05-07T20:32:44.3004194Z if contiguous: 2025-05-07T20:32:44.3004299Z x0 = x0.contiguous() 2025-05-07T20:32:44.3004393Z x1 = x1.contiguous() 2025-05-07T20:32:44.3004543Z 2025-05-07T20:32:44.3004642Z if scale_ub is not None: 2025-05-07T20:32:44.3004751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3004888Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3004970Z ) 2025-05-07T20:32:44.3005049Z else: 2025-05-07T20:32:44.3005151Z scale_ub_tensor = None 2025-05-07T20:32:44.3005227Z 2025-05-07T20:32:44.3005357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3005458Z op = silu_mul_quant 2025-05-07T20:32:44.3005551Z if compiled: 2025-05-07T20:32:44.3005653Z op = torch.compile(op) 2025-05-07T20:32:44.3005763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3005842Z 2025-05-07T20:32:44.3005943Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3005947Z 2025-05-07T20:32:44.3006049Z moe/activation_test.py:117: 2025-05-07T20:32:44.3010000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3010135Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3010242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3010619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3010725Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3011222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3011328Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3011684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3011912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3012254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3012356Z kernel = self.compile( 2025-05-07T20:32:44.3012741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3012920Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3013051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3013056Z 2025-05-07T20:32:44.3013266Z self = 2025-05-07T20:32:44.3014047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3014562Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8662ca0>} 2025-05-07T20:32:44.3015313Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3015507Z context = 2025-05-07T20:32:44.3015511Z 2025-05-07T20:32:44.3015687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3015952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3016065Z module_map=module_map) 2025-05-07T20:32:44.3016229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3016332Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3016542Z E ^ 2025-05-07T20:32:44.3016904Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3016985Z 2025-05-07T20:32:44.3017402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3017411Z 2025-05-07T20:32:44.3017515Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3017739Z self=, 2025-05-07T20:32:44.3017823Z T=1, 2025-05-07T20:32:44.3017903Z D=7168, 2025-05-07T20:32:44.3017994Z scale_ub=1200.0, 2025-05-07T20:32:44.3018083Z contiguous=False, 2025-05-07T20:32:44.3018168Z compiled=True, 2025-05-07T20:32:44.3018251Z ) 2025-05-07T20:32:44.3018471Z self = 2025-05-07T20:32:44.3018644Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3018649Z 2025-05-07T20:32:44.3018736Z @given( 2025-05-07T20:32:44.3018858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3018966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3019089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3019207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3019326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3019403Z ) 2025-05-07T20:32:44.3019649Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3019748Z def test_silu_mul_quant( 2025-05-07T20:32:44.3019825Z self, 2025-05-07T20:32:44.3019904Z T: int, 2025-05-07T20:32:44.3019989Z D: int, 2025-05-07T20:32:44.3020090Z scale_ub: Optional[float], 2025-05-07T20:32:44.3020182Z contiguous: bool, 2025-05-07T20:32:44.3020277Z compiled: bool, 2025-05-07T20:32:44.3020360Z ) -> None: 2025-05-07T20:32:44.3020465Z torch.manual_seed(2025) 2025-05-07T20:32:44.3020542Z 2025-05-07T20:32:44.3020712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3020797Z 2025-05-07T20:32:44.3020891Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3021019Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3021123Z x = x_sign * x_clamp 2025-05-07T20:32:44.3021206Z x0 = x[:, :D] 2025-05-07T20:32:44.3021288Z x1 = x[:, D:] 2025-05-07T20:32:44.3021372Z 2025-05-07T20:32:44.3021459Z if contiguous: 2025-05-07T20:32:44.3021555Z x0 = x0.contiguous() 2025-05-07T20:32:44.3021653Z x1 = x1.contiguous() 2025-05-07T20:32:44.3021729Z 2025-05-07T20:32:44.3021823Z if scale_ub is not None: 2025-05-07T20:32:44.3021937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3022073Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3022159Z ) 2025-05-07T20:32:44.3022240Z else: 2025-05-07T20:32:44.3022338Z scale_ub_tensor = None 2025-05-07T20:32:44.3022417Z 2025-05-07T20:32:44.3022548Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3022646Z op = silu_mul_quant 2025-05-07T20:32:44.3022740Z if compiled: 2025-05-07T20:32:44.3022841Z op = torch.compile(op) 2025-05-07T20:32:44.3022949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3023030Z 2025-05-07T20:32:44.3023122Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3023126Z 2025-05-07T20:32:44.3023235Z moe/activation_test.py:117: 2025-05-07T20:32:44.3023369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3023472Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3023578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3024035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3024134Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3024634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3024810Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3025173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3025397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3025736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3025836Z kernel = self.compile( 2025-05-07T20:32:44.3026218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3026397Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3026530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3026535Z 2025-05-07T20:32:44.3026741Z self = 2025-05-07T20:32:44.3027534Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3028038Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8bdd0e00>} 2025-05-07T20:32:44.3028786Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3028981Z context = 2025-05-07T20:32:44.3028985Z 2025-05-07T20:32:44.3029153Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3029420Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3029535Z module_map=module_map) 2025-05-07T20:32:44.3029704Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3029810Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3029890Z E ^ 2025-05-07T20:32:44.3030247Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3030252Z 2025-05-07T20:32:44.3030664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3030669Z 2025-05-07T20:32:44.3030777Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3031010Z self=, 2025-05-07T20:32:44.3031089Z T=1, 2025-05-07T20:32:44.3031172Z D=7168, 2025-05-07T20:32:44.3031257Z scale_ub=None, 2025-05-07T20:32:44.3031349Z contiguous=False, 2025-05-07T20:32:44.3031439Z compiled=True, 2025-05-07T20:32:44.3031514Z ) 2025-05-07T20:32:44.3031733Z self = 2025-05-07T20:32:44.3031905Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.3031910Z 2025-05-07T20:32:44.3031988Z @given( 2025-05-07T20:32:44.3032109Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3032219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3032336Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3032459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3032575Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3032732Z ) 2025-05-07T20:32:44.3032983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3033079Z def test_silu_mul_quant( 2025-05-07T20:32:44.3033237Z self, 2025-05-07T20:32:44.3033321Z T: int, 2025-05-07T20:32:44.3033401Z D: int, 2025-05-07T20:32:44.3033503Z scale_ub: Optional[float], 2025-05-07T20:32:44.3033601Z contiguous: bool, 2025-05-07T20:32:44.3033689Z compiled: bool, 2025-05-07T20:32:44.3033771Z ) -> None: 2025-05-07T20:32:44.3033874Z torch.manual_seed(2025) 2025-05-07T20:32:44.3033950Z 2025-05-07T20:32:44.3034126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3034202Z 2025-05-07T20:32:44.3034296Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3034425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3034515Z x = x_sign * x_clamp 2025-05-07T20:32:44.3034597Z x0 = x[:, :D] 2025-05-07T20:32:44.3034691Z x1 = x[:, D:] 2025-05-07T20:32:44.3034766Z 2025-05-07T20:32:44.3034853Z if contiguous: 2025-05-07T20:32:44.3034953Z x0 = x0.contiguous() 2025-05-07T20:32:44.3035051Z x1 = x1.contiguous() 2025-05-07T20:32:44.3035127Z 2025-05-07T20:32:44.3035228Z if scale_ub is not None: 2025-05-07T20:32:44.3035340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3035481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3035558Z ) 2025-05-07T20:32:44.3035635Z else: 2025-05-07T20:32:44.3035803Z scale_ub_tensor = None 2025-05-07T20:32:44.3035880Z 2025-05-07T20:32:44.3036010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3036108Z op = silu_mul_quant 2025-05-07T20:32:44.3036193Z if compiled: 2025-05-07T20:32:44.3036296Z op = torch.compile(op) 2025-05-07T20:32:44.3036412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3036488Z 2025-05-07T20:32:44.3036581Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.3036711Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.3036790Z 2025-05-07T20:32:44.3036928Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3037034Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.3037134Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.3037264Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.3037403Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.3037479Z 2025-05-07T20:32:44.3037584Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.3037588Z 2025-05-07T20:32:44.3037689Z moe/activation_test.py:126: 2025-05-07T20:32:44.3037820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3037931Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.3038073Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.3038633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.3038740Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.3039102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3039331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3039696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.3039956Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.3040328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.3040580Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.3040928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.3041088Z fn() 2025-05-07T20:32:44.3041491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.3041579Z self.fn.run( 2025-05-07T20:32:44.3041915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3042013Z kernel = self.compile( 2025-05-07T20:32:44.3042394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3042570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3042703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3042707Z 2025-05-07T20:32:44.3042920Z self = 2025-05-07T20:32:44.3043705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3044212Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be858e8e0>} 2025-05-07T20:32:44.3044958Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3045156Z context = 2025-05-07T20:32:44.3045160Z 2025-05-07T20:32:44.3045333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3045601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3045712Z module_map=module_map) 2025-05-07T20:32:44.3045883Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3045992Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.3046071Z E ^ 2025-05-07T20:32:44.3046426Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3046434Z 2025-05-07T20:32:44.3046844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3046848Z 2025-05-07T20:32:44.3046952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3047178Z self=, 2025-05-07T20:32:44.3047257Z T=1, 2025-05-07T20:32:44.3047335Z D=5120, 2025-05-07T20:32:44.3047427Z scale_ub=1200.0, 2025-05-07T20:32:44.3047516Z contiguous=False, 2025-05-07T20:32:44.3047600Z compiled=True, 2025-05-07T20:32:44.3047679Z ) 2025-05-07T20:32:44.3047902Z self = 2025-05-07T20:32:44.3048072Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3048076Z 2025-05-07T20:32:44.3048153Z @given( 2025-05-07T20:32:44.3048273Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3048377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3048492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3048611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3048728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3048803Z ) 2025-05-07T20:32:44.3049049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3049252Z def test_silu_mul_quant( 2025-05-07T20:32:44.3049330Z self, 2025-05-07T20:32:44.3049411Z T: int, 2025-05-07T20:32:44.3049489Z D: int, 2025-05-07T20:32:44.3049588Z scale_ub: Optional[float], 2025-05-07T20:32:44.3049754Z contiguous: bool, 2025-05-07T20:32:44.3049847Z compiled: bool, 2025-05-07T20:32:44.3049926Z ) -> None: 2025-05-07T20:32:44.3050027Z torch.manual_seed(2025) 2025-05-07T20:32:44.3050102Z 2025-05-07T20:32:44.3050269Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3050348Z 2025-05-07T20:32:44.3050441Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3050564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3050657Z x = x_sign * x_clamp 2025-05-07T20:32:44.3050739Z x0 = x[:, :D] 2025-05-07T20:32:44.3050824Z x1 = x[:, D:] 2025-05-07T20:32:44.3050899Z 2025-05-07T20:32:44.3050985Z if contiguous: 2025-05-07T20:32:44.3051088Z x0 = x0.contiguous() 2025-05-07T20:32:44.3051179Z x1 = x1.contiguous() 2025-05-07T20:32:44.3051253Z 2025-05-07T20:32:44.3051349Z if scale_ub is not None: 2025-05-07T20:32:44.3051462Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3051597Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3051682Z ) 2025-05-07T20:32:44.3051759Z else: 2025-05-07T20:32:44.3051854Z scale_ub_tensor = None 2025-05-07T20:32:44.3051930Z 2025-05-07T20:32:44.3052063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3052154Z op = silu_mul_quant 2025-05-07T20:32:44.3052244Z if compiled: 2025-05-07T20:32:44.3052343Z op = torch.compile(op) 2025-05-07T20:32:44.3052457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3052530Z 2025-05-07T20:32:44.3052623Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3052628Z 2025-05-07T20:32:44.3052734Z moe/activation_test.py:117: 2025-05-07T20:32:44.3052862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3052967Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3053076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3053448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3053541Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3054032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3054133Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3054489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3054713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3055060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3055156Z kernel = self.compile( 2025-05-07T20:32:44.3055539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3055716Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3055847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3055852Z 2025-05-07T20:32:44.3056054Z self = 2025-05-07T20:32:44.3056829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3057421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be858c540>} 2025-05-07T20:32:44.3058167Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3058438Z context = 2025-05-07T20:32:44.3058443Z 2025-05-07T20:32:44.3058607Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3058877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3058986Z module_map=module_map) 2025-05-07T20:32:44.3059147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3059250Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3059330Z E ^ 2025-05-07T20:32:44.3059690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3059695Z 2025-05-07T20:32:44.3060111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3060123Z 2025-05-07T20:32:44.3060229Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3060457Z self=, 2025-05-07T20:32:44.3060539Z T=1, 2025-05-07T20:32:44.3060619Z D=5120, 2025-05-07T20:32:44.3060709Z scale_ub=1200.0, 2025-05-07T20:32:44.3060800Z contiguous=False, 2025-05-07T20:32:44.3060886Z compiled=False, 2025-05-07T20:32:44.3060965Z ) 2025-05-07T20:32:44.3061184Z self = 2025-05-07T20:32:44.3061352Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3061356Z 2025-05-07T20:32:44.3061438Z @given( 2025-05-07T20:32:44.3061563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3061673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3061791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3061915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3062038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3062114Z ) 2025-05-07T20:32:44.3062359Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3062457Z def test_silu_mul_quant( 2025-05-07T20:32:44.3062538Z self, 2025-05-07T20:32:44.3062619Z T: int, 2025-05-07T20:32:44.3062701Z D: int, 2025-05-07T20:32:44.3062801Z scale_ub: Optional[float], 2025-05-07T20:32:44.3062897Z contiguous: bool, 2025-05-07T20:32:44.3062984Z compiled: bool, 2025-05-07T20:32:44.3063067Z ) -> None: 2025-05-07T20:32:44.3063167Z torch.manual_seed(2025) 2025-05-07T20:32:44.3063242Z 2025-05-07T20:32:44.3063417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3063501Z 2025-05-07T20:32:44.3063594Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3063726Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3063820Z x = x_sign * x_clamp 2025-05-07T20:32:44.3063900Z x0 = x[:, :D] 2025-05-07T20:32:44.3063981Z x1 = x[:, D:] 2025-05-07T20:32:44.3064060Z 2025-05-07T20:32:44.3064147Z if contiguous: 2025-05-07T20:32:44.3064239Z x0 = x0.contiguous() 2025-05-07T20:32:44.3064331Z x1 = x1.contiguous() 2025-05-07T20:32:44.3064407Z 2025-05-07T20:32:44.3064502Z if scale_ub is not None: 2025-05-07T20:32:44.3064612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3064747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3064827Z ) 2025-05-07T20:32:44.3064904Z else: 2025-05-07T20:32:44.3065082Z scale_ub_tensor = None 2025-05-07T20:32:44.3065160Z 2025-05-07T20:32:44.3065292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3065583Z op = silu_mul_quant 2025-05-07T20:32:44.3065815Z if compiled: 2025-05-07T20:32:44.3065925Z op = torch.compile(op) 2025-05-07T20:32:44.3066029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3066106Z 2025-05-07T20:32:44.3066196Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3066201Z 2025-05-07T20:32:44.3066302Z moe/activation_test.py:117: 2025-05-07T20:32:44.3066429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3066529Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3066632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3067127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3067228Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3067587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3067812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3068153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3068244Z kernel = self.compile( 2025-05-07T20:32:44.3068625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3068802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3068926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3068930Z 2025-05-07T20:32:44.3069139Z self = 2025-05-07T20:32:44.3069918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3070422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8b5e0c0>} 2025-05-07T20:32:44.3071168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3071356Z context = 2025-05-07T20:32:44.3071361Z 2025-05-07T20:32:44.3071528Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3071788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3071900Z module_map=module_map) 2025-05-07T20:32:44.3072071Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3072172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3072260Z E ^ 2025-05-07T20:32:44.3072613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3072618Z 2025-05-07T20:32:44.3073025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3073030Z 2025-05-07T20:32:44.3073137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3073361Z self=, 2025-05-07T20:32:44.3073440Z T=16384, 2025-05-07T20:32:44.3073523Z D=5120, 2025-05-07T20:32:44.3073608Z scale_ub=1200.0, 2025-05-07T20:32:44.3073699Z contiguous=False, 2025-05-07T20:32:44.3073926Z compiled=True, 2025-05-07T20:32:44.3074005Z ) 2025-05-07T20:32:44.3074229Z self = 2025-05-07T20:32:44.3074407Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3074546Z 2025-05-07T20:32:44.3074625Z @given( 2025-05-07T20:32:44.3074748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3074847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3074962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3075080Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3075194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3075272Z ) 2025-05-07T20:32:44.3075515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3075609Z def test_silu_mul_quant( 2025-05-07T20:32:44.3075693Z self, 2025-05-07T20:32:44.3075823Z T: int, 2025-05-07T20:32:44.3075906Z D: int, 2025-05-07T20:32:44.3076009Z scale_ub: Optional[float], 2025-05-07T20:32:44.3076099Z contiguous: bool, 2025-05-07T20:32:44.3076186Z compiled: bool, 2025-05-07T20:32:44.3076276Z ) -> None: 2025-05-07T20:32:44.3076376Z torch.manual_seed(2025) 2025-05-07T20:32:44.3076450Z 2025-05-07T20:32:44.3076621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3076699Z 2025-05-07T20:32:44.3076797Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3076920Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3077008Z x = x_sign * x_clamp 2025-05-07T20:32:44.3077098Z x0 = x[:, :D] 2025-05-07T20:32:44.3077180Z x1 = x[:, D:] 2025-05-07T20:32:44.3077254Z 2025-05-07T20:32:44.3077340Z if contiguous: 2025-05-07T20:32:44.3077432Z x0 = x0.contiguous() 2025-05-07T20:32:44.3077522Z x1 = x1.contiguous() 2025-05-07T20:32:44.3077602Z 2025-05-07T20:32:44.3077694Z if scale_ub is not None: 2025-05-07T20:32:44.3077801Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3077940Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3078024Z ) 2025-05-07T20:32:44.3078106Z else: 2025-05-07T20:32:44.3078203Z scale_ub_tensor = None 2025-05-07T20:32:44.3078277Z 2025-05-07T20:32:44.3078408Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3078500Z op = silu_mul_quant 2025-05-07T20:32:44.3078586Z if compiled: 2025-05-07T20:32:44.3078689Z op = torch.compile(op) 2025-05-07T20:32:44.3078795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3078870Z 2025-05-07T20:32:44.3078964Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3078968Z 2025-05-07T20:32:44.3079066Z moe/activation_test.py:117: 2025-05-07T20:32:44.3079200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3079307Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3079407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3079776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3079875Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3080364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3080466Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3080820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3081040Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3081379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3081601Z kernel = self.compile( 2025-05-07T20:32:44.3081985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3082158Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3082362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3082367Z 2025-05-07T20:32:44.3082573Z self = 2025-05-07T20:32:44.3083346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3083850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be90eb560>} 2025-05-07T20:32:44.3084598Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3084794Z context = 2025-05-07T20:32:44.3084799Z 2025-05-07T20:32:44.3084962Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3085224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3085335Z module_map=module_map) 2025-05-07T20:32:44.3085497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3085597Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3085681Z E ^ 2025-05-07T20:32:44.3086035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3086040Z 2025-05-07T20:32:44.3086455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3086460Z 2025-05-07T20:32:44.3086565Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3086793Z self=, 2025-05-07T20:32:44.3086873Z T=2048, 2025-05-07T20:32:44.3086949Z D=7168, 2025-05-07T20:32:44.3087032Z scale_ub=1200.0, 2025-05-07T20:32:44.3087123Z contiguous=False, 2025-05-07T20:32:44.3087206Z compiled=True, 2025-05-07T20:32:44.3087283Z ) 2025-05-07T20:32:44.3087498Z self = 2025-05-07T20:32:44.3087672Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3087677Z 2025-05-07T20:32:44.3087757Z @given( 2025-05-07T20:32:44.3087874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3087975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3088095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3088211Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3088326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3088405Z ) 2025-05-07T20:32:44.3088647Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3088743Z def test_silu_mul_quant( 2025-05-07T20:32:44.3088820Z self, 2025-05-07T20:32:44.3088896Z T: int, 2025-05-07T20:32:44.3088975Z D: int, 2025-05-07T20:32:44.3089072Z scale_ub: Optional[float], 2025-05-07T20:32:44.3089159Z contiguous: bool, 2025-05-07T20:32:44.3089248Z compiled: bool, 2025-05-07T20:32:44.3089326Z ) -> None: 2025-05-07T20:32:44.3089421Z torch.manual_seed(2025) 2025-05-07T20:32:44.3089496Z 2025-05-07T20:32:44.3089661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3089820Z 2025-05-07T20:32:44.3089915Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3090039Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3090133Z x = x_sign * x_clamp 2025-05-07T20:32:44.3090287Z x0 = x[:, :D] 2025-05-07T20:32:44.3090366Z x1 = x[:, D:] 2025-05-07T20:32:44.3090442Z 2025-05-07T20:32:44.3090524Z if contiguous: 2025-05-07T20:32:44.3090617Z x0 = x0.contiguous() 2025-05-07T20:32:44.3090708Z x1 = x1.contiguous() 2025-05-07T20:32:44.3090780Z 2025-05-07T20:32:44.3090870Z if scale_ub is not None: 2025-05-07T20:32:44.3090979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3091110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3091184Z ) 2025-05-07T20:32:44.3091264Z else: 2025-05-07T20:32:44.3091358Z scale_ub_tensor = None 2025-05-07T20:32:44.3091431Z 2025-05-07T20:32:44.3091569Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3091662Z op = silu_mul_quant 2025-05-07T20:32:44.3091750Z if compiled: 2025-05-07T20:32:44.3091848Z op = torch.compile(op) 2025-05-07T20:32:44.3091959Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3092034Z 2025-05-07T20:32:44.3092127Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3092132Z 2025-05-07T20:32:44.3092228Z moe/activation_test.py:117: 2025-05-07T20:32:44.3092359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3092459Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3092559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3092927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3093018Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3093513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3093610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3093964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3094191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3094526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3094620Z kernel = self.compile( 2025-05-07T20:32:44.3094997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3095170Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3095299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3095303Z 2025-05-07T20:32:44.3095511Z self = 2025-05-07T20:32:44.3096294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3096800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8d6fd80>} 2025-05-07T20:32:44.3097543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3097736Z context = 2025-05-07T20:32:44.3097740Z 2025-05-07T20:32:44.3097902Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3098247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3098357Z module_map=module_map) 2025-05-07T20:32:44.3098517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3098694Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3098771Z E ^ 2025-05-07T20:32:44.3099122Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3099130Z 2025-05-07T20:32:44.3099540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3099544Z 2025-05-07T20:32:44.3099647Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3099871Z self=, 2025-05-07T20:32:44.3099948Z T=1, 2025-05-07T20:32:44.3100023Z D=5120, 2025-05-07T20:32:44.3100110Z scale_ub=None, 2025-05-07T20:32:44.3100201Z contiguous=False, 2025-05-07T20:32:44.3100285Z compiled=False, 2025-05-07T20:32:44.3100363Z ) 2025-05-07T20:32:44.3100579Z self = 2025-05-07T20:32:44.3100752Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.3100757Z 2025-05-07T20:32:44.3100836Z @given( 2025-05-07T20:32:44.3100954Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3101058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3101173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3101288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3101406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3101480Z ) 2025-05-07T20:32:44.3101721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3101817Z def test_silu_mul_quant( 2025-05-07T20:32:44.3101897Z self, 2025-05-07T20:32:44.3101977Z T: int, 2025-05-07T20:32:44.3102053Z D: int, 2025-05-07T20:32:44.3102152Z scale_ub: Optional[float], 2025-05-07T20:32:44.3102244Z contiguous: bool, 2025-05-07T20:32:44.3102334Z compiled: bool, 2025-05-07T20:32:44.3102411Z ) -> None: 2025-05-07T20:32:44.3102509Z torch.manual_seed(2025) 2025-05-07T20:32:44.3102582Z 2025-05-07T20:32:44.3102747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3102825Z 2025-05-07T20:32:44.3102914Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3103039Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3103132Z x = x_sign * x_clamp 2025-05-07T20:32:44.3103212Z x0 = x[:, :D] 2025-05-07T20:32:44.3103296Z x1 = x[:, D:] 2025-05-07T20:32:44.3103369Z 2025-05-07T20:32:44.3103453Z if contiguous: 2025-05-07T20:32:44.3103546Z x0 = x0.contiguous() 2025-05-07T20:32:44.3103641Z x1 = x1.contiguous() 2025-05-07T20:32:44.3103713Z 2025-05-07T20:32:44.3103806Z if scale_ub is not None: 2025-05-07T20:32:44.3103910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3104049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3104128Z ) 2025-05-07T20:32:44.3104205Z else: 2025-05-07T20:32:44.3104299Z scale_ub_tensor = None 2025-05-07T20:32:44.3104377Z 2025-05-07T20:32:44.3104503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3104591Z op = silu_mul_quant 2025-05-07T20:32:44.3104679Z if compiled: 2025-05-07T20:32:44.3104777Z op = torch.compile(op) 2025-05-07T20:32:44.3104887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3104960Z 2025-05-07T20:32:44.3105049Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3105053Z 2025-05-07T20:32:44.3105151Z moe/activation_test.py:117: 2025-05-07T20:32:44.3105361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3105462Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3105562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3106156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3106254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3106610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3106830Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3107170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3107263Z kernel = self.compile( 2025-05-07T20:32:44.3107647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3107821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3107950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3107959Z 2025-05-07T20:32:44.3108163Z self = 2025-05-07T20:32:44.3108941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3109490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be976bd80>} 2025-05-07T20:32:44.3110242Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3110431Z context = 2025-05-07T20:32:44.3110435Z 2025-05-07T20:32:44.3110607Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3110867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3110973Z module_map=module_map) 2025-05-07T20:32:44.3111137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3111237Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3111317Z E ^ 2025-05-07T20:32:44.3111668Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3111673Z 2025-05-07T20:32:44.3112082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3112091Z 2025-05-07T20:32:44.3112199Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3112421Z self=, 2025-05-07T20:32:44.3112503Z T=4096, 2025-05-07T20:32:44.3112582Z D=7168, 2025-05-07T20:32:44.3112664Z scale_ub=1200.0, 2025-05-07T20:32:44.3112755Z contiguous=False, 2025-05-07T20:32:44.3112838Z compiled=False, 2025-05-07T20:32:44.3112911Z ) 2025-05-07T20:32:44.3113130Z self = 2025-05-07T20:32:44.3113306Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3113310Z 2025-05-07T20:32:44.3113386Z @given( 2025-05-07T20:32:44.3113507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3113605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3113719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3113916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3114031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3114108Z ) 2025-05-07T20:32:44.3114349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3114518Z def test_silu_mul_quant( 2025-05-07T20:32:44.3114598Z self, 2025-05-07T20:32:44.3114674Z T: int, 2025-05-07T20:32:44.3114749Z D: int, 2025-05-07T20:32:44.3114852Z scale_ub: Optional[float], 2025-05-07T20:32:44.3114940Z contiguous: bool, 2025-05-07T20:32:44.3115024Z compiled: bool, 2025-05-07T20:32:44.3115106Z ) -> None: 2025-05-07T20:32:44.3115201Z torch.manual_seed(2025) 2025-05-07T20:32:44.3115274Z 2025-05-07T20:32:44.3115443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3115516Z 2025-05-07T20:32:44.3115611Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3115784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3115873Z x = x_sign * x_clamp 2025-05-07T20:32:44.3115954Z x0 = x[:, :D] 2025-05-07T20:32:44.3116032Z x1 = x[:, D:] 2025-05-07T20:32:44.3116111Z 2025-05-07T20:32:44.3116196Z if contiguous: 2025-05-07T20:32:44.3116286Z x0 = x0.contiguous() 2025-05-07T20:32:44.3116374Z x1 = x1.contiguous() 2025-05-07T20:32:44.3116449Z 2025-05-07T20:32:44.3116539Z if scale_ub is not None: 2025-05-07T20:32:44.3116643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3116779Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3116854Z ) 2025-05-07T20:32:44.3116932Z else: 2025-05-07T20:32:44.3117026Z scale_ub_tensor = None 2025-05-07T20:32:44.3117100Z 2025-05-07T20:32:44.3117233Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3117322Z op = silu_mul_quant 2025-05-07T20:32:44.3117411Z if compiled: 2025-05-07T20:32:44.3117513Z op = torch.compile(op) 2025-05-07T20:32:44.3117619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3117691Z 2025-05-07T20:32:44.3117789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3117793Z 2025-05-07T20:32:44.3117888Z moe/activation_test.py:117: 2025-05-07T20:32:44.3118014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3118116Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3118216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3118713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3118807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3119161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3119392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3119731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3119826Z kernel = self.compile( 2025-05-07T20:32:44.3120205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3120379Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3120506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3120510Z 2025-05-07T20:32:44.3120712Z self = 2025-05-07T20:32:44.3121484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3122067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be93a0e00>} 2025-05-07T20:32:44.3122810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3123075Z context = 2025-05-07T20:32:44.3123080Z 2025-05-07T20:32:44.3123243Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3123505Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3123612Z module_map=module_map) 2025-05-07T20:32:44.3123773Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3123878Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3123961Z E ^ 2025-05-07T20:32:44.3124313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3124318Z 2025-05-07T20:32:44.3124733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3124738Z 2025-05-07T20:32:44.3124840Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3125064Z self=, 2025-05-07T20:32:44.3125143Z T=16384, 2025-05-07T20:32:44.3125221Z D=7168, 2025-05-07T20:32:44.3125308Z scale_ub=None, 2025-05-07T20:32:44.3125393Z contiguous=True, 2025-05-07T20:32:44.3125478Z compiled=True, 2025-05-07T20:32:44.3125555Z ) 2025-05-07T20:32:44.3125772Z self = 2025-05-07T20:32:44.3125946Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.3125955Z 2025-05-07T20:32:44.3126034Z @given( 2025-05-07T20:32:44.3126154Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3126256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3126377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3126494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3126610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3126686Z ) 2025-05-07T20:32:44.3126927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3127024Z def test_silu_mul_quant( 2025-05-07T20:32:44.3127101Z self, 2025-05-07T20:32:44.3127185Z T: int, 2025-05-07T20:32:44.3127262Z D: int, 2025-05-07T20:32:44.3127360Z scale_ub: Optional[float], 2025-05-07T20:32:44.3127453Z contiguous: bool, 2025-05-07T20:32:44.3127538Z compiled: bool, 2025-05-07T20:32:44.3127617Z ) -> None: 2025-05-07T20:32:44.3127724Z torch.manual_seed(2025) 2025-05-07T20:32:44.3127799Z 2025-05-07T20:32:44.3127966Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3128049Z 2025-05-07T20:32:44.3128142Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3128266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3128364Z x = x_sign * x_clamp 2025-05-07T20:32:44.3128446Z x0 = x[:, :D] 2025-05-07T20:32:44.3128527Z x1 = x[:, D:] 2025-05-07T20:32:44.3128607Z 2025-05-07T20:32:44.3128692Z if contiguous: 2025-05-07T20:32:44.3128786Z x0 = x0.contiguous() 2025-05-07T20:32:44.3128875Z x1 = x1.contiguous() 2025-05-07T20:32:44.3128949Z 2025-05-07T20:32:44.3129043Z if scale_ub is not None: 2025-05-07T20:32:44.3129148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3129282Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3129471Z ) 2025-05-07T20:32:44.3129564Z else: 2025-05-07T20:32:44.3129674Z scale_ub_tensor = None 2025-05-07T20:32:44.3129751Z 2025-05-07T20:32:44.3129880Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3130045Z op = silu_mul_quant 2025-05-07T20:32:44.3130134Z if compiled: 2025-05-07T20:32:44.3130239Z op = torch.compile(op) 2025-05-07T20:32:44.3130345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3130423Z 2025-05-07T20:32:44.3134193Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3134200Z 2025-05-07T20:32:44.3134319Z moe/activation_test.py:117: 2025-05-07T20:32:44.3134451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3134559Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3134664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3135051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3135151Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3135643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3135747Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3136106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3136328Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3136669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3136764Z kernel = self.compile( 2025-05-07T20:32:44.3137146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3137324Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3137456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3137460Z 2025-05-07T20:32:44.3137671Z self = 2025-05-07T20:32:44.3138452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3138956Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea4d8fe0>} 2025-05-07T20:32:44.3139709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3139906Z context = 2025-05-07T20:32:44.3139910Z 2025-05-07T20:32:44.3140080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3140344Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3140462Z module_map=module_map) 2025-05-07T20:32:44.3140627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3140730Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3140808Z E ^ 2025-05-07T20:32:44.3141167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3141172Z 2025-05-07T20:32:44.3141583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3141588Z 2025-05-07T20:32:44.3141695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3142044Z self=, 2025-05-07T20:32:44.3142130Z T=4096, 2025-05-07T20:32:44.3142209Z D=5120, 2025-05-07T20:32:44.3142292Z scale_ub=None, 2025-05-07T20:32:44.3142457Z contiguous=False, 2025-05-07T20:32:44.3142540Z compiled=True, 2025-05-07T20:32:44.3142618Z ) 2025-05-07T20:32:44.3142840Z self = 2025-05-07T20:32:44.3143013Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.3143018Z 2025-05-07T20:32:44.3143100Z @given( 2025-05-07T20:32:44.3143220Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3143320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3143439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3143556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3143670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3143754Z ) 2025-05-07T20:32:44.3144379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3144477Z def test_silu_mul_quant( 2025-05-07T20:32:44.3144561Z self, 2025-05-07T20:32:44.3144640Z T: int, 2025-05-07T20:32:44.3144721Z D: int, 2025-05-07T20:32:44.3144820Z scale_ub: Optional[float], 2025-05-07T20:32:44.3144912Z contiguous: bool, 2025-05-07T20:32:44.3145004Z compiled: bool, 2025-05-07T20:32:44.3145086Z ) -> None: 2025-05-07T20:32:44.3145181Z torch.manual_seed(2025) 2025-05-07T20:32:44.3145262Z 2025-05-07T20:32:44.3145429Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3145512Z 2025-05-07T20:32:44.3145605Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3145731Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3145826Z x = x_sign * x_clamp 2025-05-07T20:32:44.3145910Z x0 = x[:, :D] 2025-05-07T20:32:44.3145995Z x1 = x[:, D:] 2025-05-07T20:32:44.3146071Z 2025-05-07T20:32:44.3146158Z if contiguous: 2025-05-07T20:32:44.3146251Z x0 = x0.contiguous() 2025-05-07T20:32:44.3146348Z x1 = x1.contiguous() 2025-05-07T20:32:44.3146422Z 2025-05-07T20:32:44.3146514Z if scale_ub is not None: 2025-05-07T20:32:44.3146630Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3146766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3146842Z ) 2025-05-07T20:32:44.3146926Z else: 2025-05-07T20:32:44.3147021Z scale_ub_tensor = None 2025-05-07T20:32:44.3147101Z 2025-05-07T20:32:44.3147231Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3147325Z op = silu_mul_quant 2025-05-07T20:32:44.3147415Z if compiled: 2025-05-07T20:32:44.3147516Z op = torch.compile(op) 2025-05-07T20:32:44.3147626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3147706Z 2025-05-07T20:32:44.3147800Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3147805Z 2025-05-07T20:32:44.3147903Z moe/activation_test.py:117: 2025-05-07T20:32:44.3148040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3148143Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3148246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3148618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3148712Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3149205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3149303Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3149658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3149971Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3150312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3150483Z kernel = self.compile( 2025-05-07T20:32:44.3150864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3151039Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3151171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3151176Z 2025-05-07T20:32:44.3151383Z self = 2025-05-07T20:32:44.3152168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3152672Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea4afb00>} 2025-05-07T20:32:44.3153419Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3153616Z context = 2025-05-07T20:32:44.3153620Z 2025-05-07T20:32:44.3153786Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3154052Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3154160Z module_map=module_map) 2025-05-07T20:32:44.3154321Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3154427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3154506Z E ^ 2025-05-07T20:32:44.3154864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3154873Z 2025-05-07T20:32:44.3155284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3155289Z 2025-05-07T20:32:44.3155393Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3155621Z self=, 2025-05-07T20:32:44.3155780Z T=4096, 2025-05-07T20:32:44.3155860Z D=5120, 2025-05-07T20:32:44.3155947Z scale_ub=1200.0, 2025-05-07T20:32:44.3156034Z contiguous=False, 2025-05-07T20:32:44.3156123Z compiled=False, 2025-05-07T20:32:44.3156203Z ) 2025-05-07T20:32:44.3156420Z self = 2025-05-07T20:32:44.3156604Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3156609Z 2025-05-07T20:32:44.3156688Z @given( 2025-05-07T20:32:44.3156808Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3156916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3157032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3157149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3157266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3157342Z ) 2025-05-07T20:32:44.3157592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3157686Z def test_silu_mul_quant( 2025-05-07T20:32:44.3157765Z self, 2025-05-07T20:32:44.3157848Z T: int, 2025-05-07T20:32:44.3157926Z D: int, 2025-05-07T20:32:44.3158026Z scale_ub: Optional[float], 2025-05-07T20:32:44.3158120Z contiguous: bool, 2025-05-07T20:32:44.3158289Z compiled: bool, 2025-05-07T20:32:44.3158369Z ) -> None: 2025-05-07T20:32:44.3158470Z torch.manual_seed(2025) 2025-05-07T20:32:44.3158544Z 2025-05-07T20:32:44.3158716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3158868Z 2025-05-07T20:32:44.3158962Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3159094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3159188Z x = x_sign * x_clamp 2025-05-07T20:32:44.3159270Z x0 = x[:, :D] 2025-05-07T20:32:44.3159355Z x1 = x[:, D:] 2025-05-07T20:32:44.3159428Z 2025-05-07T20:32:44.3159512Z if contiguous: 2025-05-07T20:32:44.3159608Z x0 = x0.contiguous() 2025-05-07T20:32:44.3159697Z x1 = x1.contiguous() 2025-05-07T20:32:44.3159771Z 2025-05-07T20:32:44.3159866Z if scale_ub is not None: 2025-05-07T20:32:44.3159972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3160114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3160196Z ) 2025-05-07T20:32:44.3160274Z else: 2025-05-07T20:32:44.3160371Z scale_ub_tensor = None 2025-05-07T20:32:44.3160455Z 2025-05-07T20:32:44.3160586Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3160679Z op = silu_mul_quant 2025-05-07T20:32:44.3160764Z if compiled: 2025-05-07T20:32:44.3160864Z op = torch.compile(op) 2025-05-07T20:32:44.3160975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3161048Z 2025-05-07T20:32:44.3161140Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3161144Z 2025-05-07T20:32:44.3161245Z moe/activation_test.py:117: 2025-05-07T20:32:44.3161374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3161475Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3161579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3162081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3162183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3162547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3162770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3163112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3163206Z kernel = self.compile( 2025-05-07T20:32:44.3163589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3163763Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3163894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3163902Z 2025-05-07T20:32:44.3164110Z self = 2025-05-07T20:32:44.3164889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3165759Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea4ad3a0>} 2025-05-07T20:32:44.3166516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3166707Z context = 2025-05-07T20:32:44.3166713Z 2025-05-07T20:32:44.3167028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3167296Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3167409Z module_map=module_map) 2025-05-07T20:32:44.3167683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3167785Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3167869Z E ^ 2025-05-07T20:32:44.3168223Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3168228Z 2025-05-07T20:32:44.3168639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3168647Z 2025-05-07T20:32:44.3168751Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3168977Z self=, 2025-05-07T20:32:44.3169063Z T=4096, 2025-05-07T20:32:44.3169145Z D=5120, 2025-05-07T20:32:44.3169232Z scale_ub=1200.0, 2025-05-07T20:32:44.3169323Z contiguous=False, 2025-05-07T20:32:44.3169406Z compiled=True, 2025-05-07T20:32:44.3169481Z ) 2025-05-07T20:32:44.3169707Z self = 2025-05-07T20:32:44.3169882Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3169887Z 2025-05-07T20:32:44.3169968Z @given( 2025-05-07T20:32:44.3170089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3170190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3170312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3170429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3170542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3170620Z ) 2025-05-07T20:32:44.3170863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3170961Z def test_silu_mul_quant( 2025-05-07T20:32:44.3171040Z self, 2025-05-07T20:32:44.3171118Z T: int, 2025-05-07T20:32:44.3171197Z D: int, 2025-05-07T20:32:44.3171307Z scale_ub: Optional[float], 2025-05-07T20:32:44.3171396Z contiguous: bool, 2025-05-07T20:32:44.3171485Z compiled: bool, 2025-05-07T20:32:44.3171564Z ) -> None: 2025-05-07T20:32:44.3171664Z torch.manual_seed(2025) 2025-05-07T20:32:44.3171741Z 2025-05-07T20:32:44.3171909Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3171984Z 2025-05-07T20:32:44.3172080Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3172203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3172290Z x = x_sign * x_clamp 2025-05-07T20:32:44.3172372Z x0 = x[:, :D] 2025-05-07T20:32:44.3172451Z x1 = x[:, D:] 2025-05-07T20:32:44.3172525Z 2025-05-07T20:32:44.3172617Z if contiguous: 2025-05-07T20:32:44.3172708Z x0 = x0.contiguous() 2025-05-07T20:32:44.3172797Z x1 = x1.contiguous() 2025-05-07T20:32:44.3172870Z 2025-05-07T20:32:44.3172959Z if scale_ub is not None: 2025-05-07T20:32:44.3173071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3173205Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3173280Z ) 2025-05-07T20:32:44.3173360Z else: 2025-05-07T20:32:44.3173455Z scale_ub_tensor = None 2025-05-07T20:32:44.3173529Z 2025-05-07T20:32:44.3173660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3173750Z op = silu_mul_quant 2025-05-07T20:32:44.3173834Z if compiled: 2025-05-07T20:32:44.3173937Z op = torch.compile(op) 2025-05-07T20:32:44.3174041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3174122Z 2025-05-07T20:32:44.3174212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3174329Z 2025-05-07T20:32:44.3174427Z moe/activation_test.py:117: 2025-05-07T20:32:44.3174556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3174737Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3174836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3175205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3175296Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3175787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3175888Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3176243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3176466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3176806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3176902Z kernel = self.compile( 2025-05-07T20:32:44.3177280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3177461Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3177586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3177591Z 2025-05-07T20:32:44.3177793Z self = 2025-05-07T20:32:44.3178569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3179073Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea2f7880>} 2025-05-07T20:32:44.3179816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3180013Z context = 2025-05-07T20:32:44.3180018Z 2025-05-07T20:32:44.3180184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3180445Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3180550Z module_map=module_map) 2025-05-07T20:32:44.3180715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3180814Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3180900Z E ^ 2025-05-07T20:32:44.3181256Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3181261Z 2025-05-07T20:32:44.3181670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3181678Z 2025-05-07T20:32:44.3181785Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3182008Z self=, 2025-05-07T20:32:44.3182094Z T=2048, 2025-05-07T20:32:44.3182173Z D=7168, 2025-05-07T20:32:44.3182257Z scale_ub=1200.0, 2025-05-07T20:32:44.3182346Z contiguous=False, 2025-05-07T20:32:44.3182432Z compiled=False, 2025-05-07T20:32:44.3182507Z ) 2025-05-07T20:32:44.3182727Z self = 2025-05-07T20:32:44.3182901Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3182905Z 2025-05-07T20:32:44.3183067Z @given( 2025-05-07T20:32:44.3183192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3183292Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3183410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3183603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3183716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3183794Z ) 2025-05-07T20:32:44.3184035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3184127Z def test_silu_mul_quant( 2025-05-07T20:32:44.3184208Z self, 2025-05-07T20:32:44.3184286Z T: int, 2025-05-07T20:32:44.3184365Z D: int, 2025-05-07T20:32:44.3184468Z scale_ub: Optional[float], 2025-05-07T20:32:44.3184557Z contiguous: bool, 2025-05-07T20:32:44.3184644Z compiled: bool, 2025-05-07T20:32:44.3184725Z ) -> None: 2025-05-07T20:32:44.3184820Z torch.manual_seed(2025) 2025-05-07T20:32:44.3184906Z 2025-05-07T20:32:44.3185073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3185149Z 2025-05-07T20:32:44.3185242Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3185371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3185462Z x = x_sign * x_clamp 2025-05-07T20:32:44.3185546Z x0 = x[:, :D] 2025-05-07T20:32:44.3185629Z x1 = x[:, D:] 2025-05-07T20:32:44.3185703Z 2025-05-07T20:32:44.3185791Z if contiguous: 2025-05-07T20:32:44.3185882Z x0 = x0.contiguous() 2025-05-07T20:32:44.3185972Z x1 = x1.contiguous() 2025-05-07T20:32:44.3186049Z 2025-05-07T20:32:44.3186140Z if scale_ub is not None: 2025-05-07T20:32:44.3186246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3186381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3186459Z ) 2025-05-07T20:32:44.3186544Z else: 2025-05-07T20:32:44.3186640Z scale_ub_tensor = None 2025-05-07T20:32:44.3186714Z 2025-05-07T20:32:44.3186845Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3186944Z op = silu_mul_quant 2025-05-07T20:32:44.3187030Z if compiled: 2025-05-07T20:32:44.3187133Z op = torch.compile(op) 2025-05-07T20:32:44.3187237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3187311Z 2025-05-07T20:32:44.3187406Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3187410Z 2025-05-07T20:32:44.3187507Z moe/activation_test.py:117: 2025-05-07T20:32:44.3187637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3187740Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3187840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3188342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3188442Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3188796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3189025Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3189361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3189461Z kernel = self.compile( 2025-05-07T20:32:44.3189837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3190008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3190136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3190141Z 2025-05-07T20:32:44.3190344Z self = 2025-05-07T20:32:44.3191210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3191782Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5beab019e0>} 2025-05-07T20:32:44.3192521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3192715Z context = 2025-05-07T20:32:44.3192719Z 2025-05-07T20:32:44.3192880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3193152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3193259Z module_map=module_map) 2025-05-07T20:32:44.3193419Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3193527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3193605Z E ^ 2025-05-07T20:32:44.3193956Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3193963Z 2025-05-07T20:32:44.3194371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3194376Z 2025-05-07T20:32:44.3194477Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3194705Z self=, 2025-05-07T20:32:44.3194781Z T=1, 2025-05-07T20:32:44.3194857Z D=7168, 2025-05-07T20:32:44.3194943Z scale_ub=None, 2025-05-07T20:32:44.3195029Z contiguous=True, 2025-05-07T20:32:44.3195120Z compiled=False, 2025-05-07T20:32:44.3195195Z ) 2025-05-07T20:32:44.3195411Z self = 2025-05-07T20:32:44.3195580Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3195584Z 2025-05-07T20:32:44.3195663Z @given( 2025-05-07T20:32:44.3195838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3195943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3196056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3196173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3196290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3196362Z ) 2025-05-07T20:32:44.3196604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3196699Z def test_silu_mul_quant( 2025-05-07T20:32:44.3196775Z self, 2025-05-07T20:32:44.3196860Z T: int, 2025-05-07T20:32:44.3196937Z D: int, 2025-05-07T20:32:44.3197035Z scale_ub: Optional[float], 2025-05-07T20:32:44.3197125Z contiguous: bool, 2025-05-07T20:32:44.3197210Z compiled: bool, 2025-05-07T20:32:44.3197293Z ) -> None: 2025-05-07T20:32:44.3197393Z torch.manual_seed(2025) 2025-05-07T20:32:44.3197466Z 2025-05-07T20:32:44.3197632Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3197708Z 2025-05-07T20:32:44.3197798Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3197922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3198013Z x = x_sign * x_clamp 2025-05-07T20:32:44.3198095Z x0 = x[:, :D] 2025-05-07T20:32:44.3198180Z x1 = x[:, D:] 2025-05-07T20:32:44.3198252Z 2025-05-07T20:32:44.3198333Z if contiguous: 2025-05-07T20:32:44.3198426Z x0 = x0.contiguous() 2025-05-07T20:32:44.3198513Z x1 = x1.contiguous() 2025-05-07T20:32:44.3198674Z 2025-05-07T20:32:44.3198768Z if scale_ub is not None: 2025-05-07T20:32:44.3198872Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3199004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3199188Z ) 2025-05-07T20:32:44.3199263Z else: 2025-05-07T20:32:44.3199355Z scale_ub_tensor = None 2025-05-07T20:32:44.3199429Z 2025-05-07T20:32:44.3199556Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3199648Z op = silu_mul_quant 2025-05-07T20:32:44.3199731Z if compiled: 2025-05-07T20:32:44.3199829Z op = torch.compile(op) 2025-05-07T20:32:44.3199935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3200007Z 2025-05-07T20:32:44.3200096Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3200100Z 2025-05-07T20:32:44.3200199Z moe/activation_test.py:117: 2025-05-07T20:32:44.3200331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3200430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3200530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3201027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3201125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3201479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3201697Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3202032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3202123Z kernel = self.compile( 2025-05-07T20:32:44.3202500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3202680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3202805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3202814Z 2025-05-07T20:32:44.3203019Z self = 2025-05-07T20:32:44.3203790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3204292Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5beab5e020>} 2025-05-07T20:32:44.3205036Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3205222Z context = 2025-05-07T20:32:44.3205226Z 2025-05-07T20:32:44.3205390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3205652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3205766Z module_map=module_map) 2025-05-07T20:32:44.3205926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3206025Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3206106Z E ^ 2025-05-07T20:32:44.3206460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3206464Z 2025-05-07T20:32:44.3206871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3206879Z 2025-05-07T20:32:44.3207060Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3207283Z self=, 2025-05-07T20:32:44.3207364Z T=16384, 2025-05-07T20:32:44.3207574Z D=7168, 2025-05-07T20:32:44.3207658Z scale_ub=1200.0, 2025-05-07T20:32:44.3207747Z contiguous=False, 2025-05-07T20:32:44.3207830Z compiled=True, 2025-05-07T20:32:44.3207905Z ) 2025-05-07T20:32:44.3208125Z self = 2025-05-07T20:32:44.3208302Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3208306Z 2025-05-07T20:32:44.3208383Z @given( 2025-05-07T20:32:44.3208505Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3208605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3208722Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3208837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3208955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3209038Z ) 2025-05-07T20:32:44.3209279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3209378Z def test_silu_mul_quant( 2025-05-07T20:32:44.3209457Z self, 2025-05-07T20:32:44.3209533Z T: int, 2025-05-07T20:32:44.3209610Z D: int, 2025-05-07T20:32:44.3209709Z scale_ub: Optional[float], 2025-05-07T20:32:44.3209798Z contiguous: bool, 2025-05-07T20:32:44.3209885Z compiled: bool, 2025-05-07T20:32:44.3209964Z ) -> None: 2025-05-07T20:32:44.3210061Z torch.manual_seed(2025) 2025-05-07T20:32:44.3210135Z 2025-05-07T20:32:44.3210300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3210376Z 2025-05-07T20:32:44.3210471Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3210593Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3210685Z x = x_sign * x_clamp 2025-05-07T20:32:44.3210768Z x0 = x[:, :D] 2025-05-07T20:32:44.3210847Z x1 = x[:, D:] 2025-05-07T20:32:44.3210920Z 2025-05-07T20:32:44.3211005Z if contiguous: 2025-05-07T20:32:44.3211101Z x0 = x0.contiguous() 2025-05-07T20:32:44.3211187Z x1 = x1.contiguous() 2025-05-07T20:32:44.3211261Z 2025-05-07T20:32:44.3211349Z if scale_ub is not None: 2025-05-07T20:32:44.3211455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3211586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3211661Z ) 2025-05-07T20:32:44.3211742Z else: 2025-05-07T20:32:44.3211833Z scale_ub_tensor = None 2025-05-07T20:32:44.3211905Z 2025-05-07T20:32:44.3212033Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3212123Z op = silu_mul_quant 2025-05-07T20:32:44.3212207Z if compiled: 2025-05-07T20:32:44.3212310Z op = torch.compile(op) 2025-05-07T20:32:44.3212413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3212485Z 2025-05-07T20:32:44.3212576Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3212584Z 2025-05-07T20:32:44.3212678Z moe/activation_test.py:117: 2025-05-07T20:32:44.3212807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3212905Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3213000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3213367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3213457Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3213944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3214043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3214477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3214701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3215109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3215201Z kernel = self.compile( 2025-05-07T20:32:44.3215581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3215753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3215882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3215886Z 2025-05-07T20:32:44.3216086Z self = 2025-05-07T20:32:44.3216863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3217362Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05131580>} 2025-05-07T20:32:44.3218108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3218298Z context = 2025-05-07T20:32:44.3218303Z 2025-05-07T20:32:44.3218463Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3218724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3218831Z module_map=module_map) 2025-05-07T20:32:44.3218994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3219097Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3219175Z E ^ 2025-05-07T20:32:44.3219526Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3219537Z 2025-05-07T20:32:44.3219948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3219952Z 2025-05-07T20:32:44.3220055Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3220283Z self=, 2025-05-07T20:32:44.3220361Z T=1, 2025-05-07T20:32:44.3220437Z D=7168, 2025-05-07T20:32:44.3220522Z scale_ub=None, 2025-05-07T20:32:44.3220609Z contiguous=False, 2025-05-07T20:32:44.3220693Z compiled=False, 2025-05-07T20:32:44.3220772Z ) 2025-05-07T20:32:44.3220991Z self = 2025-05-07T20:32:44.3221155Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.3221159Z 2025-05-07T20:32:44.3221243Z @given( 2025-05-07T20:32:44.3221366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3221463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3221580Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3221695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3221809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3221883Z ) 2025-05-07T20:32:44.3222123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3222218Z def test_silu_mul_quant( 2025-05-07T20:32:44.3222294Z self, 2025-05-07T20:32:44.3222371Z T: int, 2025-05-07T20:32:44.3222451Z D: int, 2025-05-07T20:32:44.3222548Z scale_ub: Optional[float], 2025-05-07T20:32:44.3222722Z contiguous: bool, 2025-05-07T20:32:44.3222812Z compiled: bool, 2025-05-07T20:32:44.3222888Z ) -> None: 2025-05-07T20:32:44.3222982Z torch.manual_seed(2025) 2025-05-07T20:32:44.3223129Z 2025-05-07T20:32:44.3223292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3223367Z 2025-05-07T20:32:44.3223456Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3223579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3223668Z x = x_sign * x_clamp 2025-05-07T20:32:44.3223747Z x0 = x[:, :D] 2025-05-07T20:32:44.3223825Z x1 = x[:, D:] 2025-05-07T20:32:44.3223901Z 2025-05-07T20:32:44.3223981Z if contiguous: 2025-05-07T20:32:44.3224071Z x0 = x0.contiguous() 2025-05-07T20:32:44.3224161Z x1 = x1.contiguous() 2025-05-07T20:32:44.3224233Z 2025-05-07T20:32:44.3224323Z if scale_ub is not None: 2025-05-07T20:32:44.3224434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3224568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3224649Z ) 2025-05-07T20:32:44.3224724Z else: 2025-05-07T20:32:44.3224822Z scale_ub_tensor = None 2025-05-07T20:32:44.3224899Z 2025-05-07T20:32:44.3225025Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3225113Z op = silu_mul_quant 2025-05-07T20:32:44.3225200Z if compiled: 2025-05-07T20:32:44.3225296Z op = torch.compile(op) 2025-05-07T20:32:44.3225399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3225473Z 2025-05-07T20:32:44.3225562Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3225566Z 2025-05-07T20:32:44.3225660Z moe/activation_test.py:117: 2025-05-07T20:32:44.3225791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3225889Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3225996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3226487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3226586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3226943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3227160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3227500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3227592Z kernel = self.compile( 2025-05-07T20:32:44.3227968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3228141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3228268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3228273Z 2025-05-07T20:32:44.3228472Z self = 2025-05-07T20:32:44.3229251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3229746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05132840>} 2025-05-07T20:32:44.3230489Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3230677Z context = 2025-05-07T20:32:44.3230763Z 2025-05-07T20:32:44.3230931Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3231191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3231393Z module_map=module_map) 2025-05-07T20:32:44.3231556Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3231654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3231731Z E ^ 2025-05-07T20:32:44.3232084Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3232089Z 2025-05-07T20:32:44.3232497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3232502Z 2025-05-07T20:32:44.3232606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3232834Z self=, 2025-05-07T20:32:44.3232911Z T=2048, 2025-05-07T20:32:44.3232992Z D=7168, 2025-05-07T20:32:44.3233073Z scale_ub=None, 2025-05-07T20:32:44.3233158Z contiguous=False, 2025-05-07T20:32:44.3233249Z compiled=True, 2025-05-07T20:32:44.3233323Z ) 2025-05-07T20:32:44.3233540Z self = 2025-05-07T20:32:44.3233717Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.3233722Z 2025-05-07T20:32:44.3233799Z @given( 2025-05-07T20:32:44.3233922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3234018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3234130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3234247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3234359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3234432Z ) 2025-05-07T20:32:44.3234681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3234772Z def test_silu_mul_quant( 2025-05-07T20:32:44.3234849Z self, 2025-05-07T20:32:44.3234933Z T: int, 2025-05-07T20:32:44.3235008Z D: int, 2025-05-07T20:32:44.3235107Z scale_ub: Optional[float], 2025-05-07T20:32:44.3235194Z contiguous: bool, 2025-05-07T20:32:44.3235277Z compiled: bool, 2025-05-07T20:32:44.3235357Z ) -> None: 2025-05-07T20:32:44.3235448Z torch.manual_seed(2025) 2025-05-07T20:32:44.3235520Z 2025-05-07T20:32:44.3235687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3235810Z 2025-05-07T20:32:44.3235900Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3236026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3236112Z x = x_sign * x_clamp 2025-05-07T20:32:44.3236192Z x0 = x[:, :D] 2025-05-07T20:32:44.3236273Z x1 = x[:, D:] 2025-05-07T20:32:44.3236349Z 2025-05-07T20:32:44.3236431Z if contiguous: 2025-05-07T20:32:44.3236527Z x0 = x0.contiguous() 2025-05-07T20:32:44.3236614Z x1 = x1.contiguous() 2025-05-07T20:32:44.3236692Z 2025-05-07T20:32:44.3236780Z if scale_ub is not None: 2025-05-07T20:32:44.3236889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3237023Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3237097Z ) 2025-05-07T20:32:44.3237171Z else: 2025-05-07T20:32:44.3237266Z scale_ub_tensor = None 2025-05-07T20:32:44.3237339Z 2025-05-07T20:32:44.3237464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3237554Z op = silu_mul_quant 2025-05-07T20:32:44.3237637Z if compiled: 2025-05-07T20:32:44.3237734Z op = torch.compile(op) 2025-05-07T20:32:44.3237841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3237996Z 2025-05-07T20:32:44.3238091Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3238095Z 2025-05-07T20:32:44.3238190Z moe/activation_test.py:117: 2025-05-07T20:32:44.3238317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3238498Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3238597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3238962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3239057Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3239544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3239642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3239996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3240221Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3240559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3240656Z kernel = self.compile( 2025-05-07T20:32:44.3241032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3241206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3241332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3241336Z 2025-05-07T20:32:44.3241540Z self = 2025-05-07T20:32:44.3242308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3242816Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c0ea80ea0>} 2025-05-07T20:32:44.3243564Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3243754Z context = 2025-05-07T20:32:44.3243758Z 2025-05-07T20:32:44.3243924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3244182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3244291Z module_map=module_map) 2025-05-07T20:32:44.3244451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3244550Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3244636Z E ^ 2025-05-07T20:32:44.3244987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3244996Z 2025-05-07T20:32:44.3245403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3245411Z 2025-05-07T20:32:44.3245511Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3245731Z self=, 2025-05-07T20:32:44.3245811Z T=4096, 2025-05-07T20:32:44.3245888Z D=7168, 2025-05-07T20:32:44.3245970Z scale_ub=None, 2025-05-07T20:32:44.3246061Z contiguous=False, 2025-05-07T20:32:44.3246144Z compiled=True, 2025-05-07T20:32:44.3246217Z ) 2025-05-07T20:32:44.3246435Z self = 2025-05-07T20:32:44.3246691Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.3246696Z 2025-05-07T20:32:44.3246774Z @given( 2025-05-07T20:32:44.3246896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3246993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3247188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3247304Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3247420Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3247498Z ) 2025-05-07T20:32:44.3247740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3247833Z def test_silu_mul_quant( 2025-05-07T20:32:44.3247912Z self, 2025-05-07T20:32:44.3247990Z T: int, 2025-05-07T20:32:44.3248066Z D: int, 2025-05-07T20:32:44.3248167Z scale_ub: Optional[float], 2025-05-07T20:32:44.3248255Z contiguous: bool, 2025-05-07T20:32:44.3248344Z compiled: bool, 2025-05-07T20:32:44.3248428Z ) -> None: 2025-05-07T20:32:44.3248522Z torch.manual_seed(2025) 2025-05-07T20:32:44.3248598Z 2025-05-07T20:32:44.3248763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3248844Z 2025-05-07T20:32:44.3248939Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3249062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3249151Z x = x_sign * x_clamp 2025-05-07T20:32:44.3249233Z x0 = x[:, :D] 2025-05-07T20:32:44.3249312Z x1 = x[:, D:] 2025-05-07T20:32:44.3249386Z 2025-05-07T20:32:44.3249473Z if contiguous: 2025-05-07T20:32:44.3249563Z x0 = x0.contiguous() 2025-05-07T20:32:44.3249651Z x1 = x1.contiguous() 2025-05-07T20:32:44.3249729Z 2025-05-07T20:32:44.3249818Z if scale_ub is not None: 2025-05-07T20:32:44.3249927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3250060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3250142Z ) 2025-05-07T20:32:44.3250223Z else: 2025-05-07T20:32:44.3250317Z scale_ub_tensor = None 2025-05-07T20:32:44.3250390Z 2025-05-07T20:32:44.3250520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3250616Z op = silu_mul_quant 2025-05-07T20:32:44.3250700Z if compiled: 2025-05-07T20:32:44.3250802Z op = torch.compile(op) 2025-05-07T20:32:44.3250907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3250981Z 2025-05-07T20:32:44.3251074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3251078Z 2025-05-07T20:32:44.3251175Z moe/activation_test.py:117: 2025-05-07T20:32:44.3251308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3251408Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3251507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3251878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3251975Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3252463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3252567Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3256812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3257059Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3257410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3257508Z kernel = self.compile( 2025-05-07T20:32:44.3257891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3258169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3258303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3258308Z 2025-05-07T20:32:44.3258516Z self = 2025-05-07T20:32:44.3259372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3259877Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05e5b060>} 2025-05-07T20:32:44.3260620Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3260818Z context = 2025-05-07T20:32:44.3260822Z 2025-05-07T20:32:44.3260986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3261252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3261364Z module_map=module_map) 2025-05-07T20:32:44.3261526Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3261631Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3261710Z E ^ 2025-05-07T20:32:44.3262063Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3262073Z 2025-05-07T20:32:44.3262483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3262487Z 2025-05-07T20:32:44.3262593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3262820Z self=, 2025-05-07T20:32:44.3262900Z T=16384, 2025-05-07T20:32:44.3262977Z D=5120, 2025-05-07T20:32:44.3263063Z scale_ub=1200.0, 2025-05-07T20:32:44.3263153Z contiguous=False, 2025-05-07T20:32:44.3263239Z compiled=False, 2025-05-07T20:32:44.3263313Z ) 2025-05-07T20:32:44.3263530Z self = 2025-05-07T20:32:44.3263715Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3263720Z 2025-05-07T20:32:44.3263796Z @given( 2025-05-07T20:32:44.3263918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3264017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3264133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3264255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3264367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3264446Z ) 2025-05-07T20:32:44.3264693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3264785Z def test_silu_mul_quant( 2025-05-07T20:32:44.3264868Z self, 2025-05-07T20:32:44.3264946Z T: int, 2025-05-07T20:32:44.3265022Z D: int, 2025-05-07T20:32:44.3265124Z scale_ub: Optional[float], 2025-05-07T20:32:44.3265213Z contiguous: bool, 2025-05-07T20:32:44.3265299Z compiled: bool, 2025-05-07T20:32:44.3265636Z ) -> None: 2025-05-07T20:32:44.3265777Z torch.manual_seed(2025) 2025-05-07T20:32:44.3265874Z 2025-05-07T20:32:44.3266047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3266121Z 2025-05-07T20:32:44.3266211Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3266339Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3266427Z x = x_sign * x_clamp 2025-05-07T20:32:44.3266674Z x0 = x[:, :D] 2025-05-07T20:32:44.3266760Z x1 = x[:, D:] 2025-05-07T20:32:44.3266834Z 2025-05-07T20:32:44.3266925Z if contiguous: 2025-05-07T20:32:44.3267016Z x0 = x0.contiguous() 2025-05-07T20:32:44.3267213Z x1 = x1.contiguous() 2025-05-07T20:32:44.3267288Z 2025-05-07T20:32:44.3267378Z if scale_ub is not None: 2025-05-07T20:32:44.3267487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3267625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3267700Z ) 2025-05-07T20:32:44.3267778Z else: 2025-05-07T20:32:44.3267877Z scale_ub_tensor = None 2025-05-07T20:32:44.3267949Z 2025-05-07T20:32:44.3268076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3268169Z op = silu_mul_quant 2025-05-07T20:32:44.3268253Z if compiled: 2025-05-07T20:32:44.3268353Z op = torch.compile(op) 2025-05-07T20:32:44.3268461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3268534Z 2025-05-07T20:32:44.3268629Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3268634Z 2025-05-07T20:32:44.3268728Z moe/activation_test.py:117: 2025-05-07T20:32:44.3268861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3268964Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3269066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3269563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3269663Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3270020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3270241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3270579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3270671Z kernel = self.compile( 2025-05-07T20:32:44.3271054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3271231Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3271362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3271367Z 2025-05-07T20:32:44.3271567Z self = 2025-05-07T20:32:44.3272343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3272850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05f09120>} 2025-05-07T20:32:44.3273593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3273787Z context = 2025-05-07T20:32:44.3273792Z 2025-05-07T20:32:44.3273952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3274212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3274321Z module_map=module_map) 2025-05-07T20:32:44.3274479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3274578Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3274653Z E ^ 2025-05-07T20:32:44.3275089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3275094Z 2025-05-07T20:32:44.3275509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3275587Z 2025-05-07T20:32:44.3275688Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3275968Z self=, 2025-05-07T20:32:44.3276046Z T=16384, 2025-05-07T20:32:44.3276121Z D=5120, 2025-05-07T20:32:44.3276204Z scale_ub=1200.0, 2025-05-07T20:32:44.3276288Z contiguous=True, 2025-05-07T20:32:44.3276369Z compiled=True, 2025-05-07T20:32:44.3276445Z ) 2025-05-07T20:32:44.3276660Z self = 2025-05-07T20:32:44.3276831Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.3276835Z 2025-05-07T20:32:44.3276912Z @given( 2025-05-07T20:32:44.3277033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3277133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3277245Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3277365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3277479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3277552Z ) 2025-05-07T20:32:44.3277792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3277887Z def test_silu_mul_quant( 2025-05-07T20:32:44.3277962Z self, 2025-05-07T20:32:44.3278038Z T: int, 2025-05-07T20:32:44.3278117Z D: int, 2025-05-07T20:32:44.3278212Z scale_ub: Optional[float], 2025-05-07T20:32:44.3278300Z contiguous: bool, 2025-05-07T20:32:44.3278389Z compiled: bool, 2025-05-07T20:32:44.3278466Z ) -> None: 2025-05-07T20:32:44.3278564Z torch.manual_seed(2025) 2025-05-07T20:32:44.3278636Z 2025-05-07T20:32:44.3278804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3278880Z 2025-05-07T20:32:44.3278971Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3279094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3279192Z x = x_sign * x_clamp 2025-05-07T20:32:44.3279271Z x0 = x[:, :D] 2025-05-07T20:32:44.3279350Z x1 = x[:, D:] 2025-05-07T20:32:44.3279424Z 2025-05-07T20:32:44.3279505Z if contiguous: 2025-05-07T20:32:44.3279596Z x0 = x0.contiguous() 2025-05-07T20:32:44.3279684Z x1 = x1.contiguous() 2025-05-07T20:32:44.3279755Z 2025-05-07T20:32:44.3279844Z if scale_ub is not None: 2025-05-07T20:32:44.3279952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3280085Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3280163Z ) 2025-05-07T20:32:44.3280238Z else: 2025-05-07T20:32:44.3280336Z scale_ub_tensor = None 2025-05-07T20:32:44.3280411Z 2025-05-07T20:32:44.3280538Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3280628Z op = silu_mul_quant 2025-05-07T20:32:44.3280717Z if compiled: 2025-05-07T20:32:44.3280814Z op = torch.compile(op) 2025-05-07T20:32:44.3280916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3280991Z 2025-05-07T20:32:44.3281079Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3281083Z 2025-05-07T20:32:44.3281182Z moe/activation_test.py:117: 2025-05-07T20:32:44.3281307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3281407Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3281508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3281873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3281963Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3282540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3282638Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3283066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3283284Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3283618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3283715Z kernel = self.compile( 2025-05-07T20:32:44.3284091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3284260Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3284388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3284397Z 2025-05-07T20:32:44.3284600Z self = 2025-05-07T20:32:44.3285376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3285881Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c05e960c0>} 2025-05-07T20:32:44.3286624Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3286811Z context = 2025-05-07T20:32:44.3286815Z 2025-05-07T20:32:44.3286979Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3287244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3287354Z module_map=module_map) 2025-05-07T20:32:44.3287513Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3287616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3287691Z E ^ 2025-05-07T20:32:44.3288047Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3288051Z 2025-05-07T20:32:44.3288457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3288461Z 2025-05-07T20:32:44.3288564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3288789Z self=, 2025-05-07T20:32:44.3288871Z T=16384, 2025-05-07T20:32:44.3288949Z D=5120, 2025-05-07T20:32:44.3289031Z scale_ub=None, 2025-05-07T20:32:44.3289116Z contiguous=False, 2025-05-07T20:32:44.3289202Z compiled=True, 2025-05-07T20:32:44.3289278Z ) 2025-05-07T20:32:44.3289494Z self = 2025-05-07T20:32:44.3289673Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.3289677Z 2025-05-07T20:32:44.3289752Z @given( 2025-05-07T20:32:44.3289869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3289974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3290087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3290205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3290315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3290389Z ) 2025-05-07T20:32:44.3290713Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3290806Z def test_silu_mul_quant( 2025-05-07T20:32:44.3290881Z self, 2025-05-07T20:32:44.3290959Z T: int, 2025-05-07T20:32:44.3291034Z D: int, 2025-05-07T20:32:44.3291204Z scale_ub: Optional[float], 2025-05-07T20:32:44.3291296Z contiguous: bool, 2025-05-07T20:32:44.3291380Z compiled: bool, 2025-05-07T20:32:44.3291456Z ) -> None: 2025-05-07T20:32:44.3291552Z torch.manual_seed(2025) 2025-05-07T20:32:44.3291622Z 2025-05-07T20:32:44.3291789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3291861Z 2025-05-07T20:32:44.3291952Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3292076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3292162Z x = x_sign * x_clamp 2025-05-07T20:32:44.3292241Z x0 = x[:, :D] 2025-05-07T20:32:44.3292322Z x1 = x[:, D:] 2025-05-07T20:32:44.3292393Z 2025-05-07T20:32:44.3292481Z if contiguous: 2025-05-07T20:32:44.3292573Z x0 = x0.contiguous() 2025-05-07T20:32:44.3292662Z x1 = x1.contiguous() 2025-05-07T20:32:44.3292733Z 2025-05-07T20:32:44.3292829Z if scale_ub is not None: 2025-05-07T20:32:44.3292932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3293063Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3293140Z ) 2025-05-07T20:32:44.3293213Z else: 2025-05-07T20:32:44.3293308Z scale_ub_tensor = None 2025-05-07T20:32:44.3293379Z 2025-05-07T20:32:44.3293506Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3293598Z op = silu_mul_quant 2025-05-07T20:32:44.3293680Z if compiled: 2025-05-07T20:32:44.3293777Z op = torch.compile(op) 2025-05-07T20:32:44.3293882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3293953Z 2025-05-07T20:32:44.3294046Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3294050Z 2025-05-07T20:32:44.3294151Z moe/activation_test.py:117: 2025-05-07T20:32:44.3294276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3294386Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3294484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3294848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3294940Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3295428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3295522Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3295878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3296101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3296441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3296533Z kernel = self.compile( 2025-05-07T20:32:44.3296912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3297086Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3297212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3297216Z 2025-05-07T20:32:44.3297422Z self = 2025-05-07T20:32:44.3298197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3298805Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8ae1d00>} 2025-05-07T20:32:44.3299553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3299810Z context = 2025-05-07T20:32:44.3299815Z 2025-05-07T20:32:44.3299981Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3300238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3300342Z module_map=module_map) 2025-05-07T20:32:44.3300505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3300606Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3300681Z E ^ 2025-05-07T20:32:44.3301035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3301043Z 2025-05-07T20:32:44.3301449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3301458Z 2025-05-07T20:32:44.3301562Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3301786Z self=, 2025-05-07T20:32:44.3301862Z T=2048, 2025-05-07T20:32:44.3301938Z D=5120, 2025-05-07T20:32:44.3302020Z scale_ub=None, 2025-05-07T20:32:44.3302105Z contiguous=False, 2025-05-07T20:32:44.3302186Z compiled=True, 2025-05-07T20:32:44.3302260Z ) 2025-05-07T20:32:44.3302475Z self = 2025-05-07T20:32:44.3302649Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.3302653Z 2025-05-07T20:32:44.3302734Z @given( 2025-05-07T20:32:44.3302851Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3302952Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3303072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3303187Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3303305Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3303377Z ) 2025-05-07T20:32:44.3303619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3303711Z def test_silu_mul_quant( 2025-05-07T20:32:44.3303786Z self, 2025-05-07T20:32:44.3303864Z T: int, 2025-05-07T20:32:44.3303940Z D: int, 2025-05-07T20:32:44.3304035Z scale_ub: Optional[float], 2025-05-07T20:32:44.3304126Z contiguous: bool, 2025-05-07T20:32:44.3304209Z compiled: bool, 2025-05-07T20:32:44.3304286Z ) -> None: 2025-05-07T20:32:44.3304389Z torch.manual_seed(2025) 2025-05-07T20:32:44.3304462Z 2025-05-07T20:32:44.3304625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3304700Z 2025-05-07T20:32:44.3304795Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3304917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3305006Z x = x_sign * x_clamp 2025-05-07T20:32:44.3305086Z x0 = x[:, :D] 2025-05-07T20:32:44.3305169Z x1 = x[:, D:] 2025-05-07T20:32:44.3305243Z 2025-05-07T20:32:44.3305325Z if contiguous: 2025-05-07T20:32:44.3305415Z x0 = x0.contiguous() 2025-05-07T20:32:44.3305501Z x1 = x1.contiguous() 2025-05-07T20:32:44.3305572Z 2025-05-07T20:32:44.3305664Z if scale_ub is not None: 2025-05-07T20:32:44.3305769Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3305900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3305975Z ) 2025-05-07T20:32:44.3306133Z else: 2025-05-07T20:32:44.3306228Z scale_ub_tensor = None 2025-05-07T20:32:44.3306304Z 2025-05-07T20:32:44.3306430Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3306600Z op = silu_mul_quant 2025-05-07T20:32:44.3306683Z if compiled: 2025-05-07T20:32:44.3306780Z op = torch.compile(op) 2025-05-07T20:32:44.3306886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3306957Z 2025-05-07T20:32:44.3307047Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3307051Z 2025-05-07T20:32:44.3307149Z moe/activation_test.py:117: 2025-05-07T20:32:44.3307276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3307376Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3307478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3307842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3307942Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3308429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3308533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3308888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3309106Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3309439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3309536Z kernel = self.compile( 2025-05-07T20:32:44.3309910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3310084Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3310214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3310219Z 2025-05-07T20:32:44.3310423Z self = 2025-05-07T20:32:44.3311204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3311703Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be8ae14e0>} 2025-05-07T20:32:44.3312444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3312633Z context = 2025-05-07T20:32:44.3312641Z 2025-05-07T20:32:44.3312811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3313072Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3313182Z module_map=module_map) 2025-05-07T20:32:44.3313344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3313441Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3313516Z E ^ 2025-05-07T20:32:44.3313870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3313874Z 2025-05-07T20:32:44.3314280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3314284Z 2025-05-07T20:32:44.3314391Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3314692Z self=, 2025-05-07T20:32:44.3314769Z T=2048, 2025-05-07T20:32:44.3314849Z D=5120, 2025-05-07T20:32:44.3314931Z scale_ub=1200.0, 2025-05-07T20:32:44.3315014Z contiguous=False, 2025-05-07T20:32:44.3315177Z compiled=True, 2025-05-07T20:32:44.3315252Z ) 2025-05-07T20:32:44.3315466Z self = 2025-05-07T20:32:44.3315643Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3315648Z 2025-05-07T20:32:44.3315770Z @given( 2025-05-07T20:32:44.3315896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3315995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3316110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3316232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3316348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3316422Z ) 2025-05-07T20:32:44.3316674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3316768Z def test_silu_mul_quant( 2025-05-07T20:32:44.3316848Z self, 2025-05-07T20:32:44.3316930Z T: int, 2025-05-07T20:32:44.3317007Z D: int, 2025-05-07T20:32:44.3317110Z scale_ub: Optional[float], 2025-05-07T20:32:44.3317200Z contiguous: bool, 2025-05-07T20:32:44.3317287Z compiled: bool, 2025-05-07T20:32:44.3317370Z ) -> None: 2025-05-07T20:32:44.3317463Z torch.manual_seed(2025) 2025-05-07T20:32:44.3317539Z 2025-05-07T20:32:44.3317710Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3317785Z 2025-05-07T20:32:44.3317878Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3318004Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3318092Z x = x_sign * x_clamp 2025-05-07T20:32:44.3318171Z x0 = x[:, :D] 2025-05-07T20:32:44.3318259Z x1 = x[:, D:] 2025-05-07T20:32:44.3318333Z 2025-05-07T20:32:44.3318420Z if contiguous: 2025-05-07T20:32:44.3318511Z x0 = x0.contiguous() 2025-05-07T20:32:44.3318602Z x1 = x1.contiguous() 2025-05-07T20:32:44.3318684Z 2025-05-07T20:32:44.3318775Z if scale_ub is not None: 2025-05-07T20:32:44.3318881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3319017Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3319092Z ) 2025-05-07T20:32:44.3319174Z else: 2025-05-07T20:32:44.3319296Z scale_ub_tensor = None 2025-05-07T20:32:44.3319375Z 2025-05-07T20:32:44.3319523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3319618Z op = silu_mul_quant 2025-05-07T20:32:44.3319703Z if compiled: 2025-05-07T20:32:44.3319806Z op = torch.compile(op) 2025-05-07T20:32:44.3319910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3319988Z 2025-05-07T20:32:44.3320083Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3320088Z 2025-05-07T20:32:44.3320184Z moe/activation_test.py:117: 2025-05-07T20:32:44.3320312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3320420Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3320520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3320886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3320983Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3321473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3321574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3321930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3322236Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3322581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3322773Z kernel = self.compile( 2025-05-07T20:32:44.3323155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3323328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3323456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3323460Z 2025-05-07T20:32:44.3323667Z self = 2025-05-07T20:32:44.3324443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3324953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be97fdf80>} 2025-05-07T20:32:44.3325703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3325891Z context = 2025-05-07T20:32:44.3325896Z 2025-05-07T20:32:44.3326066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3326328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3326442Z module_map=module_map) 2025-05-07T20:32:44.3326602Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3326705Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3326788Z E ^ 2025-05-07T20:32:44.3327142Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3327151Z 2025-05-07T20:32:44.3327563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3327567Z 2025-05-07T20:32:44.3327670Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3327893Z self=, 2025-05-07T20:32:44.3327974Z T=4096, 2025-05-07T20:32:44.3328051Z D=5120, 2025-05-07T20:32:44.3328135Z scale_ub=1200.0, 2025-05-07T20:32:44.3328222Z contiguous=True, 2025-05-07T20:32:44.3328306Z compiled=True, 2025-05-07T20:32:44.3328380Z ) 2025-05-07T20:32:44.3328601Z self = 2025-05-07T20:32:44.3328777Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.3328781Z 2025-05-07T20:32:44.3328862Z @given( 2025-05-07T20:32:44.3328982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3329085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3329203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3329320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3329434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3329512Z ) 2025-05-07T20:32:44.3329756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3329849Z def test_silu_mul_quant( 2025-05-07T20:32:44.3329929Z self, 2025-05-07T20:32:44.3330008Z T: int, 2025-05-07T20:32:44.3330092Z D: int, 2025-05-07T20:32:44.3330191Z scale_ub: Optional[float], 2025-05-07T20:32:44.3330280Z contiguous: bool, 2025-05-07T20:32:44.3330372Z compiled: bool, 2025-05-07T20:32:44.3330580Z ) -> None: 2025-05-07T20:32:44.3330677Z torch.manual_seed(2025) 2025-05-07T20:32:44.3330754Z 2025-05-07T20:32:44.3330921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3331068Z 2025-05-07T20:32:44.3331163Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3331286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3331375Z x = x_sign * x_clamp 2025-05-07T20:32:44.3331459Z x0 = x[:, :D] 2025-05-07T20:32:44.3331541Z x1 = x[:, D:] 2025-05-07T20:32:44.3331613Z 2025-05-07T20:32:44.3331700Z if contiguous: 2025-05-07T20:32:44.3331792Z x0 = x0.contiguous() 2025-05-07T20:32:44.3331886Z x1 = x1.contiguous() 2025-05-07T20:32:44.3331959Z 2025-05-07T20:32:44.3332049Z if scale_ub is not None: 2025-05-07T20:32:44.3332157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3332296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3332373Z ) 2025-05-07T20:32:44.3332452Z else: 2025-05-07T20:32:44.3332548Z scale_ub_tensor = None 2025-05-07T20:32:44.3332620Z 2025-05-07T20:32:44.3332758Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3332850Z op = silu_mul_quant 2025-05-07T20:32:44.3332934Z if compiled: 2025-05-07T20:32:44.3333038Z op = torch.compile(op) 2025-05-07T20:32:44.3333144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3333224Z 2025-05-07T20:32:44.3333314Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3333318Z 2025-05-07T20:32:44.3333416Z moe/activation_test.py:117: 2025-05-07T20:32:44.3333547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3333648Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3333749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3334122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3334216Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3334710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3334813Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3335169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3335394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3335731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3335825Z kernel = self.compile( 2025-05-07T20:32:44.3336208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3336388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3336524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3336528Z 2025-05-07T20:32:44.3336731Z self = 2025-05-07T20:32:44.3337510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3338014Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5be97fe200>} 2025-05-07T20:32:44.3338759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3339032Z context = 2025-05-07T20:32:44.3339037Z 2025-05-07T20:32:44.3339202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3339608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3339715Z module_map=module_map) 2025-05-07T20:32:44.3339875Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3339978Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3340056Z E ^ 2025-05-07T20:32:44.3340411Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3340415Z 2025-05-07T20:32:44.3340830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3340835Z 2025-05-07T20:32:44.3340936Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3341167Z self=, 2025-05-07T20:32:44.3341245Z T=128, 2025-05-07T20:32:44.3341323Z D=5120, 2025-05-07T20:32:44.3341418Z scale_ub=1200.0, 2025-05-07T20:32:44.3341504Z contiguous=False, 2025-05-07T20:32:44.3341589Z compiled=True, 2025-05-07T20:32:44.3341666Z ) 2025-05-07T20:32:44.3341885Z self = 2025-05-07T20:32:44.3342058Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3342062Z 2025-05-07T20:32:44.3342144Z @given( 2025-05-07T20:32:44.3342262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3342364Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3342480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3342597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3342720Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3342795Z ) 2025-05-07T20:32:44.3343037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3343136Z def test_silu_mul_quant( 2025-05-07T20:32:44.3343218Z self, 2025-05-07T20:32:44.3343296Z T: int, 2025-05-07T20:32:44.3343376Z D: int, 2025-05-07T20:32:44.3343475Z scale_ub: Optional[float], 2025-05-07T20:32:44.3343567Z contiguous: bool, 2025-05-07T20:32:44.3343654Z compiled: bool, 2025-05-07T20:32:44.3343732Z ) -> None: 2025-05-07T20:32:44.3343832Z torch.manual_seed(2025) 2025-05-07T20:32:44.3343904Z 2025-05-07T20:32:44.3344071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3344149Z 2025-05-07T20:32:44.3344242Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3344369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3344462Z x = x_sign * x_clamp 2025-05-07T20:32:44.3344549Z x0 = x[:, :D] 2025-05-07T20:32:44.3344631Z x1 = x[:, D:] 2025-05-07T20:32:44.3344707Z 2025-05-07T20:32:44.3344790Z if contiguous: 2025-05-07T20:32:44.3344882Z x0 = x0.contiguous() 2025-05-07T20:32:44.3344978Z x1 = x1.contiguous() 2025-05-07T20:32:44.3345052Z 2025-05-07T20:32:44.3345145Z if scale_ub is not None: 2025-05-07T20:32:44.3345254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3345389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3345468Z ) 2025-05-07T20:32:44.3345544Z else: 2025-05-07T20:32:44.3345640Z scale_ub_tensor = None 2025-05-07T20:32:44.3345715Z 2025-05-07T20:32:44.3345846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3345939Z op = silu_mul_quant 2025-05-07T20:32:44.3346027Z if compiled: 2025-05-07T20:32:44.3346129Z op = torch.compile(op) 2025-05-07T20:32:44.3346322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3346400Z 2025-05-07T20:32:44.3346492Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3346497Z 2025-05-07T20:32:44.3346673Z moe/activation_test.py:117: 2025-05-07T20:32:44.3346803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3346903Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3347010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3347376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3347469Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3347963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3348061Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3348424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3348647Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3348985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3349090Z kernel = self.compile( 2025-05-07T20:32:44.3349494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3349692Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3349825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3349830Z 2025-05-07T20:32:44.3350031Z self = 2025-05-07T20:32:44.3350817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3351320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea294c20>} 2025-05-07T20:32:44.3352072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3352264Z context = 2025-05-07T20:32:44.3352268Z 2025-05-07T20:32:44.3352433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3352697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3352805Z module_map=module_map) 2025-05-07T20:32:44.3352973Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3353076Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3353158Z E ^ 2025-05-07T20:32:44.3353515Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3353524Z 2025-05-07T20:32:44.3353934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3353938Z 2025-05-07T20:32:44.3354042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3354269Z self=, 2025-05-07T20:32:44.3354348Z T=16384, 2025-05-07T20:32:44.3354430Z D=7168, 2025-05-07T20:32:44.3354515Z scale_ub=1200.0, 2025-05-07T20:32:44.3354601Z contiguous=True, 2025-05-07T20:32:44.3354688Z compiled=True, 2025-05-07T20:32:44.3354766Z ) 2025-05-07T20:32:44.3354984Z self = 2025-05-07T20:32:44.3355247Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.3355252Z 2025-05-07T20:32:44.3355332Z @given( 2025-05-07T20:32:44.3355451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3355652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3355816Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3355935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3356052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3356129Z ) 2025-05-07T20:32:44.3356376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3356470Z def test_silu_mul_quant( 2025-05-07T20:32:44.3356546Z self, 2025-05-07T20:32:44.3356628Z T: int, 2025-05-07T20:32:44.3356706Z D: int, 2025-05-07T20:32:44.3356805Z scale_ub: Optional[float], 2025-05-07T20:32:44.3356899Z contiguous: bool, 2025-05-07T20:32:44.3356989Z compiled: bool, 2025-05-07T20:32:44.3357069Z ) -> None: 2025-05-07T20:32:44.3357169Z torch.manual_seed(2025) 2025-05-07T20:32:44.3357243Z 2025-05-07T20:32:44.3357413Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3357491Z 2025-05-07T20:32:44.3357587Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3357715Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3357803Z x = x_sign * x_clamp 2025-05-07T20:32:44.3357885Z x0 = x[:, :D] 2025-05-07T20:32:44.3357967Z x1 = x[:, D:] 2025-05-07T20:32:44.3358040Z 2025-05-07T20:32:44.3358123Z if contiguous: 2025-05-07T20:32:44.3358217Z x0 = x0.contiguous() 2025-05-07T20:32:44.3358306Z x1 = x1.contiguous() 2025-05-07T20:32:44.3358381Z 2025-05-07T20:32:44.3358474Z if scale_ub is not None: 2025-05-07T20:32:44.3358581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3358719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3358802Z ) 2025-05-07T20:32:44.3358878Z else: 2025-05-07T20:32:44.3358974Z scale_ub_tensor = None 2025-05-07T20:32:44.3359052Z 2025-05-07T20:32:44.3359183Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3359277Z op = silu_mul_quant 2025-05-07T20:32:44.3359361Z if compiled: 2025-05-07T20:32:44.3359461Z op = torch.compile(op) 2025-05-07T20:32:44.3359568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3359641Z 2025-05-07T20:32:44.3359731Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3359736Z 2025-05-07T20:32:44.3359834Z moe/activation_test.py:117: 2025-05-07T20:32:44.3359963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3360063Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3360166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3360536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3360631Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3361126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3361222Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3361582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3361805Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3362148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3362242Z kernel = self.compile( 2025-05-07T20:32:44.3362708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3362887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3363018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3363096Z 2025-05-07T20:32:44.3363302Z self = 2025-05-07T20:32:44.3364080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3364582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea296a20>} 2025-05-07T20:32:44.3365332Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3365856Z context = 2025-05-07T20:32:44.3365862Z 2025-05-07T20:32:44.3366031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3366302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3366410Z module_map=module_map) 2025-05-07T20:32:44.3366578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3366679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3366759Z E ^ 2025-05-07T20:32:44.3367117Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3367121Z 2025-05-07T20:32:44.3367531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3367536Z 2025-05-07T20:32:44.3367648Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3367871Z self=, 2025-05-07T20:32:44.3367952Z T=16384, 2025-05-07T20:32:44.3368039Z D=5120, 2025-05-07T20:32:44.3368124Z scale_ub=1200.0, 2025-05-07T20:32:44.3368211Z contiguous=True, 2025-05-07T20:32:44.3368299Z compiled=False, 2025-05-07T20:32:44.3368374Z ) 2025-05-07T20:32:44.3368595Z self = 2025-05-07T20:32:44.3368772Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3368776Z 2025-05-07T20:32:44.3368856Z @given( 2025-05-07T20:32:44.3368980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3369081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3369199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3369323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3369438Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3369514Z ) 2025-05-07T20:32:44.3369761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3369859Z def test_silu_mul_quant( 2025-05-07T20:32:44.3369940Z self, 2025-05-07T20:32:44.3370019Z T: int, 2025-05-07T20:32:44.3370101Z D: int, 2025-05-07T20:32:44.3370204Z scale_ub: Optional[float], 2025-05-07T20:32:44.3370297Z contiguous: bool, 2025-05-07T20:32:44.3370386Z compiled: bool, 2025-05-07T20:32:44.3370469Z ) -> None: 2025-05-07T20:32:44.3370566Z torch.manual_seed(2025) 2025-05-07T20:32:44.3370640Z 2025-05-07T20:32:44.3370810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3370885Z 2025-05-07T20:32:44.3370977Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3371104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3371334Z x = x_sign * x_clamp 2025-05-07T20:32:44.3371422Z x0 = x[:, :D] 2025-05-07T20:32:44.3371502Z x1 = x[:, D:] 2025-05-07T20:32:44.3371576Z 2025-05-07T20:32:44.3371778Z if contiguous: 2025-05-07T20:32:44.3371869Z x0 = x0.contiguous() 2025-05-07T20:32:44.3371963Z x1 = x1.contiguous() 2025-05-07T20:32:44.3372039Z 2025-05-07T20:32:44.3372130Z if scale_ub is not None: 2025-05-07T20:32:44.3372238Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3372376Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3372451Z ) 2025-05-07T20:32:44.3372527Z else: 2025-05-07T20:32:44.3372626Z scale_ub_tensor = None 2025-05-07T20:32:44.3372699Z 2025-05-07T20:32:44.3372828Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3372925Z op = silu_mul_quant 2025-05-07T20:32:44.3373011Z if compiled: 2025-05-07T20:32:44.3373122Z op = torch.compile(op) 2025-05-07T20:32:44.3373227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3373300Z 2025-05-07T20:32:44.3373394Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3373404Z 2025-05-07T20:32:44.3373501Z moe/activation_test.py:117: 2025-05-07T20:32:44.3373629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3373734Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3373835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3374337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3374434Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3374793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3375022Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3375362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3375454Z kernel = self.compile( 2025-05-07T20:32:44.3375842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3376021Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3376148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3376156Z 2025-05-07T20:32:44.3376359Z self = 2025-05-07T20:32:44.3377137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3377654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea19f9c0>} 2025-05-07T20:32:44.3378400Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3382441Z context = 2025-05-07T20:32:44.3382449Z 2025-05-07T20:32:44.3382622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3382888Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3382998Z module_map=module_map) 2025-05-07T20:32:44.3383158Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3383259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3383336Z E ^ 2025-05-07T20:32:44.3383796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3383806Z 2025-05-07T20:32:44.3384221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3384301Z 2025-05-07T20:32:44.3384404Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3384632Z self=, 2025-05-07T20:32:44.3384708Z T=1, 2025-05-07T20:32:44.3384784Z D=7168, 2025-05-07T20:32:44.3384870Z scale_ub=1200.0, 2025-05-07T20:32:44.3384955Z contiguous=False, 2025-05-07T20:32:44.3385039Z compiled=False, 2025-05-07T20:32:44.3385119Z ) 2025-05-07T20:32:44.3385337Z self = 2025-05-07T20:32:44.3385506Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3385510Z 2025-05-07T20:32:44.3385593Z @given( 2025-05-07T20:32:44.3385713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3385813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3385934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3386049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3386163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3386236Z ) 2025-05-07T20:32:44.3386480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3386574Z def test_silu_mul_quant( 2025-05-07T20:32:44.3386654Z self, 2025-05-07T20:32:44.3386732Z T: int, 2025-05-07T20:32:44.3386808Z D: int, 2025-05-07T20:32:44.3386908Z scale_ub: Optional[float], 2025-05-07T20:32:44.3386995Z contiguous: bool, 2025-05-07T20:32:44.3387079Z compiled: bool, 2025-05-07T20:32:44.3387163Z ) -> None: 2025-05-07T20:32:44.3387262Z torch.manual_seed(2025) 2025-05-07T20:32:44.3387336Z 2025-05-07T20:32:44.3387508Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3387581Z 2025-05-07T20:32:44.3387679Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3387806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3387894Z x = x_sign * x_clamp 2025-05-07T20:32:44.3387974Z x0 = x[:, :D] 2025-05-07T20:32:44.3388055Z x1 = x[:, D:] 2025-05-07T20:32:44.3388127Z 2025-05-07T20:32:44.3388216Z if contiguous: 2025-05-07T20:32:44.3388306Z x0 = x0.contiguous() 2025-05-07T20:32:44.3388395Z x1 = x1.contiguous() 2025-05-07T20:32:44.3388471Z 2025-05-07T20:32:44.3388561Z if scale_ub is not None: 2025-05-07T20:32:44.3388665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3388804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3388879Z ) 2025-05-07T20:32:44.3388959Z else: 2025-05-07T20:32:44.3389058Z scale_ub_tensor = None 2025-05-07T20:32:44.3389131Z 2025-05-07T20:32:44.3389260Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3389359Z op = silu_mul_quant 2025-05-07T20:32:44.3389442Z if compiled: 2025-05-07T20:32:44.3389546Z op = torch.compile(op) 2025-05-07T20:32:44.3389651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3389725Z 2025-05-07T20:32:44.3389820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3389825Z 2025-05-07T20:32:44.3389923Z moe/activation_test.py:117: 2025-05-07T20:32:44.3390051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3390156Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3390256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3390859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3390961Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3391322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3391622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3391958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3392052Z kernel = self.compile( 2025-05-07T20:32:44.3392440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3392612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3392742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3392746Z 2025-05-07T20:32:44.3392955Z self = 2025-05-07T20:32:44.3393733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3394250Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c5800>} 2025-05-07T20:32:44.3394996Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3395190Z context = 2025-05-07T20:32:44.3395194Z 2025-05-07T20:32:44.3395357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3395621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3395805Z module_map=module_map) 2025-05-07T20:32:44.3395968Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3396085Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3396162Z E ^ 2025-05-07T20:32:44.3396513Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3396518Z 2025-05-07T20:32:44.3396929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3396933Z 2025-05-07T20:32:44.3397035Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3397260Z self=, 2025-05-07T20:32:44.3397337Z T=4096, 2025-05-07T20:32:44.3397412Z D=7168, 2025-05-07T20:32:44.3397500Z scale_ub=1200.0, 2025-05-07T20:32:44.3397594Z contiguous=False, 2025-05-07T20:32:44.3397677Z compiled=True, 2025-05-07T20:32:44.3397753Z ) 2025-05-07T20:32:44.3397967Z self = 2025-05-07T20:32:44.3398147Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3398151Z 2025-05-07T20:32:44.3398232Z @given( 2025-05-07T20:32:44.3398351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3398454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3398568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3398683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3398797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3398872Z ) 2025-05-07T20:32:44.3399114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3399211Z def test_silu_mul_quant( 2025-05-07T20:32:44.3399372Z self, 2025-05-07T20:32:44.3399449Z T: int, 2025-05-07T20:32:44.3399529Z D: int, 2025-05-07T20:32:44.3399628Z scale_ub: Optional[float], 2025-05-07T20:32:44.3399715Z contiguous: bool, 2025-05-07T20:32:44.3399883Z compiled: bool, 2025-05-07T20:32:44.3399960Z ) -> None: 2025-05-07T20:32:44.3400057Z torch.manual_seed(2025) 2025-05-07T20:32:44.3400129Z 2025-05-07T20:32:44.3400295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3400372Z 2025-05-07T20:32:44.3400462Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3400587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3400680Z x = x_sign * x_clamp 2025-05-07T20:32:44.3400759Z x0 = x[:, :D] 2025-05-07T20:32:44.3400839Z x1 = x[:, D:] 2025-05-07T20:32:44.3400915Z 2025-05-07T20:32:44.3400997Z if contiguous: 2025-05-07T20:32:44.3401088Z x0 = x0.contiguous() 2025-05-07T20:32:44.3401186Z x1 = x1.contiguous() 2025-05-07T20:32:44.3401259Z 2025-05-07T20:32:44.3401349Z if scale_ub is not None: 2025-05-07T20:32:44.3401458Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3401601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3401683Z ) 2025-05-07T20:32:44.3401759Z else: 2025-05-07T20:32:44.3401855Z scale_ub_tensor = None 2025-05-07T20:32:44.3401930Z 2025-05-07T20:32:44.3402058Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3402149Z op = silu_mul_quant 2025-05-07T20:32:44.3402235Z if compiled: 2025-05-07T20:32:44.3402333Z op = torch.compile(op) 2025-05-07T20:32:44.3402437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3402513Z 2025-05-07T20:32:44.3402605Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3402609Z 2025-05-07T20:32:44.3402709Z moe/activation_test.py:117: 2025-05-07T20:32:44.3402843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3402943Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3403048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3403418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3403509Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3404001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3404096Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3404454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3404676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3405015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3405112Z kernel = self.compile( 2025-05-07T20:32:44.3405488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3405666Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3405797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3405802Z 2025-05-07T20:32:44.3406005Z self = 2025-05-07T20:32:44.3406790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3407376Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5bea6c6d40>} 2025-05-07T20:32:44.3408123Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3408385Z context = 2025-05-07T20:32:44.3408389Z 2025-05-07T20:32:44.3408552Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3408816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3408924Z module_map=module_map) 2025-05-07T20:32:44.3409083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3409184Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3409262Z E ^ 2025-05-07T20:32:44.3409623Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3409628Z 2025-05-07T20:32:44.3410037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3410046Z 2025-05-07T20:32:44.3410148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3410374Z self=, 2025-05-07T20:32:44.3410451Z T=128, 2025-05-07T20:32:44.3410531Z D=7168, 2025-05-07T20:32:44.3410614Z scale_ub=1200.0, 2025-05-07T20:32:44.3410699Z contiguous=False, 2025-05-07T20:32:44.3410784Z compiled=True, 2025-05-07T20:32:44.3410856Z ) 2025-05-07T20:32:44.3411072Z self = 2025-05-07T20:32:44.3411248Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.3411252Z 2025-05-07T20:32:44.3411329Z @given( 2025-05-07T20:32:44.3411448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3411559Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3411673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3411790Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3411910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3411985Z ) 2025-05-07T20:32:44.3412230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3412326Z def test_silu_mul_quant( 2025-05-07T20:32:44.3412403Z self, 2025-05-07T20:32:44.3412487Z T: int, 2025-05-07T20:32:44.3412562Z D: int, 2025-05-07T20:32:44.3412660Z scale_ub: Optional[float], 2025-05-07T20:32:44.3412753Z contiguous: bool, 2025-05-07T20:32:44.3412838Z compiled: bool, 2025-05-07T20:32:44.3412917Z ) -> None: 2025-05-07T20:32:44.3413016Z torch.manual_seed(2025) 2025-05-07T20:32:44.3413090Z 2025-05-07T20:32:44.3413264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3413338Z 2025-05-07T20:32:44.3413429Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3413555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3413647Z x = x_sign * x_clamp 2025-05-07T20:32:44.3413725Z x0 = x[:, :D] 2025-05-07T20:32:44.3413808Z x1 = x[:, D:] 2025-05-07T20:32:44.3413880Z 2025-05-07T20:32:44.3413962Z if contiguous: 2025-05-07T20:32:44.3414056Z x0 = x0.contiguous() 2025-05-07T20:32:44.3414143Z x1 = x1.contiguous() 2025-05-07T20:32:44.3414218Z 2025-05-07T20:32:44.3414310Z if scale_ub is not None: 2025-05-07T20:32:44.3414416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3414548Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3414626Z ) 2025-05-07T20:32:44.3414701Z else: 2025-05-07T20:32:44.3414797Z scale_ub_tensor = None 2025-05-07T20:32:44.3414868Z 2025-05-07T20:32:44.3415080Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3415173Z op = silu_mul_quant 2025-05-07T20:32:44.3415258Z if compiled: 2025-05-07T20:32:44.3415432Z op = torch.compile(op) 2025-05-07T20:32:44.3415538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3415610Z 2025-05-07T20:32:44.3415700Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3415704Z 2025-05-07T20:32:44.3415801Z moe/activation_test.py:117: 2025-05-07T20:32:44.3415927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3416032Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3416131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3416496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3416592Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3417086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3417185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3417549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3417768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3418106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3418199Z kernel = self.compile( 2025-05-07T20:32:44.3418576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3418753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3418879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3418884Z 2025-05-07T20:32:44.3419097Z self = 2025-05-07T20:32:44.3419871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3420375Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5c04ae2160>} 2025-05-07T20:32:44.3421119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3421308Z context = 2025-05-07T20:32:44.3421312Z 2025-05-07T20:32:44.3421484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3421747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3421853Z module_map=module_map) 2025-05-07T20:32:44.3422019Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3422117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3422196Z E ^ 2025-05-07T20:32:44.3422552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3422556Z 2025-05-07T20:32:44.3422965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3422970Z 2025-05-07T20:32:44.3423077Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3423299Z self=, 2025-05-07T20:32:44.3423375Z T=2048, 2025-05-07T20:32:44.3423455Z D=7168, 2025-05-07T20:32:44.3423642Z scale_ub=None, 2025-05-07T20:32:44.3423730Z contiguous=True, 2025-05-07T20:32:44.3423815Z compiled=True, 2025-05-07T20:32:44.3423891Z ) 2025-05-07T20:32:44.3424184Z self = 2025-05-07T20:32:44.3424356Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.3424361Z 2025-05-07T20:32:44.3424437Z @given( 2025-05-07T20:32:44.3424556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3424657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3424771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3424888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3425002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3425077Z ) 2025-05-07T20:32:44.3425321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3425418Z def test_silu_mul_quant( 2025-05-07T20:32:44.3425494Z self, 2025-05-07T20:32:44.3425576Z T: int, 2025-05-07T20:32:44.3425651Z D: int, 2025-05-07T20:32:44.3425749Z scale_ub: Optional[float], 2025-05-07T20:32:44.3425846Z contiguous: bool, 2025-05-07T20:32:44.3425931Z compiled: bool, 2025-05-07T20:32:44.3426011Z ) -> None: 2025-05-07T20:32:44.3426112Z torch.manual_seed(2025) 2025-05-07T20:32:44.3426184Z 2025-05-07T20:32:44.3426349Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3426427Z 2025-05-07T20:32:44.3426519Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3426641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3426732Z x = x_sign * x_clamp 2025-05-07T20:32:44.3426811Z x0 = x[:, :D] 2025-05-07T20:32:44.3426893Z x1 = x[:, D:] 2025-05-07T20:32:44.3426966Z 2025-05-07T20:32:44.3427048Z if contiguous: 2025-05-07T20:32:44.3427152Z x0 = x0.contiguous() 2025-05-07T20:32:44.3427243Z x1 = x1.contiguous() 2025-05-07T20:32:44.3427316Z 2025-05-07T20:32:44.3427407Z if scale_ub is not None: 2025-05-07T20:32:44.3427518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3427651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3427728Z ) 2025-05-07T20:32:44.3427804Z else: 2025-05-07T20:32:44.3427898Z scale_ub_tensor = None 2025-05-07T20:32:44.3427976Z 2025-05-07T20:32:44.3428103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3428194Z op = silu_mul_quant 2025-05-07T20:32:44.3428279Z if compiled: 2025-05-07T20:32:44.3428377Z op = torch.compile(op) 2025-05-07T20:32:44.3428484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3428556Z 2025-05-07T20:32:44.3428645Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3428650Z 2025-05-07T20:32:44.3428753Z moe/activation_test.py:117: 2025-05-07T20:32:44.3428882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3428982Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3429090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3429455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3429550Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3430039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3430136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3430493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3430712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3431132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3431229Z kernel = self.compile( 2025-05-07T20:32:44.3431606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3431869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3431998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3432002Z 2025-05-07T20:32:44.3432204Z self = 2025-05-07T20:32:44.3432982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3433487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8bdd18a0>} 2025-05-07T20:32:44.3434233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3434429Z context = 2025-05-07T20:32:44.3434434Z 2025-05-07T20:32:44.3434597Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3434859Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3434965Z module_map=module_map) 2025-05-07T20:32:44.3435128Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3435226Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3435304Z E ^ 2025-05-07T20:32:44.3435665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3435669Z 2025-05-07T20:32:44.3436135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3436145Z 2025-05-07T20:32:44.3436252Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3436473Z self=, 2025-05-07T20:32:44.3436552Z T=16384, 2025-05-07T20:32:44.3436632Z D=5120, 2025-05-07T20:32:44.3436714Z scale_ub=None, 2025-05-07T20:32:44.3436800Z contiguous=False, 2025-05-07T20:32:44.3436886Z compiled=False, 2025-05-07T20:32:44.3436958Z ) 2025-05-07T20:32:44.3437174Z self = 2025-05-07T20:32:44.3437353Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.3437358Z 2025-05-07T20:32:44.3437434Z @given( 2025-05-07T20:32:44.3437563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3437660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3437774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3437898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3438011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3438084Z ) 2025-05-07T20:32:44.3438331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3438424Z def test_silu_mul_quant( 2025-05-07T20:32:44.3438508Z self, 2025-05-07T20:32:44.3438586Z T: int, 2025-05-07T20:32:44.3438663Z D: int, 2025-05-07T20:32:44.3438765Z scale_ub: Optional[float], 2025-05-07T20:32:44.3438852Z contiguous: bool, 2025-05-07T20:32:44.3438937Z compiled: bool, 2025-05-07T20:32:44.3439019Z ) -> None: 2025-05-07T20:32:44.3439114Z torch.manual_seed(2025) 2025-05-07T20:32:44.3439187Z 2025-05-07T20:32:44.3439835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3439912Z 2025-05-07T20:32:44.3440003Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3440205Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3442019Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3442025Z 2025-05-07T20:32:44.3442146Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.3442150Z 2025-05-07T20:32:44.3442259Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3442489Z self=, 2025-05-07T20:32:44.3442566Z T=4096, 2025-05-07T20:32:44.3442647Z D=7168, 2025-05-07T20:32:44.3442735Z scale_ub=1200.0, 2025-05-07T20:32:44.3442818Z contiguous=True, 2025-05-07T20:32:44.3442901Z compiled=True, 2025-05-07T20:32:44.3442976Z ) 2025-05-07T20:32:44.3443189Z self = 2025-05-07T20:32:44.3443359Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.3443363Z 2025-05-07T20:32:44.3443444Z @given( 2025-05-07T20:32:44.3443561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3443660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3443776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3443891Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3444009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3444083Z ) 2025-05-07T20:32:44.3444323Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3444424Z def test_silu_mul_quant( 2025-05-07T20:32:44.3444500Z self, 2025-05-07T20:32:44.3444576Z T: int, 2025-05-07T20:32:44.3444655Z D: int, 2025-05-07T20:32:44.3444752Z scale_ub: Optional[float], 2025-05-07T20:32:44.3444840Z contiguous: bool, 2025-05-07T20:32:44.3444929Z compiled: bool, 2025-05-07T20:32:44.3445006Z ) -> None: 2025-05-07T20:32:44.3445105Z torch.manual_seed(2025) 2025-05-07T20:32:44.3445178Z 2025-05-07T20:32:44.3445342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3445420Z 2025-05-07T20:32:44.3445511Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3445634Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3447437Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3447447Z 2025-05-07T20:32:44.3447565Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.3447569Z 2025-05-07T20:32:44.3447673Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3447893Z self=, 2025-05-07T20:32:44.3447972Z T=16384, 2025-05-07T20:32:44.3448050Z D=7168, 2025-05-07T20:32:44.3448214Z scale_ub=None, 2025-05-07T20:32:44.3448305Z contiguous=False, 2025-05-07T20:32:44.3448388Z compiled=False, 2025-05-07T20:32:44.3448466Z ) 2025-05-07T20:32:44.3448683Z self = 2025-05-07T20:32:44.3448956Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.3448960Z 2025-05-07T20:32:44.3449038Z @given( 2025-05-07T20:32:44.3449164Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3449262Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3449374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3449492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3449602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3449684Z ) 2025-05-07T20:32:44.3449926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3450018Z def test_silu_mul_quant( 2025-05-07T20:32:44.3450103Z self, 2025-05-07T20:32:44.3450180Z T: int, 2025-05-07T20:32:44.3450257Z D: int, 2025-05-07T20:32:44.3450358Z scale_ub: Optional[float], 2025-05-07T20:32:44.3450452Z contiguous: bool, 2025-05-07T20:32:44.3450536Z compiled: bool, 2025-05-07T20:32:44.3450616Z ) -> None: 2025-05-07T20:32:44.3450710Z torch.manual_seed(2025) 2025-05-07T20:32:44.3450783Z 2025-05-07T20:32:44.3450950Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3452752Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3452763Z 2025-05-07T20:32:44.3452877Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3452885Z 2025-05-07T20:32:44.3452986Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3453211Z self=, 2025-05-07T20:32:44.3453290Z T=2048, 2025-05-07T20:32:44.3453365Z D=7168, 2025-05-07T20:32:44.3453450Z scale_ub=1200.0, 2025-05-07T20:32:44.3453534Z contiguous=True, 2025-05-07T20:32:44.3453619Z compiled=True, 2025-05-07T20:32:44.3453694Z ) 2025-05-07T20:32:44.3453909Z self = 2025-05-07T20:32:44.3454078Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.3454086Z 2025-05-07T20:32:44.3454162Z @given( 2025-05-07T20:32:44.3454284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3454384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3454496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3454616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3454730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3454802Z ) 2025-05-07T20:32:44.3455042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3455139Z def test_silu_mul_quant( 2025-05-07T20:32:44.3455214Z self, 2025-05-07T20:32:44.3455295Z T: int, 2025-05-07T20:32:44.3455370Z D: int, 2025-05-07T20:32:44.3455465Z scale_ub: Optional[float], 2025-05-07T20:32:44.3455553Z contiguous: bool, 2025-05-07T20:32:44.3455636Z compiled: bool, 2025-05-07T20:32:44.3455711Z ) -> None: 2025-05-07T20:32:44.3455806Z torch.manual_seed(2025) 2025-05-07T20:32:44.3455878Z 2025-05-07T20:32:44.3456123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3456198Z 2025-05-07T20:32:44.3456288Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3456414Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3458268Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3458274Z 2025-05-07T20:32:44.3458387Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.3458392Z 2025-05-07T20:32:44.3458501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3458719Z self=, 2025-05-07T20:32:44.3458798Z T=2048, 2025-05-07T20:32:44.3458882Z D=7168, 2025-05-07T20:32:44.3458963Z scale_ub=None, 2025-05-07T20:32:44.3459049Z contiguous=True, 2025-05-07T20:32:44.3459130Z compiled=False, 2025-05-07T20:32:44.3459201Z ) 2025-05-07T20:32:44.3459417Z self = 2025-05-07T20:32:44.3459585Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3459590Z 2025-05-07T20:32:44.3459666Z @given( 2025-05-07T20:32:44.3459787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3459882Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3459996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3460109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3460224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3460302Z ) 2025-05-07T20:32:44.3460539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3460636Z def test_silu_mul_quant( 2025-05-07T20:32:44.3460712Z self, 2025-05-07T20:32:44.3460786Z T: int, 2025-05-07T20:32:44.3460860Z D: int, 2025-05-07T20:32:44.3460957Z scale_ub: Optional[float], 2025-05-07T20:32:44.3461043Z contiguous: bool, 2025-05-07T20:32:44.3461126Z compiled: bool, 2025-05-07T20:32:44.3461204Z ) -> None: 2025-05-07T20:32:44.3461297Z torch.manual_seed(2025) 2025-05-07T20:32:44.3461375Z 2025-05-07T20:32:44.3461538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3461610Z 2025-05-07T20:32:44.3461703Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.3463489Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3463500Z 2025-05-07T20:32:44.3463621Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.3463626Z 2025-05-07T20:32:44.3463725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3463943Z self=, 2025-05-07T20:32:44.3464025Z T=1, 2025-05-07T20:32:44.3464099Z D=7168, 2025-05-07T20:32:44.3464178Z scale_ub=1200.0, 2025-05-07T20:32:44.3464264Z contiguous=True, 2025-05-07T20:32:44.3464345Z compiled=False, 2025-05-07T20:32:44.3464508Z ) 2025-05-07T20:32:44.3464725Z self = 2025-05-07T20:32:44.3464885Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3464963Z 2025-05-07T20:32:44.3465040Z @given( 2025-05-07T20:32:44.3465155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3465251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3465719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3465892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3466048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3466163Z ) 2025-05-07T20:32:44.3466424Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3466519Z def test_silu_mul_quant( 2025-05-07T20:32:44.3466595Z self, 2025-05-07T20:32:44.3466670Z T: int, 2025-05-07T20:32:44.3466749Z D: int, 2025-05-07T20:32:44.3466852Z scale_ub: Optional[float], 2025-05-07T20:32:44.3466939Z contiguous: bool, 2025-05-07T20:32:44.3467028Z compiled: bool, 2025-05-07T20:32:44.3467105Z ) -> None: 2025-05-07T20:32:44.3467203Z torch.manual_seed(2025) 2025-05-07T20:32:44.3467280Z 2025-05-07T20:32:44.3467444Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3467517Z 2025-05-07T20:32:44.3467615Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3467736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3467828Z x = x_sign * x_clamp 2025-05-07T20:32:44.3467907Z x0 = x[:, :D] 2025-05-07T20:32:44.3467984Z x1 = x[:, D:] 2025-05-07T20:32:44.3468059Z 2025-05-07T20:32:44.3468140Z if contiguous: 2025-05-07T20:32:44.3468229Z x0 = x0.contiguous() 2025-05-07T20:32:44.3468319Z x1 = x1.contiguous() 2025-05-07T20:32:44.3468390Z 2025-05-07T20:32:44.3468484Z if scale_ub is not None: 2025-05-07T20:32:44.3468593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3468724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3468803Z ) 2025-05-07T20:32:44.3468880Z else: 2025-05-07T20:32:44.3468972Z scale_ub_tensor = None 2025-05-07T20:32:44.3469043Z 2025-05-07T20:32:44.3469171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3469260Z op = silu_mul_quant 2025-05-07T20:32:44.3469345Z if compiled: 2025-05-07T20:32:44.3469443Z op = torch.compile(op) 2025-05-07T20:32:44.3469546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3469621Z 2025-05-07T20:32:44.3469709Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3469714Z 2025-05-07T20:32:44.3469810Z moe/activation_test.py:117: 2025-05-07T20:32:44.3469939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3470042Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3470143Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3470645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3470745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3471104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3471322Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3471658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3471754Z kernel = self.compile( 2025-05-07T20:32:44.3472133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3472545Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3472675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3472680Z 2025-05-07T20:32:44.3472881Z self = 2025-05-07T20:32:44.3473767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3474267Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b260f40>} 2025-05-07T20:32:44.3475011Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3475203Z context = 2025-05-07T20:32:44.3475208Z 2025-05-07T20:32:44.3475370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3475639Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3475804Z module_map=module_map) 2025-05-07T20:32:44.3475966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3476063Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3476140Z E ^ 2025-05-07T20:32:44.3476497Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3476501Z 2025-05-07T20:32:44.3476911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3476916Z 2025-05-07T20:32:44.3477019Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3477244Z self=, 2025-05-07T20:32:44.3477320Z T=128, 2025-05-07T20:32:44.3477398Z D=5120, 2025-05-07T20:32:44.3477482Z scale_ub=None, 2025-05-07T20:32:44.3477564Z contiguous=True, 2025-05-07T20:32:44.3477651Z compiled=False, 2025-05-07T20:32:44.3477724Z ) 2025-05-07T20:32:44.3477940Z self = 2025-05-07T20:32:44.3478110Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3478114Z 2025-05-07T20:32:44.3478190Z @given( 2025-05-07T20:32:44.3478309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3478407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3478520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3478637Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3478753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3478826Z ) 2025-05-07T20:32:44.3479068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3479165Z def test_silu_mul_quant( 2025-05-07T20:32:44.3479244Z self, 2025-05-07T20:32:44.3479327Z T: int, 2025-05-07T20:32:44.3479401Z D: int, 2025-05-07T20:32:44.3479499Z scale_ub: Optional[float], 2025-05-07T20:32:44.3479590Z contiguous: bool, 2025-05-07T20:32:44.3479674Z compiled: bool, 2025-05-07T20:32:44.3479750Z ) -> None: 2025-05-07T20:32:44.3479845Z torch.manual_seed(2025) 2025-05-07T20:32:44.3479917Z 2025-05-07T20:32:44.3480087Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3480159Z 2025-05-07T20:32:44.3480249Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3480373Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3480459Z x = x_sign * x_clamp 2025-05-07T20:32:44.3480621Z x0 = x[:, :D] 2025-05-07T20:32:44.3480706Z x1 = x[:, D:] 2025-05-07T20:32:44.3480778Z 2025-05-07T20:32:44.3480861Z if contiguous: 2025-05-07T20:32:44.3480954Z x0 = x0.contiguous() 2025-05-07T20:32:44.3481146Z x1 = x1.contiguous() 2025-05-07T20:32:44.3481217Z 2025-05-07T20:32:44.3481308Z if scale_ub is not None: 2025-05-07T20:32:44.3481412Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3481547Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3481621Z ) 2025-05-07T20:32:44.3481696Z else: 2025-05-07T20:32:44.3481792Z scale_ub_tensor = None 2025-05-07T20:32:44.3481863Z 2025-05-07T20:32:44.3481989Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3482082Z op = silu_mul_quant 2025-05-07T20:32:44.3482164Z if compiled: 2025-05-07T20:32:44.3482263Z op = torch.compile(op) 2025-05-07T20:32:44.3482374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3482445Z 2025-05-07T20:32:44.3482534Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3482538Z 2025-05-07T20:32:44.3482639Z moe/activation_test.py:117: 2025-05-07T20:32:44.3482772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3482871Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3482968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3483461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3483559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3483913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3484130Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3484469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3484562Z kernel = self.compile( 2025-05-07T20:32:44.3484940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3485117Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3485242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3485247Z 2025-05-07T20:32:44.3485450Z self = 2025-05-07T20:32:44.3486221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3486725Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b262020>} 2025-05-07T20:32:44.3487463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3487655Z context = 2025-05-07T20:32:44.3487662Z 2025-05-07T20:32:44.3487825Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3488086Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3488196Z module_map=module_map) 2025-05-07T20:32:44.3488358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3488458Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3488537Z E ^ 2025-05-07T20:32:44.3488971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3488976Z 2025-05-07T20:32:44.3489391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3489468Z 2025-05-07T20:32:44.3489569Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3489790Z self=, 2025-05-07T20:32:44.3489868Z T=128, 2025-05-07T20:32:44.3489943Z D=7168, 2025-05-07T20:32:44.3490022Z scale_ub=None, 2025-05-07T20:32:44.3490109Z contiguous=True, 2025-05-07T20:32:44.3490190Z compiled=False, 2025-05-07T20:32:44.3490261Z ) 2025-05-07T20:32:44.3490480Z self = 2025-05-07T20:32:44.3490647Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3490651Z 2025-05-07T20:32:44.3490734Z @given( 2025-05-07T20:32:44.3490857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3490957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3491074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3491192Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3491304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3491380Z ) 2025-05-07T20:32:44.3491619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3491712Z def test_silu_mul_quant( 2025-05-07T20:32:44.3491787Z self, 2025-05-07T20:32:44.3491862Z T: int, 2025-05-07T20:32:44.3491940Z D: int, 2025-05-07T20:32:44.3492037Z scale_ub: Optional[float], 2025-05-07T20:32:44.3492124Z contiguous: bool, 2025-05-07T20:32:44.3492212Z compiled: bool, 2025-05-07T20:32:44.3492289Z ) -> None: 2025-05-07T20:32:44.3492385Z torch.manual_seed(2025) 2025-05-07T20:32:44.3492459Z 2025-05-07T20:32:44.3492629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3492702Z 2025-05-07T20:32:44.3492797Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3492919Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3493011Z x = x_sign * x_clamp 2025-05-07T20:32:44.3493093Z x0 = x[:, :D] 2025-05-07T20:32:44.3493172Z x1 = x[:, D:] 2025-05-07T20:32:44.3493246Z 2025-05-07T20:32:44.3493329Z if contiguous: 2025-05-07T20:32:44.3493418Z x0 = x0.contiguous() 2025-05-07T20:32:44.3493508Z x1 = x1.contiguous() 2025-05-07T20:32:44.3493579Z 2025-05-07T20:32:44.3493668Z if scale_ub is not None: 2025-05-07T20:32:44.3493775Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3493905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3493977Z ) 2025-05-07T20:32:44.3494054Z else: 2025-05-07T20:32:44.3494150Z scale_ub_tensor = None 2025-05-07T20:32:44.3494223Z 2025-05-07T20:32:44.3494353Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3494442Z op = silu_mul_quant 2025-05-07T20:32:44.3494534Z if compiled: 2025-05-07T20:32:44.3494630Z op = torch.compile(op) 2025-05-07T20:32:44.3494733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3494808Z 2025-05-07T20:32:44.3494898Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3494902Z 2025-05-07T20:32:44.3494998Z moe/activation_test.py:117: 2025-05-07T20:32:44.3495128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3495225Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3495322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3495820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3495998Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3496358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3496581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3496996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3497092Z kernel = self.compile( 2025-05-07T20:32:44.3497469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3497642Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3497767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3497771Z 2025-05-07T20:32:44.3497973Z self = 2025-05-07T20:32:44.3498752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3499256Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b262f20>} 2025-05-07T20:32:44.3499999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3500187Z context = 2025-05-07T20:32:44.3500191Z 2025-05-07T20:32:44.3500353Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3500618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3500729Z module_map=module_map) 2025-05-07T20:32:44.3500895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3500995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3501077Z E ^ 2025-05-07T20:32:44.3501440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3501444Z 2025-05-07T20:32:44.3501849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3501854Z 2025-05-07T20:32:44.3501958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3502185Z self=, 2025-05-07T20:32:44.3502264Z T=2048, 2025-05-07T20:32:44.3506228Z D=7168, 2025-05-07T20:32:44.3506329Z scale_ub=1200.0, 2025-05-07T20:32:44.3506419Z contiguous=True, 2025-05-07T20:32:44.3506508Z compiled=False, 2025-05-07T20:32:44.3506590Z ) 2025-05-07T20:32:44.3506817Z self = 2025-05-07T20:32:44.3506992Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3507001Z 2025-05-07T20:32:44.3507079Z @given( 2025-05-07T20:32:44.3507203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3507301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3507414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3507534Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3507646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3507721Z ) 2025-05-07T20:32:44.3507969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3508062Z def test_silu_mul_quant( 2025-05-07T20:32:44.3508143Z self, 2025-05-07T20:32:44.3508224Z T: int, 2025-05-07T20:32:44.3508403Z D: int, 2025-05-07T20:32:44.3508509Z scale_ub: Optional[float], 2025-05-07T20:32:44.3508597Z contiguous: bool, 2025-05-07T20:32:44.3508683Z compiled: bool, 2025-05-07T20:32:44.3508844Z ) -> None: 2025-05-07T20:32:44.3508940Z torch.manual_seed(2025) 2025-05-07T20:32:44.3509015Z 2025-05-07T20:32:44.3509189Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3510988Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3510994Z 2025-05-07T20:32:44.3511116Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3511120Z 2025-05-07T20:32:44.3511225Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3511454Z self=, 2025-05-07T20:32:44.3511532Z T=1, 2025-05-07T20:32:44.3511612Z D=5120, 2025-05-07T20:32:44.3511696Z scale_ub=1200.0, 2025-05-07T20:32:44.3511780Z contiguous=True, 2025-05-07T20:32:44.3511865Z compiled=False, 2025-05-07T20:32:44.3511939Z ) 2025-05-07T20:32:44.3512155Z self = 2025-05-07T20:32:44.3512325Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3512330Z 2025-05-07T20:32:44.3512407Z @given( 2025-05-07T20:32:44.3512525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3512631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3512750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3512871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3512982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3513064Z ) 2025-05-07T20:32:44.3513310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3513405Z def test_silu_mul_quant( 2025-05-07T20:32:44.3513478Z self, 2025-05-07T20:32:44.3513555Z T: int, 2025-05-07T20:32:44.3513630Z D: int, 2025-05-07T20:32:44.3513727Z scale_ub: Optional[float], 2025-05-07T20:32:44.3513819Z contiguous: bool, 2025-05-07T20:32:44.3513905Z compiled: bool, 2025-05-07T20:32:44.3513985Z ) -> None: 2025-05-07T20:32:44.3514082Z torch.manual_seed(2025) 2025-05-07T20:32:44.3514153Z 2025-05-07T20:32:44.3514323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3514397Z 2025-05-07T20:32:44.3514495Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3514619Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3514707Z x = x_sign * x_clamp 2025-05-07T20:32:44.3514790Z x0 = x[:, :D] 2025-05-07T20:32:44.3514876Z x1 = x[:, D:] 2025-05-07T20:32:44.3514948Z 2025-05-07T20:32:44.3515032Z if contiguous: 2025-05-07T20:32:44.3515126Z x0 = x0.contiguous() 2025-05-07T20:32:44.3515215Z x1 = x1.contiguous() 2025-05-07T20:32:44.3515286Z 2025-05-07T20:32:44.3515376Z if scale_ub is not None: 2025-05-07T20:32:44.3515481Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3515619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3515692Z ) 2025-05-07T20:32:44.3515823Z else: 2025-05-07T20:32:44.3515922Z scale_ub_tensor = None 2025-05-07T20:32:44.3515998Z 2025-05-07T20:32:44.3516124Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3516327Z op = silu_mul_quant 2025-05-07T20:32:44.3516414Z if compiled: 2025-05-07T20:32:44.3516513Z op = torch.compile(op) 2025-05-07T20:32:44.3516702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3516775Z 2025-05-07T20:32:44.3516864Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3516874Z 2025-05-07T20:32:44.3516969Z moe/activation_test.py:117: 2025-05-07T20:32:44.3517099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3517201Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3517299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3517799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3517898Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3518260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3518481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3518819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3518918Z kernel = self.compile( 2025-05-07T20:32:44.3519299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3519470Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3519595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3519599Z 2025-05-07T20:32:44.3519810Z self = 2025-05-07T20:32:44.3520591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3521094Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b1004a0>} 2025-05-07T20:32:44.3521840Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3522034Z context = 2025-05-07T20:32:44.3522038Z 2025-05-07T20:32:44.3522199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3522461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3522571Z module_map=module_map) 2025-05-07T20:32:44.3522736Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3522834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3522912Z E ^ 2025-05-07T20:32:44.3523268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3523277Z 2025-05-07T20:32:44.3523687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3523692Z 2025-05-07T20:32:44.3523794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3524012Z self=, 2025-05-07T20:32:44.3524090Z T=2048, 2025-05-07T20:32:44.3524166Z D=5120, 2025-05-07T20:32:44.3524246Z scale_ub=None, 2025-05-07T20:32:44.3524332Z contiguous=True, 2025-05-07T20:32:44.3524413Z compiled=False, 2025-05-07T20:32:44.3524487Z ) 2025-05-07T20:32:44.3524702Z self = 2025-05-07T20:32:44.3524953Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3524959Z 2025-05-07T20:32:44.3525039Z @given( 2025-05-07T20:32:44.3525157Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3525329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3525446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3525562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3525673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3525750Z ) 2025-05-07T20:32:44.3525991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3526086Z def test_silu_mul_quant( 2025-05-07T20:32:44.3526160Z self, 2025-05-07T20:32:44.3526236Z T: int, 2025-05-07T20:32:44.3526317Z D: int, 2025-05-07T20:32:44.3526415Z scale_ub: Optional[float], 2025-05-07T20:32:44.3526503Z contiguous: bool, 2025-05-07T20:32:44.3526599Z compiled: bool, 2025-05-07T20:32:44.3526678Z ) -> None: 2025-05-07T20:32:44.3526774Z torch.manual_seed(2025) 2025-05-07T20:32:44.3526850Z 2025-05-07T20:32:44.3527019Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3527092Z 2025-05-07T20:32:44.3527187Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.3528975Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3528981Z 2025-05-07T20:32:44.3529107Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.3529112Z 2025-05-07T20:32:44.3529212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3529436Z self=, 2025-05-07T20:32:44.3529519Z T=16384, 2025-05-07T20:32:44.3529596Z D=5120, 2025-05-07T20:32:44.3529683Z scale_ub=None, 2025-05-07T20:32:44.3529765Z contiguous=True, 2025-05-07T20:32:44.3529846Z compiled=False, 2025-05-07T20:32:44.3529921Z ) 2025-05-07T20:32:44.3530136Z self = 2025-05-07T20:32:44.3530306Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3530314Z 2025-05-07T20:32:44.3530389Z @given( 2025-05-07T20:32:44.3530506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3530607Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3530723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3530837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3530955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3531032Z ) 2025-05-07T20:32:44.3531270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3531365Z def test_silu_mul_quant( 2025-05-07T20:32:44.3531442Z self, 2025-05-07T20:32:44.3531518Z T: int, 2025-05-07T20:32:44.3531597Z D: int, 2025-05-07T20:32:44.3531692Z scale_ub: Optional[float], 2025-05-07T20:32:44.3531780Z contiguous: bool, 2025-05-07T20:32:44.3531865Z compiled: bool, 2025-05-07T20:32:44.3531942Z ) -> None: 2025-05-07T20:32:44.3532038Z torch.manual_seed(2025) 2025-05-07T20:32:44.3532109Z 2025-05-07T20:32:44.3532273Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3534142Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3534221Z 2025-05-07T20:32:44.3534343Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3534347Z 2025-05-07T20:32:44.3534450Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3534671Z self=, 2025-05-07T20:32:44.3534748Z T=4096, 2025-05-07T20:32:44.3534826Z D=5120, 2025-05-07T20:32:44.3534907Z scale_ub=None, 2025-05-07T20:32:44.3534992Z contiguous=True, 2025-05-07T20:32:44.3535080Z compiled=False, 2025-05-07T20:32:44.3535152Z ) 2025-05-07T20:32:44.3535369Z self = 2025-05-07T20:32:44.3535537Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3535549Z 2025-05-07T20:32:44.3535626Z @given( 2025-05-07T20:32:44.3535750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3535846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3535957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3536073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3536186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3536264Z ) 2025-05-07T20:32:44.3536504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3536594Z def test_silu_mul_quant( 2025-05-07T20:32:44.3536675Z self, 2025-05-07T20:32:44.3536756Z T: int, 2025-05-07T20:32:44.3536830Z D: int, 2025-05-07T20:32:44.3536933Z scale_ub: Optional[float], 2025-05-07T20:32:44.3537023Z contiguous: bool, 2025-05-07T20:32:44.3537105Z compiled: bool, 2025-05-07T20:32:44.3537191Z ) -> None: 2025-05-07T20:32:44.3537284Z torch.manual_seed(2025) 2025-05-07T20:32:44.3537361Z 2025-05-07T20:32:44.3537527Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3539303Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3539309Z 2025-05-07T20:32:44.3539426Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3539430Z 2025-05-07T20:32:44.3539529Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3539754Z self=, 2025-05-07T20:32:44.3539828Z T=2048, 2025-05-07T20:32:44.3539902Z D=5120, 2025-05-07T20:32:44.3539984Z scale_ub=None, 2025-05-07T20:32:44.3540069Z contiguous=False, 2025-05-07T20:32:44.3540149Z compiled=False, 2025-05-07T20:32:44.3540224Z ) 2025-05-07T20:32:44.3540436Z self = 2025-05-07T20:32:44.3540604Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.3540612Z 2025-05-07T20:32:44.3540687Z @given( 2025-05-07T20:32:44.3540802Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3540985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3541101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3541215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3541402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3541476Z ) 2025-05-07T20:32:44.3541714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3541810Z def test_silu_mul_quant( 2025-05-07T20:32:44.3541886Z self, 2025-05-07T20:32:44.3541960Z T: int, 2025-05-07T20:32:44.3542037Z D: int, 2025-05-07T20:32:44.3542133Z scale_ub: Optional[float], 2025-05-07T20:32:44.3542225Z contiguous: bool, 2025-05-07T20:32:44.3542308Z compiled: bool, 2025-05-07T20:32:44.3542384Z ) -> None: 2025-05-07T20:32:44.3542482Z torch.manual_seed(2025) 2025-05-07T20:32:44.3542553Z 2025-05-07T20:32:44.3542718Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3544496Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3544507Z 2025-05-07T20:32:44.3544621Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3544625Z 2025-05-07T20:32:44.3544727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3544945Z self=, 2025-05-07T20:32:44.3545020Z T=4096, 2025-05-07T20:32:44.3545102Z D=7168, 2025-05-07T20:32:44.3545185Z scale_ub=None, 2025-05-07T20:32:44.3545273Z contiguous=True, 2025-05-07T20:32:44.3545356Z compiled=True, 2025-05-07T20:32:44.3545426Z ) 2025-05-07T20:32:44.3545644Z self = 2025-05-07T20:32:44.3545814Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.3545818Z 2025-05-07T20:32:44.3545894Z @given( 2025-05-07T20:32:44.3546015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3546115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3546227Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3546345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3546454Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3546531Z ) 2025-05-07T20:32:44.3546768Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3546859Z def test_silu_mul_quant( 2025-05-07T20:32:44.3546943Z self, 2025-05-07T20:32:44.3547020Z T: int, 2025-05-07T20:32:44.3547096Z D: int, 2025-05-07T20:32:44.3547196Z scale_ub: Optional[float], 2025-05-07T20:32:44.3547288Z contiguous: bool, 2025-05-07T20:32:44.3547370Z compiled: bool, 2025-05-07T20:32:44.3547447Z ) -> None: 2025-05-07T20:32:44.3547547Z torch.manual_seed(2025) 2025-05-07T20:32:44.3547619Z 2025-05-07T20:32:44.3547781Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3549676Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3549682Z 2025-05-07T20:32:44.3549796Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3549872Z 2025-05-07T20:32:44.3549981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3550199Z self=, 2025-05-07T20:32:44.3550276Z T=2048, 2025-05-07T20:32:44.3550350Z D=5120, 2025-05-07T20:32:44.3550430Z scale_ub=1200.0, 2025-05-07T20:32:44.3550517Z contiguous=False, 2025-05-07T20:32:44.3550599Z compiled=False, 2025-05-07T20:32:44.3550670Z ) 2025-05-07T20:32:44.3550885Z self = 2025-05-07T20:32:44.3551058Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3551062Z 2025-05-07T20:32:44.3551137Z @given( 2025-05-07T20:32:44.3551265Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3551361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3551478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3551598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3551709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3551783Z ) 2025-05-07T20:32:44.3552024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3552115Z def test_silu_mul_quant( 2025-05-07T20:32:44.3552196Z self, 2025-05-07T20:32:44.3552269Z T: int, 2025-05-07T20:32:44.3552345Z D: int, 2025-05-07T20:32:44.3552442Z scale_ub: Optional[float], 2025-05-07T20:32:44.3552529Z contiguous: bool, 2025-05-07T20:32:44.3552611Z compiled: bool, 2025-05-07T20:32:44.3552692Z ) -> None: 2025-05-07T20:32:44.3552785Z torch.manual_seed(2025) 2025-05-07T20:32:44.3552859Z 2025-05-07T20:32:44.3553026Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3554799Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3554813Z 2025-05-07T20:32:44.3554927Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3554931Z 2025-05-07T20:32:44.3555030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3555253Z self=, 2025-05-07T20:32:44.3555329Z T=4096, 2025-05-07T20:32:44.3555408Z D=7168, 2025-05-07T20:32:44.3555494Z scale_ub=1200.0, 2025-05-07T20:32:44.3555575Z contiguous=True, 2025-05-07T20:32:44.3555657Z compiled=False, 2025-05-07T20:32:44.3555779Z ) 2025-05-07T20:32:44.3555994Z self = 2025-05-07T20:32:44.3556168Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3556172Z 2025-05-07T20:32:44.3556247Z @given( 2025-05-07T20:32:44.3556361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3556460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3556572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3556683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3556798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3556872Z ) 2025-05-07T20:32:44.3557199Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3557291Z def test_silu_mul_quant( 2025-05-07T20:32:44.3557367Z self, 2025-05-07T20:32:44.3557445Z T: int, 2025-05-07T20:32:44.3557520Z D: int, 2025-05-07T20:32:44.3557687Z scale_ub: Optional[float], 2025-05-07T20:32:44.3557777Z contiguous: bool, 2025-05-07T20:32:44.3557860Z compiled: bool, 2025-05-07T20:32:44.3557935Z ) -> None: 2025-05-07T20:32:44.3558032Z torch.manual_seed(2025) 2025-05-07T20:32:44.3558106Z 2025-05-07T20:32:44.3558268Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3560057Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3560067Z 2025-05-07T20:32:44.3560180Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3560185Z 2025-05-07T20:32:44.3560289Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3560507Z self=, 2025-05-07T20:32:44.3560586Z T=16384, 2025-05-07T20:32:44.3560660Z D=7168, 2025-05-07T20:32:44.3560739Z scale_ub=None, 2025-05-07T20:32:44.3560826Z contiguous=False, 2025-05-07T20:32:44.3560908Z compiled=True, 2025-05-07T20:32:44.3560980Z ) 2025-05-07T20:32:44.3561199Z self = 2025-05-07T20:32:44.3561371Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.3561375Z 2025-05-07T20:32:44.3561458Z @given( 2025-05-07T20:32:44.3561577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3561674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3561796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3561908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3562017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3562094Z ) 2025-05-07T20:32:44.3562334Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3562424Z def test_silu_mul_quant( 2025-05-07T20:32:44.3562503Z self, 2025-05-07T20:32:44.3562579Z T: int, 2025-05-07T20:32:44.3562652Z D: int, 2025-05-07T20:32:44.3562751Z scale_ub: Optional[float], 2025-05-07T20:32:44.3562837Z contiguous: bool, 2025-05-07T20:32:44.3562921Z compiled: bool, 2025-05-07T20:32:44.3563001Z ) -> None: 2025-05-07T20:32:44.3563102Z torch.manual_seed(2025) 2025-05-07T20:32:44.3563176Z 2025-05-07T20:32:44.3563337Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3565113Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3565126Z 2025-05-07T20:32:44.3565239Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3565243Z 2025-05-07T20:32:44.3565592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3565954Z self=, 2025-05-07T20:32:44.3566033Z T=4096, 2025-05-07T20:32:44.3566108Z D=7168, 2025-05-07T20:32:44.3566192Z scale_ub=None, 2025-05-07T20:32:44.3566274Z contiguous=True, 2025-05-07T20:32:44.3566467Z compiled=False, 2025-05-07T20:32:44.3566543Z ) 2025-05-07T20:32:44.3566757Z self = 2025-05-07T20:32:44.3566927Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3566932Z 2025-05-07T20:32:44.3567008Z @given( 2025-05-07T20:32:44.3567123Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3567225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3567337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3567449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3567564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3567636Z ) 2025-05-07T20:32:44.3567888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3567979Z def test_silu_mul_quant( 2025-05-07T20:32:44.3568053Z self, 2025-05-07T20:32:44.3568136Z T: int, 2025-05-07T20:32:44.3568210Z D: int, 2025-05-07T20:32:44.3568306Z scale_ub: Optional[float], 2025-05-07T20:32:44.3568395Z contiguous: bool, 2025-05-07T20:32:44.3568478Z compiled: bool, 2025-05-07T20:32:44.3568555Z ) -> None: 2025-05-07T20:32:44.3568651Z torch.manual_seed(2025) 2025-05-07T20:32:44.3568723Z 2025-05-07T20:32:44.3568886Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3570671Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3570681Z 2025-05-07T20:32:44.3570794Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3570801Z 2025-05-07T20:32:44.3570900Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3571119Z self=, 2025-05-07T20:32:44.3571200Z T=16384, 2025-05-07T20:32:44.3571274Z D=7168, 2025-05-07T20:32:44.3571353Z scale_ub=None, 2025-05-07T20:32:44.3571440Z contiguous=True, 2025-05-07T20:32:44.3571522Z compiled=False, 2025-05-07T20:32:44.3571595Z ) 2025-05-07T20:32:44.3571810Z self = 2025-05-07T20:32:44.3571985Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3571990Z 2025-05-07T20:32:44.3572069Z @given( 2025-05-07T20:32:44.3572187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3572286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3572402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3572517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3572627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3572706Z ) 2025-05-07T20:32:44.3572945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3573037Z def test_silu_mul_quant( 2025-05-07T20:32:44.3573117Z self, 2025-05-07T20:32:44.3573192Z T: int, 2025-05-07T20:32:44.3573266Z D: int, 2025-05-07T20:32:44.3573366Z scale_ub: Optional[float], 2025-05-07T20:32:44.3573452Z contiguous: bool, 2025-05-07T20:32:44.3573540Z compiled: bool, 2025-05-07T20:32:44.3573697Z ) -> None: 2025-05-07T20:32:44.3573792Z torch.manual_seed(2025) 2025-05-07T20:32:44.3573867Z 2025-05-07T20:32:44.3574030Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3575903Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3575912Z 2025-05-07T20:32:44.3576026Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3576030Z 2025-05-07T20:32:44.3576137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3576362Z self=, 2025-05-07T20:32:44.3576437Z T=16384, 2025-05-07T20:32:44.3576516Z D=7168, 2025-05-07T20:32:44.3576605Z scale_ub=1200.0, 2025-05-07T20:32:44.3576687Z contiguous=True, 2025-05-07T20:32:44.3576771Z compiled=False, 2025-05-07T20:32:44.3576846Z ) 2025-05-07T20:32:44.3577058Z self = 2025-05-07T20:32:44.3577233Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3577238Z 2025-05-07T20:32:44.3577314Z @given( 2025-05-07T20:32:44.3577429Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3577528Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3577639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3577753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3577870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3577942Z ) 2025-05-07T20:32:44.3578184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3578280Z def test_silu_mul_quant( 2025-05-07T20:32:44.3578355Z self, 2025-05-07T20:32:44.3578434Z T: int, 2025-05-07T20:32:44.3578509Z D: int, 2025-05-07T20:32:44.3578605Z scale_ub: Optional[float], 2025-05-07T20:32:44.3578696Z contiguous: bool, 2025-05-07T20:32:44.3578780Z compiled: bool, 2025-05-07T20:32:44.3578857Z ) -> None: 2025-05-07T20:32:44.3578954Z torch.manual_seed(2025) 2025-05-07T20:32:44.3579024Z 2025-05-07T20:32:44.3579200Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3581026Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3581036Z 2025-05-07T20:32:44.3581150Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3581156Z 2025-05-07T20:32:44.3581256Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3581476Z self=, 2025-05-07T20:32:44.3581556Z T=128, 2025-05-07T20:32:44.3581630Z D=5120, 2025-05-07T20:32:44.3581711Z scale_ub=1200.0, 2025-05-07T20:32:44.3581797Z contiguous=False, 2025-05-07T20:32:44.3581879Z compiled=False, 2025-05-07T20:32:44.3581949Z ) 2025-05-07T20:32:44.3582247Z self = 2025-05-07T20:32:44.3582417Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3582422Z 2025-05-07T20:32:44.3582496Z @given( 2025-05-07T20:32:44.3582694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3582790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3582905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3583018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3583128Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3583205Z ) 2025-05-07T20:32:44.3583441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3583532Z def test_silu_mul_quant( 2025-05-07T20:32:44.3583610Z self, 2025-05-07T20:32:44.3583685Z T: int, 2025-05-07T20:32:44.3583759Z D: int, 2025-05-07T20:32:44.3583858Z scale_ub: Optional[float], 2025-05-07T20:32:44.3583949Z contiguous: bool, 2025-05-07T20:32:44.3584035Z compiled: bool, 2025-05-07T20:32:44.3584111Z ) -> None: 2025-05-07T20:32:44.3584203Z torch.manual_seed(2025) 2025-05-07T20:32:44.3584278Z 2025-05-07T20:32:44.3584446Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3584521Z 2025-05-07T20:32:44.3584614Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3584737Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3584823Z x = x_sign * x_clamp 2025-05-07T20:32:44.3584907Z x0 = x[:, :D] 2025-05-07T20:32:44.3584985Z x1 = x[:, D:] 2025-05-07T20:32:44.3585057Z 2025-05-07T20:32:44.3585145Z if contiguous: 2025-05-07T20:32:44.3585236Z x0 = x0.contiguous() 2025-05-07T20:32:44.3585323Z x1 = x1.contiguous() 2025-05-07T20:32:44.3585396Z 2025-05-07T20:32:44.3585484Z if scale_ub is not None: 2025-05-07T20:32:44.3585596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3585729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3585803Z ) 2025-05-07T20:32:44.3585882Z else: 2025-05-07T20:32:44.3585981Z scale_ub_tensor = None 2025-05-07T20:32:44.3586053Z 2025-05-07T20:32:44.3586184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3586274Z op = silu_mul_quant 2025-05-07T20:32:44.3586357Z if compiled: 2025-05-07T20:32:44.3586459Z op = torch.compile(op) 2025-05-07T20:32:44.3586563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3586634Z 2025-05-07T20:32:44.3586727Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3586731Z 2025-05-07T20:32:44.3586825Z moe/activation_test.py:117: 2025-05-07T20:32:44.3586955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3587052Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3587153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3587655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3587754Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3588109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3588330Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3588665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3588761Z kernel = self.compile( 2025-05-07T20:32:44.3589138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3589310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3589529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3589534Z 2025-05-07T20:32:44.3589736Z self = 2025-05-07T20:32:44.3590518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3591091Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8b153060>} 2025-05-07T20:32:44.3591835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3592025Z context = 2025-05-07T20:32:44.3592029Z 2025-05-07T20:32:44.3592195Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3592458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3592569Z module_map=module_map) 2025-05-07T20:32:44.3592730Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3592829Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3592905Z E ^ 2025-05-07T20:32:44.3593264Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3593269Z 2025-05-07T20:32:44.3593676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3593680Z 2025-05-07T20:32:44.3593780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3594007Z self=, 2025-05-07T20:32:44.3594087Z T=2048, 2025-05-07T20:32:44.3594164Z D=7168, 2025-05-07T20:32:44.3594246Z scale_ub=None, 2025-05-07T20:32:44.3594330Z contiguous=False, 2025-05-07T20:32:44.3594422Z compiled=False, 2025-05-07T20:32:44.3594493Z ) 2025-05-07T20:32:44.3594708Z self = 2025-05-07T20:32:44.3594884Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.3594889Z 2025-05-07T20:32:44.3594965Z @given( 2025-05-07T20:32:44.3595081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3595182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3595296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3595414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3595526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3595599Z ) 2025-05-07T20:32:44.3595900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3595992Z def test_silu_mul_quant( 2025-05-07T20:32:44.3596066Z self, 2025-05-07T20:32:44.3596145Z T: int, 2025-05-07T20:32:44.3596223Z D: int, 2025-05-07T20:32:44.3596320Z scale_ub: Optional[float], 2025-05-07T20:32:44.3596410Z contiguous: bool, 2025-05-07T20:32:44.3596494Z compiled: bool, 2025-05-07T20:32:44.3596571Z ) -> None: 2025-05-07T20:32:44.3596666Z torch.manual_seed(2025) 2025-05-07T20:32:44.3596740Z 2025-05-07T20:32:44.3596903Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3598775Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3598852Z 2025-05-07T20:32:44.3598970Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3598979Z 2025-05-07T20:32:44.3599078Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3599298Z self=, 2025-05-07T20:32:44.3599377Z T=128, 2025-05-07T20:32:44.3599453Z D=7168, 2025-05-07T20:32:44.3599534Z scale_ub=1200.0, 2025-05-07T20:32:44.3599622Z contiguous=True, 2025-05-07T20:32:44.3599702Z compiled=True, 2025-05-07T20:32:44.3599773Z ) 2025-05-07T20:32:44.3599988Z self = 2025-05-07T20:32:44.3600152Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.3600162Z 2025-05-07T20:32:44.3600240Z @given( 2025-05-07T20:32:44.3600358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3600454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3600577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3600690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3600799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3600876Z ) 2025-05-07T20:32:44.3601114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3601206Z def test_silu_mul_quant( 2025-05-07T20:32:44.3601287Z self, 2025-05-07T20:32:44.3601361Z T: int, 2025-05-07T20:32:44.3601435Z D: int, 2025-05-07T20:32:44.3601533Z scale_ub: Optional[float], 2025-05-07T20:32:44.3601619Z contiguous: bool, 2025-05-07T20:32:44.3601704Z compiled: bool, 2025-05-07T20:32:44.3601782Z ) -> None: 2025-05-07T20:32:44.3601877Z torch.manual_seed(2025) 2025-05-07T20:32:44.3601954Z 2025-05-07T20:32:44.3602118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3602197Z 2025-05-07T20:32:44.3602289Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3602411Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3602497Z x = x_sign * x_clamp 2025-05-07T20:32:44.3602580Z x0 = x[:, :D] 2025-05-07T20:32:44.3602658Z x1 = x[:, D:] 2025-05-07T20:32:44.3602729Z 2025-05-07T20:32:44.3602812Z if contiguous: 2025-05-07T20:32:44.3602902Z x0 = x0.contiguous() 2025-05-07T20:32:44.3602991Z x1 = x1.contiguous() 2025-05-07T20:32:44.3603065Z 2025-05-07T20:32:44.3603155Z if scale_ub is not None: 2025-05-07T20:32:44.3603258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3603392Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3603470Z ) 2025-05-07T20:32:44.3603548Z else: 2025-05-07T20:32:44.3603641Z scale_ub_tensor = None 2025-05-07T20:32:44.3603712Z 2025-05-07T20:32:44.3603843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3603937Z op = silu_mul_quant 2025-05-07T20:32:44.3604020Z if compiled: 2025-05-07T20:32:44.3604120Z op = torch.compile(op) 2025-05-07T20:32:44.3604223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3604294Z 2025-05-07T20:32:44.3604388Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3604392Z 2025-05-07T20:32:44.3604486Z moe/activation_test.py:117: 2025-05-07T20:32:44.3604617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3604714Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3604812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3605344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.3605437Z return fn(*args, **kwargs) 2025-05-07T20:32:44.3605926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3606125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3606479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3606703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3607036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3607127Z kernel = self.compile( 2025-05-07T20:32:44.3607506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3607675Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3607805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3607813Z 2025-05-07T20:32:44.3608016Z self = 2025-05-07T20:32:44.3608795Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3609301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f5b8afe0900>} 2025-05-07T20:32:44.3610041Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3610238Z context = 2025-05-07T20:32:44.3610242Z 2025-05-07T20:32:44.3610403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3610664Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3610776Z module_map=module_map) 2025-05-07T20:32:44.3610935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3611032Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3611111Z E ^ 2025-05-07T20:32:44.3611462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3611466Z 2025-05-07T20:32:44.3611874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3611879Z 2025-05-07T20:32:44.3611979Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3612204Z self=, 2025-05-07T20:32:44.3612283Z T=128, 2025-05-07T20:32:44.3612357Z D=7168, 2025-05-07T20:32:44.3612441Z scale_ub=1200.0, 2025-05-07T20:32:44.3612527Z contiguous=True, 2025-05-07T20:32:44.3612608Z compiled=False, 2025-05-07T20:32:44.3612682Z ) 2025-05-07T20:32:44.3612898Z self = 2025-05-07T20:32:44.3613064Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3613068Z 2025-05-07T20:32:44.3613148Z @given( 2025-05-07T20:32:44.3613264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3613362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3613481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3613594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3613708Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3613868Z ) 2025-05-07T20:32:44.3614110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3614203Z def test_silu_mul_quant( 2025-05-07T20:32:44.3614350Z self, 2025-05-07T20:32:44.3614425Z T: int, 2025-05-07T20:32:44.3614503Z D: int, 2025-05-07T20:32:44.3614600Z scale_ub: Optional[float], 2025-05-07T20:32:44.3614687Z contiguous: bool, 2025-05-07T20:32:44.3614773Z compiled: bool, 2025-05-07T20:32:44.3614850Z ) -> None: 2025-05-07T20:32:44.3614944Z torch.manual_seed(2025) 2025-05-07T20:32:44.3615021Z 2025-05-07T20:32:44.3615185Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3615259Z 2025-05-07T20:32:44.3615350Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3615471Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3617266Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3617277Z 2025-05-07T20:32:44.3617394Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.3617398Z 2025-05-07T20:32:44.3617501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3617720Z self=, 2025-05-07T20:32:44.3617794Z T=128, 2025-05-07T20:32:44.3617871Z D=5120, 2025-05-07T20:32:44.3617950Z scale_ub=1200.0, 2025-05-07T20:32:44.3618030Z contiguous=True, 2025-05-07T20:32:44.3618114Z compiled=True, 2025-05-07T20:32:44.3618190Z ) 2025-05-07T20:32:44.3618404Z self = 2025-05-07T20:32:44.3618572Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.3618582Z 2025-05-07T20:32:44.3618657Z @given( 2025-05-07T20:32:44.3618776Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3618872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3618984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3619104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3619236Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3619318Z ) 2025-05-07T20:32:44.3619576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3619667Z def test_silu_mul_quant( 2025-05-07T20:32:44.3619744Z self, 2025-05-07T20:32:44.3619819Z T: int, 2025-05-07T20:32:44.3619897Z D: int, 2025-05-07T20:32:44.3619996Z scale_ub: Optional[float], 2025-05-07T20:32:44.3620082Z contiguous: bool, 2025-05-07T20:32:44.3620165Z compiled: bool, 2025-05-07T20:32:44.3620248Z ) -> None: 2025-05-07T20:32:44.3620339Z torch.manual_seed(2025) 2025-05-07T20:32:44.3620410Z 2025-05-07T20:32:44.3620574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3620650Z 2025-05-07T20:32:44.3620740Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.3622602Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3622609Z 2025-05-07T20:32:44.3622726Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.3622805Z 2025-05-07T20:32:44.3622909Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3623128Z self=, 2025-05-07T20:32:44.3623206Z T=128, 2025-05-07T20:32:44.3623279Z D=7168, 2025-05-07T20:32:44.3623359Z scale_ub=None, 2025-05-07T20:32:44.3623443Z contiguous=True, 2025-05-07T20:32:44.3623523Z compiled=True, 2025-05-07T20:32:44.3623594Z ) 2025-05-07T20:32:44.3623810Z self = 2025-05-07T20:32:44.3623970Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.3623975Z 2025-05-07T20:32:44.3624050Z @given( 2025-05-07T20:32:44.3624174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3624271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3624388Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3624501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3624616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3624692Z ) 2025-05-07T20:32:44.3624932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3625021Z def test_silu_mul_quant( 2025-05-07T20:32:44.3625100Z self, 2025-05-07T20:32:44.3625176Z T: int, 2025-05-07T20:32:44.3625251Z D: int, 2025-05-07T20:32:44.3625349Z scale_ub: Optional[float], 2025-05-07T20:32:44.3625437Z contiguous: bool, 2025-05-07T20:32:44.3625519Z compiled: bool, 2025-05-07T20:32:44.3625599Z ) -> None: 2025-05-07T20:32:44.3625690Z torch.manual_seed(2025) 2025-05-07T20:32:44.3625765Z 2025-05-07T20:32:44.3625938Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3627711Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3627722Z 2025-05-07T20:32:44.3627845Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3627979Z =============================== warnings summary =============================== 2025-05-07T20:32:44.3628287Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.3628589Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.3632990Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.3633898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:44.3634125Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:44.3634133Z 2025-05-07T20:32:44.3634310Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:44.3635688Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:44.3635992Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:44.3636078Z 2025-05-07T20:32:44.3636289Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:44.3636458Z ================== 1 failed, 1 passed, 13 warnings in 20.18s =================== 2025-05-07T20:32:46.0908165Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:46.1522566Z 2025-05-07T20:32:46.1523144Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:46.1523624Z 2025-05-07T20:32:46.1523630Z 2025-05-07T20:32:46.1543643Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:48.3093925Z ============================= test session starts ============================== 2025-05-07T20:32:48.3095625Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:48.3096722Z cachedir: .pytest_cache 2025-05-07T20:32:48.3097845Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:48.3099270Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:48.3100078Z plugins: hypothesis-6.131.14 2025-05-07T20:32:49.9301084Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:50.0388516Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:50.0389559Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:50.0390108Z 2025-05-07T20:32:52.1320858Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:52.1321981Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:52.1323353Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:52.1324830Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:52.1325833Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.1327136Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:52.1328527Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1329520Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.1330752Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:52.1332469Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1333692Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.1334983Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:52.1336235Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:52.1337467Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:52.1338683Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:52.1339517Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.1340551Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:52.1341574Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:52.1342380Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:52.1343593Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:52.1344891Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:52.1346017Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:52.1347063Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:52.1348247Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:52.1349607Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:52.1350679Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1351603Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1352354Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:52.1353371Z W0507 20:32:52.130000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1482905Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:52.1483984Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:52.1485477Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:52.1486905Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:52.1487892Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.1489357Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:52.1490748Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1491734Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.1492974Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:52.1494358Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1495417Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.1496711Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:52.1498111Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:52.1499411Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:52.1500625Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:52.1501444Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.1502476Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:52.1503498Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:52.1504299Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:52.1505616Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:52.1506893Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:52.1508087Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:52.1509134Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:52.1510314Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:52.1511678Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:52.1512735Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1513657Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1514400Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:52.1515422Z W0507 20:32:52.147000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.5693984Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.5694643Z self=, 2025-05-07T20:32:52.5695071Z T=1, 2025-05-07T20:32:52.5695269Z D=5120, 2025-05-07T20:32:52.5695462Z scale_ub=None, 2025-05-07T20:32:52.5695687Z contiguous=True, 2025-05-07T20:32:52.5695916Z compiled=True, 2025-05-07T20:32:52.5696137Z ) 2025-05-07T20:32:52.5696466Z self = 2025-05-07T20:32:52.5696960Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.5697221Z 2025-05-07T20:32:52.5697303Z @given( 2025-05-07T20:32:52.5697539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.5697869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.5698177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.5698519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.5698856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.5699153Z ) 2025-05-07T20:32:52.5699509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.5699972Z def test_silu_mul_quant( 2025-05-07T20:32:52.5700228Z self, 2025-05-07T20:32:52.5700432Z T: int, 2025-05-07T20:32:52.5700642Z D: int, 2025-05-07T20:32:52.5700874Z scale_ub: Optional[float], 2025-05-07T20:32:52.5701151Z contiguous: bool, 2025-05-07T20:32:52.5701402Z compiled: bool, 2025-05-07T20:32:52.5701646Z ) -> None: 2025-05-07T20:32:52.5701861Z torch.manual_seed(2025) 2025-05-07T20:32:52.5702113Z 2025-05-07T20:32:52.5702398Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.5702742Z 2025-05-07T20:32:52.5702946Z x_sign = torch.sign(x) 2025-05-07T20:32:52.5703250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.5703562Z x = x_sign * x_clamp 2025-05-07T20:32:52.5703817Z x0 = x[:, :D] 2025-05-07T20:32:52.5704041Z x1 = x[:, D:] 2025-05-07T20:32:52.5704248Z 2025-05-07T20:32:52.5704708Z if contiguous: 2025-05-07T20:32:52.5704954Z x0 = x0.contiguous() 2025-05-07T20:32:52.5705229Z x1 = x1.contiguous() 2025-05-07T20:32:52.5705472Z 2025-05-07T20:32:52.5705821Z if scale_ub is not None: 2025-05-07T20:32:52.5706104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.5706441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.5706758Z ) 2025-05-07T20:32:52.5706956Z else: 2025-05-07T20:32:52.5707168Z scale_ub_tensor = None 2025-05-07T20:32:52.5707426Z 2025-05-07T20:32:52.5707663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.5707977Z op = silu_mul_quant 2025-05-07T20:32:52.5708238Z if compiled: 2025-05-07T20:32:52.5708492Z op = torch.compile(op) 2025-05-07T20:32:52.5708787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.5709067Z 2025-05-07T20:32:52.5709275Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.5709561Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.5709867Z 2025-05-07T20:32:52.5710116Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.5710464Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.5710760Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.5711080Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.5711448Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.5711760Z 2025-05-07T20:32:52.5711973Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.5712170Z 2025-05-07T20:32:52.5712282Z moe/activation_test.py:126: 2025-05-07T20:32:52.5712582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.5712928Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.5713262Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.5714063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.5714815Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.5715373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.5716135Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.5716829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.5717552Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.5718290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.5718935Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.5719539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.5720061Z fn() 2025-05-07T20:32:52.5720575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.5721164Z self.fn.run( 2025-05-07T20:32:52.5721628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.5722163Z kernel = self.compile( 2025-05-07T20:32:52.5722704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.5723352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.5723787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.5724047Z 2025-05-07T20:32:52.5724254Z self = 2025-05-07T20:32:52.5725432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.5726908Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b91f50c20>} 2025-05-07T20:32:52.5728251Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.5729279Z context = 2025-05-07T20:32:52.5729568Z 2025-05-07T20:32:52.5729742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.5730275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.5730744Z module_map=module_map) 2025-05-07T20:32:52.5731113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.5731485Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.5731750Z E ^ 2025-05-07T20:32:52.5732218Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.5732672Z 2025-05-07T20:32:52.5733094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.5733608Z 2025-05-07T20:32:52.5733742Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.5734177Z self=, 2025-05-07T20:32:52.5734597Z T=2048, 2025-05-07T20:32:52.5734790Z D=5120, 2025-05-07T20:32:52.5734982Z scale_ub=1200.0, 2025-05-07T20:32:52.5735209Z contiguous=True, 2025-05-07T20:32:52.5735440Z compiled=False, 2025-05-07T20:32:52.5735645Z ) 2025-05-07T20:32:53.0200064Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:53.0201175Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:53.0202523Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:53.0204006Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:53.0205008Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.0206306Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:53.0207692Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0208679Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.0210241Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:53.0211622Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0212832Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.0214116Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:53.0215364Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:53.0216592Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:53.0217805Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:53.0218635Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.0219662Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:53.0220678Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:53.0221474Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:53.0222682Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:53.0224012Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:53.0225134Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:53.0226171Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:53.0227351Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:53.0228698Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:53.0229759Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0230671Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0231411Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:53.0232422Z W0507 20:32:53.016000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1116737Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:53.1118869Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:53.1121871Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:53.1124245Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:53.1125221Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.1126537Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:53.1127932Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1128927Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.1130163Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:53.1131545Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1132621Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.1133940Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:53.1135216Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:53.1136440Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:53.1137647Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:53.1138479Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.1139514Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:53.1140534Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:53.1141328Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:53.1142631Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:53.1143918Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:53.1145115Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:53.1146165Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:53.1147337Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:53.1148705Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:53.1149770Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1150693Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.1151431Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:53.1152447Z W0507 20:32:53.108000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5796537Z self = 2025-05-07T20:32:53.5797109Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.5797496Z 2025-05-07T20:32:53.5797593Z @given( 2025-05-07T20:32:53.5797827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5798154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5798478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5798812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5799150Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5799445Z ) 2025-05-07T20:32:53.5799794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5800241Z def test_silu_mul_quant( 2025-05-07T20:32:53.5800491Z self, 2025-05-07T20:32:53.5800691Z T: int, 2025-05-07T20:32:53.5800895Z D: int, 2025-05-07T20:32:53.5801123Z scale_ub: Optional[float], 2025-05-07T20:32:53.5801399Z contiguous: bool, 2025-05-07T20:32:53.5801665Z compiled: bool, 2025-05-07T20:32:53.5801897Z ) -> None: 2025-05-07T20:32:53.5802121Z torch.manual_seed(2025) 2025-05-07T20:32:53.5802374Z 2025-05-07T20:32:53.5802665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5803019Z 2025-05-07T20:32:53.5803230Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5803529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5803846Z x = x_sign * x_clamp 2025-05-07T20:32:53.5804095Z x0 = x[:, :D] 2025-05-07T20:32:53.5804327Z x1 = x[:, D:] 2025-05-07T20:32:53.5804533Z 2025-05-07T20:32:53.5804770Z if contiguous: 2025-05-07T20:32:53.5805004Z x0 = x0.contiguous() 2025-05-07T20:32:53.5805279Z x1 = x1.contiguous() 2025-05-07T20:32:53.5805532Z 2025-05-07T20:32:53.5805723Z if scale_ub is not None: 2025-05-07T20:32:53.5806007Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.5806350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.5806895Z ) 2025-05-07T20:32:53.5807098Z else: 2025-05-07T20:32:53.5807313Z scale_ub_tensor = None 2025-05-07T20:32:53.5807568Z 2025-05-07T20:32:53.5807813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5808280Z op = silu_mul_quant 2025-05-07T20:32:53.5808536Z if compiled: 2025-05-07T20:32:53.5808786Z op = torch.compile(op) 2025-05-07T20:32:53.5809088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5809376Z 2025-05-07T20:32:53.5809571Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.5809741Z 2025-05-07T20:32:53.5809844Z moe/activation_test.py:117: 2025-05-07T20:32:53.5810148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5810484Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.5810772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5811475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.5812178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.5812718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.5813412Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.5814084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.5814615Z kernel = self.compile( 2025-05-07T20:32:53.5815163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.5815821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5816229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5816461Z 2025-05-07T20:32:53.5816676Z self = 2025-05-07T20:32:53.5817775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.5819178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b91e10180>} 2025-05-07T20:32:53.5820529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.5821567Z context = 2025-05-07T20:32:53.5821856Z 2025-05-07T20:32:53.5822028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.5822572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5823049Z module_map=module_map) 2025-05-07T20:32:53.5823413Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5823779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.5824051Z E ^ 2025-05-07T20:32:53.5824518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5824971Z 2025-05-07T20:32:53.5825384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.5825904Z 2025-05-07T20:32:53.5826010Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5826427Z self=, 2025-05-07T20:32:53.5826841Z T=2048, 2025-05-07T20:32:53.5827030Z D=5120, 2025-05-07T20:32:53.5827228Z scale_ub=1200.0, 2025-05-07T20:32:53.5827589Z contiguous=True, 2025-05-07T20:32:53.5827815Z compiled=True, 2025-05-07T20:32:53.5828027Z ) 2025-05-07T20:32:53.5828353Z self = 2025-05-07T20:32:53.5828928Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.5829207Z 2025-05-07T20:32:53.5829287Z @given( 2025-05-07T20:32:53.5829522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5829835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5830151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5830500Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5830827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5831125Z ) 2025-05-07T20:32:53.5840356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5840850Z def test_silu_mul_quant( 2025-05-07T20:32:53.5841107Z self, 2025-05-07T20:32:53.5841317Z T: int, 2025-05-07T20:32:53.5841529Z D: int, 2025-05-07T20:32:53.5841754Z scale_ub: Optional[float], 2025-05-07T20:32:53.5842048Z contiguous: bool, 2025-05-07T20:32:53.5842301Z compiled: bool, 2025-05-07T20:32:53.5842528Z ) -> None: 2025-05-07T20:32:53.5842760Z torch.manual_seed(2025) 2025-05-07T20:32:53.5843022Z 2025-05-07T20:32:53.5843301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5843658Z 2025-05-07T20:32:53.5843870Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5844196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5844537Z x = x_sign * x_clamp 2025-05-07T20:32:53.5844793Z x0 = x[:, :D] 2025-05-07T20:32:53.5845019Z x1 = x[:, D:] 2025-05-07T20:32:53.5845229Z 2025-05-07T20:32:53.5845426Z if contiguous: 2025-05-07T20:32:53.5845670Z x0 = x0.contiguous() 2025-05-07T20:32:53.5845942Z x1 = x1.contiguous() 2025-05-07T20:32:53.5846195Z 2025-05-07T20:32:53.5846405Z if scale_ub is not None: 2025-05-07T20:32:53.5846686Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.5847041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.5847367Z ) 2025-05-07T20:32:53.5847574Z else: 2025-05-07T20:32:53.5847791Z scale_ub_tensor = None 2025-05-07T20:32:53.5848067Z 2025-05-07T20:32:53.5848315Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5848634Z op = silu_mul_quant 2025-05-07T20:32:53.5848900Z if compiled: 2025-05-07T20:32:53.5849165Z op = torch.compile(op) 2025-05-07T20:32:53.5849470Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5849765Z 2025-05-07T20:32:53.5850042Z y_fp8, y_scale = fn() 2025-05-07T20:32:53.5850428Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:53.5850751Z 2025-05-07T20:32:53.5851003Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5851344Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:53.5851655Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:53.5851986Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:53.5852370Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.5852693Z 2025-05-07T20:32:53.5852924Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:53.5853127Z 2025-05-07T20:32:53.5853242Z moe/activation_test.py:126: 2025-05-07T20:32:53.5853544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5853894Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:53.5854236Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:53.5855160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:53.5855934Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:53.5856489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.5857263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.5857954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:53.5858687Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:53.5859429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:53.5860079Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:53.5860683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:53.5861393Z fn() 2025-05-07T20:32:53.5862023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:53.5862620Z self.fn.run( 2025-05-07T20:32:53.5863104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.5863656Z kernel = self.compile( 2025-05-07T20:32:53.5864211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.5864874Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5865289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5865873Z 2025-05-07T20:32:53.5866094Z self = 2025-05-07T20:32:53.5867192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.5868571Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b906de840>} 2025-05-07T20:32:53.5869930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.5870968Z context = 2025-05-07T20:32:53.5871260Z 2025-05-07T20:32:53.5871440Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.5871970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5872453Z module_map=module_map) 2025-05-07T20:32:53.5872834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5873203Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:53.5873471Z E ^ 2025-05-07T20:32:53.5873952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5874412Z 2025-05-07T20:32:53.5874841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.5875354Z 2025-05-07T20:32:53.5875469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5875947Z self=, 2025-05-07T20:32:53.5876366Z T=16384, 2025-05-07T20:32:53.5876574Z D=7168, 2025-05-07T20:32:53.5876771Z scale_ub=1200.0, 2025-05-07T20:32:53.5877013Z contiguous=False, 2025-05-07T20:32:53.5877256Z compiled=False, 2025-05-07T20:32:53.5877467Z ) 2025-05-07T20:32:53.8345963Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:53.8348380Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:53.8351067Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:53.8353930Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:53.8355133Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.8356545Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:53.8357945Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.8358938Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.8360176Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:53.8361559Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.8362636Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.8363929Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:53.8365188Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:53.8366768Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:53.8367989Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:53.8368833Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.8369986Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:53.8371017Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:53.8371817Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:53.8373199Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:53.8374553Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:53.8375801Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:53.8376857Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:53.8378042Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:53.8379419Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:53.8380633Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.8381571Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.8382324Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:53.8383349Z W0507 20:32:53.831000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.8971630Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:53.8972853Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:53.8974221Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:53.8975659Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:53.8976641Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.8977961Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:53.8979355Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.8980357Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.8981591Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:53.8983308Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.8984409Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.8985901Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:53.8987169Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:53.8988389Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:53.8989609Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:53.8990448Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:53.8991484Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:53.8992501Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:53.8993307Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:53.8994526Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:53.8995907Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:53.8997030Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:53.8998077Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:53.8999261Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:53.9000629Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:53.9001705Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.9002626Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.9003362Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:53.9004395Z W0507 20:32:53.894000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4036282Z self = 2025-05-07T20:32:54.4037185Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.4037662Z 2025-05-07T20:32:54.4037792Z @given( 2025-05-07T20:32:54.4038543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4039056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4039533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4040314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4040819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4041270Z ) 2025-05-07T20:32:54.4041840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4042561Z def test_silu_mul_quant( 2025-05-07T20:32:54.4042933Z self, 2025-05-07T20:32:54.4043243Z T: int, 2025-05-07T20:32:54.4043553Z D: int, 2025-05-07T20:32:54.4043891Z scale_ub: Optional[float], 2025-05-07T20:32:54.4044335Z contiguous: bool, 2025-05-07T20:32:54.4044738Z compiled: bool, 2025-05-07T20:32:54.4045111Z ) -> None: 2025-05-07T20:32:54.4045460Z torch.manual_seed(2025) 2025-05-07T20:32:54.4045848Z 2025-05-07T20:32:54.4046290Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4046856Z 2025-05-07T20:32:54.4047174Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4047656Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4048152Z x = x_sign * x_clamp 2025-05-07T20:32:54.4048548Z x0 = x[:, :D] 2025-05-07T20:32:54.4048898Z x1 = x[:, D:] 2025-05-07T20:32:54.4049221Z 2025-05-07T20:32:54.4049519Z if contiguous: 2025-05-07T20:32:54.4049888Z x0 = x0.contiguous() 2025-05-07T20:32:54.4050296Z x1 = x1.contiguous() 2025-05-07T20:32:54.4050686Z 2025-05-07T20:32:54.4050992Z if scale_ub is not None: 2025-05-07T20:32:54.4051419Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4051952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4052446Z ) 2025-05-07T20:32:54.4052742Z else: 2025-05-07T20:32:54.4053084Z scale_ub_tensor = None 2025-05-07T20:32:54.4053481Z 2025-05-07T20:32:54.4053839Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4054351Z op = silu_mul_quant 2025-05-07T20:32:54.4054770Z if compiled: 2025-05-07T20:32:54.4055173Z op = torch.compile(op) 2025-05-07T20:32:54.4055593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4055986Z 2025-05-07T20:32:54.4056270Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.4056516Z 2025-05-07T20:32:54.4056666Z moe/activation_test.py:117: 2025-05-07T20:32:54.4057111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4057636Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.4058060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4059191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.4060387Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.4061319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4062407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4063512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4064379Z kernel = self.compile( 2025-05-07T20:32:54.4065325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4066885Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4067573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4067960Z 2025-05-07T20:32:54.4068308Z self = 2025-05-07T20:32:54.4070422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4072802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b908cd260>} 2025-05-07T20:32:54.4075044Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4076755Z context = 2025-05-07T20:32:54.4077217Z 2025-05-07T20:32:54.4077495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4078351Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4079128Z module_map=module_map) 2025-05-07T20:32:54.4079731Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4080299Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.4080740Z E ^ 2025-05-07T20:32:54.4081496Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4082237Z 2025-05-07T20:32:54.4082972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4083793Z 2025-05-07T20:32:54.4083951Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4084574Z self=, 2025-05-07T20:32:54.4085300Z T=1, 2025-05-07T20:32:54.4085601Z D=7168, 2025-05-07T20:32:54.4085921Z scale_ub=None, 2025-05-07T20:32:54.4086287Z contiguous=True, 2025-05-07T20:32:54.4086655Z compiled=True, 2025-05-07T20:32:54.4086998Z ) 2025-05-07T20:32:54.4087536Z self = 2025-05-07T20:32:54.4088370Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.4088826Z 2025-05-07T20:32:54.4088955Z @given( 2025-05-07T20:32:54.4089342Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.4089871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.4090384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.4090949Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.4091509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.4091989Z ) 2025-05-07T20:32:54.4092589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.4093355Z def test_silu_mul_quant( 2025-05-07T20:32:54.4093759Z self, 2025-05-07T20:32:54.4094080Z T: int, 2025-05-07T20:32:54.4094418Z D: int, 2025-05-07T20:32:54.4094784Z scale_ub: Optional[float], 2025-05-07T20:32:54.4095217Z contiguous: bool, 2025-05-07T20:32:54.4095594Z compiled: bool, 2025-05-07T20:32:54.4095972Z ) -> None: 2025-05-07T20:32:54.4096308Z torch.manual_seed(2025) 2025-05-07T20:32:54.4096703Z 2025-05-07T20:32:54.4097137Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.4097683Z 2025-05-07T20:32:54.4098004Z x_sign = torch.sign(x) 2025-05-07T20:32:54.4098497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.4099014Z x = x_sign * x_clamp 2025-05-07T20:32:54.4099420Z x0 = x[:, :D] 2025-05-07T20:32:54.4099779Z x1 = x[:, D:] 2025-05-07T20:32:54.4100118Z 2025-05-07T20:32:54.4100425Z if contiguous: 2025-05-07T20:32:54.4100813Z x0 = x0.contiguous() 2025-05-07T20:32:54.4101246Z x1 = x1.contiguous() 2025-05-07T20:32:54.4101640Z 2025-05-07T20:32:54.4102104Z if scale_ub is not None: 2025-05-07T20:32:54.4102582Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.4103135Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.4103735Z ) 2025-05-07T20:32:54.4104063Z else: 2025-05-07T20:32:54.4104408Z scale_ub_tensor = None 2025-05-07T20:32:54.4104834Z 2025-05-07T20:32:54.4105220Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4105749Z op = silu_mul_quant 2025-05-07T20:32:54.4106168Z if compiled: 2025-05-07T20:32:54.4106582Z op = torch.compile(op) 2025-05-07T20:32:54.4107081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.4107556Z 2025-05-07T20:32:54.4107876Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.4108345Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.4108848Z 2025-05-07T20:32:54.4109250Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.4109825Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.4110320Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.4110853Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.4111471Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4111998Z 2025-05-07T20:32:54.4112332Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.4112668Z 2025-05-07T20:32:54.4112841Z moe/activation_test.py:126: 2025-05-07T20:32:54.4113340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4113918Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.4114478Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.4115901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.4116922Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.4117664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.4118603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.4119534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.4120558Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.4121618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.4122563Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.4123411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.4124158Z fn() 2025-05-07T20:32:54.4124959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.4125788Z self.fn.run( 2025-05-07T20:32:54.4126474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.4127278Z kernel = self.compile( 2025-05-07T20:32:54.4128111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.4129141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.4129794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.4130164Z 2025-05-07T20:32:54.4130496Z self = 2025-05-07T20:32:54.4132449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.4134850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b67ecaac0>} 2025-05-07T20:32:54.4137194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.4138826Z context = 2025-05-07T20:32:54.4139222Z 2025-05-07T20:32:54.4139495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.4140266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.4140963Z module_map=module_map) 2025-05-07T20:32:54.4141524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.4142074Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.4142489Z E ^ 2025-05-07T20:32:54.4143212Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.4143924Z 2025-05-07T20:32:54.4144576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.4145382Z 2025-05-07T20:32:54.4145543Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.4146181Z self=, 2025-05-07T20:32:54.4146810Z T=4096, 2025-05-07T20:32:54.4147100Z D=5120, 2025-05-07T20:32:54.4147388Z scale_ub=None, 2025-05-07T20:32:54.4147726Z contiguous=False, 2025-05-07T20:32:54.4148071Z compiled=False, 2025-05-07T20:32:54.4148383Z ) 2025-05-07T20:32:54.8690051Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:54.8691922Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:54.8694218Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:54.8696590Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:54.8698302Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.8700511Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:54.8702839Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8704528Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.8706660Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:54.8709462Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8711315Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.8713749Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:54.8716065Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:54.8718167Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:54.8720283Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:54.8721612Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.8723343Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:54.8725097Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:54.8726445Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:54.8728420Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:54.8730515Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:54.8732292Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:54.8734004Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:54.8735932Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:54.8738207Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:54.8740028Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8741530Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8742741Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:54.8744510Z W0507 20:32:54.865000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.0955356Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:55.0957693Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:55.0959957Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:55.0972323Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:55.0974014Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:55.0976305Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:55.0978743Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.0980468Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:55.0982473Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:55.0984763Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.0986631Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:55.0988753Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:55.0990773Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:55.0992724Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:55.0994617Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:55.0995946Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:55.0997416Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:55.0998956Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:55.1000234Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:55.1002175Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:55.1004465Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:55.1006453Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:55.1008392Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:55.1010328Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:55.1012694Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:55.1014463Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.1015999Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.1017283Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:55.1018986Z W0507 20:32:55.092000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.6993913Z self = 2025-05-07T20:32:55.6994520Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.6994963Z 2025-05-07T20:32:55.6995057Z @given( 2025-05-07T20:32:55.6995535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.6996316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.6996951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.6997639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.6998293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.6998897Z ) 2025-05-07T20:32:55.6999607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.7000509Z def test_silu_mul_quant( 2025-05-07T20:32:55.7000995Z self, 2025-05-07T20:32:55.7001397Z T: int, 2025-05-07T20:32:55.7001803Z D: int, 2025-05-07T20:32:55.7002236Z scale_ub: Optional[float], 2025-05-07T20:32:55.7002781Z contiguous: bool, 2025-05-07T20:32:55.7003263Z compiled: bool, 2025-05-07T20:32:55.7003713Z ) -> None: 2025-05-07T20:32:55.7004152Z torch.manual_seed(2025) 2025-05-07T20:32:55.7004644Z 2025-05-07T20:32:55.7005144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.7005509Z 2025-05-07T20:32:55.7005712Z x_sign = torch.sign(x) 2025-05-07T20:32:55.7006004Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.7006330Z x = x_sign * x_clamp 2025-05-07T20:32:55.7006591Z x0 = x[:, :D] 2025-05-07T20:32:55.7006812Z x1 = x[:, D:] 2025-05-07T20:32:55.7007026Z 2025-05-07T20:32:55.7007223Z if contiguous: 2025-05-07T20:32:55.7007456Z x0 = x0.contiguous() 2025-05-07T20:32:55.7007728Z x1 = x1.contiguous() 2025-05-07T20:32:55.7007979Z 2025-05-07T20:32:55.7008177Z if scale_ub is not None: 2025-05-07T20:32:55.7008463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.7008807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.7009129Z ) 2025-05-07T20:32:55.7009325Z else: 2025-05-07T20:32:55.7009548Z scale_ub_tensor = None 2025-05-07T20:32:55.7009811Z 2025-05-07T20:32:55.7010391Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.7010724Z op = silu_mul_quant 2025-05-07T20:32:55.7010985Z if compiled: 2025-05-07T20:32:55.7011237Z op = torch.compile(op) 2025-05-07T20:32:55.7011728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7012018Z 2025-05-07T20:32:55.7012215Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.7012391Z 2025-05-07T20:32:55.7012496Z moe/activation_test.py:117: 2025-05-07T20:32:55.7012801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7013146Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.7013430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7014132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.7014830Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.7015372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.7016069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.7016740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.7017288Z kernel = self.compile( 2025-05-07T20:32:55.7017832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.7018496Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.7018907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7019137Z 2025-05-07T20:32:55.7019344Z self = 2025-05-07T20:32:55.7020437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.7022042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b908cdee0>} 2025-05-07T20:32:55.7023400Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.7024435Z context = 2025-05-07T20:32:55.7024729Z 2025-05-07T20:32:55.7024897Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.7025430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.7025910Z module_map=module_map) 2025-05-07T20:32:55.7026285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.7026642Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.7026914Z E ^ 2025-05-07T20:32:55.7027389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.7027847Z 2025-05-07T20:32:55.7028263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.7028781Z 2025-05-07T20:32:55.7028888Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.7029311Z self=, 2025-05-07T20:32:55.7029729Z T=4096, 2025-05-07T20:32:55.7029924Z D=7168, 2025-05-07T20:32:55.7030128Z scale_ub=None, 2025-05-07T20:32:55.7030355Z contiguous=False, 2025-05-07T20:32:55.7030583Z compiled=False, 2025-05-07T20:32:55.7030803Z ) 2025-05-07T20:32:55.7031233Z self = 2025-05-07T20:32:55.7031736Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.7032024Z 2025-05-07T20:32:55.7032106Z @given( 2025-05-07T20:32:55.7032425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.7032748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.7033057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.7033398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.7033734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.7034022Z ) 2025-05-07T20:32:55.7034381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.7034998Z def test_silu_mul_quant( 2025-05-07T20:32:55.7035252Z self, 2025-05-07T20:32:55.7035456Z T: int, 2025-05-07T20:32:55.7035665Z D: int, 2025-05-07T20:32:55.7035948Z scale_ub: Optional[float], 2025-05-07T20:32:55.7036235Z contiguous: bool, 2025-05-07T20:32:55.7036486Z compiled: bool, 2025-05-07T20:32:55.7036714Z ) -> None: 2025-05-07T20:32:55.7036940Z torch.manual_seed(2025) 2025-05-07T20:32:55.7037197Z 2025-05-07T20:32:55.7037470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.7037823Z 2025-05-07T20:32:55.7038028Z x_sign = torch.sign(x) 2025-05-07T20:32:55.7038330Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.7038647Z x = x_sign * x_clamp 2025-05-07T20:32:55.7038895Z x0 = x[:, :D] 2025-05-07T20:32:55.7039112Z x1 = x[:, D:] 2025-05-07T20:32:55.7039327Z 2025-05-07T20:32:55.7039522Z if contiguous: 2025-05-07T20:32:55.7039755Z x0 = x0.contiguous() 2025-05-07T20:32:55.7040021Z x1 = x1.contiguous() 2025-05-07T20:32:55.7040268Z 2025-05-07T20:32:55.7040463Z if scale_ub is not None: 2025-05-07T20:32:55.7040749Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.7041087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.7041399Z ) 2025-05-07T20:32:55.7041595Z else: 2025-05-07T20:32:55.7041819Z scale_ub_tensor = None 2025-05-07T20:32:55.7042076Z 2025-05-07T20:32:55.7042308Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.7042630Z op = silu_mul_quant 2025-05-07T20:32:55.7042887Z if compiled: 2025-05-07T20:32:55.7043133Z op = torch.compile(op) 2025-05-07T20:32:55.7043434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7043712Z 2025-05-07T20:32:55.7043910Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.7044081Z 2025-05-07T20:32:55.7044185Z moe/activation_test.py:117: 2025-05-07T20:32:55.7044488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7044821Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.7045116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7045806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.7046500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.7047045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.7047731Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.7048398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.7048929Z kernel = self.compile( 2025-05-07T20:32:55.7049471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.7050134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.7050646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7050881Z 2025-05-07T20:32:55.7051089Z self = 2025-05-07T20:32:55.7052177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.7053637Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b908cd940>} 2025-05-07T20:32:55.7054991Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.7056027Z context = 2025-05-07T20:32:55.7056320Z 2025-05-07T20:32:55.7056500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.7057035Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.7057531Z module_map=module_map) 2025-05-07T20:32:55.7057896Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.7058261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.7058535Z E ^ 2025-05-07T20:32:55.7059009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.7059461Z 2025-05-07T20:32:55.7059877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.7060395Z 2025-05-07T20:32:55.7060503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.7060929Z self=, 2025-05-07T20:32:55.7061345Z T=128, 2025-05-07T20:32:55.7061543Z D=7168, 2025-05-07T20:32:55.7061755Z scale_ub=None, 2025-05-07T20:32:55.7061990Z contiguous=False, 2025-05-07T20:32:55.7062231Z compiled=True, 2025-05-07T20:32:55.7062449Z ) 2025-05-07T20:32:55.7619263Z self = 2025-05-07T20:32:55.7620432Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:55.7621115Z 2025-05-07T20:32:55.7621285Z @given( 2025-05-07T20:32:55.7621758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.7622387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.7623015Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.7623690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.7624401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.7624988Z ) 2025-05-07T20:32:55.7625509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.7625960Z def test_silu_mul_quant( 2025-05-07T20:32:55.7626217Z self, 2025-05-07T20:32:55.7626419Z T: int, 2025-05-07T20:32:55.7626638Z D: int, 2025-05-07T20:32:55.7626866Z scale_ub: Optional[float], 2025-05-07T20:32:55.7627143Z contiguous: bool, 2025-05-07T20:32:55.7627398Z compiled: bool, 2025-05-07T20:32:55.7627637Z ) -> None: 2025-05-07T20:32:55.7627857Z torch.manual_seed(2025) 2025-05-07T20:32:55.7628112Z 2025-05-07T20:32:55.7628403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.7628748Z 2025-05-07T20:32:55.7628951Z x_sign = torch.sign(x) 2025-05-07T20:32:55.7629252Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.7629565Z x = x_sign * x_clamp 2025-05-07T20:32:55.7629815Z x0 = x[:, :D] 2025-05-07T20:32:55.7630046Z x1 = x[:, D:] 2025-05-07T20:32:55.7630546Z 2025-05-07T20:32:55.7630738Z if contiguous: 2025-05-07T20:32:55.7630978Z x0 = x0.contiguous() 2025-05-07T20:32:55.7631248Z x1 = x1.contiguous() 2025-05-07T20:32:55.7631617Z 2025-05-07T20:32:55.7631818Z if scale_ub is not None: 2025-05-07T20:32:55.7632100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.7632438Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.7632761Z ) 2025-05-07T20:32:55.7632963Z else: 2025-05-07T20:32:55.7633179Z scale_ub_tensor = None 2025-05-07T20:32:55.7633441Z 2025-05-07T20:32:55.7633687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.7634006Z op = silu_mul_quant 2025-05-07T20:32:55.7634266Z if compiled: 2025-05-07T20:32:55.7634525Z op = torch.compile(op) 2025-05-07T20:32:55.7634826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.7635119Z 2025-05-07T20:32:55.7635329Z y_fp8, y_scale = fn() 2025-05-07T20:32:55.7635625Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:55.7636033Z 2025-05-07T20:32:55.7636283Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.7636635Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:55.7636935Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:55.7637258Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:55.7637629Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.7637943Z 2025-05-07T20:32:55.7638158Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:55.7638356Z 2025-05-07T20:32:55.7638464Z moe/activation_test.py:126: 2025-05-07T20:32:55.7638768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7639118Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:55.7639461Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:55.7640254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:55.7641014Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:55.7641573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.7642268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.7642968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:55.7643691Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:55.7644431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:55.7645079Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:55.7645690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:55.7646219Z fn() 2025-05-07T20:32:55.7646734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:55.7647330Z self.fn.run( 2025-05-07T20:32:55.7647800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.7648341Z kernel = self.compile( 2025-05-07T20:32:55.7648893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.7649548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.7649955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.7650195Z 2025-05-07T20:32:55.7650534Z self = 2025-05-07T20:32:55.7651634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.7653108Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90281620>} 2025-05-07T20:32:55.7654452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.7655490Z context = 2025-05-07T20:32:55.7655790Z 2025-05-07T20:32:55.7655964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.7656503Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.7656980Z module_map=module_map) 2025-05-07T20:32:55.7657361Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.7657746Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:55.7658020Z E ^ 2025-05-07T20:32:55.7658495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.7658955Z 2025-05-07T20:32:55.7659372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.7659886Z 2025-05-07T20:32:55.7660003Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.7660423Z self=, 2025-05-07T20:32:55.7660840Z T=128, 2025-05-07T20:32:55.7661045Z D=7168, 2025-05-07T20:32:55.7661249Z scale_ub=None, 2025-05-07T20:32:55.7661489Z contiguous=False, 2025-05-07T20:32:55.7661725Z compiled=False, 2025-05-07T20:32:55.7661936Z ) 2025-05-07T20:32:55.9637822Z self = 2025-05-07T20:32:55.9638555Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:55.9638865Z 2025-05-07T20:32:55.9638951Z @given( 2025-05-07T20:32:55.9639192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.9639515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.9639825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.9640162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.9640502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.9640794Z ) 2025-05-07T20:32:55.9641153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.9641605Z def test_silu_mul_quant( 2025-05-07T20:32:55.9641866Z self, 2025-05-07T20:32:55.9642071Z T: int, 2025-05-07T20:32:55.9642279Z D: int, 2025-05-07T20:32:55.9642498Z scale_ub: Optional[float], 2025-05-07T20:32:55.9642788Z contiguous: bool, 2025-05-07T20:32:55.9643050Z compiled: bool, 2025-05-07T20:32:55.9643282Z ) -> None: 2025-05-07T20:32:55.9643512Z torch.manual_seed(2025) 2025-05-07T20:32:55.9643765Z 2025-05-07T20:32:55.9644039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.9644397Z 2025-05-07T20:32:55.9644599Z x_sign = torch.sign(x) 2025-05-07T20:32:55.9644902Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.9645212Z x = x_sign * x_clamp 2025-05-07T20:32:55.9645492Z x0 = x[:, :D] 2025-05-07T20:32:55.9645749Z x1 = x[:, D:] 2025-05-07T20:32:55.9645959Z 2025-05-07T20:32:55.9646156Z if contiguous: 2025-05-07T20:32:55.9646399Z x0 = x0.contiguous() 2025-05-07T20:32:55.9647018Z x1 = x1.contiguous() 2025-05-07T20:32:55.9647270Z 2025-05-07T20:32:55.9647474Z if scale_ub is not None: 2025-05-07T20:32:55.9647756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.9648240Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.9648565Z ) 2025-05-07T20:32:55.9648763Z else: 2025-05-07T20:32:55.9648984Z scale_ub_tensor = None 2025-05-07T20:32:55.9649243Z 2025-05-07T20:32:55.9649480Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9649805Z op = silu_mul_quant 2025-05-07T20:32:55.9650068Z if compiled: 2025-05-07T20:32:55.9650320Z op = torch.compile(op) 2025-05-07T20:32:55.9650617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9650902Z 2025-05-07T20:32:55.9651108Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.9651274Z 2025-05-07T20:32:55.9651378Z moe/activation_test.py:117: 2025-05-07T20:32:55.9651692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9652032Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.9652315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9653020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.9653720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.9654265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.9654949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.9655672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.9656219Z kernel = self.compile( 2025-05-07T20:32:55.9656766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.9657428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.9657835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9658075Z 2025-05-07T20:32:55.9658290Z self = 2025-05-07T20:32:55.9659376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.9660859Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90282160>} 2025-05-07T20:32:55.9662297Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.9663339Z context = 2025-05-07T20:32:55.9663629Z 2025-05-07T20:32:55.9663811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.9664338Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.9664822Z module_map=module_map) 2025-05-07T20:32:55.9665195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.9665906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.9666173Z E ^ 2025-05-07T20:32:55.9666639Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.9667090Z 2025-05-07T20:32:55.9667514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.9668209Z 2025-05-07T20:32:55.9668321Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.9668740Z self=, 2025-05-07T20:32:55.9669263Z T=4096, 2025-05-07T20:32:55.9669456Z D=5120, 2025-05-07T20:32:55.9669660Z scale_ub=1200.0, 2025-05-07T20:32:55.9669888Z contiguous=True, 2025-05-07T20:32:55.9670111Z compiled=False, 2025-05-07T20:32:55.9670326Z ) 2025-05-07T20:32:55.9670654Z self = 2025-05-07T20:32:55.9671155Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:55.9671431Z 2025-05-07T20:32:55.9671516Z @given( 2025-05-07T20:32:55.9671757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:55.9672077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:55.9672385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:55.9680607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:55.9680966Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:55.9681260Z ) 2025-05-07T20:32:55.9681623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:55.9682087Z def test_silu_mul_quant( 2025-05-07T20:32:55.9682335Z self, 2025-05-07T20:32:55.9682549Z T: int, 2025-05-07T20:32:55.9682757Z D: int, 2025-05-07T20:32:55.9682978Z scale_ub: Optional[float], 2025-05-07T20:32:55.9683260Z contiguous: bool, 2025-05-07T20:32:55.9683512Z compiled: bool, 2025-05-07T20:32:55.9683741Z ) -> None: 2025-05-07T20:32:55.9683971Z torch.manual_seed(2025) 2025-05-07T20:32:55.9684225Z 2025-05-07T20:32:55.9684507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:55.9684868Z 2025-05-07T20:32:55.9685064Z x_sign = torch.sign(x) 2025-05-07T20:32:55.9685367Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:55.9685691Z x = x_sign * x_clamp 2025-05-07T20:32:55.9685941Z x0 = x[:, :D] 2025-05-07T20:32:55.9686161Z x1 = x[:, D:] 2025-05-07T20:32:55.9686380Z 2025-05-07T20:32:55.9686585Z if contiguous: 2025-05-07T20:32:55.9686824Z x0 = x0.contiguous() 2025-05-07T20:32:55.9687103Z x1 = x1.contiguous() 2025-05-07T20:32:55.9687356Z 2025-05-07T20:32:55.9687552Z if scale_ub is not None: 2025-05-07T20:32:55.9687834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:55.9688179Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:55.9688492Z ) 2025-05-07T20:32:55.9688697Z else: 2025-05-07T20:32:55.9688910Z scale_ub_tensor = None 2025-05-07T20:32:55.9689166Z 2025-05-07T20:32:55.9689406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:55.9689737Z op = silu_mul_quant 2025-05-07T20:32:55.9689990Z if compiled: 2025-05-07T20:32:55.9690253Z op = torch.compile(op) 2025-05-07T20:32:55.9690560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9690846Z 2025-05-07T20:32:55.9691047Z > y_fp8, y_scale = fn() 2025-05-07T20:32:55.9691228Z 2025-05-07T20:32:55.9691335Z moe/activation_test.py:117: 2025-05-07T20:32:55.9691641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9691978Z moe/activation_test.py:115: in fn 2025-05-07T20:32:55.9692270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:55.9692979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:55.9693666Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:55.9694209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:55.9695012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:55.9695694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:55.9696227Z kernel = self.compile( 2025-05-07T20:32:55.9696850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:55.9697514Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:55.9697922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:55.9698153Z 2025-05-07T20:32:55.9698362Z self = 2025-05-07T20:32:55.9699449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:55.9700837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9204efc0>} 2025-05-07T20:32:55.9702182Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:55.9703212Z context = 2025-05-07T20:32:55.9703510Z 2025-05-07T20:32:55.9703679Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:55.9704214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:55.9704689Z module_map=module_map) 2025-05-07T20:32:55.9705054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:55.9705415Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:55.9705684Z E ^ 2025-05-07T20:32:55.9706152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:55.9706616Z 2025-05-07T20:32:55.9707037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:55.9707556Z 2025-05-07T20:32:55.9707662Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:55.9708087Z self=, 2025-05-07T20:32:55.9708496Z T=1, 2025-05-07T20:32:55.9708691Z D=5120, 2025-05-07T20:32:55.9708898Z scale_ub=None, 2025-05-07T20:32:55.9709117Z contiguous=True, 2025-05-07T20:32:55.9709351Z compiled=True, 2025-05-07T20:32:55.9709569Z ) 2025-05-07T20:32:56.2101028Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:56.2103190Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:56.2105737Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:56.2107203Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:56.2108191Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2109876Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:56.2111452Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.2112627Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2113865Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:56.2115449Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.2116673Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2117975Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:56.2119237Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:56.2120467Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:56.2121678Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:56.2122513Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2123546Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:56.2124578Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:56.2125383Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:56.2126640Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:56.2127932Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:56.2129056Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:56.2130118Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:56.2131308Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:56.2132669Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:56.2133834Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.2134759Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.2135582Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:56.2136604Z W0507 20:32:56.206000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.2801506Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:56.2802742Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:56.2804128Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:56.2805597Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:56.2806615Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2807929Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:56.2809327Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.2810326Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2811560Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:56.2812947Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.2814031Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2815323Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:56.2816591Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:56.2817815Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:56.2819034Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:56.2820202Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.2821239Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:56.2822406Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:56.2823206Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:56.2824420Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:56.2825729Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:56.2826871Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:56.2827920Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:56.2829104Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:56.2830470Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:56.2831539Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.2832453Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.2833200Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:56.2834235Z W0507 20:32:56.277000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.5799886Z self = 2025-05-07T20:32:56.5800619Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:56.5800967Z 2025-05-07T20:32:56.5801052Z @given( 2025-05-07T20:32:56.5801296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:56.5801614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:56.5801929Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:56.5802293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:56.5802629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:56.5802925Z ) 2025-05-07T20:32:56.5803298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:56.5803746Z def test_silu_mul_quant( 2025-05-07T20:32:56.5803989Z self, 2025-05-07T20:32:56.5804194Z T: int, 2025-05-07T20:32:56.5804399Z D: int, 2025-05-07T20:32:56.5804616Z scale_ub: Optional[float], 2025-05-07T20:32:56.5804903Z contiguous: bool, 2025-05-07T20:32:56.5805150Z compiled: bool, 2025-05-07T20:32:56.5805381Z ) -> None: 2025-05-07T20:32:56.5805602Z torch.manual_seed(2025) 2025-05-07T20:32:56.5805850Z 2025-05-07T20:32:56.5806123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:56.5806470Z 2025-05-07T20:32:56.5806668Z x_sign = torch.sign(x) 2025-05-07T20:32:56.5807269Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:56.5807593Z x = x_sign * x_clamp 2025-05-07T20:32:56.5807836Z x0 = x[:, :D] 2025-05-07T20:32:56.5808055Z x1 = x[:, D:] 2025-05-07T20:32:56.5808438Z 2025-05-07T20:32:56.5808628Z if contiguous: 2025-05-07T20:32:56.5808867Z x0 = x0.contiguous() 2025-05-07T20:32:56.5809123Z x1 = x1.contiguous() 2025-05-07T20:32:56.5809367Z 2025-05-07T20:32:56.5809566Z if scale_ub is not None: 2025-05-07T20:32:56.5809839Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:56.5810182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:56.5810496Z ) 2025-05-07T20:32:56.5810686Z else: 2025-05-07T20:32:56.5810903Z scale_ub_tensor = None 2025-05-07T20:32:56.5811165Z 2025-05-07T20:32:56.5811398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.5811718Z op = silu_mul_quant 2025-05-07T20:32:56.5811983Z if compiled: 2025-05-07T20:32:56.5812233Z op = torch.compile(op) 2025-05-07T20:32:56.5812534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:56.5812820Z 2025-05-07T20:32:56.5813019Z y_fp8, y_scale = fn() 2025-05-07T20:32:56.5813306Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:56.5813606Z 2025-05-07T20:32:56.5813850Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:56.5814184Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:56.5814481Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:56.5814801Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:56.5815157Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.5815475Z 2025-05-07T20:32:56.5815693Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:56.5815920Z 2025-05-07T20:32:56.5816031Z moe/activation_test.py:126: 2025-05-07T20:32:56.5816337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.5816681Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:56.5817011Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:56.5817802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:56.5818560Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:56.5819107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:56.5819795Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:56.5820478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:56.5821201Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:56.5821943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:56.5822581Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:56.5823186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:56.5823709Z fn() 2025-05-07T20:32:56.5824219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:56.5824794Z self.fn.run( 2025-05-07T20:32:56.5825265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:56.5825802Z kernel = self.compile( 2025-05-07T20:32:56.5826335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:56.5827160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.5827569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:56.5827800Z 2025-05-07T20:32:56.5828014Z self = 2025-05-07T20:32:56.5829171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:56.5830561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b901cdee0>} 2025-05-07T20:32:56.5831903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:56.5832936Z context = 2025-05-07T20:32:56.5833224Z 2025-05-07T20:32:56.5833399Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:56.5833919Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.5834401Z module_map=module_map) 2025-05-07T20:32:56.5834769Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.5835125Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:56.5835399Z E ^ 2025-05-07T20:32:56.5836036Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.5836505Z 2025-05-07T20:32:56.5836923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:56.5837434Z 2025-05-07T20:32:56.5837540Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:56.5837962Z self=, 2025-05-07T20:32:56.5838376Z T=2048, 2025-05-07T20:32:56.5838569Z D=5120, 2025-05-07T20:32:56.5838770Z scale_ub=None, 2025-05-07T20:32:56.5839002Z contiguous=True, 2025-05-07T20:32:56.5839225Z compiled=True, 2025-05-07T20:32:56.5839442Z ) 2025-05-07T20:32:56.8092553Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:56.8093664Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:56.8095025Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:56.8096536Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:56.8097534Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.8098848Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:56.8100243Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.8101601Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.8102840Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:56.8104371Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.8105441Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.8106782Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:56.8108047Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:56.8109262Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:56.8110488Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:56.8111323Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.8112359Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:56.8113384Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:56.8114176Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:56.8115399Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:56.8116791Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:56.8117912Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:56.8118957Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:56.8120139Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:56.8121508Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:56.8122574Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.8123493Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.8124233Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:56.8125352Z W0507 20:32:56.806000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:56.8797760Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:56.8798841Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:56.8800205Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:56.8801758Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:56.8802739Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.8804059Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:56.8805451Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:56.8806487Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.8807728Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:56.8809101Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:56.8810179Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.8811468Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:56.8812726Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:56.8813952Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:56.8815161Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:56.8815999Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:56.8817027Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:56.8818048Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:56.8819182Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:56.8820401Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:56.8822247Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:56.8823368Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:56.8824411Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:56.8825592Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:56.8827006Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:56.8828085Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:56.8829006Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:56.8829750Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:56.8830778Z W0507 20:32:56.876000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.1833531Z self = 2025-05-07T20:32:57.1834611Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:57.1835199Z 2025-05-07T20:32:57.1835364Z @given( 2025-05-07T20:32:57.1835911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.1836308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.1836660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.1836997Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.1837330Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.1837617Z ) 2025-05-07T20:32:57.1837973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.1838431Z def test_silu_mul_quant( 2025-05-07T20:32:57.1838678Z self, 2025-05-07T20:32:57.1838888Z T: int, 2025-05-07T20:32:57.1839107Z D: int, 2025-05-07T20:32:57.1839328Z scale_ub: Optional[float], 2025-05-07T20:32:57.1839608Z contiguous: bool, 2025-05-07T20:32:57.1839866Z compiled: bool, 2025-05-07T20:32:57.1840101Z ) -> None: 2025-05-07T20:32:57.1840331Z torch.manual_seed(2025) 2025-05-07T20:32:57.1840587Z 2025-05-07T20:32:57.1840864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.1841226Z 2025-05-07T20:32:57.1841428Z x_sign = torch.sign(x) 2025-05-07T20:32:57.1841727Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.1842042Z x = x_sign * x_clamp 2025-05-07T20:32:57.1842294Z x0 = x[:, :D] 2025-05-07T20:32:57.1842515Z x1 = x[:, D:] 2025-05-07T20:32:57.1842723Z 2025-05-07T20:32:57.1842914Z if contiguous: 2025-05-07T20:32:57.1843150Z x0 = x0.contiguous() 2025-05-07T20:32:57.1843412Z x1 = x1.contiguous() 2025-05-07T20:32:57.1843664Z 2025-05-07T20:32:57.1844197Z if scale_ub is not None: 2025-05-07T20:32:57.1844476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.1844821Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.1845284Z ) 2025-05-07T20:32:57.1845484Z else: 2025-05-07T20:32:57.1845702Z scale_ub_tensor = None 2025-05-07T20:32:57.1845964Z 2025-05-07T20:32:57.1846194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.1846515Z op = silu_mul_quant 2025-05-07T20:32:57.1846775Z if compiled: 2025-05-07T20:32:57.1847027Z op = torch.compile(op) 2025-05-07T20:32:57.1847328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.1847610Z 2025-05-07T20:32:57.1847808Z y_fp8, y_scale = fn() 2025-05-07T20:32:57.1848092Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:57.1848390Z 2025-05-07T20:32:57.1848641Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.1848980Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:57.1849279Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:57.1849601Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:57.1849966Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.1850286Z 2025-05-07T20:32:57.1850496Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:57.1850694Z 2025-05-07T20:32:57.1850808Z moe/activation_test.py:126: 2025-05-07T20:32:57.1851110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.1851458Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:57.1851797Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.1852591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:57.1853357Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:57.1853929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.1862246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.1862961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:57.1863703Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.1864441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:57.1865090Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:57.1865979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:57.1866558Z fn() 2025-05-07T20:32:57.1867084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:57.1867667Z self.fn.run( 2025-05-07T20:32:57.1868140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.1868683Z kernel = self.compile( 2025-05-07T20:32:57.1869224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.1869875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.1870291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.1870528Z 2025-05-07T20:32:57.1870745Z self = 2025-05-07T20:32:57.1872072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.1873471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90a64ea0>} 2025-05-07T20:32:57.1874954Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.1876094Z context = 2025-05-07T20:32:57.1876410Z 2025-05-07T20:32:57.1876581Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.1877116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.1877594Z module_map=module_map) 2025-05-07T20:32:57.1877980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.1878343Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:57.1878618Z E ^ 2025-05-07T20:32:57.1879095Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.1879555Z 2025-05-07T20:32:57.1879971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.1880494Z 2025-05-07T20:32:57.1880602Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.1881028Z self=, 2025-05-07T20:32:57.1881441Z T=128, 2025-05-07T20:32:57.1881636Z D=5120, 2025-05-07T20:32:57.1881846Z scale_ub=None, 2025-05-07T20:32:57.1882070Z contiguous=True, 2025-05-07T20:32:57.1882298Z compiled=True, 2025-05-07T20:32:57.1882516Z ) 2025-05-07T20:32:57.4291525Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:57.4292628Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:57.4293982Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:57.4295415Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:57.4296402Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:57.4297716Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:57.4299105Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.4300104Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:57.4301329Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:57.4303046Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.4304127Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:57.4305639Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:57.4306894Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:57.4308107Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:57.4309326Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:57.4310160Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:57.4311197Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:57.4312217Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:57.4313008Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:57.4314221Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:57.4315508Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:57.4316709Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:57.4317759Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:57.4318932Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:57.4320300Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:57.4321370Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.4322286Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.4323028Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:57.4324060Z W0507 20:32:57.426000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.4993371Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:57.4994770Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:57.4996202Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:57.4997781Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:57.4998763Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:57.5000073Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:57.5001464Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.5002463Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:57.5003692Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:57.5005060Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.5006136Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:57.5007417Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:57.5008671Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:57.5009891Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:57.5011092Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:57.5011925Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:57.5012951Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:57.5013975Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:57.5014762Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:57.5015973Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:57.5017339Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:57.5018459Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:57.5019578Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:57.5020749Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:57.5022103Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:57.5023169Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.5024081Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:57.5024830Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:57.5025842Z W0507 20:32:57.496000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.8430255Z self = 2025-05-07T20:32:57.8430884Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:57.8431201Z 2025-05-07T20:32:57.8431285Z @given( 2025-05-07T20:32:57.8431532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:57.8431871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:57.8432185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:57.8432524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:57.8432860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:57.8433154Z ) 2025-05-07T20:32:57.8433522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:57.8433964Z def test_silu_mul_quant( 2025-05-07T20:32:57.8434210Z self, 2025-05-07T20:32:57.8434408Z T: int, 2025-05-07T20:32:57.8434606Z D: int, 2025-05-07T20:32:57.8434828Z scale_ub: Optional[float], 2025-05-07T20:32:57.8435106Z contiguous: bool, 2025-05-07T20:32:57.8435347Z compiled: bool, 2025-05-07T20:32:57.8435580Z ) -> None: 2025-05-07T20:32:57.8435911Z torch.manual_seed(2025) 2025-05-07T20:32:57.8436157Z 2025-05-07T20:32:57.8436425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:57.8436780Z 2025-05-07T20:32:57.8436979Z x_sign = torch.sign(x) 2025-05-07T20:32:57.8437266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:57.8437580Z x = x_sign * x_clamp 2025-05-07T20:32:57.8437832Z x0 = x[:, :D] 2025-05-07T20:32:57.8438051Z x1 = x[:, D:] 2025-05-07T20:32:57.8438271Z 2025-05-07T20:32:57.8438460Z if contiguous: 2025-05-07T20:32:57.8438688Z x0 = x0.contiguous() 2025-05-07T20:32:57.8438952Z x1 = x1.contiguous() 2025-05-07T20:32:57.8439194Z 2025-05-07T20:32:57.8439384Z if scale_ub is not None: 2025-05-07T20:32:57.8439662Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:57.8440003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:57.8440318Z ) 2025-05-07T20:32:57.8440521Z else: 2025-05-07T20:32:57.8440740Z scale_ub_tensor = None 2025-05-07T20:32:57.8440987Z 2025-05-07T20:32:57.8441558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.8441875Z op = silu_mul_quant 2025-05-07T20:32:57.8442125Z if compiled: 2025-05-07T20:32:57.8442372Z op = torch.compile(op) 2025-05-07T20:32:57.8442809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:57.8443084Z 2025-05-07T20:32:57.8443274Z y_fp8, y_scale = fn() 2025-05-07T20:32:57.8443560Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:57.8443949Z 2025-05-07T20:32:57.8444193Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:57.8444616Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:57.8444917Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:57.8445239Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:57.8445608Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.8445931Z 2025-05-07T20:32:57.8446141Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:57.8446347Z 2025-05-07T20:32:57.8446455Z moe/activation_test.py:126: 2025-05-07T20:32:57.8446764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.8447118Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:57.8447454Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:57.8448251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:57.8449011Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:57.8449556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:57.8450250Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:57.8450947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:57.8451684Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:57.8452428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:57.8453087Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:57.8453698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:57.8454223Z fn() 2025-05-07T20:32:57.8454731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:57.8455323Z self.fn.run( 2025-05-07T20:32:57.8455799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:57.8456339Z kernel = self.compile( 2025-05-07T20:32:57.8456895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:57.8457557Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:57.8457968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:57.8458205Z 2025-05-07T20:32:57.8458415Z self = 2025-05-07T20:32:57.8459509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:57.8460913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66e86f20>} 2025-05-07T20:32:57.8462427Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:57.8463475Z context = 2025-05-07T20:32:57.8463767Z 2025-05-07T20:32:57.8464074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:57.8464615Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:57.8465095Z module_map=module_map) 2025-05-07T20:32:57.8465803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:57.8466177Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:57.8466450Z E ^ 2025-05-07T20:32:57.8466916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:57.8467374Z 2025-05-07T20:32:57.8467787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:57.8468310Z 2025-05-07T20:32:57.8468416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:57.8468832Z self=, 2025-05-07T20:32:57.8469240Z T=4096, 2025-05-07T20:32:57.8469436Z D=5120, 2025-05-07T20:32:57.8469635Z scale_ub=None, 2025-05-07T20:32:57.8469847Z contiguous=True, 2025-05-07T20:32:57.8470075Z compiled=True, 2025-05-07T20:32:57.8470290Z ) 2025-05-07T20:32:58.0894696Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:58.0896055Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:58.0897475Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:58.0898932Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:58.0899919Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:58.0901225Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:58.0902612Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.0903613Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:58.0904846Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:58.0906224Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.0907303Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:58.0908920Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:58.0910175Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:58.0911537Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:58.0912741Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:58.0913570Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:58.0914607Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:58.0915630Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:58.0916563Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:58.0917768Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:58.0919055Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:58.0920182Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:58.0921233Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:58.0922409Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:58.0923777Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:58.0924845Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.0925760Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.0926513Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:58.0927533Z W0507 20:32:58.086000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.1616214Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:58.1617553Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:58.1618895Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:58.1624132Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:58.1625128Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:58.1626522Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:58.1627962Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.1628964Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:58.1630216Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:58.1631601Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.1632684Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:58.1633966Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:58.1635225Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:58.1636538Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:58.1637806Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:58.1638639Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:58.1639674Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:58.1640705Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:58.1641502Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:58.1642723Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:58.1644009Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:58.1645131Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:58.1646262Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:58.1647537Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:58.1648938Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:58.1650008Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.1650928Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.1651673Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:58.1652703Z W0507 20:32:58.158000 96051 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.5083526Z self = 2025-05-07T20:32:58.5084298Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.5084668Z 2025-05-07T20:32:58.5084793Z @given( 2025-05-07T20:32:58.5085059Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.5085383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.5085700Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.5086029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.5086364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.5086660Z ) 2025-05-07T20:32:58.5087010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.5087472Z def test_silu_mul_quant( 2025-05-07T20:32:58.5087726Z self, 2025-05-07T20:32:58.5087924Z T: int, 2025-05-07T20:32:58.5088129Z D: int, 2025-05-07T20:32:58.5088358Z scale_ub: Optional[float], 2025-05-07T20:32:58.5088639Z contiguous: bool, 2025-05-07T20:32:58.5088896Z compiled: bool, 2025-05-07T20:32:58.5089139Z ) -> None: 2025-05-07T20:32:58.5089362Z torch.manual_seed(2025) 2025-05-07T20:32:58.5089604Z 2025-05-07T20:32:58.5089885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.5090235Z 2025-05-07T20:32:58.5090433Z x_sign = torch.sign(x) 2025-05-07T20:32:58.5090728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.5091065Z x = x_sign * x_clamp 2025-05-07T20:32:58.5091317Z x0 = x[:, :D] 2025-05-07T20:32:58.5091531Z x1 = x[:, D:] 2025-05-07T20:32:58.5091740Z 2025-05-07T20:32:58.5091931Z if contiguous: 2025-05-07T20:32:58.5092171Z x0 = x0.contiguous() 2025-05-07T20:32:58.5092439Z x1 = x1.contiguous() 2025-05-07T20:32:58.5092684Z 2025-05-07T20:32:58.5092875Z if scale_ub is not None: 2025-05-07T20:32:58.5093158Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.5093498Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.5093805Z ) 2025-05-07T20:32:58.5094012Z else: 2025-05-07T20:32:58.5094234Z scale_ub_tensor = None 2025-05-07T20:32:58.5094488Z 2025-05-07T20:32:58.5094735Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.5095056Z op = silu_mul_quant 2025-05-07T20:32:58.5095305Z if compiled: 2025-05-07T20:32:58.5095562Z op = torch.compile(op) 2025-05-07T20:32:58.5095870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.5096150Z 2025-05-07T20:32:58.5096352Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.5096997Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.5097413Z 2025-05-07T20:32:58.5097650Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.5097996Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.5098368Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.5098682Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.5099043Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.5099357Z 2025-05-07T20:32:58.5099561Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.5099763Z 2025-05-07T20:32:58.5099867Z moe/activation_test.py:126: 2025-05-07T20:32:58.5100168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.5100510Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.5100836Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.5101631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.5102388Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.5102929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.5103616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.5104304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.5105027Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.5105751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.5106392Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.5106998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.5107519Z fn() 2025-05-07T20:32:58.5108022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.5108603Z self.fn.run( 2025-05-07T20:32:58.5109074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.5109602Z kernel = self.compile( 2025-05-07T20:32:58.5110143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.5110796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.5111198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.5111432Z 2025-05-07T20:32:58.5111640Z self = 2025-05-07T20:32:58.5112730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.5114125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b671162a0>} 2025-05-07T20:32:58.5115467Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.5116617Z context = 2025-05-07T20:32:58.5116945Z 2025-05-07T20:32:58.5117131Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.5117659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.5118292Z module_map=module_map) 2025-05-07T20:32:58.5118659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.5119030Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.5119373Z E ^ 2025-05-07T20:32:58.5119839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.5120301Z 2025-05-07T20:32:58.5120715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.5121231Z 2025-05-07T20:32:58.5121336Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.5121754Z self=, 2025-05-07T20:32:58.5122157Z T=16384, 2025-05-07T20:32:58.5122361Z D=5120, 2025-05-07T20:32:58.5122563Z scale_ub=None, 2025-05-07T20:32:58.5122783Z contiguous=True, 2025-05-07T20:32:58.5123012Z compiled=True, 2025-05-07T20:32:58.5123236Z ) 2025-05-07T20:32:58.5380417Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:58.5381961Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:58.5383312Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:58.5384306Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:58.5385425Z W0507 20:32:58.536000 96051 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:58.6270602Z self = 2025-05-07T20:32:58.6278669Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:58.6278982Z 2025-05-07T20:32:58.6279074Z @given( 2025-05-07T20:32:58.6279318Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.6279651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.6279974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.6280316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.6280660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.6280972Z ) 2025-05-07T20:32:58.6281333Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.6281787Z def test_silu_mul_quant( 2025-05-07T20:32:58.6282046Z self, 2025-05-07T20:32:58.6282256Z T: int, 2025-05-07T20:32:58.6282458Z D: int, 2025-05-07T20:32:58.6282706Z scale_ub: Optional[float], 2025-05-07T20:32:58.6282985Z contiguous: bool, 2025-05-07T20:32:58.6283238Z compiled: bool, 2025-05-07T20:32:58.6283480Z ) -> None: 2025-05-07T20:32:58.6283705Z torch.manual_seed(2025) 2025-05-07T20:32:58.6283968Z 2025-05-07T20:32:58.6284256Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.6284617Z 2025-05-07T20:32:58.6284820Z x_sign = torch.sign(x) 2025-05-07T20:32:58.6285130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.6285453Z x = x_sign * x_clamp 2025-05-07T20:32:58.6285703Z x0 = x[:, :D] 2025-05-07T20:32:58.6285934Z x1 = x[:, D:] 2025-05-07T20:32:58.6286160Z 2025-05-07T20:32:58.6286354Z if contiguous: 2025-05-07T20:32:58.6286603Z x0 = x0.contiguous() 2025-05-07T20:32:58.6286878Z x1 = x1.contiguous() 2025-05-07T20:32:58.6287123Z 2025-05-07T20:32:58.6287731Z if scale_ub is not None: 2025-05-07T20:32:58.6288018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.6288359Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.6288757Z ) 2025-05-07T20:32:58.6288959Z else: 2025-05-07T20:32:58.6289173Z scale_ub_tensor = None 2025-05-07T20:32:58.6289434Z 2025-05-07T20:32:58.6289680Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6290002Z op = silu_mul_quant 2025-05-07T20:32:58.6290264Z if compiled: 2025-05-07T20:32:58.6290522Z op = torch.compile(op) 2025-05-07T20:32:58.6290825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.6291103Z 2025-05-07T20:32:58.6291308Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.6291603Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.6291896Z 2025-05-07T20:32:58.6292149Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.6292503Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.6292804Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.6293129Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.6293509Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.6293823Z 2025-05-07T20:32:58.6294041Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.6294244Z 2025-05-07T20:32:58.6294348Z moe/activation_test.py:126: 2025-05-07T20:32:58.6294661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6295000Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.6295338Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.6296137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.6296894Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.6297458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.6298160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.6298863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.6299587Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.6300328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.6300977Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.6301590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.6302111Z fn() 2025-05-07T20:32:58.6302636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.6303229Z self.fn.run( 2025-05-07T20:32:58.6303703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.6304247Z kernel = self.compile( 2025-05-07T20:32:58.6304801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.6305469Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.6305876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.6306121Z 2025-05-07T20:32:58.6306333Z self = 2025-05-07T20:32:58.6307519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.6308962Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b668df240>} 2025-05-07T20:32:58.6310347Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.6311389Z context = 2025-05-07T20:32:58.6311697Z 2025-05-07T20:32:58.6311867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.6312405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.6312882Z module_map=module_map) 2025-05-07T20:32:58.6313258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.6313635Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.6313905Z E ^ 2025-05-07T20:32:58.6314382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.6314852Z 2025-05-07T20:32:58.6315276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.6315884Z 2025-05-07T20:32:58.6316000Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.6316420Z self=, 2025-05-07T20:32:58.6316834Z T=1, 2025-05-07T20:32:58.6317034Z D=5120, 2025-05-07T20:32:58.6317232Z scale_ub=1200.0, 2025-05-07T20:32:58.6317469Z contiguous=True, 2025-05-07T20:32:58.6317706Z compiled=True, 2025-05-07T20:32:58.6317921Z ) 2025-05-07T20:32:58.7707069Z self = 2025-05-07T20:32:58.7707883Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:58.7708260Z 2025-05-07T20:32:58.7708374Z @given( 2025-05-07T20:32:58.7708663Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.7708985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.7709304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.7709647Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.7709980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.7710278Z ) 2025-05-07T20:32:58.7710635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.7711085Z def test_silu_mul_quant( 2025-05-07T20:32:58.7711335Z self, 2025-05-07T20:32:58.7711546Z T: int, 2025-05-07T20:32:58.7711757Z D: int, 2025-05-07T20:32:58.7711987Z scale_ub: Optional[float], 2025-05-07T20:32:58.7712279Z contiguous: bool, 2025-05-07T20:32:58.7712540Z compiled: bool, 2025-05-07T20:32:58.7712771Z ) -> None: 2025-05-07T20:32:58.7712999Z torch.manual_seed(2025) 2025-05-07T20:32:58.7713253Z 2025-05-07T20:32:58.7713525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.7713886Z 2025-05-07T20:32:58.7714097Z x_sign = torch.sign(x) 2025-05-07T20:32:58.7714392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.7714712Z x = x_sign * x_clamp 2025-05-07T20:32:58.7714968Z x0 = x[:, :D] 2025-05-07T20:32:58.7715187Z x1 = x[:, D:] 2025-05-07T20:32:58.7715408Z 2025-05-07T20:32:58.7715614Z if contiguous: 2025-05-07T20:32:58.7715971Z x0 = x0.contiguous() 2025-05-07T20:32:58.7716237Z x1 = x1.contiguous() 2025-05-07T20:32:58.7716489Z 2025-05-07T20:32:58.7716695Z if scale_ub is not None: 2025-05-07T20:32:58.7716978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.7717661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.7718058Z ) 2025-05-07T20:32:58.7718267Z else: 2025-05-07T20:32:58.7718488Z scale_ub_tensor = None 2025-05-07T20:32:58.7718811Z 2025-05-07T20:32:58.7719058Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.7719384Z op = silu_mul_quant 2025-05-07T20:32:58.7719637Z if compiled: 2025-05-07T20:32:58.7719891Z op = torch.compile(op) 2025-05-07T20:32:58.7720195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7720471Z 2025-05-07T20:32:58.7720682Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.7720846Z 2025-05-07T20:32:58.7720958Z moe/activation_test.py:117: 2025-05-07T20:32:58.7721256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7721595Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.7721879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.7722450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.7723008Z return fn(*args, **kwargs) 2025-05-07T20:32:58.7723668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.7724361Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.7724895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.7725578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.7726248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.7726783Z kernel = self.compile( 2025-05-07T20:32:58.7727323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.7727991Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.7728401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.7728633Z 2025-05-07T20:32:58.7728847Z self = 2025-05-07T20:32:58.7729926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.7731327Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66bb8cc0>} 2025-05-07T20:32:58.7732675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.7733719Z context = 2025-05-07T20:32:58.7734008Z 2025-05-07T20:32:58.7734187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.7734713Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.7735193Z module_map=module_map) 2025-05-07T20:32:58.7735563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.7735925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.7736193Z E ^ 2025-05-07T20:32:58.7736665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.7737117Z 2025-05-07T20:32:58.7737537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.7738049Z 2025-05-07T20:32:58.7738244Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.7738735Z self=, 2025-05-07T20:32:58.7739146Z T=1, 2025-05-07T20:32:58.7739337Z D=5120, 2025-05-07T20:32:58.7739579Z scale_ub=None, 2025-05-07T20:32:58.7739803Z contiguous=False, 2025-05-07T20:32:58.7740033Z compiled=True, 2025-05-07T20:32:58.7740243Z ) 2025-05-07T20:32:58.8364384Z self = 2025-05-07T20:32:58.8366196Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.8366863Z 2025-05-07T20:32:58.8366990Z @given( 2025-05-07T20:32:58.8367282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.8367608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.8367928Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.8368262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.8368615Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.8368914Z ) 2025-05-07T20:32:58.8369268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.8369727Z def test_silu_mul_quant( 2025-05-07T20:32:58.8369976Z self, 2025-05-07T20:32:58.8370171Z T: int, 2025-05-07T20:32:58.8370375Z D: int, 2025-05-07T20:32:58.8370597Z scale_ub: Optional[float], 2025-05-07T20:32:58.8370875Z contiguous: bool, 2025-05-07T20:32:58.8371119Z compiled: bool, 2025-05-07T20:32:58.8371352Z ) -> None: 2025-05-07T20:32:58.8371575Z torch.manual_seed(2025) 2025-05-07T20:32:58.8371819Z 2025-05-07T20:32:58.8372100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.8372448Z 2025-05-07T20:32:58.8372648Z x_sign = torch.sign(x) 2025-05-07T20:32:58.8372944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.8373261Z x = x_sign * x_clamp 2025-05-07T20:32:58.8373508Z x0 = x[:, :D] 2025-05-07T20:32:58.8373730Z x1 = x[:, D:] 2025-05-07T20:32:58.8373943Z 2025-05-07T20:32:58.8374131Z if contiguous: 2025-05-07T20:32:58.8374370Z x0 = x0.contiguous() 2025-05-07T20:32:58.8374635Z x1 = x1.contiguous() 2025-05-07T20:32:58.8374877Z 2025-05-07T20:32:58.8375079Z if scale_ub is not None: 2025-05-07T20:32:58.8375360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.8375696Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.8376015Z ) 2025-05-07T20:32:58.8376215Z else: 2025-05-07T20:32:58.8376431Z scale_ub_tensor = None 2025-05-07T20:32:58.8376686Z 2025-05-07T20:32:58.8376926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8377253Z op = silu_mul_quant 2025-05-07T20:32:58.8377504Z if compiled: 2025-05-07T20:32:58.8377763Z op = torch.compile(op) 2025-05-07T20:32:58.8378072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.8378349Z 2025-05-07T20:32:58.8378550Z y_fp8, y_scale = fn() 2025-05-07T20:32:58.8378844Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:58.8379138Z 2025-05-07T20:32:58.8379387Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.8379733Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:58.8380030Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:58.8380351Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:58.8380720Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8381043Z 2025-05-07T20:32:58.8381248Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:58.8381455Z 2025-05-07T20:32:58.8381561Z moe/activation_test.py:126: 2025-05-07T20:32:58.8381868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8382560Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:58.8382897Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:58.8383689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:58.8384525Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:58.8385070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.8385766Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.8386461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:58.8387232Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:58.8387979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:58.8388626Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:58.8389230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:58.8389750Z fn() 2025-05-07T20:32:58.8390261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:58.8390853Z self.fn.run( 2025-05-07T20:32:58.8391318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.8391856Z kernel = self.compile( 2025-05-07T20:32:58.8392399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.8393058Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.8393458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.8393702Z 2025-05-07T20:32:58.8393911Z self = 2025-05-07T20:32:58.8395007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.8396526Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66bbb2e0>} 2025-05-07T20:32:58.8397928Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.8398953Z context = 2025-05-07T20:32:58.8399248Z 2025-05-07T20:32:58.8399426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.8399964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.8400437Z module_map=module_map) 2025-05-07T20:32:58.8400812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.8401180Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:58.8401461Z E ^ 2025-05-07T20:32:58.8401929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.8402393Z 2025-05-07T20:32:58.8402809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.8403321Z 2025-05-07T20:32:58.8403436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.8403865Z self=, 2025-05-07T20:32:58.8404403Z T=1, 2025-05-07T20:32:58.8404599Z D=5120, 2025-05-07T20:32:58.8404803Z scale_ub=None, 2025-05-07T20:32:58.8405019Z contiguous=True, 2025-05-07T20:32:58.8405251Z compiled=False, 2025-05-07T20:32:58.8405509Z ) 2025-05-07T20:32:58.9930005Z self = 2025-05-07T20:32:58.9930775Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:58.9931149Z 2025-05-07T20:32:58.9931263Z @given( 2025-05-07T20:32:58.9931561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9931883Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9932201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9932539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9932881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9933172Z ) 2025-05-07T20:32:58.9933559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9934018Z def test_silu_mul_quant( 2025-05-07T20:32:58.9934266Z self, 2025-05-07T20:32:58.9934468Z T: int, 2025-05-07T20:32:58.9934668Z D: int, 2025-05-07T20:32:58.9934892Z scale_ub: Optional[float], 2025-05-07T20:32:58.9935168Z contiguous: bool, 2025-05-07T20:32:58.9935421Z compiled: bool, 2025-05-07T20:32:58.9935652Z ) -> None: 2025-05-07T20:32:58.9935879Z torch.manual_seed(2025) 2025-05-07T20:32:58.9936128Z 2025-05-07T20:32:58.9936412Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9936759Z 2025-05-07T20:32:58.9936959Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9937255Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9937573Z x = x_sign * x_clamp 2025-05-07T20:32:58.9937823Z x0 = x[:, :D] 2025-05-07T20:32:58.9938053Z x1 = x[:, D:] 2025-05-07T20:32:58.9938264Z 2025-05-07T20:32:58.9938473Z if contiguous: 2025-05-07T20:32:58.9938714Z x0 = x0.contiguous() 2025-05-07T20:32:58.9938984Z x1 = x1.contiguous() 2025-05-07T20:32:58.9939230Z 2025-05-07T20:32:58.9939431Z if scale_ub is not None: 2025-05-07T20:32:58.9939707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9940050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9940366Z ) 2025-05-07T20:32:58.9940563Z else: 2025-05-07T20:32:58.9940790Z scale_ub_tensor = None 2025-05-07T20:32:58.9941055Z 2025-05-07T20:32:58.9941296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9941611Z op = silu_mul_quant 2025-05-07T20:32:58.9941874Z if compiled: 2025-05-07T20:32:58.9942135Z op = torch.compile(op) 2025-05-07T20:32:58.9942431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9942713Z 2025-05-07T20:32:58.9942924Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9943095Z 2025-05-07T20:32:58.9943199Z moe/activation_test.py:117: 2025-05-07T20:32:58.9943507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9943858Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9944146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9944849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9945548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9946094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9946780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9947452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9947999Z kernel = self.compile( 2025-05-07T20:32:58.9948967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9949638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9950132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9950366Z 2025-05-07T20:32:58.9950583Z self = 2025-05-07T20:32:58.9951667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9953063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66bbbc40>} 2025-05-07T20:32:58.9954423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:58.9955456Z context = 2025-05-07T20:32:58.9955866Z 2025-05-07T20:32:58.9956050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:58.9956575Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:58.9957057Z module_map=module_map) 2025-05-07T20:32:58.9957425Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:58.9957780Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:58.9958049Z E ^ 2025-05-07T20:32:58.9958521Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:58.9958974Z 2025-05-07T20:32:58.9959399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:58.9959914Z 2025-05-07T20:32:58.9960022Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:58.9960445Z self=, 2025-05-07T20:32:58.9960861Z T=128, 2025-05-07T20:32:58.9961053Z D=5120, 2025-05-07T20:32:58.9961261Z scale_ub=None, 2025-05-07T20:32:58.9961490Z contiguous=False, 2025-05-07T20:32:58.9961731Z compiled=True, 2025-05-07T20:32:58.9961947Z ) 2025-05-07T20:32:58.9962274Z self = 2025-05-07T20:32:58.9962765Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:58.9974148Z 2025-05-07T20:32:58.9974250Z @given( 2025-05-07T20:32:58.9974513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:58.9974832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:58.9975162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:58.9975494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:58.9975820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:58.9976120Z ) 2025-05-07T20:32:58.9976476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:58.9976930Z def test_silu_mul_quant( 2025-05-07T20:32:58.9977179Z self, 2025-05-07T20:32:58.9977388Z T: int, 2025-05-07T20:32:58.9977597Z D: int, 2025-05-07T20:32:58.9977817Z scale_ub: Optional[float], 2025-05-07T20:32:58.9978097Z contiguous: bool, 2025-05-07T20:32:58.9978343Z compiled: bool, 2025-05-07T20:32:58.9978569Z ) -> None: 2025-05-07T20:32:58.9978800Z torch.manual_seed(2025) 2025-05-07T20:32:58.9979050Z 2025-05-07T20:32:58.9979323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:58.9979680Z 2025-05-07T20:32:58.9980140Z x_sign = torch.sign(x) 2025-05-07T20:32:58.9980436Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:58.9980755Z x = x_sign * x_clamp 2025-05-07T20:32:58.9981004Z x0 = x[:, :D] 2025-05-07T20:32:58.9981283Z x1 = x[:, D:] 2025-05-07T20:32:58.9981498Z 2025-05-07T20:32:58.9981692Z if contiguous: 2025-05-07T20:32:58.9981933Z x0 = x0.contiguous() 2025-05-07T20:32:58.9982189Z x1 = x1.contiguous() 2025-05-07T20:32:58.9982435Z 2025-05-07T20:32:58.9982636Z if scale_ub is not None: 2025-05-07T20:32:58.9982909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:58.9983251Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:58.9983565Z ) 2025-05-07T20:32:58.9983755Z else: 2025-05-07T20:32:58.9983970Z scale_ub_tensor = None 2025-05-07T20:32:58.9984226Z 2025-05-07T20:32:58.9984486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:58.9984948Z op = silu_mul_quant 2025-05-07T20:32:58.9985296Z if compiled: 2025-05-07T20:32:58.9985620Z op = torch.compile(op) 2025-05-07T20:32:58.9986021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9986411Z 2025-05-07T20:32:58.9986675Z > y_fp8, y_scale = fn() 2025-05-07T20:32:58.9986898Z 2025-05-07T20:32:58.9987050Z moe/activation_test.py:117: 2025-05-07T20:32:58.9987515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9987988Z moe/activation_test.py:115: in fn 2025-05-07T20:32:58.9988388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:58.9989150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:58.9989921Z return fn(*args, **kwargs) 2025-05-07T20:32:58.9990583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:58.9991276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:58.9991817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:58.9992504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:58.9993160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:58.9993697Z kernel = self.compile( 2025-05-07T20:32:58.9994241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:58.9994901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:58.9995299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:58.9995540Z 2025-05-07T20:32:58.9995816Z self = 2025-05-07T20:32:58.9996913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:58.9998300Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66bb9da0>} 2025-05-07T20:32:58.9999637Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.0000661Z context = 2025-05-07T20:32:59.0000952Z 2025-05-07T20:32:59.0001119Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.0001766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.0002281Z module_map=module_map) 2025-05-07T20:32:59.0002650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.0003054Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.0003317Z E ^ 2025-05-07T20:32:59.0003784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.0004239Z 2025-05-07T20:32:59.0004651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.0005159Z 2025-05-07T20:32:59.0005272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.0005680Z self=, 2025-05-07T20:32:59.0006086Z T=128, 2025-05-07T20:32:59.0006280Z D=7168, 2025-05-07T20:32:59.0006472Z scale_ub=1200.0, 2025-05-07T20:32:59.0006707Z contiguous=False, 2025-05-07T20:32:59.0006943Z compiled=False, 2025-05-07T20:32:59.0007147Z ) 2025-05-07T20:32:59.1163683Z self = 2025-05-07T20:32:59.1165076Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:59.1166120Z 2025-05-07T20:32:59.1166347Z @given( 2025-05-07T20:32:59.1166967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.1167304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.1167619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.1167956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.1168285Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.1168616Z ) 2025-05-07T20:32:59.1168970Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.1169422Z def test_silu_mul_quant( 2025-05-07T20:32:59.1169676Z self, 2025-05-07T20:32:59.1169902Z T: int, 2025-05-07T20:32:59.1170114Z D: int, 2025-05-07T20:32:59.1170347Z scale_ub: Optional[float], 2025-05-07T20:32:59.1170615Z contiguous: bool, 2025-05-07T20:32:59.1170866Z compiled: bool, 2025-05-07T20:32:59.1171100Z ) -> None: 2025-05-07T20:32:59.1171315Z torch.manual_seed(2025) 2025-05-07T20:32:59.1171565Z 2025-05-07T20:32:59.1171848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.1172198Z 2025-05-07T20:32:59.1172395Z x_sign = torch.sign(x) 2025-05-07T20:32:59.1172698Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.1173014Z x = x_sign * x_clamp 2025-05-07T20:32:59.1173253Z x0 = x[:, :D] 2025-05-07T20:32:59.1173469Z x1 = x[:, D:] 2025-05-07T20:32:59.1173679Z 2025-05-07T20:32:59.1173867Z if contiguous: 2025-05-07T20:32:59.1174102Z x0 = x0.contiguous() 2025-05-07T20:32:59.1174372Z x1 = x1.contiguous() 2025-05-07T20:32:59.1174612Z 2025-05-07T20:32:59.1174807Z if scale_ub is not None: 2025-05-07T20:32:59.1175084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.1175419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.1175731Z ) 2025-05-07T20:32:59.1175928Z else: 2025-05-07T20:32:59.1176135Z scale_ub_tensor = None 2025-05-07T20:32:59.1176391Z 2025-05-07T20:32:59.1176627Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.1176948Z op = silu_mul_quant 2025-05-07T20:32:59.1177220Z if compiled: 2025-05-07T20:32:59.1177492Z op = torch.compile(op) 2025-05-07T20:32:59.1177796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1178069Z 2025-05-07T20:32:59.1178270Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.1178434Z 2025-05-07T20:32:59.1178545Z moe/activation_test.py:117: 2025-05-07T20:32:59.1179266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1179609Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.1179892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1180655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.1181346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.1181885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.1182571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.1183228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.1183766Z kernel = self.compile( 2025-05-07T20:32:59.1184311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.1184976Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.1185374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1185614Z 2025-05-07T20:32:59.1185821Z self = 2025-05-07T20:32:59.1186914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.1188304Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b663a0ae0>} 2025-05-07T20:32:59.1189651Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.1190676Z context = 2025-05-07T20:32:59.1190975Z 2025-05-07T20:32:59.1191144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.1191673Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.1192140Z module_map=module_map) 2025-05-07T20:32:59.1192505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.1192870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.1193132Z E ^ 2025-05-07T20:32:59.1193604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.1194063Z 2025-05-07T20:32:59.1194473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.1194984Z 2025-05-07T20:32:59.1195102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.1195514Z self=, 2025-05-07T20:32:59.1196024Z T=128, 2025-05-07T20:32:59.1196222Z D=5120, 2025-05-07T20:32:59.1196419Z scale_ub=None, 2025-05-07T20:32:59.1196642Z contiguous=False, 2025-05-07T20:32:59.1196878Z compiled=False, 2025-05-07T20:32:59.1197096Z ) 2025-05-07T20:32:59.1197460Z self = 2025-05-07T20:32:59.1197952Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:59.1198222Z 2025-05-07T20:32:59.1198306Z @given( 2025-05-07T20:32:59.1198535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.1198850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.1199164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.1199490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.1199959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.1200248Z ) 2025-05-07T20:32:59.1200598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.1201081Z def test_silu_mul_quant( 2025-05-07T20:32:59.1201327Z self, 2025-05-07T20:32:59.1201532Z T: int, 2025-05-07T20:32:59.1201727Z D: int, 2025-05-07T20:32:59.1201947Z scale_ub: Optional[float], 2025-05-07T20:32:59.1202222Z contiguous: bool, 2025-05-07T20:32:59.1202461Z compiled: bool, 2025-05-07T20:32:59.1202691Z ) -> None: 2025-05-07T20:32:59.1202917Z torch.manual_seed(2025) 2025-05-07T20:32:59.1203156Z 2025-05-07T20:32:59.1203431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.1203786Z 2025-05-07T20:32:59.1203981Z x_sign = torch.sign(x) 2025-05-07T20:32:59.1204282Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.1204607Z x = x_sign * x_clamp 2025-05-07T20:32:59.1204846Z x0 = x[:, :D] 2025-05-07T20:32:59.1205072Z x1 = x[:, D:] 2025-05-07T20:32:59.1205293Z 2025-05-07T20:32:59.1205486Z if contiguous: 2025-05-07T20:32:59.1205729Z x0 = x0.contiguous() 2025-05-07T20:32:59.1205994Z x1 = x1.contiguous() 2025-05-07T20:32:59.1206241Z 2025-05-07T20:32:59.1206437Z if scale_ub is not None: 2025-05-07T20:32:59.1206714Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.1207060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.1207413Z ) 2025-05-07T20:32:59.1207624Z else: 2025-05-07T20:32:59.1207849Z scale_ub_tensor = None 2025-05-07T20:32:59.1208102Z 2025-05-07T20:32:59.1208341Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.1208660Z op = silu_mul_quant 2025-05-07T20:32:59.1208906Z if compiled: 2025-05-07T20:32:59.1209174Z op = torch.compile(op) 2025-05-07T20:32:59.1209480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1209754Z 2025-05-07T20:32:59.1209952Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.1210120Z 2025-05-07T20:32:59.1210227Z moe/activation_test.py:117: 2025-05-07T20:32:59.1210528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1210864Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.1211154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.1211846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.1212532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.1213075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.1213761Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.1214435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.1214961Z kernel = self.compile( 2025-05-07T20:32:59.1215508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.1216164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.1216556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.1216795Z 2025-05-07T20:32:59.1217011Z self = 2025-05-07T20:32:59.1218128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.1219582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b663a1bc0>} 2025-05-07T20:32:59.1220963Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.1222058Z context = 2025-05-07T20:32:59.1222353Z 2025-05-07T20:32:59.1222522Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.1223047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.1223519Z module_map=module_map) 2025-05-07T20:32:59.1223885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.1224252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.1224521Z E ^ 2025-05-07T20:32:59.1224994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.1225453Z 2025-05-07T20:32:59.1225864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.1226395Z 2025-05-07T20:32:59.1226500Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.1226927Z self=, 2025-05-07T20:32:59.1227379Z T=128, 2025-05-07T20:32:59.1227575Z D=5120, 2025-05-07T20:32:59.1227786Z scale_ub=1200.0, 2025-05-07T20:32:59.1228006Z contiguous=True, 2025-05-07T20:32:59.1228233Z compiled=False, 2025-05-07T20:32:59.1228448Z ) 2025-05-07T20:32:59.2982401Z self = 2025-05-07T20:32:59.2983289Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:59.2983758Z 2025-05-07T20:32:59.2983904Z @given( 2025-05-07T20:32:59.2984258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.2984762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.2985257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.2985785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.2986301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.2986758Z ) 2025-05-07T20:32:59.2987324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.2988006Z def test_silu_mul_quant( 2025-05-07T20:32:59.2988376Z self, 2025-05-07T20:32:59.2988681Z T: int, 2025-05-07T20:32:59.2988982Z D: int, 2025-05-07T20:32:59.2989318Z scale_ub: Optional[float], 2025-05-07T20:32:59.2989743Z contiguous: bool, 2025-05-07T20:32:59.2990110Z compiled: bool, 2025-05-07T20:32:59.2990461Z ) -> None: 2025-05-07T20:32:59.2990807Z torch.manual_seed(2025) 2025-05-07T20:32:59.2991163Z 2025-05-07T20:32:59.2991593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.2992161Z 2025-05-07T20:32:59.2992470Z x_sign = torch.sign(x) 2025-05-07T20:32:59.2992941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.2993437Z x = x_sign * x_clamp 2025-05-07T20:32:59.2993809Z x0 = x[:, :D] 2025-05-07T20:32:59.2994165Z x1 = x[:, D:] 2025-05-07T20:32:59.2994507Z 2025-05-07T20:32:59.2994801Z if contiguous: 2025-05-07T20:32:59.2995170Z x0 = x0.contiguous() 2025-05-07T20:32:59.2995583Z x1 = x1.contiguous() 2025-05-07T20:32:59.2996097Z 2025-05-07T20:32:59.2996410Z if scale_ub is not None: 2025-05-07T20:32:59.2996851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.2997384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.2997869Z ) 2025-05-07T20:32:59.2998729Z else: 2025-05-07T20:32:59.2999065Z scale_ub_tensor = None 2025-05-07T20:32:59.2999460Z 2025-05-07T20:32:59.2999818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3000420Z op = silu_mul_quant 2025-05-07T20:32:59.3000806Z if compiled: 2025-05-07T20:32:59.3001194Z op = torch.compile(op) 2025-05-07T20:32:59.3001665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3002107Z 2025-05-07T20:32:59.3002428Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.3002702Z 2025-05-07T20:32:59.3002865Z moe/activation_test.py:117: 2025-05-07T20:32:59.3003310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3003767Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.3004166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3005185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.3006273Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.3007090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.3008175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.3009233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.3010224Z kernel = self.compile( 2025-05-07T20:32:59.3011135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.3012294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.3012952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3013359Z 2025-05-07T20:32:59.3013702Z self = 2025-05-07T20:32:59.3015574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.3017941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b663a2ca0>} 2025-05-07T20:32:59.3020062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.3021722Z context = 2025-05-07T20:32:59.3022194Z 2025-05-07T20:32:59.3022470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.3023329Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.3024117Z module_map=module_map) 2025-05-07T20:32:59.3024696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.3025274Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.3025698Z E ^ 2025-05-07T20:32:59.3026406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.3027163Z 2025-05-07T20:32:59.3027861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.3028671Z 2025-05-07T20:32:59.3028829Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.3029426Z self=, 2025-05-07T20:32:59.3030106Z T=1, 2025-05-07T20:32:59.3030416Z D=7168, 2025-05-07T20:32:59.3030743Z scale_ub=1200.0, 2025-05-07T20:32:59.3031371Z contiguous=True, 2025-05-07T20:32:59.3031748Z compiled=True, 2025-05-07T20:32:59.3032096Z ) 2025-05-07T20:32:59.3032619Z self = 2025-05-07T20:32:59.3033520Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:59.3033974Z 2025-05-07T20:32:59.3034107Z @given( 2025-05-07T20:32:59.3034480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.3035009Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.3035607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.3036237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.3036795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.3037286Z ) 2025-05-07T20:32:59.3037888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.3038652Z def test_silu_mul_quant( 2025-05-07T20:32:59.3039063Z self, 2025-05-07T20:32:59.3039404Z T: int, 2025-05-07T20:32:59.3039724Z D: int, 2025-05-07T20:32:59.3040114Z scale_ub: Optional[float], 2025-05-07T20:32:59.3040573Z contiguous: bool, 2025-05-07T20:32:59.3040958Z compiled: bool, 2025-05-07T20:32:59.3041313Z ) -> None: 2025-05-07T20:32:59.3041663Z torch.manual_seed(2025) 2025-05-07T20:32:59.3042038Z 2025-05-07T20:32:59.3042469Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.3043026Z 2025-05-07T20:32:59.3043325Z x_sign = torch.sign(x) 2025-05-07T20:32:59.3043802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.3044336Z x = x_sign * x_clamp 2025-05-07T20:32:59.3044729Z x0 = x[:, :D] 2025-05-07T20:32:59.3058056Z x1 = x[:, D:] 2025-05-07T20:32:59.3058472Z 2025-05-07T20:32:59.3058797Z if contiguous: 2025-05-07T20:32:59.3059179Z x0 = x0.contiguous() 2025-05-07T20:32:59.3059639Z x1 = x1.contiguous() 2025-05-07T20:32:59.3060051Z 2025-05-07T20:32:59.3060367Z if scale_ub is not None: 2025-05-07T20:32:59.3060839Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.3061433Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.3061903Z ) 2025-05-07T20:32:59.3062175Z else: 2025-05-07T20:32:59.3062455Z scale_ub_tensor = None 2025-05-07T20:32:59.3062783Z 2025-05-07T20:32:59.3063100Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.3063558Z op = silu_mul_quant 2025-05-07T20:32:59.3063915Z if compiled: 2025-05-07T20:32:59.3064280Z op = torch.compile(op) 2025-05-07T20:32:59.3064712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3065120Z 2025-05-07T20:32:59.3065922Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.3066184Z 2025-05-07T20:32:59.3066340Z moe/activation_test.py:117: 2025-05-07T20:32:59.3066835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3067356Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.3067817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.3068704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.3069605Z return fn(*args, **kwargs) 2025-05-07T20:32:59.3070688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.3071916Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.3072855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.3074048Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.3075513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.3076636Z kernel = self.compile( 2025-05-07T20:32:59.3077639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.3078903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.3079591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.3079991Z 2025-05-07T20:32:59.3080347Z self = 2025-05-07T20:32:59.3082258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.3084714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef4360>} 2025-05-07T20:32:59.3087116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.3088943Z context = 2025-05-07T20:32:59.3089443Z 2025-05-07T20:32:59.3089730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.3090629Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.3091440Z module_map=module_map) 2025-05-07T20:32:59.3092056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.3092646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.3093085Z E ^ 2025-05-07T20:32:59.3093892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.3094710Z 2025-05-07T20:32:59.3095449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.3096360Z 2025-05-07T20:32:59.3096537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.3097256Z self=, 2025-05-07T20:32:59.3098007Z T=1, 2025-05-07T20:32:59.3098312Z D=7168, 2025-05-07T20:32:59.3098640Z scale_ub=1200.0, 2025-05-07T20:32:59.3099016Z contiguous=False, 2025-05-07T20:32:59.3099396Z compiled=True, 2025-05-07T20:32:59.3099731Z ) 2025-05-07T20:32:59.4430425Z self = 2025-05-07T20:32:59.4431294Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:59.4431734Z 2025-05-07T20:32:59.4431868Z @given( 2025-05-07T20:32:59.4432227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.4432772Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.4433270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.4433800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.4434341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.4434804Z ) 2025-05-07T20:32:59.4435375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.4436224Z def test_silu_mul_quant( 2025-05-07T20:32:59.4436613Z self, 2025-05-07T20:32:59.4436909Z T: int, 2025-05-07T20:32:59.4437204Z D: int, 2025-05-07T20:32:59.4437597Z scale_ub: Optional[float], 2025-05-07T20:32:59.4438029Z contiguous: bool, 2025-05-07T20:32:59.4438401Z compiled: bool, 2025-05-07T20:32:59.4438759Z ) -> None: 2025-05-07T20:32:59.4439100Z torch.manual_seed(2025) 2025-05-07T20:32:59.4439481Z 2025-05-07T20:32:59.4440333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.4440996Z 2025-05-07T20:32:59.4441309Z x_sign = torch.sign(x) 2025-05-07T20:32:59.4441794Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.4442417Z x = x_sign * x_clamp 2025-05-07T20:32:59.4442793Z x0 = x[:, :D] 2025-05-07T20:32:59.4443134Z x1 = x[:, D:] 2025-05-07T20:32:59.4443472Z 2025-05-07T20:32:59.4443770Z if contiguous: 2025-05-07T20:32:59.4444139Z x0 = x0.contiguous() 2025-05-07T20:32:59.4444568Z x1 = x1.contiguous() 2025-05-07T20:32:59.4444956Z 2025-05-07T20:32:59.4445255Z if scale_ub is not None: 2025-05-07T20:32:59.4445703Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.4446242Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.4446735Z ) 2025-05-07T20:32:59.4447039Z else: 2025-05-07T20:32:59.4447377Z scale_ub_tensor = None 2025-05-07T20:32:59.4447777Z 2025-05-07T20:32:59.4448159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.4448675Z op = silu_mul_quant 2025-05-07T20:32:59.4449073Z if compiled: 2025-05-07T20:32:59.4449469Z op = torch.compile(op) 2025-05-07T20:32:59.4449948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.4450389Z 2025-05-07T20:32:59.4450698Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.4450974Z 2025-05-07T20:32:59.4451136Z moe/activation_test.py:117: 2025-05-07T20:32:59.4451615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.4452129Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.4452579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.4453504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.4454427Z return fn(*args, **kwargs) 2025-05-07T20:32:59.4455348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.4456375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.4457216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.4458318Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.4459372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.4460211Z kernel = self.compile( 2025-05-07T20:32:59.4461123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.4462253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.4462946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.4463349Z 2025-05-07T20:32:59.4463718Z self = 2025-05-07T20:32:59.4465919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.4468252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef59e0>} 2025-05-07T20:32:59.4470475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.4472139Z context = 2025-05-07T20:32:59.4472621Z 2025-05-07T20:32:59.4472883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.4473991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.4474746Z module_map=module_map) 2025-05-07T20:32:59.4475429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.4476083Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.4476503Z E ^ 2025-05-07T20:32:59.4477248Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.4477972Z 2025-05-07T20:32:59.4478654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.4479521Z 2025-05-07T20:32:59.4479685Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.4480327Z self=, 2025-05-07T20:32:59.4480882Z T=1, 2025-05-07T20:32:59.4481184Z D=7168, 2025-05-07T20:32:59.4481518Z scale_ub=None, 2025-05-07T20:32:59.4481875Z contiguous=False, 2025-05-07T20:32:59.4482252Z compiled=True, 2025-05-07T20:32:59.4482596Z ) 2025-05-07T20:32:59.7144578Z self = 2025-05-07T20:32:59.7145473Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:59.7145912Z 2025-05-07T20:32:59.7146036Z @given( 2025-05-07T20:32:59.7146400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.7146916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.7147398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.7147939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.7148466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.7148893Z ) 2025-05-07T20:32:59.7149445Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.7150175Z def test_silu_mul_quant( 2025-05-07T20:32:59.7150558Z self, 2025-05-07T20:32:59.7150870Z T: int, 2025-05-07T20:32:59.7151179Z D: int, 2025-05-07T20:32:59.7151526Z scale_ub: Optional[float], 2025-05-07T20:32:59.7151952Z contiguous: bool, 2025-05-07T20:32:59.7152329Z compiled: bool, 2025-05-07T20:32:59.7152685Z ) -> None: 2025-05-07T20:32:59.7153022Z torch.manual_seed(2025) 2025-05-07T20:32:59.7153417Z 2025-05-07T20:32:59.7153855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.7154409Z 2025-05-07T20:32:59.7154721Z x_sign = torch.sign(x) 2025-05-07T20:32:59.7155185Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.7155690Z x = x_sign * x_clamp 2025-05-07T20:32:59.7156169Z x0 = x[:, :D] 2025-05-07T20:32:59.7156528Z x1 = x[:, D:] 2025-05-07T20:32:59.7156854Z 2025-05-07T20:32:59.7157148Z if contiguous: 2025-05-07T20:32:59.7157567Z x0 = x0.contiguous() 2025-05-07T20:32:59.7157997Z x1 = x1.contiguous() 2025-05-07T20:32:59.7158387Z 2025-05-07T20:32:59.7158691Z if scale_ub is not None: 2025-05-07T20:32:59.7159115Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.7159656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.7160156Z ) 2025-05-07T20:32:59.7160460Z else: 2025-05-07T20:32:59.7160786Z scale_ub_tensor = None 2025-05-07T20:32:59.7161183Z 2025-05-07T20:32:59.7161546Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.7162043Z op = silu_mul_quant 2025-05-07T20:32:59.7162427Z if compiled: 2025-05-07T20:32:59.7162826Z op = torch.compile(op) 2025-05-07T20:32:59.7163296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.7163733Z 2025-05-07T20:32:59.7164051Z y_fp8, y_scale = fn() 2025-05-07T20:32:59.7164961Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:59.7165848Z 2025-05-07T20:32:59.7166200Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.7166672Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:59.7167215Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:59.7167684Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:59.7168217Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.7168691Z 2025-05-07T20:32:59.7169006Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:59.7169321Z 2025-05-07T20:32:59.7169488Z moe/activation_test.py:126: 2025-05-07T20:32:59.7169953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.7170441Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:59.7170931Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:59.7172171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:59.7173393Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:59.7174246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.7175336Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.7176494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:59.7177783Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:59.7179078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:59.7180191Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:59.7181243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:59.7182113Z fn() 2025-05-07T20:32:59.7182922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:59.7183835Z self.fn.run( 2025-05-07T20:32:59.7184635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.7185482Z kernel = self.compile( 2025-05-07T20:32:59.7186347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.7187403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.7188051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.7188432Z 2025-05-07T20:32:59.7188771Z self = 2025-05-07T20:32:59.7190577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.7192831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef6700>} 2025-05-07T20:32:59.7194992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.7196789Z context = 2025-05-07T20:32:59.7197299Z 2025-05-07T20:32:59.7197586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.7198493Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.7199508Z module_map=module_map) 2025-05-07T20:32:59.7200210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.7200818Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:59.7201332Z E ^ 2025-05-07T20:32:59.7202140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.7202941Z 2025-05-07T20:32:59.7203686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.7204603Z 2025-05-07T20:32:59.7204782Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.7205489Z self=, 2025-05-07T20:32:59.7206188Z T=1, 2025-05-07T20:32:59.7206503Z D=5120, 2025-05-07T20:32:59.7206812Z scale_ub=1200.0, 2025-05-07T20:32:59.7207175Z contiguous=False, 2025-05-07T20:32:59.7207540Z compiled=True, 2025-05-07T20:32:59.7207872Z ) 2025-05-07T20:32:59.8749987Z self = 2025-05-07T20:32:59.8750879Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:59.8751335Z 2025-05-07T20:32:59.8751460Z @given( 2025-05-07T20:32:59.8751833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.8752336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.8752836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.8753366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.8753904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.8754345Z ) 2025-05-07T20:32:59.8754901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.8755620Z def test_silu_mul_quant( 2025-05-07T20:32:59.8756090Z self, 2025-05-07T20:32:59.8756397Z T: int, 2025-05-07T20:32:59.8756705Z D: int, 2025-05-07T20:32:59.8757058Z scale_ub: Optional[float], 2025-05-07T20:32:59.8757500Z contiguous: bool, 2025-05-07T20:32:59.8757884Z compiled: bool, 2025-05-07T20:32:59.8758245Z ) -> None: 2025-05-07T20:32:59.8758598Z torch.manual_seed(2025) 2025-05-07T20:32:59.8759007Z 2025-05-07T20:32:59.8759443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.8760001Z 2025-05-07T20:32:59.8760313Z x_sign = torch.sign(x) 2025-05-07T20:32:59.8760776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.8761283Z x = x_sign * x_clamp 2025-05-07T20:32:59.8761686Z x0 = x[:, :D] 2025-05-07T20:32:59.8762039Z x1 = x[:, D:] 2025-05-07T20:32:59.8762366Z 2025-05-07T20:32:59.8762670Z if contiguous: 2025-05-07T20:32:59.8763047Z x0 = x0.contiguous() 2025-05-07T20:32:59.8763453Z x1 = x1.contiguous() 2025-05-07T20:32:59.8763843Z 2025-05-07T20:32:59.8764160Z if scale_ub is not None: 2025-05-07T20:32:59.8764595Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.8765137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.8765981Z ) 2025-05-07T20:32:59.8766282Z else: 2025-05-07T20:32:59.8766620Z scale_ub_tensor = None 2025-05-07T20:32:59.8767023Z 2025-05-07T20:32:59.8767393Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.8767926Z op = silu_mul_quant 2025-05-07T20:32:59.8768322Z if compiled: 2025-05-07T20:32:59.8768717Z op = torch.compile(op) 2025-05-07T20:32:59.8769193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8769626Z 2025-05-07T20:32:59.8769944Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.8770199Z 2025-05-07T20:32:59.8770357Z moe/activation_test.py:117: 2025-05-07T20:32:59.8770763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8771766Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.8772208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8773096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.8774087Z return fn(*args, **kwargs) 2025-05-07T20:32:59.8775125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.8776224Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.8777150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.8778373Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.8779528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.8780455Z kernel = self.compile( 2025-05-07T20:32:59.8781416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.8782493Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.8783128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8783473Z 2025-05-07T20:32:59.8783813Z self = 2025-05-07T20:32:59.8785575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.8787830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef7e20>} 2025-05-07T20:32:59.8790025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.8791739Z context = 2025-05-07T20:32:59.8792227Z 2025-05-07T20:32:59.8792483Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.8793321Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.8794130Z module_map=module_map) 2025-05-07T20:32:59.8794684Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.8795226Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.8795590Z E ^ 2025-05-07T20:32:59.8796447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.8797255Z 2025-05-07T20:32:59.8798060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.8798992Z 2025-05-07T20:32:59.8799163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.8799870Z self=, 2025-05-07T20:32:59.8800578Z T=1, 2025-05-07T20:32:59.8800871Z D=5120, 2025-05-07T20:32:59.8801195Z scale_ub=1200.0, 2025-05-07T20:32:59.8801569Z contiguous=False, 2025-05-07T20:32:59.8801936Z compiled=False, 2025-05-07T20:32:59.8802287Z ) 2025-05-07T20:32:59.8802829Z self = 2025-05-07T20:32:59.8803672Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:59.8804148Z 2025-05-07T20:32:59.8804277Z @given( 2025-05-07T20:32:59.8804657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.8805179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.8805858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.8806494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.8807054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.8807571Z ) 2025-05-07T20:32:59.8808191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.8808906Z def test_silu_mul_quant( 2025-05-07T20:32:59.8809281Z self, 2025-05-07T20:32:59.8809598Z T: int, 2025-05-07T20:32:59.8809913Z D: int, 2025-05-07T20:32:59.8810256Z scale_ub: Optional[float], 2025-05-07T20:32:59.8810714Z contiguous: bool, 2025-05-07T20:32:59.8811112Z compiled: bool, 2025-05-07T20:32:59.8811476Z ) -> None: 2025-05-07T20:32:59.8811835Z torch.manual_seed(2025) 2025-05-07T20:32:59.8812243Z 2025-05-07T20:32:59.8812690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.8813279Z 2025-05-07T20:32:59.8813609Z x_sign = torch.sign(x) 2025-05-07T20:32:59.8814094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.8814615Z x = x_sign * x_clamp 2025-05-07T20:32:59.8815015Z x0 = x[:, :D] 2025-05-07T20:32:59.8815379Z x1 = x[:, D:] 2025-05-07T20:32:59.8815733Z 2025-05-07T20:32:59.8816035Z if contiguous: 2025-05-07T20:32:59.8816411Z x0 = x0.contiguous() 2025-05-07T20:32:59.8816850Z x1 = x1.contiguous() 2025-05-07T20:32:59.8817261Z 2025-05-07T20:32:59.8817578Z if scale_ub is not None: 2025-05-07T20:32:59.8818045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.8818614Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.8819145Z ) 2025-05-07T20:32:59.8819460Z else: 2025-05-07T20:32:59.8819813Z scale_ub_tensor = None 2025-05-07T20:32:59.8820239Z 2025-05-07T20:32:59.8820620Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.8821167Z op = silu_mul_quant 2025-05-07T20:32:59.8821580Z if compiled: 2025-05-07T20:32:59.8821985Z op = torch.compile(op) 2025-05-07T20:32:59.8822483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8822953Z 2025-05-07T20:32:59.8823269Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.8823554Z 2025-05-07T20:32:59.8823719Z moe/activation_test.py:117: 2025-05-07T20:32:59.8836420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8836980Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.8837434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.8838593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.8839707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.8840563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.8841649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.8842677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.8843480Z kernel = self.compile( 2025-05-07T20:32:59.8844333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.8845352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.8845966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.8846335Z 2025-05-07T20:32:59.8846672Z self = 2025-05-07T20:32:59.8848505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.8850799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66666480>} 2025-05-07T20:32:59.8853218Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.8855045Z context = 2025-05-07T20:32:59.8855540Z 2025-05-07T20:32:59.8855836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.8856740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.8857599Z module_map=module_map) 2025-05-07T20:32:59.8858241Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.8858858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.8859306Z E ^ 2025-05-07T20:32:59.8860116Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.8860921Z 2025-05-07T20:32:59.8861670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.8862582Z 2025-05-07T20:32:59.8862756Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.8863477Z self=, 2025-05-07T20:32:59.8864179Z T=16384, 2025-05-07T20:32:59.8864494Z D=5120, 2025-05-07T20:32:59.8864823Z scale_ub=1200.0, 2025-05-07T20:32:59.8865206Z contiguous=False, 2025-05-07T20:32:59.8865929Z compiled=True, 2025-05-07T20:32:59.8866274Z ) 2025-05-07T20:32:59.9730900Z self = 2025-05-07T20:32:59.9731887Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:59.9732389Z 2025-05-07T20:32:59.9732521Z @given( 2025-05-07T20:32:59.9732911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.9733448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.9733957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.9734521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.9735076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.9735568Z ) 2025-05-07T20:32:59.9736171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.9736928Z def test_silu_mul_quant( 2025-05-07T20:32:59.9737340Z self, 2025-05-07T20:32:59.9737667Z T: int, 2025-05-07T20:32:59.9737990Z D: int, 2025-05-07T20:32:59.9738358Z scale_ub: Optional[float], 2025-05-07T20:32:59.9738826Z contiguous: bool, 2025-05-07T20:32:59.9739244Z compiled: bool, 2025-05-07T20:32:59.9739624Z ) -> None: 2025-05-07T20:32:59.9739992Z torch.manual_seed(2025) 2025-05-07T20:32:59.9740409Z 2025-05-07T20:32:59.9740858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.9741453Z 2025-05-07T20:32:59.9741773Z x_sign = torch.sign(x) 2025-05-07T20:32:59.9742231Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.9742729Z x = x_sign * x_clamp 2025-05-07T20:32:59.9743117Z x0 = x[:, :D] 2025-05-07T20:32:59.9743463Z x1 = x[:, D:] 2025-05-07T20:32:59.9743793Z 2025-05-07T20:32:59.9744094Z if contiguous: 2025-05-07T20:32:59.9744466Z x0 = x0.contiguous() 2025-05-07T20:32:59.9744907Z x1 = x1.contiguous() 2025-05-07T20:32:59.9745324Z 2025-05-07T20:32:59.9745635Z if scale_ub is not None: 2025-05-07T20:32:59.9746095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.9747199Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.9747733Z ) 2025-05-07T20:32:59.9748058Z else: 2025-05-07T20:32:59.9748409Z scale_ub_tensor = None 2025-05-07T20:32:59.9748937Z 2025-05-07T20:32:59.9749316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.9749858Z op = silu_mul_quant 2025-05-07T20:32:59.9750271Z if compiled: 2025-05-07T20:32:59.9750672Z op = torch.compile(op) 2025-05-07T20:32:59.9751176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.9751648Z 2025-05-07T20:32:59.9751961Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.9752253Z 2025-05-07T20:32:59.9752418Z moe/activation_test.py:117: 2025-05-07T20:32:59.9752925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.9753486Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.9753972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.9754963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.9756076Z return fn(*args, **kwargs) 2025-05-07T20:32:59.9757237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.9758514Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.9759462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.9760652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.9761824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.9762773Z kernel = self.compile( 2025-05-07T20:32:59.9763640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.9764512Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.9765081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.9765719Z 2025-05-07T20:32:59.9766043Z self = 2025-05-07T20:32:59.9767678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.9769911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66667ce0>} 2025-05-07T20:32:59.9772176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.9773943Z context = 2025-05-07T20:32:59.9774447Z 2025-05-07T20:32:59.9774737Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.9775617Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.9776368Z module_map=module_map) 2025-05-07T20:32:59.9776933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.9777476Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.9777879Z E ^ 2025-05-07T20:32:59.9778624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.9779353Z 2025-05-07T20:32:59.9779995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.9780812Z 2025-05-07T20:32:59.9781318Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:59.9781975Z self=, 2025-05-07T20:32:59.9782619Z T=2048, 2025-05-07T20:32:59.9783006Z D=7168, 2025-05-07T20:32:59.9783279Z scale_ub=1200.0, 2025-05-07T20:32:59.9783609Z contiguous=False, 2025-05-07T20:32:59.9783949Z compiled=True, 2025-05-07T20:32:59.9784264Z ) 2025-05-07T20:32:59.9784749Z self = 2025-05-07T20:32:59.9785506Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:59.9785936Z 2025-05-07T20:32:59.9786054Z @given( 2025-05-07T20:32:59.9786403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:59.9786876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:59.9787341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:59.9787889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:59.9788415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:59.9788858Z ) 2025-05-07T20:32:59.9789389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:59.9790077Z def test_silu_mul_quant( 2025-05-07T20:32:59.9790444Z self, 2025-05-07T20:32:59.9790746Z T: int, 2025-05-07T20:32:59.9791049Z D: int, 2025-05-07T20:32:59.9791374Z scale_ub: Optional[float], 2025-05-07T20:32:59.9791797Z contiguous: bool, 2025-05-07T20:32:59.9792173Z compiled: bool, 2025-05-07T20:32:59.9792501Z ) -> None: 2025-05-07T20:32:59.9792830Z torch.manual_seed(2025) 2025-05-07T20:32:59.9793202Z 2025-05-07T20:32:59.9793609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:59.9794130Z 2025-05-07T20:32:59.9794424Z x_sign = torch.sign(x) 2025-05-07T20:32:59.9794872Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:59.9795352Z x = x_sign * x_clamp 2025-05-07T20:32:59.9795716Z x0 = x[:, :D] 2025-05-07T20:32:59.9796150Z x1 = x[:, D:] 2025-05-07T20:32:59.9796454Z 2025-05-07T20:32:59.9796732Z if contiguous: 2025-05-07T20:32:59.9797090Z x0 = x0.contiguous() 2025-05-07T20:32:59.9797472Z x1 = x1.contiguous() 2025-05-07T20:32:59.9797870Z 2025-05-07T20:32:59.9798182Z if scale_ub is not None: 2025-05-07T20:32:59.9798598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:59.9799112Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:59.9799592Z ) 2025-05-07T20:32:59.9799883Z else: 2025-05-07T20:32:59.9800206Z scale_ub_tensor = None 2025-05-07T20:32:59.9800600Z 2025-05-07T20:32:59.9800947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:59.9801454Z op = silu_mul_quant 2025-05-07T20:32:59.9801856Z if compiled: 2025-05-07T20:32:59.9802268Z op = torch.compile(op) 2025-05-07T20:32:59.9802767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.9803229Z 2025-05-07T20:32:59.9803552Z > y_fp8, y_scale = fn() 2025-05-07T20:32:59.9803837Z 2025-05-07T20:32:59.9804001Z moe/activation_test.py:117: 2025-05-07T20:32:59.9804491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.9805049Z moe/activation_test.py:115: in fn 2025-05-07T20:32:59.9805507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:59.9806457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:59.9807454Z return fn(*args, **kwargs) 2025-05-07T20:32:59.9808544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:59.9809664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:59.9810750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:59.9812049Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:59.9813265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:59.9814195Z kernel = self.compile( 2025-05-07T20:32:59.9815141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:59.9816277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:59.9816950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:59.9817355Z 2025-05-07T20:32:59.9817745Z self = 2025-05-07T20:32:59.9819662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:59.9822123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b666679c0>} 2025-05-07T20:32:59.9824506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:59.9826309Z context = 2025-05-07T20:32:59.9826815Z 2025-05-07T20:32:59.9827092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:59.9828050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:59.9828854Z module_map=module_map) 2025-05-07T20:32:59.9829480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:59.9830077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:59.9830515Z E ^ 2025-05-07T20:32:59.9831316Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:59.9832130Z 2025-05-07T20:32:59.9832857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:59.9833768Z 2025-05-07T20:33:00.1003656Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1004459Z self=, 2025-05-07T20:33:00.1005159Z T=1, 2025-05-07T20:33:00.1005475Z D=5120, 2025-05-07T20:33:00.1005802Z scale_ub=None, 2025-05-07T20:33:00.1006155Z contiguous=False, 2025-05-07T20:33:00.1006532Z compiled=False, 2025-05-07T20:33:00.1006878Z ) 2025-05-07T20:33:00.1007444Z self = 2025-05-07T20:33:00.1008287Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.1008755Z 2025-05-07T20:33:00.1008883Z @given( 2025-05-07T20:33:00.1009281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1009804Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1010335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1010906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1011465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1011960Z ) 2025-05-07T20:33:00.1012570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1013345Z def test_silu_mul_quant( 2025-05-07T20:33:00.1013748Z self, 2025-05-07T20:33:00.1014076Z T: int, 2025-05-07T20:33:00.1014404Z D: int, 2025-05-07T20:33:00.1014765Z scale_ub: Optional[float], 2025-05-07T20:33:00.1015657Z contiguous: bool, 2025-05-07T20:33:00.1016059Z compiled: bool, 2025-05-07T20:33:00.1016406Z ) -> None: 2025-05-07T20:33:00.1016756Z torch.manual_seed(2025) 2025-05-07T20:33:00.1017281Z 2025-05-07T20:33:00.1017739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1018359Z 2025-05-07T20:33:00.1018681Z x_sign = torch.sign(x) 2025-05-07T20:33:00.1019157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.1019685Z x = x_sign * x_clamp 2025-05-07T20:33:00.1020086Z x0 = x[:, :D] 2025-05-07T20:33:00.1020435Z x1 = x[:, D:] 2025-05-07T20:33:00.1020777Z 2025-05-07T20:33:00.1021084Z if contiguous: 2025-05-07T20:33:00.1021462Z x0 = x0.contiguous() 2025-05-07T20:33:00.1021897Z x1 = x1.contiguous() 2025-05-07T20:33:00.1022292Z 2025-05-07T20:33:00.1022610Z if scale_ub is not None: 2025-05-07T20:33:00.1023089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.1023654Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.1024181Z ) 2025-05-07T20:33:00.1024492Z else: 2025-05-07T20:33:00.1024847Z scale_ub_tensor = None 2025-05-07T20:33:00.1025271Z 2025-05-07T20:33:00.1025657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.1026203Z op = silu_mul_quant 2025-05-07T20:33:00.1026629Z if compiled: 2025-05-07T20:33:00.1027041Z op = torch.compile(op) 2025-05-07T20:33:00.1027545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.1028021Z 2025-05-07T20:33:00.1028331Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.1028622Z 2025-05-07T20:33:00.1028793Z moe/activation_test.py:117: 2025-05-07T20:33:00.1029296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.1029873Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.1030354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.1031571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.1032798Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.1033733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.1034943Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.1036217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.1036986Z kernel = self.compile( 2025-05-07T20:33:00.1037704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.1038604Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.1039168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.1039490Z 2025-05-07T20:33:00.1039763Z self = 2025-05-07T20:33:00.1041240Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.1043233Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b668dd800>} 2025-05-07T20:33:00.1045153Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.1046653Z context = 2025-05-07T20:33:00.1047251Z 2025-05-07T20:33:00.1047496Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.1048304Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.1049079Z module_map=module_map) 2025-05-07T20:33:00.1049638Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.1050146Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.1050561Z E ^ 2025-05-07T20:33:00.1051299Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.1052016Z 2025-05-07T20:33:00.1052674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.1053510Z 2025-05-07T20:33:00.1053672Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1054368Z self=, 2025-05-07T20:33:00.1055073Z T=4096, 2025-05-07T20:33:00.1055376Z D=7168, 2025-05-07T20:33:00.1055699Z scale_ub=1200.0, 2025-05-07T20:33:00.1056067Z contiguous=False, 2025-05-07T20:33:00.1056456Z compiled=False, 2025-05-07T20:33:00.1056798Z ) 2025-05-07T20:33:00.1057335Z self = 2025-05-07T20:33:00.1058238Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.1058719Z 2025-05-07T20:33:00.1058851Z @given( 2025-05-07T20:33:00.1059233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.1059766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.1060282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.1060842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.1061406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.1061896Z ) 2025-05-07T20:33:00.1062495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.1063263Z def test_silu_mul_quant( 2025-05-07T20:33:00.1063669Z self, 2025-05-07T20:33:00.1063991Z T: int, 2025-05-07T20:33:00.1064319Z D: int, 2025-05-07T20:33:00.1064680Z scale_ub: Optional[float], 2025-05-07T20:33:00.1065130Z contiguous: bool, 2025-05-07T20:33:00.1065916Z compiled: bool, 2025-05-07T20:33:00.1066297Z ) -> None: 2025-05-07T20:33:00.1066653Z torch.manual_seed(2025) 2025-05-07T20:33:00.1067063Z 2025-05-07T20:33:00.1067515Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.1068090Z 2025-05-07T20:33:00.1068410Z x_sign = torch.sign(x) 2025-05-07T20:33:00.1068899Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.1069420Z x = x_sign * x_clamp 2025-05-07T20:33:00.1069819Z x0 = x[:, :D] 2025-05-07T20:33:00.1070182Z x1 = x[:, D:] 2025-05-07T20:33:00.1070540Z 2025-05-07T20:33:00.1070841Z if contiguous: 2025-05-07T20:33:00.1071230Z x0 = x0.contiguous() 2025-05-07T20:33:00.1071666Z x1 = x1.contiguous() 2025-05-07T20:33:00.1072073Z 2025-05-07T20:33:00.1072393Z if scale_ub is not None: 2025-05-07T20:33:00.1072856Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.1073413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.1073938Z ) 2025-05-07T20:33:00.1074263Z else: 2025-05-07T20:33:00.1074608Z scale_ub_tensor = None 2025-05-07T20:33:00.1075037Z 2025-05-07T20:33:00.1075423Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.1076048Z op = silu_mul_quant 2025-05-07T20:33:00.1076473Z if compiled: 2025-05-07T20:33:00.1076888Z op = torch.compile(op) 2025-05-07T20:33:00.1077378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.1078199Z 2025-05-07T20:33:00.1078544Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.1078827Z 2025-05-07T20:33:00.1079003Z moe/activation_test.py:117: 2025-05-07T20:33:00.1079493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.1080155Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.1080628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.1081830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.1083043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.1083978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.1085175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.1086330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.1087272Z kernel = self.compile( 2025-05-07T20:33:00.1088264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.1089408Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.1090083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.1090490Z 2025-05-07T20:33:00.1090833Z self = 2025-05-07T20:33:00.1092742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.1095195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66e85300>} 2025-05-07T20:33:00.1097601Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.1099417Z context = 2025-05-07T20:33:00.1099923Z 2025-05-07T20:33:00.1100209Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.1101118Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.1101928Z module_map=module_map) 2025-05-07T20:33:00.1102548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.1103172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.1103612Z E ^ 2025-05-07T20:33:00.1104415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.1105228Z 2025-05-07T20:33:00.1105968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.1106879Z 2025-05-07T20:33:00.1107069Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.1107775Z self=, 2025-05-07T20:33:00.1108526Z T=16384, 2025-05-07T20:33:00.1108850Z D=7168, 2025-05-07T20:33:00.1109165Z scale_ub=None, 2025-05-07T20:33:00.1109526Z contiguous=True, 2025-05-07T20:33:00.1109899Z compiled=True, 2025-05-07T20:33:00.1110234Z ) 2025-05-07T20:33:00.2884838Z self = 2025-05-07T20:33:00.2885782Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.2886265Z 2025-05-07T20:33:00.2886394Z @given( 2025-05-07T20:33:00.2886778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.2887881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.2888408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.2901786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.2902544Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.2903037Z ) 2025-05-07T20:33:00.2903629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.2904400Z def test_silu_mul_quant( 2025-05-07T20:33:00.2904804Z self, 2025-05-07T20:33:00.2905134Z T: int, 2025-05-07T20:33:00.2905469Z D: int, 2025-05-07T20:33:00.2905828Z scale_ub: Optional[float], 2025-05-07T20:33:00.2906294Z contiguous: bool, 2025-05-07T20:33:00.2906704Z compiled: bool, 2025-05-07T20:33:00.2907084Z ) -> None: 2025-05-07T20:33:00.2907439Z torch.manual_seed(2025) 2025-05-07T20:33:00.2907855Z 2025-05-07T20:33:00.2908320Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.2908921Z 2025-05-07T20:33:00.2909244Z x_sign = torch.sign(x) 2025-05-07T20:33:00.2909738Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.2910269Z x = x_sign * x_clamp 2025-05-07T20:33:00.2910675Z x0 = x[:, :D] 2025-05-07T20:33:00.2911042Z x1 = x[:, D:] 2025-05-07T20:33:00.2911379Z 2025-05-07T20:33:00.2911688Z if contiguous: 2025-05-07T20:33:00.2912080Z x0 = x0.contiguous() 2025-05-07T20:33:00.2912508Z x1 = x1.contiguous() 2025-05-07T20:33:00.2912919Z 2025-05-07T20:33:00.2913239Z if scale_ub is not None: 2025-05-07T20:33:00.2913697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.2914270Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.2914797Z ) 2025-05-07T20:33:00.2915112Z else: 2025-05-07T20:33:00.2915468Z scale_ub_tensor = None 2025-05-07T20:33:00.2916028Z 2025-05-07T20:33:00.2916426Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.2916883Z op = silu_mul_quant 2025-05-07T20:33:00.2917203Z if compiled: 2025-05-07T20:33:00.2917527Z op = torch.compile(op) 2025-05-07T20:33:00.2917969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2918362Z 2025-05-07T20:33:00.2918637Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.2918857Z 2025-05-07T20:33:00.2918993Z moe/activation_test.py:117: 2025-05-07T20:33:00.2919392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2919860Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.2920243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2920998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.2921803Z return fn(*args, **kwargs) 2025-05-07T20:33:00.2922770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.2923727Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.2924521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.2925504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.2926473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.2927237Z kernel = self.compile( 2025-05-07T20:33:00.2928094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.2929131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.2929760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2930136Z 2025-05-07T20:33:00.2930601Z self = 2025-05-07T20:33:00.2932410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.2934681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90a65440>} 2025-05-07T20:33:00.2936834Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.2938406Z context = 2025-05-07T20:33:00.2938864Z 2025-05-07T20:33:00.2939120Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.2939943Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.2940703Z module_map=module_map) 2025-05-07T20:33:00.2941290Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.2941832Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.2942251Z E ^ 2025-05-07T20:33:00.2943030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.2943808Z 2025-05-07T20:33:00.2944536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.2945441Z 2025-05-07T20:33:00.2945613Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.2946309Z self=, 2025-05-07T20:33:00.2946937Z T=4096, 2025-05-07T20:33:00.2947223Z D=5120, 2025-05-07T20:33:00.2947542Z scale_ub=None, 2025-05-07T20:33:00.2947885Z contiguous=False, 2025-05-07T20:33:00.2948238Z compiled=True, 2025-05-07T20:33:00.2948553Z ) 2025-05-07T20:33:00.2949060Z self = 2025-05-07T20:33:00.2949849Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.2950285Z 2025-05-07T20:33:00.2950404Z @given( 2025-05-07T20:33:00.2950723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.2951183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.2951634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.2952142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.2952636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.2953070Z ) 2025-05-07T20:33:00.2953609Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.2954274Z def test_silu_mul_quant( 2025-05-07T20:33:00.2954656Z self, 2025-05-07T20:33:00.2954948Z T: int, 2025-05-07T20:33:00.2955230Z D: int, 2025-05-07T20:33:00.2955553Z scale_ub: Optional[float], 2025-05-07T20:33:00.2956089Z contiguous: bool, 2025-05-07T20:33:00.2956453Z compiled: bool, 2025-05-07T20:33:00.2956804Z ) -> None: 2025-05-07T20:33:00.2957128Z torch.manual_seed(2025) 2025-05-07T20:33:00.2957496Z 2025-05-07T20:33:00.2957901Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.2958435Z 2025-05-07T20:33:00.2958738Z x_sign = torch.sign(x) 2025-05-07T20:33:00.2959177Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.2959667Z x = x_sign * x_clamp 2025-05-07T20:33:00.2960044Z x0 = x[:, :D] 2025-05-07T20:33:00.2960367Z x1 = x[:, D:] 2025-05-07T20:33:00.2960691Z 2025-05-07T20:33:00.2960981Z if contiguous: 2025-05-07T20:33:00.2961483Z x0 = x0.contiguous() 2025-05-07T20:33:00.2961939Z x1 = x1.contiguous() 2025-05-07T20:33:00.2962302Z 2025-05-07T20:33:00.2962587Z if scale_ub is not None: 2025-05-07T20:33:00.2963010Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.2963592Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.2964063Z ) 2025-05-07T20:33:00.2964357Z else: 2025-05-07T20:33:00.2964677Z scale_ub_tensor = None 2025-05-07T20:33:00.2965051Z 2025-05-07T20:33:00.2965782Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.2966243Z op = silu_mul_quant 2025-05-07T20:33:00.2966580Z if compiled: 2025-05-07T20:33:00.2966909Z op = torch.compile(op) 2025-05-07T20:33:00.2967289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2967650Z 2025-05-07T20:33:00.2967854Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.2968022Z 2025-05-07T20:33:00.2968144Z moe/activation_test.py:117: 2025-05-07T20:33:00.2968436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2968774Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.2969067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.2969624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.2970192Z return fn(*args, **kwargs) 2025-05-07T20:33:00.2970848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.2971533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.2972063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.2972746Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.2973421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.2973951Z kernel = self.compile( 2025-05-07T20:33:00.2974496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.2975157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.2975565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.2975796Z 2025-05-07T20:33:00.2976002Z self = 2025-05-07T20:33:00.2977088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.2978523Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90283060>} 2025-05-07T20:33:00.2979867Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.2980896Z context = 2025-05-07T20:33:00.2981190Z 2025-05-07T20:33:00.2981361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.2981894Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.2982373Z module_map=module_map) 2025-05-07T20:33:00.2982741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.2983103Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.2983369Z E ^ 2025-05-07T20:33:00.2984020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.2984534Z 2025-05-07T20:33:00.2984948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.2985522Z 2025-05-07T20:33:00.4453885Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4454331Z self=, 2025-05-07T20:33:00.4454774Z T=4096, 2025-05-07T20:33:00.4454969Z D=5120, 2025-05-07T20:33:00.4455162Z scale_ub=1200.0, 2025-05-07T20:33:00.4455384Z contiguous=False, 2025-05-07T20:33:00.4455613Z compiled=False, 2025-05-07T20:33:00.4455825Z ) 2025-05-07T20:33:00.4456142Z self = 2025-05-07T20:33:00.4456645Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.4456922Z 2025-05-07T20:33:00.4457009Z @given( 2025-05-07T20:33:00.4457260Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4457585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4457899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4458240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4458569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4458857Z ) 2025-05-07T20:33:00.4459206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4459647Z def test_silu_mul_quant( 2025-05-07T20:33:00.4459891Z self, 2025-05-07T20:33:00.4460093Z T: int, 2025-05-07T20:33:00.4460287Z D: int, 2025-05-07T20:33:00.4460510Z scale_ub: Optional[float], 2025-05-07T20:33:00.4460787Z contiguous: bool, 2025-05-07T20:33:00.4461024Z compiled: bool, 2025-05-07T20:33:00.4461262Z ) -> None: 2025-05-07T20:33:00.4461491Z torch.manual_seed(2025) 2025-05-07T20:33:00.4461734Z 2025-05-07T20:33:00.4462018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4462368Z 2025-05-07T20:33:00.4462559Z x_sign = torch.sign(x) 2025-05-07T20:33:00.4462860Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.4463183Z x = x_sign * x_clamp 2025-05-07T20:33:00.4463432Z x0 = x[:, :D] 2025-05-07T20:33:00.4463646Z x1 = x[:, D:] 2025-05-07T20:33:00.4463857Z 2025-05-07T20:33:00.4464049Z if contiguous: 2025-05-07T20:33:00.4464278Z x0 = x0.contiguous() 2025-05-07T20:33:00.4464544Z x1 = x1.contiguous() 2025-05-07T20:33:00.4464788Z 2025-05-07T20:33:00.4464978Z if scale_ub is not None: 2025-05-07T20:33:00.4465263Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.4465844Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.4466152Z ) 2025-05-07T20:33:00.4466351Z else: 2025-05-07T20:33:00.4466571Z scale_ub_tensor = None 2025-05-07T20:33:00.4466821Z 2025-05-07T20:33:00.4467057Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.4467376Z op = silu_mul_quant 2025-05-07T20:33:00.4467625Z if compiled: 2025-05-07T20:33:00.4467875Z op = torch.compile(op) 2025-05-07T20:33:00.4468170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4468449Z 2025-05-07T20:33:00.4468640Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.4468811Z 2025-05-07T20:33:00.4468913Z moe/activation_test.py:117: 2025-05-07T20:33:00.4469210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4469544Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.4469828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4470513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.4471553Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.4472163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.4472846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.4473595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.4474122Z kernel = self.compile( 2025-05-07T20:33:00.4474662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.4475316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.4475723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4476043Z 2025-05-07T20:33:00.4476247Z self = 2025-05-07T20:33:00.4477331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.4478723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b661b1b20>} 2025-05-07T20:33:00.4480061Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.4481076Z context = 2025-05-07T20:33:00.4481374Z 2025-05-07T20:33:00.4481542Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.4482069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.4482550Z module_map=module_map) 2025-05-07T20:33:00.4482911Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.4483277Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.4483543Z E ^ 2025-05-07T20:33:00.4484003Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.4484463Z 2025-05-07T20:33:00.4484880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.4485395Z 2025-05-07T20:33:00.4485499Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.4485922Z self=, 2025-05-07T20:33:00.4486325Z T=4096, 2025-05-07T20:33:00.4486520Z D=5120, 2025-05-07T20:33:00.4486718Z scale_ub=1200.0, 2025-05-07T20:33:00.4486953Z contiguous=False, 2025-05-07T20:33:00.4487185Z compiled=True, 2025-05-07T20:33:00.4487402Z ) 2025-05-07T20:33:00.4487714Z self = 2025-05-07T20:33:00.4488211Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.4488493Z 2025-05-07T20:33:00.4488575Z @given( 2025-05-07T20:33:00.4488811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.4489129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.4489443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.4489773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.4490098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.4490385Z ) 2025-05-07T20:33:00.4490732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.4491167Z def test_silu_mul_quant( 2025-05-07T20:33:00.4491412Z self, 2025-05-07T20:33:00.4491608Z T: int, 2025-05-07T20:33:00.4491933Z D: int, 2025-05-07T20:33:00.4492155Z scale_ub: Optional[float], 2025-05-07T20:33:00.4492430Z contiguous: bool, 2025-05-07T20:33:00.4492675Z compiled: bool, 2025-05-07T20:33:00.4492934Z ) -> None: 2025-05-07T20:33:00.4493154Z torch.manual_seed(2025) 2025-05-07T20:33:00.4493397Z 2025-05-07T20:33:00.4493663Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.4494009Z 2025-05-07T20:33:00.4494203Z x_sign = torch.sign(x) 2025-05-07T20:33:00.4494488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.4494799Z x = x_sign * x_clamp 2025-05-07T20:33:00.4495044Z x0 = x[:, :D] 2025-05-07T20:33:00.4495258Z x1 = x[:, D:] 2025-05-07T20:33:00.4495470Z 2025-05-07T20:33:00.4495659Z if contiguous: 2025-05-07T20:33:00.4495886Z x0 = x0.contiguous() 2025-05-07T20:33:00.4496146Z x1 = x1.contiguous() 2025-05-07T20:33:00.4496394Z 2025-05-07T20:33:00.4496593Z if scale_ub is not None: 2025-05-07T20:33:00.4496870Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.4497209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.4497524Z ) 2025-05-07T20:33:00.4497717Z else: 2025-05-07T20:33:00.4497932Z scale_ub_tensor = None 2025-05-07T20:33:00.4498188Z 2025-05-07T20:33:00.4498421Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.4498740Z op = silu_mul_quant 2025-05-07T20:33:00.4498993Z if compiled: 2025-05-07T20:33:00.4499236Z op = torch.compile(op) 2025-05-07T20:33:00.4499534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4499812Z 2025-05-07T20:33:00.4500004Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.4500178Z 2025-05-07T20:33:00.4500277Z moe/activation_test.py:117: 2025-05-07T20:33:00.4500577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4500923Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.4501202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.4501761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.4502323Z return fn(*args, **kwargs) 2025-05-07T20:33:00.4502981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.4503659Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.4504208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.4504885Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.4505544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.4506067Z kernel = self.compile( 2025-05-07T20:33:00.4506610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.4507265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.4507660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.4507895Z 2025-05-07T20:33:00.4508100Z self = 2025-05-07T20:33:00.4509177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.4510551Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b661b3f60>} 2025-05-07T20:33:00.4511971Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.4513029Z context = 2025-05-07T20:33:00.4513362Z 2025-05-07T20:33:00.4513529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.4514058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.4514531Z module_map=module_map) 2025-05-07T20:33:00.4514890Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.4515248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.4515513Z E ^ 2025-05-07T20:33:00.4516081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.4516540Z 2025-05-07T20:33:00.4516956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.4517471Z 2025-05-07T20:33:00.5678300Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5678742Z self=, 2025-05-07T20:33:00.5679147Z T=2048, 2025-05-07T20:33:00.5679341Z D=7168, 2025-05-07T20:33:00.5679538Z scale_ub=1200.0, 2025-05-07T20:33:00.5679763Z contiguous=False, 2025-05-07T20:33:00.5679992Z compiled=False, 2025-05-07T20:33:00.5680202Z ) 2025-05-07T20:33:00.5680520Z self = 2025-05-07T20:33:00.5681020Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:00.5681307Z 2025-05-07T20:33:00.5681387Z @given( 2025-05-07T20:33:00.5681627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5681942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5682273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5682610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5682938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5683228Z ) 2025-05-07T20:33:00.5683578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5684023Z def test_silu_mul_quant( 2025-05-07T20:33:00.5684264Z self, 2025-05-07T20:33:00.5684469Z T: int, 2025-05-07T20:33:00.5684672Z D: int, 2025-05-07T20:33:00.5684891Z scale_ub: Optional[float], 2025-05-07T20:33:00.5685187Z contiguous: bool, 2025-05-07T20:33:00.5685435Z compiled: bool, 2025-05-07T20:33:00.5685662Z ) -> None: 2025-05-07T20:33:00.5685882Z torch.manual_seed(2025) 2025-05-07T20:33:00.5686127Z 2025-05-07T20:33:00.5686402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5686758Z 2025-05-07T20:33:00.5686961Z x_sign = torch.sign(x) 2025-05-07T20:33:00.5687276Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.5687594Z x = x_sign * x_clamp 2025-05-07T20:33:00.5687837Z x0 = x[:, :D] 2025-05-07T20:33:00.5688061Z x1 = x[:, D:] 2025-05-07T20:33:00.5688276Z 2025-05-07T20:33:00.5688467Z if contiguous: 2025-05-07T20:33:00.5688703Z x0 = x0.contiguous() 2025-05-07T20:33:00.5688977Z x1 = x1.contiguous() 2025-05-07T20:33:00.5689228Z 2025-05-07T20:33:00.5689428Z if scale_ub is not None: 2025-05-07T20:33:00.5689715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.5690048Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.5690372Z ) 2025-05-07T20:33:00.5690576Z else: 2025-05-07T20:33:00.5690798Z scale_ub_tensor = None 2025-05-07T20:33:00.5691048Z 2025-05-07T20:33:00.5691287Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.5691891Z op = silu_mul_quant 2025-05-07T20:33:00.5692142Z if compiled: 2025-05-07T20:33:00.5692397Z op = torch.compile(op) 2025-05-07T20:33:00.5701955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.5702277Z 2025-05-07T20:33:00.5702487Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.5702654Z 2025-05-07T20:33:00.5702769Z moe/activation_test.py:117: 2025-05-07T20:33:00.5703071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.5703405Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.5703694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.5704396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.5705083Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.5705633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.5706321Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.5706987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.5707518Z kernel = self.compile( 2025-05-07T20:33:00.5708064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.5708722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.5709121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.5709359Z 2025-05-07T20:33:00.5709568Z self = 2025-05-07T20:33:00.5710656Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.5712052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b661b1440>} 2025-05-07T20:33:00.5713404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.5714423Z context = 2025-05-07T20:33:00.5714720Z 2025-05-07T20:33:00.5714888Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.5715421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.5715988Z module_map=module_map) 2025-05-07T20:33:00.5716360Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.5716723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.5716987Z E ^ 2025-05-07T20:33:00.5717446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.5717908Z 2025-05-07T20:33:00.5718319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.5718840Z 2025-05-07T20:33:00.5718949Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5719366Z self=, 2025-05-07T20:33:00.5719769Z T=1, 2025-05-07T20:33:00.5719965Z D=7168, 2025-05-07T20:33:00.5720166Z scale_ub=None, 2025-05-07T20:33:00.5720378Z contiguous=True, 2025-05-07T20:33:00.5720607Z compiled=False, 2025-05-07T20:33:00.5720818Z ) 2025-05-07T20:33:00.5721135Z self = 2025-05-07T20:33:00.5721793Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:00.5722054Z 2025-05-07T20:33:00.5722143Z @given( 2025-05-07T20:33:00.5722380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.5722735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.5723046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.5723378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.5723702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.5723995Z ) 2025-05-07T20:33:00.5724352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.5724789Z def test_silu_mul_quant( 2025-05-07T20:33:00.5725039Z self, 2025-05-07T20:33:00.5725241Z T: int, 2025-05-07T20:33:00.5725437Z D: int, 2025-05-07T20:33:00.5725668Z scale_ub: Optional[float], 2025-05-07T20:33:00.5725949Z contiguous: bool, 2025-05-07T20:33:00.5726197Z compiled: bool, 2025-05-07T20:33:00.5726438Z ) -> None: 2025-05-07T20:33:00.5726667Z torch.manual_seed(2025) 2025-05-07T20:33:00.5726910Z 2025-05-07T20:33:00.5727199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.5727550Z 2025-05-07T20:33:00.5727757Z x_sign = torch.sign(x) 2025-05-07T20:33:00.5728050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.5728374Z x = x_sign * x_clamp 2025-05-07T20:33:00.5728624Z x0 = x[:, :D] 2025-05-07T20:33:00.5728840Z x1 = x[:, D:] 2025-05-07T20:33:00.5729058Z 2025-05-07T20:33:00.5729254Z if contiguous: 2025-05-07T20:33:00.5729490Z x0 = x0.contiguous() 2025-05-07T20:33:00.5729757Z x1 = x1.contiguous() 2025-05-07T20:33:00.5730006Z 2025-05-07T20:33:00.5730201Z if scale_ub is not None: 2025-05-07T20:33:00.5730491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.5730838Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.5731146Z ) 2025-05-07T20:33:00.5731350Z else: 2025-05-07T20:33:00.5731575Z scale_ub_tensor = None 2025-05-07T20:33:00.5731831Z 2025-05-07T20:33:00.5732068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.5732398Z op = silu_mul_quant 2025-05-07T20:33:00.5732659Z if compiled: 2025-05-07T20:33:00.5732908Z op = torch.compile(op) 2025-05-07T20:33:00.5733218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.5733501Z 2025-05-07T20:33:00.5733697Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.5733869Z 2025-05-07T20:33:00.5733971Z moe/activation_test.py:117: 2025-05-07T20:33:00.5734279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.5734613Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.5734901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.5735611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.5736307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.5736844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.5737535Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.5738204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.5738734Z kernel = self.compile( 2025-05-07T20:33:00.5739279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.5739937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.5740428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.5740700Z 2025-05-07T20:33:00.5740906Z self = 2025-05-07T20:33:00.5741996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.5743411Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66837740>} 2025-05-07T20:33:00.5744764Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.5745800Z context = 2025-05-07T20:33:00.5746093Z 2025-05-07T20:33:00.5746274Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.5746808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.5747291Z module_map=module_map) 2025-05-07T20:33:00.5747663Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.5748058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.5748355Z E ^ 2025-05-07T20:33:00.5748824Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.5749277Z 2025-05-07T20:33:00.5749689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.5750216Z 2025-05-07T20:33:00.5750324Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.5750753Z self=, 2025-05-07T20:33:00.5751171Z T=16384, 2025-05-07T20:33:00.5751380Z D=7168, 2025-05-07T20:33:00.5751590Z scale_ub=1200.0, 2025-05-07T20:33:00.5751813Z contiguous=False, 2025-05-07T20:33:00.5752048Z compiled=True, 2025-05-07T20:33:00.8176741Z ) 2025-05-07T20:33:00.8177152Z self = 2025-05-07T20:33:00.8177776Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:00.8178066Z 2025-05-07T20:33:00.8178146Z @given( 2025-05-07T20:33:00.8178381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.8178697Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.8179005Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.8179342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.8179675Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.8179959Z ) 2025-05-07T20:33:00.8180335Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.8180781Z def test_silu_mul_quant( 2025-05-07T20:33:00.8181030Z self, 2025-05-07T20:33:00.8181223Z T: int, 2025-05-07T20:33:00.8181434Z D: int, 2025-05-07T20:33:00.8181656Z scale_ub: Optional[float], 2025-05-07T20:33:00.8181926Z contiguous: bool, 2025-05-07T20:33:00.8182165Z compiled: bool, 2025-05-07T20:33:00.8182398Z ) -> None: 2025-05-07T20:33:00.8182610Z torch.manual_seed(2025) 2025-05-07T20:33:00.8182856Z 2025-05-07T20:33:00.8183128Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.8183465Z 2025-05-07T20:33:00.8183661Z x_sign = torch.sign(x) 2025-05-07T20:33:00.8183956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.8184261Z x = x_sign * x_clamp 2025-05-07T20:33:00.8184501Z x0 = x[:, :D] 2025-05-07T20:33:00.8184720Z x1 = x[:, D:] 2025-05-07T20:33:00.8184920Z 2025-05-07T20:33:00.8185660Z if contiguous: 2025-05-07T20:33:00.8185893Z x0 = x0.contiguous() 2025-05-07T20:33:00.8186151Z x1 = x1.contiguous() 2025-05-07T20:33:00.8186384Z 2025-05-07T20:33:00.8186664Z if scale_ub is not None: 2025-05-07T20:33:00.8186937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.8187273Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.8187583Z ) 2025-05-07T20:33:00.8187777Z else: 2025-05-07T20:33:00.8187983Z scale_ub_tensor = None 2025-05-07T20:33:00.8188240Z 2025-05-07T20:33:00.8188477Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.8188790Z op = silu_mul_quant 2025-05-07T20:33:00.8189041Z if compiled: 2025-05-07T20:33:00.8189294Z op = torch.compile(op) 2025-05-07T20:33:00.8189590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.8189871Z 2025-05-07T20:33:00.8190080Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.8190243Z 2025-05-07T20:33:00.8190350Z moe/activation_test.py:117: 2025-05-07T20:33:00.8190643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8190980Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.8191265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.8191815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.8192373Z return fn(*args, **kwargs) 2025-05-07T20:33:00.8193030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.8193711Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.8194248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.8194930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.8195591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.8196214Z kernel = self.compile( 2025-05-07T20:33:00.8196755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.8197404Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.8197802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8198033Z 2025-05-07T20:33:00.8198239Z self = 2025-05-07T20:33:00.8199319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.8200706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b668345e0>} 2025-05-07T20:33:00.8202048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.8203065Z context = 2025-05-07T20:33:00.8203355Z 2025-05-07T20:33:00.8203520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.8204041Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.8204516Z module_map=module_map) 2025-05-07T20:33:00.8204873Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.8205230Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.8205495Z E ^ 2025-05-07T20:33:00.8206112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.8206569Z 2025-05-07T20:33:00.8206980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.8207530Z 2025-05-07T20:33:00.8207634Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.8208048Z self=, 2025-05-07T20:33:00.8208447Z T=1, 2025-05-07T20:33:00.8208635Z D=7168, 2025-05-07T20:33:00.8208834Z scale_ub=None, 2025-05-07T20:33:00.8209070Z contiguous=False, 2025-05-07T20:33:00.8209292Z compiled=False, 2025-05-07T20:33:00.8209499Z ) 2025-05-07T20:33:00.8209819Z self = 2025-05-07T20:33:00.8210301Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:00.8210572Z 2025-05-07T20:33:00.8210655Z @given( 2025-05-07T20:33:00.8210895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.8211216Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.8211524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.8211859Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.8212195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.8212480Z ) 2025-05-07T20:33:00.8212836Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.8213281Z def test_silu_mul_quant( 2025-05-07T20:33:00.8213522Z self, 2025-05-07T20:33:00.8213723Z T: int, 2025-05-07T20:33:00.8213926Z D: int, 2025-05-07T20:33:00.8214143Z scale_ub: Optional[float], 2025-05-07T20:33:00.8214417Z contiguous: bool, 2025-05-07T20:33:00.8214665Z compiled: bool, 2025-05-07T20:33:00.8214898Z ) -> None: 2025-05-07T20:33:00.8215113Z torch.manual_seed(2025) 2025-05-07T20:33:00.8215362Z 2025-05-07T20:33:00.8215638Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.8215975Z 2025-05-07T20:33:00.8216178Z x_sign = torch.sign(x) 2025-05-07T20:33:00.8216477Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.8216787Z x = x_sign * x_clamp 2025-05-07T20:33:00.8217030Z x0 = x[:, :D] 2025-05-07T20:33:00.8217251Z x1 = x[:, D:] 2025-05-07T20:33:00.8217454Z 2025-05-07T20:33:00.8217647Z if contiguous: 2025-05-07T20:33:00.8217881Z x0 = x0.contiguous() 2025-05-07T20:33:00.8218136Z x1 = x1.contiguous() 2025-05-07T20:33:00.8218385Z 2025-05-07T20:33:00.8218581Z if scale_ub is not None: 2025-05-07T20:33:00.8218848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.8219185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.8219500Z ) 2025-05-07T20:33:00.8219702Z else: 2025-05-07T20:33:00.8219913Z scale_ub_tensor = None 2025-05-07T20:33:00.8220169Z 2025-05-07T20:33:00.8220401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.8220715Z op = silu_mul_quant 2025-05-07T20:33:00.8220965Z if compiled: 2025-05-07T20:33:00.8221221Z op = torch.compile(op) 2025-05-07T20:33:00.8221514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.8221793Z 2025-05-07T20:33:00.8221991Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.8222158Z 2025-05-07T20:33:00.8222256Z moe/activation_test.py:117: 2025-05-07T20:33:00.8222555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8222887Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.8223162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.8223933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.8224663Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.8225207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.8225927Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.8226596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.8227145Z kernel = self.compile( 2025-05-07T20:33:00.8227685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.8228342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.8228749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8228982Z 2025-05-07T20:33:00.8229195Z self = 2025-05-07T20:33:00.8230294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.8231664Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b674593a0>} 2025-05-07T20:33:00.8233005Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.8234030Z context = 2025-05-07T20:33:00.8234316Z 2025-05-07T20:33:00.8234489Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.8235013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.8235488Z module_map=module_map) 2025-05-07T20:33:00.8235908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.8236270Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.8236529Z E ^ 2025-05-07T20:33:00.8236993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.8237441Z 2025-05-07T20:33:00.8237857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.8238362Z 2025-05-07T20:33:00.8238466Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.8238881Z self=, 2025-05-07T20:33:00.8239288Z T=2048, 2025-05-07T20:33:00.8239478Z D=7168, 2025-05-07T20:33:00.8239665Z scale_ub=None, 2025-05-07T20:33:00.8239889Z contiguous=False, 2025-05-07T20:33:00.8240116Z compiled=True, 2025-05-07T20:33:00.8240313Z ) 2025-05-07T20:33:00.9126668Z self = 2025-05-07T20:33:00.9128044Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.9128395Z 2025-05-07T20:33:00.9128478Z @given( 2025-05-07T20:33:00.9128716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.9129023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.9129334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.9129666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.9130001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.9130300Z ) 2025-05-07T20:33:00.9130646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.9131095Z def test_silu_mul_quant( 2025-05-07T20:33:00.9131345Z self, 2025-05-07T20:33:00.9131938Z T: int, 2025-05-07T20:33:00.9132139Z D: int, 2025-05-07T20:33:00.9132360Z scale_ub: Optional[float], 2025-05-07T20:33:00.9132629Z contiguous: bool, 2025-05-07T20:33:00.9132947Z compiled: bool, 2025-05-07T20:33:00.9133173Z ) -> None: 2025-05-07T20:33:00.9133387Z torch.manual_seed(2025) 2025-05-07T20:33:00.9133629Z 2025-05-07T20:33:00.9133900Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.9134238Z 2025-05-07T20:33:00.9134435Z x_sign = torch.sign(x) 2025-05-07T20:33:00.9134727Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.9135039Z x = x_sign * x_clamp 2025-05-07T20:33:00.9135273Z x0 = x[:, :D] 2025-05-07T20:33:00.9135494Z x1 = x[:, D:] 2025-05-07T20:33:00.9135703Z 2025-05-07T20:33:00.9135884Z if contiguous: 2025-05-07T20:33:00.9136115Z x0 = x0.contiguous() 2025-05-07T20:33:00.9136386Z x1 = x1.contiguous() 2025-05-07T20:33:00.9136624Z 2025-05-07T20:33:00.9136824Z if scale_ub is not None: 2025-05-07T20:33:00.9137097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.9137430Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.9137741Z ) 2025-05-07T20:33:00.9137941Z else: 2025-05-07T20:33:00.9138150Z scale_ub_tensor = None 2025-05-07T20:33:00.9138408Z 2025-05-07T20:33:00.9138646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.9138965Z op = silu_mul_quant 2025-05-07T20:33:00.9139225Z if compiled: 2025-05-07T20:33:00.9139480Z op = torch.compile(op) 2025-05-07T20:33:00.9139768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.9140050Z 2025-05-07T20:33:00.9140252Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.9140417Z 2025-05-07T20:33:00.9140528Z moe/activation_test.py:117: 2025-05-07T20:33:00.9140821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.9141156Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.9141439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.9141994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.9142560Z return fn(*args, **kwargs) 2025-05-07T20:33:00.9143218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.9143916Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.9144452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.9145136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.9145810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.9146335Z kernel = self.compile( 2025-05-07T20:33:00.9146880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.9147536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.9147934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.9148162Z 2025-05-07T20:33:00.9148367Z self = 2025-05-07T20:33:00.9149444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.9150935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b67115b20>} 2025-05-07T20:33:00.9152317Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.9153379Z context = 2025-05-07T20:33:00.9153668Z 2025-05-07T20:33:00.9153832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.9154350Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.9154823Z module_map=module_map) 2025-05-07T20:33:00.9155179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.9155533Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.9155947Z E ^ 2025-05-07T20:33:00.9156416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.9156876Z 2025-05-07T20:33:00.9157291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.9157808Z 2025-05-07T20:33:00.9157912Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.9158326Z self=, 2025-05-07T20:33:00.9158756Z T=4096, 2025-05-07T20:33:00.9158951Z D=7168, 2025-05-07T20:33:00.9159151Z scale_ub=None, 2025-05-07T20:33:00.9159363Z contiguous=False, 2025-05-07T20:33:00.9159596Z compiled=True, 2025-05-07T20:33:00.9159803Z ) 2025-05-07T20:33:00.9160128Z self = 2025-05-07T20:33:00.9160635Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:00.9160918Z 2025-05-07T20:33:00.9160997Z @given( 2025-05-07T20:33:00.9161237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.9161557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.9161878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.9162217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.9162548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.9162841Z ) 2025-05-07T20:33:00.9163197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.9163638Z def test_silu_mul_quant( 2025-05-07T20:33:00.9163896Z self, 2025-05-07T20:33:00.9174199Z T: int, 2025-05-07T20:33:00.9174515Z D: int, 2025-05-07T20:33:00.9174739Z scale_ub: Optional[float], 2025-05-07T20:33:00.9175023Z contiguous: bool, 2025-05-07T20:33:00.9175272Z compiled: bool, 2025-05-07T20:33:00.9175501Z ) -> None: 2025-05-07T20:33:00.9175726Z torch.manual_seed(2025) 2025-05-07T20:33:00.9175976Z 2025-05-07T20:33:00.9176265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.9176631Z 2025-05-07T20:33:00.9176836Z x_sign = torch.sign(x) 2025-05-07T20:33:00.9177133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.9177458Z x = x_sign * x_clamp 2025-05-07T20:33:00.9177712Z x0 = x[:, :D] 2025-05-07T20:33:00.9177935Z x1 = x[:, D:] 2025-05-07T20:33:00.9178154Z 2025-05-07T20:33:00.9178350Z if contiguous: 2025-05-07T20:33:00.9178587Z x0 = x0.contiguous() 2025-05-07T20:33:00.9178871Z x1 = x1.contiguous() 2025-05-07T20:33:00.9179218Z 2025-05-07T20:33:00.9179487Z if scale_ub is not None: 2025-05-07T20:33:00.9179878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.9180300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.9180626Z ) 2025-05-07T20:33:00.9180824Z else: 2025-05-07T20:33:00.9181046Z scale_ub_tensor = None 2025-05-07T20:33:00.9181311Z 2025-05-07T20:33:00.9181807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.9182135Z op = silu_mul_quant 2025-05-07T20:33:00.9182393Z if compiled: 2025-05-07T20:33:00.9182699Z op = torch.compile(op) 2025-05-07T20:33:00.9183001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.9183285Z 2025-05-07T20:33:00.9183481Z > y_fp8, y_scale = fn() 2025-05-07T20:33:00.9183653Z 2025-05-07T20:33:00.9183755Z moe/activation_test.py:117: 2025-05-07T20:33:00.9184062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.9184402Z moe/activation_test.py:115: in fn 2025-05-07T20:33:00.9184683Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.9185254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:00.9185819Z return fn(*args, **kwargs) 2025-05-07T20:33:00.9186486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:00.9187182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:00.9187718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.9188409Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.9189069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.9189606Z kernel = self.compile( 2025-05-07T20:33:00.9190151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.9190804Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.9191212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.9191459Z 2025-05-07T20:33:00.9191677Z self = 2025-05-07T20:33:00.9192768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.9194143Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b67115760>} 2025-05-07T20:33:00.9195495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.9196612Z context = 2025-05-07T20:33:00.9196901Z 2025-05-07T20:33:00.9197084Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.9197621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.9198095Z module_map=module_map) 2025-05-07T20:33:00.9198470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.9198841Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.9199100Z E ^ 2025-05-07T20:33:00.9199577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.9200035Z 2025-05-07T20:33:00.9200465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.9200979Z 2025-05-07T20:33:01.0801997Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.0802695Z self=, 2025-05-07T20:33:01.0803267Z T=16384, 2025-05-07T20:33:01.0803547Z D=5120, 2025-05-07T20:33:01.0804136Z scale_ub=1200.0, 2025-05-07T20:33:01.0804372Z contiguous=False, 2025-05-07T20:33:01.0804604Z compiled=False, 2025-05-07T20:33:01.0804818Z ) 2025-05-07T20:33:01.0805148Z self = 2025-05-07T20:33:01.0805749Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.0806035Z 2025-05-07T20:33:01.0806118Z @given( 2025-05-07T20:33:01.0806356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.0806679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.0806997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.0807326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.0807659Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.0807950Z ) 2025-05-07T20:33:01.0808297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.0808757Z def test_silu_mul_quant( 2025-05-07T20:33:01.0809006Z self, 2025-05-07T20:33:01.0809198Z T: int, 2025-05-07T20:33:01.0809403Z D: int, 2025-05-07T20:33:01.0809635Z scale_ub: Optional[float], 2025-05-07T20:33:01.0809911Z contiguous: bool, 2025-05-07T20:33:01.0810210Z compiled: bool, 2025-05-07T20:33:01.0810491Z ) -> None: 2025-05-07T20:33:01.0810710Z torch.manual_seed(2025) 2025-05-07T20:33:01.0810957Z 2025-05-07T20:33:01.0811239Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0811592Z 2025-05-07T20:33:01.0811787Z x_sign = torch.sign(x) 2025-05-07T20:33:01.0812087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.0812408Z x = x_sign * x_clamp 2025-05-07T20:33:01.0812648Z x0 = x[:, :D] 2025-05-07T20:33:01.0812871Z x1 = x[:, D:] 2025-05-07T20:33:01.0813088Z 2025-05-07T20:33:01.0813274Z if contiguous: 2025-05-07T20:33:01.0813524Z x0 = x0.contiguous() 2025-05-07T20:33:01.0813791Z x1 = x1.contiguous() 2025-05-07T20:33:01.0814032Z 2025-05-07T20:33:01.0814236Z if scale_ub is not None: 2025-05-07T20:33:01.0814522Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.0814856Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.0815178Z ) 2025-05-07T20:33:01.0815377Z else: 2025-05-07T20:33:01.0815590Z scale_ub_tensor = None 2025-05-07T20:33:01.0815842Z 2025-05-07T20:33:01.0816069Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.0816387Z op = silu_mul_quant 2025-05-07T20:33:01.0816641Z if compiled: 2025-05-07T20:33:01.0816883Z op = torch.compile(op) 2025-05-07T20:33:01.0817179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.0817454Z 2025-05-07T20:33:01.0817643Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.0817813Z 2025-05-07T20:33:01.0817921Z moe/activation_test.py:117: 2025-05-07T20:33:01.0818221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.0818562Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.0818841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.0819535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.0820228Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.0820759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.0821443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.0822106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.0822644Z kernel = self.compile( 2025-05-07T20:33:01.0823279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.0823997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.0824434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.0824664Z 2025-05-07T20:33:01.0824874Z self = 2025-05-07T20:33:01.0825952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.0827336Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b671147c0>} 2025-05-07T20:33:01.0828683Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.0829708Z context = 2025-05-07T20:33:01.0829997Z 2025-05-07T20:33:01.0830162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.0830684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.0831157Z module_map=module_map) 2025-05-07T20:33:01.0831526Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.0831880Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.0832146Z E ^ 2025-05-07T20:33:01.0832618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.0833066Z 2025-05-07T20:33:01.0833482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.0834001Z 2025-05-07T20:33:01.0834105Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.0834523Z self=, 2025-05-07T20:33:01.0834930Z T=16384, 2025-05-07T20:33:01.0835123Z D=5120, 2025-05-07T20:33:01.0835322Z scale_ub=1200.0, 2025-05-07T20:33:01.0835548Z contiguous=True, 2025-05-07T20:33:01.0835843Z compiled=True, 2025-05-07T20:33:01.0836056Z ) 2025-05-07T20:33:01.0836380Z self = 2025-05-07T20:33:01.0836875Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.0837158Z 2025-05-07T20:33:01.0837237Z @given( 2025-05-07T20:33:01.0837474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.0837787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.0838094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.0838433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.0838762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.0839044Z ) 2025-05-07T20:33:01.0839395Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.0839837Z def test_silu_mul_quant( 2025-05-07T20:33:01.0840074Z self, 2025-05-07T20:33:01.0840276Z T: int, 2025-05-07T20:33:01.0840476Z D: int, 2025-05-07T20:33:01.0840693Z scale_ub: Optional[float], 2025-05-07T20:33:01.0840965Z contiguous: bool, 2025-05-07T20:33:01.0841212Z compiled: bool, 2025-05-07T20:33:01.0841428Z ) -> None: 2025-05-07T20:33:01.0841650Z torch.manual_seed(2025) 2025-05-07T20:33:01.0841898Z 2025-05-07T20:33:01.0842170Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.0842509Z 2025-05-07T20:33:01.0842706Z x_sign = torch.sign(x) 2025-05-07T20:33:01.0843163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.0843471Z x = x_sign * x_clamp 2025-05-07T20:33:01.0843716Z x0 = x[:, :D] 2025-05-07T20:33:01.0843975Z x1 = x[:, D:] 2025-05-07T20:33:01.0844184Z 2025-05-07T20:33:01.0844375Z if contiguous: 2025-05-07T20:33:01.0844614Z x0 = x0.contiguous() 2025-05-07T20:33:01.0844866Z x1 = x1.contiguous() 2025-05-07T20:33:01.0845110Z 2025-05-07T20:33:01.0845308Z if scale_ub is not None: 2025-05-07T20:33:01.0845579Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.0845919Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.0846233Z ) 2025-05-07T20:33:01.0846422Z else: 2025-05-07T20:33:01.0846636Z scale_ub_tensor = None 2025-05-07T20:33:01.0846896Z 2025-05-07T20:33:01.0847140Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.0847461Z op = silu_mul_quant 2025-05-07T20:33:01.0847717Z if compiled: 2025-05-07T20:33:01.0847978Z op = torch.compile(op) 2025-05-07T20:33:01.0848274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.0848559Z 2025-05-07T20:33:01.0848757Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.0848920Z 2025-05-07T20:33:01.0849022Z moe/activation_test.py:117: 2025-05-07T20:33:01.0849324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.0849661Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.0849939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.0850499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.0851059Z return fn(*args, **kwargs) 2025-05-07T20:33:01.0851725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.0852418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.0852961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.0853653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.0854317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.0854844Z kernel = self.compile( 2025-05-07T20:33:01.0855380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.0856046Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.0856441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.0856678Z 2025-05-07T20:33:01.0856885Z self = 2025-05-07T20:33:01.0857976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.0859351Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef7740>} 2025-05-07T20:33:01.0860692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.0861711Z context = 2025-05-07T20:33:01.0862005Z 2025-05-07T20:33:01.0862171Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.0862697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.0863292Z module_map=module_map) 2025-05-07T20:33:01.0863653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.0864015Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.0864323Z E ^ 2025-05-07T20:33:01.0864780Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.0865234Z 2025-05-07T20:33:01.0866006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.0866530Z 2025-05-07T20:33:01.2607869Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2608694Z self=, 2025-05-07T20:33:01.2609235Z T=16384, 2025-05-07T20:33:01.2609442Z D=5120, 2025-05-07T20:33:01.2609642Z scale_ub=None, 2025-05-07T20:33:01.2609863Z contiguous=False, 2025-05-07T20:33:01.2610127Z compiled=True, 2025-05-07T20:33:01.2610343Z ) 2025-05-07T20:33:01.2610666Z self = 2025-05-07T20:33:01.2611178Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.2611476Z 2025-05-07T20:33:01.2611562Z @given( 2025-05-07T20:33:01.2611804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.2612119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.2612430Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.2612766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.2613097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.2613391Z ) 2025-05-07T20:33:01.2613748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.2614188Z def test_silu_mul_quant( 2025-05-07T20:33:01.2614439Z self, 2025-05-07T20:33:01.2614641Z T: int, 2025-05-07T20:33:01.2614852Z D: int, 2025-05-07T20:33:01.2615069Z scale_ub: Optional[float], 2025-05-07T20:33:01.2615344Z contiguous: bool, 2025-05-07T20:33:01.2615584Z compiled: bool, 2025-05-07T20:33:01.2615814Z ) -> None: 2025-05-07T20:33:01.2616041Z torch.manual_seed(2025) 2025-05-07T20:33:01.2616295Z 2025-05-07T20:33:01.2616569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.2616916Z 2025-05-07T20:33:01.2617118Z x_sign = torch.sign(x) 2025-05-07T20:33:01.2617408Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.2617721Z x = x_sign * x_clamp 2025-05-07T20:33:01.2617968Z x0 = x[:, :D] 2025-05-07T20:33:01.2618185Z x1 = x[:, D:] 2025-05-07T20:33:01.2618396Z 2025-05-07T20:33:01.2618589Z if contiguous: 2025-05-07T20:33:01.2618816Z x0 = x0.contiguous() 2025-05-07T20:33:01.2619082Z x1 = x1.contiguous() 2025-05-07T20:33:01.2619332Z 2025-05-07T20:33:01.2619527Z if scale_ub is not None: 2025-05-07T20:33:01.2619803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.2620145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.2620459Z ) 2025-05-07T20:33:01.2620650Z else: 2025-05-07T20:33:01.2620867Z scale_ub_tensor = None 2025-05-07T20:33:01.2621122Z 2025-05-07T20:33:01.2621352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.2621676Z op = silu_mul_quant 2025-05-07T20:33:01.2621932Z if compiled: 2025-05-07T20:33:01.2622177Z op = torch.compile(op) 2025-05-07T20:33:01.2622477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2622757Z 2025-05-07T20:33:01.2622949Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.2623122Z 2025-05-07T20:33:01.2623223Z moe/activation_test.py:117: 2025-05-07T20:33:01.2623891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2624295Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.2624578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.2625138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.2625779Z return fn(*args, **kwargs) 2025-05-07T20:33:01.2626429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.2627120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.2627668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.2628349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.2629009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.2629547Z kernel = self.compile( 2025-05-07T20:33:01.2630092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.2630739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2631148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.2631386Z 2025-05-07T20:33:01.2631591Z self = 2025-05-07T20:33:01.2632674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.2634143Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69accb80>} 2025-05-07T20:33:01.2635487Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.2636591Z context = 2025-05-07T20:33:01.2636890Z 2025-05-07T20:33:01.2637057Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.2637583Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2638050Z module_map=module_map) 2025-05-07T20:33:01.2638419Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2638780Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.2639044Z E ^ 2025-05-07T20:33:01.2639503Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.2639957Z 2025-05-07T20:33:01.2640376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.2640883Z 2025-05-07T20:33:01.2640995Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.2641421Z self=, 2025-05-07T20:33:01.2641823Z T=2048, 2025-05-07T20:33:01.2642019Z D=5120, 2025-05-07T20:33:01.2642224Z scale_ub=None, 2025-05-07T20:33:01.2642443Z contiguous=False, 2025-05-07T20:33:01.2642681Z compiled=True, 2025-05-07T20:33:01.2642891Z ) 2025-05-07T20:33:01.3574620Z self = 2025-05-07T20:33:01.3575392Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:01.3575770Z 2025-05-07T20:33:01.3575875Z @given( 2025-05-07T20:33:01.3576119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.3576434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.3577142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.3577480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.3577815Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.3578179Z ) 2025-05-07T20:33:01.3578525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.3578970Z def test_silu_mul_quant( 2025-05-07T20:33:01.3579215Z self, 2025-05-07T20:33:01.3579405Z T: int, 2025-05-07T20:33:01.3579604Z D: int, 2025-05-07T20:33:01.3579824Z scale_ub: Optional[float], 2025-05-07T20:33:01.3580095Z contiguous: bool, 2025-05-07T20:33:01.3580333Z compiled: bool, 2025-05-07T20:33:01.3580556Z ) -> None: 2025-05-07T20:33:01.3580772Z torch.manual_seed(2025) 2025-05-07T20:33:01.3581015Z 2025-05-07T20:33:01.3581290Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.3581636Z 2025-05-07T20:33:01.3581839Z x_sign = torch.sign(x) 2025-05-07T20:33:01.3582129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.3582433Z x = x_sign * x_clamp 2025-05-07T20:33:01.3582679Z x0 = x[:, :D] 2025-05-07T20:33:01.3582896Z x1 = x[:, D:] 2025-05-07T20:33:01.3583121Z 2025-05-07T20:33:01.3583320Z if contiguous: 2025-05-07T20:33:01.3583557Z x0 = x0.contiguous() 2025-05-07T20:33:01.3583814Z x1 = x1.contiguous() 2025-05-07T20:33:01.3584060Z 2025-05-07T20:33:01.3584258Z if scale_ub is not None: 2025-05-07T20:33:01.3584535Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.3584877Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.3585197Z ) 2025-05-07T20:33:01.3585395Z else: 2025-05-07T20:33:01.3585604Z scale_ub_tensor = None 2025-05-07T20:33:01.3585867Z 2025-05-07T20:33:01.3586108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.3586421Z op = silu_mul_quant 2025-05-07T20:33:01.3586678Z if compiled: 2025-05-07T20:33:01.3586938Z op = torch.compile(op) 2025-05-07T20:33:01.3587233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.3587512Z 2025-05-07T20:33:01.3587710Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.3587875Z 2025-05-07T20:33:01.3587973Z moe/activation_test.py:117: 2025-05-07T20:33:01.3588280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.3598388Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.3598743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.3599316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.3599884Z return fn(*args, **kwargs) 2025-05-07T20:33:01.3600548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.3601242Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.3601783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.3602469Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.3603130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.3603668Z kernel = self.compile( 2025-05-07T20:33:01.3604214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.3604863Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.3605274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.3605514Z 2025-05-07T20:33:01.3605853Z self = 2025-05-07T20:33:01.3606981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.3608461Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ace0c0>} 2025-05-07T20:33:01.3609797Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.3610826Z context = 2025-05-07T20:33:01.3611127Z 2025-05-07T20:33:01.3611296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.3611830Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.3612304Z module_map=module_map) 2025-05-07T20:33:01.3612674Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.3613043Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.3613305Z E ^ 2025-05-07T20:33:01.3613778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.3614245Z 2025-05-07T20:33:01.3614658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.3615169Z 2025-05-07T20:33:01.3615287Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.3615699Z self=, 2025-05-07T20:33:01.3616109Z T=2048, 2025-05-07T20:33:01.3616307Z D=5120, 2025-05-07T20:33:01.3616505Z scale_ub=1200.0, 2025-05-07T20:33:01.3616755Z contiguous=False, 2025-05-07T20:33:01.3616992Z compiled=True, 2025-05-07T20:33:01.3617204Z ) 2025-05-07T20:33:01.3617535Z self = 2025-05-07T20:33:01.3618041Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.3618317Z 2025-05-07T20:33:01.3618402Z @given( 2025-05-07T20:33:01.3618636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.3618963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.3619281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.3619614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.3619953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.3620248Z ) 2025-05-07T20:33:01.3620593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.3621038Z def test_silu_mul_quant( 2025-05-07T20:33:01.3621300Z self, 2025-05-07T20:33:01.3621501Z T: int, 2025-05-07T20:33:01.3621697Z D: int, 2025-05-07T20:33:01.3621927Z scale_ub: Optional[float], 2025-05-07T20:33:01.3622204Z contiguous: bool, 2025-05-07T20:33:01.3622449Z compiled: bool, 2025-05-07T20:33:01.3622678Z ) -> None: 2025-05-07T20:33:01.3622903Z torch.manual_seed(2025) 2025-05-07T20:33:01.3623144Z 2025-05-07T20:33:01.3623431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.3623779Z 2025-05-07T20:33:01.3623976Z x_sign = torch.sign(x) 2025-05-07T20:33:01.3624274Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.3624591Z x = x_sign * x_clamp 2025-05-07T20:33:01.3624834Z x0 = x[:, :D] 2025-05-07T20:33:01.3625061Z x1 = x[:, D:] 2025-05-07T20:33:01.3625273Z 2025-05-07T20:33:01.3625450Z if contiguous: 2025-05-07T20:33:01.3625686Z x0 = x0.contiguous() 2025-05-07T20:33:01.3626082Z x1 = x1.contiguous() 2025-05-07T20:33:01.3626330Z 2025-05-07T20:33:01.3626533Z if scale_ub is not None: 2025-05-07T20:33:01.3626821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.3627204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.3627518Z ) 2025-05-07T20:33:01.3627725Z else: 2025-05-07T20:33:01.3627945Z scale_ub_tensor = None 2025-05-07T20:33:01.3628200Z 2025-05-07T20:33:01.3628461Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.3628795Z op = silu_mul_quant 2025-05-07T20:33:01.3629040Z if compiled: 2025-05-07T20:33:01.3629304Z op = torch.compile(op) 2025-05-07T20:33:01.3629612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.3629895Z 2025-05-07T20:33:01.3630097Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.3630262Z 2025-05-07T20:33:01.3630376Z moe/activation_test.py:117: 2025-05-07T20:33:01.3630682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.3631022Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.3631313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.3631883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.3632443Z return fn(*args, **kwargs) 2025-05-07T20:33:01.3633108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.3633805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.3634338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.3635021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.3635694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.3636303Z kernel = self.compile( 2025-05-07T20:33:01.3636840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.3637499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.3637899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.3638134Z 2025-05-07T20:33:01.3638354Z self = 2025-05-07T20:33:01.3639432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.3640815Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69acf2e0>} 2025-05-07T20:33:01.3642172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.3643213Z context = 2025-05-07T20:33:01.3643503Z 2025-05-07T20:33:01.3643673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.3644211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.3644689Z module_map=module_map) 2025-05-07T20:33:01.3645066Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.3645425Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.3645693Z E ^ 2025-05-07T20:33:01.3646260Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.3646753Z 2025-05-07T20:33:01.3647171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.3647727Z 2025-05-07T20:33:01.5429673Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.5430949Z self=, 2025-05-07T20:33:01.5432074Z T=4096, 2025-05-07T20:33:01.5432504Z D=5120, 2025-05-07T20:33:01.5432885Z scale_ub=1200.0, 2025-05-07T20:33:01.5433333Z contiguous=True, 2025-05-07T20:33:01.5433772Z compiled=True, 2025-05-07T20:33:01.5434164Z ) 2025-05-07T20:33:01.5434807Z self = 2025-05-07T20:33:01.5435892Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.5436439Z 2025-05-07T20:33:01.5436594Z @given( 2025-05-07T20:33:01.5437048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.5437704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.5438305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.5438682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.5439022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.5439306Z ) 2025-05-07T20:33:01.5439653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.5440095Z def test_silu_mul_quant( 2025-05-07T20:33:01.5440332Z self, 2025-05-07T20:33:01.5440527Z T: int, 2025-05-07T20:33:01.5440724Z D: int, 2025-05-07T20:33:01.5440940Z scale_ub: Optional[float], 2025-05-07T20:33:01.5441213Z contiguous: bool, 2025-05-07T20:33:01.5441454Z compiled: bool, 2025-05-07T20:33:01.5441679Z ) -> None: 2025-05-07T20:33:01.5441896Z torch.manual_seed(2025) 2025-05-07T20:33:01.5442141Z 2025-05-07T20:33:01.5442414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.5442755Z 2025-05-07T20:33:01.5442950Z x_sign = torch.sign(x) 2025-05-07T20:33:01.5443240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.5443547Z x = x_sign * x_clamp 2025-05-07T20:33:01.5443787Z x0 = x[:, :D] 2025-05-07T20:33:01.5444005Z x1 = x[:, D:] 2025-05-07T20:33:01.5444208Z 2025-05-07T20:33:01.5444397Z if contiguous: 2025-05-07T20:33:01.5444630Z x0 = x0.contiguous() 2025-05-07T20:33:01.5444886Z x1 = x1.contiguous() 2025-05-07T20:33:01.5445128Z 2025-05-07T20:33:01.5445323Z if scale_ub is not None: 2025-05-07T20:33:01.5445594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.5445932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.5446243Z ) 2025-05-07T20:33:01.5446436Z else: 2025-05-07T20:33:01.5446650Z scale_ub_tensor = None 2025-05-07T20:33:01.5446913Z 2025-05-07T20:33:01.5447148Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.5447461Z op = silu_mul_quant 2025-05-07T20:33:01.5447717Z if compiled: 2025-05-07T20:33:01.5447969Z op = torch.compile(op) 2025-05-07T20:33:01.5448260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.5448540Z 2025-05-07T20:33:01.5448744Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.5448909Z 2025-05-07T20:33:01.5449010Z moe/activation_test.py:117: 2025-05-07T20:33:01.5449305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.5449641Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.5449917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.5450477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.5451038Z return fn(*args, **kwargs) 2025-05-07T20:33:01.5452013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.5452796Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.5453409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.5454092Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.5454754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.5455277Z kernel = self.compile( 2025-05-07T20:33:01.5455813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.5456462Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.5456855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.5457089Z 2025-05-07T20:33:01.5457305Z self = 2025-05-07T20:33:01.5458382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.5459771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a6948c860>} 2025-05-07T20:33:01.5461107Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.5462121Z context = 2025-05-07T20:33:01.5462413Z 2025-05-07T20:33:01.5462583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.5463104Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.5463575Z module_map=module_map) 2025-05-07T20:33:01.5463933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.5464289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.5464548Z E ^ 2025-05-07T20:33:01.5465005Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.5465759Z 2025-05-07T20:33:01.5466170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.5466683Z 2025-05-07T20:33:01.5466790Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.5467201Z self=, 2025-05-07T20:33:01.5467597Z T=128, 2025-05-07T20:33:01.5467790Z D=5120, 2025-05-07T20:33:01.5467996Z scale_ub=1200.0, 2025-05-07T20:33:01.5468215Z contiguous=False, 2025-05-07T20:33:01.5468440Z compiled=True, 2025-05-07T20:33:01.5468648Z ) 2025-05-07T20:33:01.8314488Z self = 2025-05-07T20:33:01.8315263Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:01.8315640Z 2025-05-07T20:33:01.8315826Z @given( 2025-05-07T20:33:01.8316089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.8316406Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.8316723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.8317053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.8317386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.8317685Z ) 2025-05-07T20:33:01.8318035Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.8318954Z def test_silu_mul_quant( 2025-05-07T20:33:01.8319203Z self, 2025-05-07T20:33:01.8319406Z T: int, 2025-05-07T20:33:01.8319602Z D: int, 2025-05-07T20:33:01.8319830Z scale_ub: Optional[float], 2025-05-07T20:33:01.8320204Z contiguous: bool, 2025-05-07T20:33:01.8320447Z compiled: bool, 2025-05-07T20:33:01.8320682Z ) -> None: 2025-05-07T20:33:01.8320909Z torch.manual_seed(2025) 2025-05-07T20:33:01.8321150Z 2025-05-07T20:33:01.8321428Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.8321780Z 2025-05-07T20:33:01.8321973Z x_sign = torch.sign(x) 2025-05-07T20:33:01.8322270Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.8322595Z x = x_sign * x_clamp 2025-05-07T20:33:01.8322834Z x0 = x[:, :D] 2025-05-07T20:33:01.8323054Z x1 = x[:, D:] 2025-05-07T20:33:01.8323267Z 2025-05-07T20:33:01.8323451Z if contiguous: 2025-05-07T20:33:01.8323697Z x0 = x0.contiguous() 2025-05-07T20:33:01.8323963Z x1 = x1.contiguous() 2025-05-07T20:33:01.8324203Z 2025-05-07T20:33:01.8324397Z if scale_ub is not None: 2025-05-07T20:33:01.8324681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.8325026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.8325333Z ) 2025-05-07T20:33:01.8325530Z else: 2025-05-07T20:33:01.8325743Z scale_ub_tensor = None 2025-05-07T20:33:01.8325994Z 2025-05-07T20:33:01.8326229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8326571Z op = silu_mul_quant 2025-05-07T20:33:01.8326823Z if compiled: 2025-05-07T20:33:01.8327071Z op = torch.compile(op) 2025-05-07T20:33:01.8327367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8327643Z 2025-05-07T20:33:01.8327836Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.8328006Z 2025-05-07T20:33:01.8328111Z moe/activation_test.py:117: 2025-05-07T20:33:01.8328409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8328740Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.8329031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8329599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.8330159Z return fn(*args, **kwargs) 2025-05-07T20:33:01.8330813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.8331509Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.8332049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.8332729Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.8333413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.8333956Z kernel = self.compile( 2025-05-07T20:33:01.8334506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.8335157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.8335567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8335810Z 2025-05-07T20:33:01.8336016Z self = 2025-05-07T20:33:01.8337100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.8338577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a6948d580>} 2025-05-07T20:33:01.8339966Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.8341040Z context = 2025-05-07T20:33:01.8341326Z 2025-05-07T20:33:01.8341500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.8342024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.8342492Z module_map=module_map) 2025-05-07T20:33:01.8342866Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.8343224Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.8343482Z E ^ 2025-05-07T20:33:01.8343952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.8344405Z 2025-05-07T20:33:01.8344826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.8345335Z 2025-05-07T20:33:01.8345453Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.8345863Z self=, 2025-05-07T20:33:01.8346271Z T=16384, 2025-05-07T20:33:01.8346475Z D=7168, 2025-05-07T20:33:01.8346678Z scale_ub=1200.0, 2025-05-07T20:33:01.8346901Z contiguous=True, 2025-05-07T20:33:01.8347133Z compiled=True, 2025-05-07T20:33:01.8347353Z ) 2025-05-07T20:33:01.8347669Z self = 2025-05-07T20:33:01.8348174Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.8348455Z 2025-05-07T20:33:01.8348543Z @given( 2025-05-07T20:33:01.8348786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.8349104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.8349413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.8349745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.8350083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.8350375Z ) 2025-05-07T20:33:01.8350730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.8351167Z def test_silu_mul_quant( 2025-05-07T20:33:01.8351416Z self, 2025-05-07T20:33:01.8351617Z T: int, 2025-05-07T20:33:01.8351815Z D: int, 2025-05-07T20:33:01.8352040Z scale_ub: Optional[float], 2025-05-07T20:33:01.8352317Z contiguous: bool, 2025-05-07T20:33:01.8352554Z compiled: bool, 2025-05-07T20:33:01.8352788Z ) -> None: 2025-05-07T20:33:01.8353009Z torch.manual_seed(2025) 2025-05-07T20:33:01.8353253Z 2025-05-07T20:33:01.8353530Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.8353885Z 2025-05-07T20:33:01.8354084Z x_sign = torch.sign(x) 2025-05-07T20:33:01.8354381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.8354704Z x = x_sign * x_clamp 2025-05-07T20:33:01.8354953Z x0 = x[:, :D] 2025-05-07T20:33:01.8355173Z x1 = x[:, D:] 2025-05-07T20:33:01.8355392Z 2025-05-07T20:33:01.8355590Z if contiguous: 2025-05-07T20:33:01.8355883Z x0 = x0.contiguous() 2025-05-07T20:33:01.8356155Z x1 = x1.contiguous() 2025-05-07T20:33:01.8356406Z 2025-05-07T20:33:01.8356602Z if scale_ub is not None: 2025-05-07T20:33:01.8356881Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.8357223Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.8357533Z ) 2025-05-07T20:33:01.8357731Z else: 2025-05-07T20:33:01.8358479Z scale_ub_tensor = None 2025-05-07T20:33:01.8358756Z 2025-05-07T20:33:01.8359029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8359356Z op = silu_mul_quant 2025-05-07T20:33:01.8359647Z if compiled: 2025-05-07T20:33:01.8359907Z op = torch.compile(op) 2025-05-07T20:33:01.8360213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8360492Z 2025-05-07T20:33:01.8360700Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.8360875Z 2025-05-07T20:33:01.8360979Z moe/activation_test.py:117: 2025-05-07T20:33:01.8361276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8361604Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.8361885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8362445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:01.8363012Z return fn(*args, **kwargs) 2025-05-07T20:33:01.8363669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.8364362Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.8364900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.8365857Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.8366520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.8367058Z kernel = self.compile( 2025-05-07T20:33:01.8367590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.8368244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.8368683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8368939Z 2025-05-07T20:33:01.8369151Z self = 2025-05-07T20:33:01.8370221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.8371621Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a6948e0c0>} 2025-05-07T20:33:01.8372974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.8374002Z context = 2025-05-07T20:33:01.8374297Z 2025-05-07T20:33:01.8374484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.8375016Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.8375504Z module_map=module_map) 2025-05-07T20:33:01.8375884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.8376248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.8376524Z E ^ 2025-05-07T20:33:01.8376998Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.8377451Z 2025-05-07T20:33:01.8377881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.8378398Z 2025-05-07T20:33:01.9622833Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.9623485Z self=, 2025-05-07T20:33:01.9624537Z T=16384, 2025-05-07T20:33:01.9624806Z D=5120, 2025-05-07T20:33:01.9625054Z scale_ub=1200.0, 2025-05-07T20:33:01.9625351Z contiguous=True, 2025-05-07T20:33:01.9625644Z compiled=False, 2025-05-07T20:33:01.9625938Z ) 2025-05-07T20:33:01.9635464Z self = 2025-05-07T20:33:01.9636034Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.9636324Z 2025-05-07T20:33:01.9636406Z @given( 2025-05-07T20:33:01.9636648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.9636962Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.9637272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.9637610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.9637943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.9638228Z ) 2025-05-07T20:33:01.9638595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.9639045Z def test_silu_mul_quant( 2025-05-07T20:33:01.9639322Z self, 2025-05-07T20:33:01.9639527Z T: int, 2025-05-07T20:33:01.9639726Z D: int, 2025-05-07T20:33:01.9639953Z scale_ub: Optional[float], 2025-05-07T20:33:01.9640233Z contiguous: bool, 2025-05-07T20:33:01.9640470Z compiled: bool, 2025-05-07T20:33:01.9640706Z ) -> None: 2025-05-07T20:33:01.9640932Z torch.manual_seed(2025) 2025-05-07T20:33:01.9641173Z 2025-05-07T20:33:01.9641451Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.9641808Z 2025-05-07T20:33:01.9642004Z x_sign = torch.sign(x) 2025-05-07T20:33:01.9642306Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.9642620Z x = x_sign * x_clamp 2025-05-07T20:33:01.9642864Z x0 = x[:, :D] 2025-05-07T20:33:01.9643089Z x1 = x[:, D:] 2025-05-07T20:33:01.9643310Z 2025-05-07T20:33:01.9643494Z if contiguous: 2025-05-07T20:33:01.9643732Z x0 = x0.contiguous() 2025-05-07T20:33:01.9643998Z x1 = x1.contiguous() 2025-05-07T20:33:01.9644235Z 2025-05-07T20:33:01.9644433Z if scale_ub is not None: 2025-05-07T20:33:01.9644710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.9645049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.9645355Z ) 2025-05-07T20:33:01.9645551Z else: 2025-05-07T20:33:01.9645765Z scale_ub_tensor = None 2025-05-07T20:33:01.9646013Z 2025-05-07T20:33:01.9646251Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.9646574Z op = silu_mul_quant 2025-05-07T20:33:01.9646822Z if compiled: 2025-05-07T20:33:01.9647073Z op = torch.compile(op) 2025-05-07T20:33:01.9647370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9647640Z 2025-05-07T20:33:01.9647846Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.9648008Z 2025-05-07T20:33:01.9648115Z moe/activation_test.py:117: 2025-05-07T20:33:01.9648407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9648750Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.9649040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9649739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.9650424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.9650964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.9651655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.9652330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.9653019Z kernel = self.compile( 2025-05-07T20:33:01.9653566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.9654223Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.9654664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9654906Z 2025-05-07T20:33:01.9655114Z self = 2025-05-07T20:33:01.9656202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.9657596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a6948f1a0>} 2025-05-07T20:33:01.9658955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.9659982Z context = 2025-05-07T20:33:01.9660281Z 2025-05-07T20:33:01.9660450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.9660984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.9661462Z module_map=module_map) 2025-05-07T20:33:01.9661827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.9662197Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.9662468Z E ^ 2025-05-07T20:33:01.9662933Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.9663398Z 2025-05-07T20:33:01.9663818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.9664342Z 2025-05-07T20:33:01.9664446Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.9664878Z self=, 2025-05-07T20:33:01.9665279Z T=1, 2025-05-07T20:33:01.9665732Z D=7168, 2025-05-07T20:33:01.9665934Z scale_ub=1200.0, 2025-05-07T20:33:01.9666159Z contiguous=False, 2025-05-07T20:33:01.9666397Z compiled=False, 2025-05-07T20:33:01.9666611Z ) 2025-05-07T20:33:01.9666933Z self = 2025-05-07T20:33:01.9667439Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:01.9667712Z 2025-05-07T20:33:01.9667800Z @given( 2025-05-07T20:33:01.9668030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.9668357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.9668680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.9669018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.9669346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.9669645Z ) 2025-05-07T20:33:01.9669995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.9670435Z def test_silu_mul_quant( 2025-05-07T20:33:01.9670682Z self, 2025-05-07T20:33:01.9670888Z T: int, 2025-05-07T20:33:01.9671087Z D: int, 2025-05-07T20:33:01.9671320Z scale_ub: Optional[float], 2025-05-07T20:33:01.9671603Z contiguous: bool, 2025-05-07T20:33:01.9671854Z compiled: bool, 2025-05-07T20:33:01.9672091Z ) -> None: 2025-05-07T20:33:01.9672319Z torch.manual_seed(2025) 2025-05-07T20:33:01.9672569Z 2025-05-07T20:33:01.9672849Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.9673415Z 2025-05-07T20:33:01.9673611Z x_sign = torch.sign(x) 2025-05-07T20:33:01.9673906Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.9674225Z x = x_sign * x_clamp 2025-05-07T20:33:01.9674537Z x0 = x[:, :D] 2025-05-07T20:33:01.9674755Z x1 = x[:, D:] 2025-05-07T20:33:01.9674968Z 2025-05-07T20:33:01.9675153Z if contiguous: 2025-05-07T20:33:01.9675378Z x0 = x0.contiguous() 2025-05-07T20:33:01.9675642Z x1 = x1.contiguous() 2025-05-07T20:33:01.9675964Z 2025-05-07T20:33:01.9676158Z if scale_ub is not None: 2025-05-07T20:33:01.9676440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.9676781Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.9677090Z ) 2025-05-07T20:33:01.9677291Z else: 2025-05-07T20:33:01.9677508Z scale_ub_tensor = None 2025-05-07T20:33:01.9677764Z 2025-05-07T20:33:01.9678007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.9678321Z op = silu_mul_quant 2025-05-07T20:33:01.9678575Z if compiled: 2025-05-07T20:33:01.9678824Z op = torch.compile(op) 2025-05-07T20:33:01.9679117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9679393Z 2025-05-07T20:33:01.9679592Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.9679754Z 2025-05-07T20:33:01.9679860Z moe/activation_test.py:117: 2025-05-07T20:33:01.9680148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9680480Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.9680763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.9681444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.9682132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.9682674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.9683354Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.9684018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.9684556Z kernel = self.compile( 2025-05-07T20:33:01.9685094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.9685739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.9686138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.9686374Z 2025-05-07T20:33:01.9686581Z self = 2025-05-07T20:33:01.9687666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.9689040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66cf0680>} 2025-05-07T20:33:01.9690377Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.9691403Z context = 2025-05-07T20:33:01.9691695Z 2025-05-07T20:33:01.9691875Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.9692416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.9692884Z module_map=module_map) 2025-05-07T20:33:01.9693387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.9693755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.9694022Z E ^ 2025-05-07T20:33:01.9694499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.9694991Z 2025-05-07T20:33:01.9695411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.9695922Z 2025-05-07T20:33:02.1463547Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.1464783Z self=, 2025-05-07T20:33:02.1466247Z T=4096, 2025-05-07T20:33:02.1466775Z D=7168, 2025-05-07T20:33:02.1467222Z scale_ub=1200.0, 2025-05-07T20:33:02.1467683Z contiguous=False, 2025-05-07T20:33:02.1468130Z compiled=True, 2025-05-07T20:33:02.1468537Z ) 2025-05-07T20:33:02.1468998Z self = 2025-05-07T20:33:02.1469514Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.1469792Z 2025-05-07T20:33:02.1469878Z @given( 2025-05-07T20:33:02.1470112Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.1470454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.1470770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.1471106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.1471437Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.1471733Z ) 2025-05-07T20:33:02.1472088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.1472532Z def test_silu_mul_quant( 2025-05-07T20:33:02.1472787Z self, 2025-05-07T20:33:02.1472992Z T: int, 2025-05-07T20:33:02.1473189Z D: int, 2025-05-07T20:33:02.1473421Z scale_ub: Optional[float], 2025-05-07T20:33:02.1473707Z contiguous: bool, 2025-05-07T20:33:02.1473947Z compiled: bool, 2025-05-07T20:33:02.1474183Z ) -> None: 2025-05-07T20:33:02.1474409Z torch.manual_seed(2025) 2025-05-07T20:33:02.1474657Z 2025-05-07T20:33:02.1474935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.1475291Z 2025-05-07T20:33:02.1475491Z x_sign = torch.sign(x) 2025-05-07T20:33:02.1475869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.1476193Z x = x_sign * x_clamp 2025-05-07T20:33:02.1476439Z x0 = x[:, :D] 2025-05-07T20:33:02.1476656Z x1 = x[:, D:] 2025-05-07T20:33:02.1476870Z 2025-05-07T20:33:02.1477063Z if contiguous: 2025-05-07T20:33:02.1477295Z x0 = x0.contiguous() 2025-05-07T20:33:02.1477558Z x1 = x1.contiguous() 2025-05-07T20:33:02.1477804Z 2025-05-07T20:33:02.1477999Z if scale_ub is not None: 2025-05-07T20:33:02.1478284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.1478628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.1478944Z ) 2025-05-07T20:33:02.1479143Z else: 2025-05-07T20:33:02.1479364Z scale_ub_tensor = None 2025-05-07T20:33:02.1479616Z 2025-05-07T20:33:02.1479853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.1480175Z op = silu_mul_quant 2025-05-07T20:33:02.1480431Z if compiled: 2025-05-07T20:33:02.1480680Z op = torch.compile(op) 2025-05-07T20:33:02.1480982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.1481264Z 2025-05-07T20:33:02.1481458Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.1481631Z 2025-05-07T20:33:02.1481734Z moe/activation_test.py:117: 2025-05-07T20:33:02.1482036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.1482370Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.1483125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.1483699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.1484344Z return fn(*args, **kwargs) 2025-05-07T20:33:02.1485001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.1485695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.1486243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.1486926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.1487600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.1488148Z kernel = self.compile( 2025-05-07T20:33:02.1488707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.1489365Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.1489774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.1490010Z 2025-05-07T20:33:02.1490229Z self = 2025-05-07T20:33:02.1491317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.1492711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66cf1940>} 2025-05-07T20:33:02.1494068Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.1495101Z context = 2025-05-07T20:33:02.1495391Z 2025-05-07T20:33:02.1495569Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.1496091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.1496568Z module_map=module_map) 2025-05-07T20:33:02.1496943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.1497313Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.1497573Z E ^ 2025-05-07T20:33:02.1498044Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.1498497Z 2025-05-07T20:33:02.1498973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.1499491Z 2025-05-07T20:33:02.1499598Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.1500021Z self=, 2025-05-07T20:33:02.1500433Z T=128, 2025-05-07T20:33:02.1500627Z D=7168, 2025-05-07T20:33:02.1500821Z scale_ub=1200.0, 2025-05-07T20:33:02.1501050Z contiguous=False, 2025-05-07T20:33:02.1501286Z compiled=True, 2025-05-07T20:33:02.1501492Z ) 2025-05-07T20:33:02.2429792Z self = 2025-05-07T20:33:02.2430541Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:02.2430901Z 2025-05-07T20:33:02.2431020Z @given( 2025-05-07T20:33:02.2431335Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.2431771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.2432143Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.2432876Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.2433212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.2433510Z ) 2025-05-07T20:33:02.2433860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.2434388Z def test_silu_mul_quant( 2025-05-07T20:33:02.2434638Z self, 2025-05-07T20:33:02.2434843Z T: int, 2025-05-07T20:33:02.2435041Z D: int, 2025-05-07T20:33:02.2435269Z scale_ub: Optional[float], 2025-05-07T20:33:02.2435550Z contiguous: bool, 2025-05-07T20:33:02.2435884Z compiled: bool, 2025-05-07T20:33:02.2436122Z ) -> None: 2025-05-07T20:33:02.2436345Z torch.manual_seed(2025) 2025-05-07T20:33:02.2436592Z 2025-05-07T20:33:02.2436871Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.2437227Z 2025-05-07T20:33:02.2437425Z x_sign = torch.sign(x) 2025-05-07T20:33:02.2437729Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.2438059Z x = x_sign * x_clamp 2025-05-07T20:33:02.2438302Z x0 = x[:, :D] 2025-05-07T20:33:02.2438534Z x1 = x[:, D:] 2025-05-07T20:33:02.2438757Z 2025-05-07T20:33:02.2438946Z if contiguous: 2025-05-07T20:33:02.2439187Z x0 = x0.contiguous() 2025-05-07T20:33:02.2439457Z x1 = x1.contiguous() 2025-05-07T20:33:02.2439705Z 2025-05-07T20:33:02.2439905Z if scale_ub is not None: 2025-05-07T20:33:02.2440183Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.2440530Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.2440842Z ) 2025-05-07T20:33:02.2441049Z else: 2025-05-07T20:33:02.2441274Z scale_ub_tensor = None 2025-05-07T20:33:02.2441534Z 2025-05-07T20:33:02.2441779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.2442106Z op = silu_mul_quant 2025-05-07T20:33:02.2442397Z if compiled: 2025-05-07T20:33:02.2442650Z op = torch.compile(op) 2025-05-07T20:33:02.2442944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2443228Z 2025-05-07T20:33:02.2443434Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.2443603Z 2025-05-07T20:33:02.2443705Z moe/activation_test.py:117: 2025-05-07T20:33:02.2444008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2444353Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.2444638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2445204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.2445768Z return fn(*args, **kwargs) 2025-05-07T20:33:02.2446431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.2447123Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.2447672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.2448362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.2449025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.2449564Z kernel = self.compile( 2025-05-07T20:33:02.2450112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.2450775Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.2451176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2451419Z 2025-05-07T20:33:02.2451630Z self = 2025-05-07T20:33:02.2452808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.2454278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66cf2700>} 2025-05-07T20:33:02.2455622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.2456659Z context = 2025-05-07T20:33:02.2456955Z 2025-05-07T20:33:02.2457123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.2457654Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.2458130Z module_map=module_map) 2025-05-07T20:33:02.2458518Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.2458881Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.2459157Z E ^ 2025-05-07T20:33:02.2459624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.2460086Z 2025-05-07T20:33:02.2460505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.2461018Z 2025-05-07T20:33:02.2461132Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.2461553Z self=, 2025-05-07T20:33:02.2461957Z T=2048, 2025-05-07T20:33:02.2462157Z D=7168, 2025-05-07T20:33:02.2462363Z scale_ub=None, 2025-05-07T20:33:02.2462584Z contiguous=True, 2025-05-07T20:33:02.2463005Z compiled=True, 2025-05-07T20:33:02.2463228Z ) 2025-05-07T20:33:02.2463557Z self = 2025-05-07T20:33:02.2464063Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:02.2464339Z 2025-05-07T20:33:02.2464425Z @given( 2025-05-07T20:33:02.2464659Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.2464986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.2465303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.2465796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.2466124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.2466415Z ) 2025-05-07T20:33:02.2466769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.2467209Z def test_silu_mul_quant( 2025-05-07T20:33:02.2467461Z self, 2025-05-07T20:33:02.2467663Z T: int, 2025-05-07T20:33:02.2467858Z D: int, 2025-05-07T20:33:02.2468100Z scale_ub: Optional[float], 2025-05-07T20:33:02.2468381Z contiguous: bool, 2025-05-07T20:33:02.2468619Z compiled: bool, 2025-05-07T20:33:02.2468851Z ) -> None: 2025-05-07T20:33:02.2469078Z torch.manual_seed(2025) 2025-05-07T20:33:02.2469319Z 2025-05-07T20:33:02.2469597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.2469949Z 2025-05-07T20:33:02.2470145Z x_sign = torch.sign(x) 2025-05-07T20:33:02.2470459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.2470777Z x = x_sign * x_clamp 2025-05-07T20:33:02.2471021Z x0 = x[:, :D] 2025-05-07T20:33:02.2471248Z x1 = x[:, D:] 2025-05-07T20:33:02.2471468Z 2025-05-07T20:33:02.2471655Z if contiguous: 2025-05-07T20:33:02.2471903Z x0 = x0.contiguous() 2025-05-07T20:33:02.2472174Z x1 = x1.contiguous() 2025-05-07T20:33:02.2472422Z 2025-05-07T20:33:02.2472778Z if scale_ub is not None: 2025-05-07T20:33:02.2473112Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.2473460Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.2473836Z ) 2025-05-07T20:33:02.2474040Z else: 2025-05-07T20:33:02.2474262Z scale_ub_tensor = None 2025-05-07T20:33:02.2474520Z 2025-05-07T20:33:02.2474765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.2475098Z op = silu_mul_quant 2025-05-07T20:33:02.2475353Z if compiled: 2025-05-07T20:33:02.2475609Z op = torch.compile(op) 2025-05-07T20:33:02.2484678Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2484981Z 2025-05-07T20:33:02.2485186Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.2485363Z 2025-05-07T20:33:02.2485468Z moe/activation_test.py:117: 2025-05-07T20:33:02.2485775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2486134Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.2486422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.2487012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.2487590Z return fn(*args, **kwargs) 2025-05-07T20:33:02.2488260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.2488969Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.2489524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.2490231Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.2490900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.2491442Z kernel = self.compile( 2025-05-07T20:33:02.2492000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.2492669Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.2493076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.2493318Z 2025-05-07T20:33:02.2493529Z self = 2025-05-07T20:33:02.2494630Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.2496024Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66cf37e0>} 2025-05-07T20:33:02.2497379Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.2498425Z context = 2025-05-07T20:33:02.2498734Z 2025-05-07T20:33:02.2498936Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.2499506Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.2499981Z module_map=module_map) 2025-05-07T20:33:02.2500357Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.2500723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.2500986Z E ^ 2025-05-07T20:33:02.2501459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.2501921Z 2025-05-07T20:33:02.2502450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.2503001Z 2025-05-07T20:33:02.3144929Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3146120Z self=, 2025-05-07T20:33:02.3147258Z T=16384, 2025-05-07T20:33:02.3147661Z D=5120, 2025-05-07T20:33:02.3148068Z scale_ub=None, 2025-05-07T20:33:02.3148517Z contiguous=False, 2025-05-07T20:33:02.3148868Z compiled=False, 2025-05-07T20:33:02.3149094Z ) 2025-05-07T20:33:02.3149430Z self = 2025-05-07T20:33:02.3149937Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:02.3150229Z 2025-05-07T20:33:02.3150311Z @given( 2025-05-07T20:33:02.3150556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3150881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3151201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3151549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3151894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3152187Z ) 2025-05-07T20:33:02.3152549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3153011Z def test_silu_mul_quant( 2025-05-07T20:33:02.3153260Z self, 2025-05-07T20:33:02.3153472Z T: int, 2025-05-07T20:33:02.3153685Z D: int, 2025-05-07T20:33:02.3153910Z scale_ub: Optional[float], 2025-05-07T20:33:02.3154194Z contiguous: bool, 2025-05-07T20:33:02.3154451Z compiled: bool, 2025-05-07T20:33:02.3154690Z ) -> None: 2025-05-07T20:33:02.3154911Z torch.manual_seed(2025) 2025-05-07T20:33:02.3155166Z 2025-05-07T20:33:02.3155462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3155938Z 2025-05-07T20:33:02.3156154Z x_sign = torch.sign(x) 2025-05-07T20:33:02.3156461Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.3158498Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.3160450Z 2025-05-07T20:33:02.3160583Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:02.3160801Z 2025-05-07T20:33:02.3160911Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3161341Z self=, 2025-05-07T20:33:02.3161768Z T=4096, 2025-05-07T20:33:02.3161974Z D=7168, 2025-05-07T20:33:02.3162173Z scale_ub=1200.0, 2025-05-07T20:33:02.3162410Z contiguous=True, 2025-05-07T20:33:02.3162646Z compiled=True, 2025-05-07T20:33:02.3162858Z ) 2025-05-07T20:33:02.3163196Z self = 2025-05-07T20:33:02.3163700Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:02.3163976Z 2025-05-07T20:33:02.3164060Z @given( 2025-05-07T20:33:02.3164304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3164629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3164947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3165292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3165946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3166241Z ) 2025-05-07T20:33:02.3166764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3167338Z def test_silu_mul_quant( 2025-05-07T20:33:02.3167593Z self, 2025-05-07T20:33:02.3167797Z T: int, 2025-05-07T20:33:02.3168067Z D: int, 2025-05-07T20:33:02.3168298Z scale_ub: Optional[float], 2025-05-07T20:33:02.3168573Z contiguous: bool, 2025-05-07T20:33:02.3168838Z compiled: bool, 2025-05-07T20:33:02.3169117Z ) -> None: 2025-05-07T20:33:02.3169338Z torch.manual_seed(2025) 2025-05-07T20:33:02.3169600Z 2025-05-07T20:33:02.3169883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3170232Z 2025-05-07T20:33:02.3170440Z x_sign = torch.sign(x) 2025-05-07T20:33:02.3170740Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.3172774Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.3174656Z 2025-05-07T20:33:02.3174787Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:02.3175007Z 2025-05-07T20:33:02.3175116Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3175540Z self=, 2025-05-07T20:33:02.3175960Z T=16384, 2025-05-07T20:33:02.3176165Z D=7168, 2025-05-07T20:33:02.3176370Z scale_ub=None, 2025-05-07T20:33:02.3176605Z contiguous=False, 2025-05-07T20:33:02.3176836Z compiled=False, 2025-05-07T20:33:02.3177053Z ) 2025-05-07T20:33:02.3177394Z self = 2025-05-07T20:33:02.3177907Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:02.3178194Z 2025-05-07T20:33:02.3178280Z @given( 2025-05-07T20:33:02.3178525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3178884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3179225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3179569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3179915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3180213Z ) 2025-05-07T20:33:02.3180576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3181020Z def test_silu_mul_quant( 2025-05-07T20:33:02.3181273Z self, 2025-05-07T20:33:02.3181481Z T: int, 2025-05-07T20:33:02.3181683Z D: int, 2025-05-07T20:33:02.3181917Z scale_ub: Optional[float], 2025-05-07T20:33:02.3182199Z contiguous: bool, 2025-05-07T20:33:02.3182452Z compiled: bool, 2025-05-07T20:33:02.3182680Z ) -> None: 2025-05-07T20:33:02.3182912Z torch.manual_seed(2025) 2025-05-07T20:33:02.3183171Z 2025-05-07T20:33:02.3183450Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3185529Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.3187416Z 2025-05-07T20:33:02.3187661Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.3187878Z 2025-05-07T20:33:02.3187991Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3188420Z self=, 2025-05-07T20:33:02.3188868Z T=2048, 2025-05-07T20:33:02.3189068Z D=7168, 2025-05-07T20:33:02.3189275Z scale_ub=1200.0, 2025-05-07T20:33:02.3189502Z contiguous=True, 2025-05-07T20:33:02.3189738Z compiled=True, 2025-05-07T20:33:02.3189948Z ) 2025-05-07T20:33:02.3190269Z self = 2025-05-07T20:33:02.3190770Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:02.3191044Z 2025-05-07T20:33:02.3191133Z @given( 2025-05-07T20:33:02.3191364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3191686Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3192003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3192346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3192687Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3192983Z ) 2025-05-07T20:33:02.3193346Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3193790Z def test_silu_mul_quant( 2025-05-07T20:33:02.3194041Z self, 2025-05-07T20:33:02.3194246Z T: int, 2025-05-07T20:33:02.3194447Z D: int, 2025-05-07T20:33:02.3194676Z scale_ub: Optional[float], 2025-05-07T20:33:02.3194955Z contiguous: bool, 2025-05-07T20:33:02.3195197Z compiled: bool, 2025-05-07T20:33:02.3195429Z ) -> None: 2025-05-07T20:33:02.3195653Z torch.manual_seed(2025) 2025-05-07T20:33:02.3195965Z 2025-05-07T20:33:02.3196251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3196602Z 2025-05-07T20:33:02.3196800Z x_sign = torch.sign(x) 2025-05-07T20:33:02.3197107Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.3199166Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.3201030Z 2025-05-07T20:33:02.3201151Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:02.3201370Z 2025-05-07T20:33:02.3201492Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3201909Z self=, 2025-05-07T20:33:02.3202322Z T=2048, 2025-05-07T20:33:02.3202526Z D=7168, 2025-05-07T20:33:02.3202722Z scale_ub=None, 2025-05-07T20:33:02.3202945Z contiguous=True, 2025-05-07T20:33:02.3203179Z compiled=False, 2025-05-07T20:33:02.3203388Z ) 2025-05-07T20:33:02.4351187Z self = 2025-05-07T20:33:02.4351730Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.4352076Z 2025-05-07T20:33:02.4352192Z @given( 2025-05-07T20:33:02.4352503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.4352816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.4353130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.4353464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.4353792Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.4354082Z ) 2025-05-07T20:33:02.4354745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.4355279Z def test_silu_mul_quant( 2025-05-07T20:33:02.4355521Z self, 2025-05-07T20:33:02.4355720Z T: int, 2025-05-07T20:33:02.4356012Z D: int, 2025-05-07T20:33:02.4356312Z scale_ub: Optional[float], 2025-05-07T20:33:02.4356589Z contiguous: bool, 2025-05-07T20:33:02.4356835Z compiled: bool, 2025-05-07T20:33:02.4357054Z ) -> None: 2025-05-07T20:33:02.4357275Z torch.manual_seed(2025) 2025-05-07T20:33:02.4357522Z 2025-05-07T20:33:02.4357790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.4358138Z 2025-05-07T20:33:02.4358339Z > x_sign = torch.sign(x) 2025-05-07T20:33:02.4360298Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.4362164Z 2025-05-07T20:33:02.4362289Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:02.4362499Z 2025-05-07T20:33:02.4362601Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.4363014Z self=, 2025-05-07T20:33:02.4363421Z T=1, 2025-05-07T20:33:02.4363603Z D=7168, 2025-05-07T20:33:02.4363797Z scale_ub=1200.0, 2025-05-07T20:33:02.4364021Z contiguous=True, 2025-05-07T20:33:02.4364239Z compiled=False, 2025-05-07T20:33:02.4364446Z ) 2025-05-07T20:33:02.4364765Z self = 2025-05-07T20:33:02.4365247Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:02.4365791Z 2025-05-07T20:33:02.4365872Z @given( 2025-05-07T20:33:02.4366110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.4366431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.4366734Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.4367070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.4367403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.4367689Z ) 2025-05-07T20:33:02.4368039Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.4368483Z def test_silu_mul_quant( 2025-05-07T20:33:02.4368722Z self, 2025-05-07T20:33:02.4368921Z T: int, 2025-05-07T20:33:02.4369125Z D: int, 2025-05-07T20:33:02.4369341Z scale_ub: Optional[float], 2025-05-07T20:33:02.4369620Z contiguous: bool, 2025-05-07T20:33:02.4369865Z compiled: bool, 2025-05-07T20:33:02.4370091Z ) -> None: 2025-05-07T20:33:02.4370314Z torch.manual_seed(2025) 2025-05-07T20:33:02.4370559Z 2025-05-07T20:33:02.4370832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.4371172Z 2025-05-07T20:33:02.4371368Z x_sign = torch.sign(x) 2025-05-07T20:33:02.4371657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.4371965Z x = x_sign * x_clamp 2025-05-07T20:33:02.4372210Z x0 = x[:, :D] 2025-05-07T20:33:02.4372428Z x1 = x[:, D:] 2025-05-07T20:33:02.4372637Z 2025-05-07T20:33:02.4372829Z if contiguous: 2025-05-07T20:33:02.4373061Z x0 = x0.contiguous() 2025-05-07T20:33:02.4373322Z x1 = x1.contiguous() 2025-05-07T20:33:02.4373567Z 2025-05-07T20:33:02.4373764Z if scale_ub is not None: 2025-05-07T20:33:02.4374037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.4374514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.4374883Z ) 2025-05-07T20:33:02.4375074Z else: 2025-05-07T20:33:02.4375302Z scale_ub_tensor = None 2025-05-07T20:33:02.4375555Z 2025-05-07T20:33:02.4375851Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.4376164Z op = silu_mul_quant 2025-05-07T20:33:02.4376416Z if compiled: 2025-05-07T20:33:02.4376667Z op = torch.compile(op) 2025-05-07T20:33:02.4376960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4377241Z 2025-05-07T20:33:02.4377440Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.4377609Z 2025-05-07T20:33:02.4377720Z moe/activation_test.py:117: 2025-05-07T20:33:02.4378012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4378349Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.4378639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.4379333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.4380040Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.4380583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.4381273Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.4381929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.4382466Z kernel = self.compile( 2025-05-07T20:33:02.4383014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.4383662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.4384065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.4384301Z 2025-05-07T20:33:02.4384518Z self = 2025-05-07T20:33:02.4385605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.4386977Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a692dab60>} 2025-05-07T20:33:02.4388328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.4389361Z context = 2025-05-07T20:33:02.4389650Z 2025-05-07T20:33:02.4389833Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.4390365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.4390831Z module_map=module_map) 2025-05-07T20:33:02.4391204Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.4391565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.4391822Z E ^ 2025-05-07T20:33:02.4392289Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.4392738Z 2025-05-07T20:33:02.4393161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.4393671Z 2025-05-07T20:33:02.4393783Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.4394193Z self=, 2025-05-07T20:33:02.4394596Z T=128, 2025-05-07T20:33:02.4394795Z D=5120, 2025-05-07T20:33:02.4395111Z scale_ub=None, 2025-05-07T20:33:02.4395332Z contiguous=True, 2025-05-07T20:33:02.4395555Z compiled=False, 2025-05-07T20:33:02.4395808Z ) 2025-05-07T20:33:02.5086144Z self = 2025-05-07T20:33:02.5087125Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.5087505Z 2025-05-07T20:33:02.5087617Z @given( 2025-05-07T20:33:02.5087931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.5088249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.5088568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.5088907Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.5089234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.5089524Z ) 2025-05-07T20:33:02.5089877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.5090334Z def test_silu_mul_quant( 2025-05-07T20:33:02.5090582Z self, 2025-05-07T20:33:02.5090787Z T: int, 2025-05-07T20:33:02.5090993Z D: int, 2025-05-07T20:33:02.5091215Z scale_ub: Optional[float], 2025-05-07T20:33:02.5091493Z contiguous: bool, 2025-05-07T20:33:02.5091736Z compiled: bool, 2025-05-07T20:33:02.5091958Z ) -> None: 2025-05-07T20:33:02.5092179Z torch.manual_seed(2025) 2025-05-07T20:33:02.5092425Z 2025-05-07T20:33:02.5092693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.5093044Z 2025-05-07T20:33:02.5093243Z x_sign = torch.sign(x) 2025-05-07T20:33:02.5093531Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.5093849Z x = x_sign * x_clamp 2025-05-07T20:33:02.5094101Z x0 = x[:, :D] 2025-05-07T20:33:02.5094319Z x1 = x[:, D:] 2025-05-07T20:33:02.5094535Z 2025-05-07T20:33:02.5094730Z if contiguous: 2025-05-07T20:33:02.5094972Z x0 = x0.contiguous() 2025-05-07T20:33:02.5095239Z x1 = x1.contiguous() 2025-05-07T20:33:02.5095486Z 2025-05-07T20:33:02.5095678Z if scale_ub is not None: 2025-05-07T20:33:02.5095962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.5096304Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.5096621Z ) 2025-05-07T20:33:02.5096817Z else: 2025-05-07T20:33:02.5097038Z scale_ub_tensor = None 2025-05-07T20:33:02.5097300Z 2025-05-07T20:33:02.5097534Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.5097856Z op = silu_mul_quant 2025-05-07T20:33:02.5098112Z if compiled: 2025-05-07T20:33:02.5098357Z op = torch.compile(op) 2025-05-07T20:33:02.5098656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.5098939Z 2025-05-07T20:33:02.5099132Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.5099302Z 2025-05-07T20:33:02.5099411Z moe/activation_test.py:117: 2025-05-07T20:33:02.5099715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.5100045Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.5100336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.5101027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.5101715Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.5102251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.5102936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.5103599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.5104137Z kernel = self.compile( 2025-05-07T20:33:02.5104837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.5105553Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.5105992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.5106222Z 2025-05-07T20:33:02.5106427Z self = 2025-05-07T20:33:02.5107507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.5108884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a692dbc40>} 2025-05-07T20:33:02.5110234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.5111261Z context = 2025-05-07T20:33:02.5111555Z 2025-05-07T20:33:02.5111724Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.5112252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.5112725Z module_map=module_map) 2025-05-07T20:33:02.5113095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.5113448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.5113730Z E ^ 2025-05-07T20:33:02.5114194Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.5114654Z 2025-05-07T20:33:02.5115076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.5115596Z 2025-05-07T20:33:02.5115702Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.5116216Z self=, 2025-05-07T20:33:02.5116623Z T=128, 2025-05-07T20:33:02.5116819Z D=7168, 2025-05-07T20:33:02.5117019Z scale_ub=None, 2025-05-07T20:33:02.5117232Z contiguous=True, 2025-05-07T20:33:02.5117464Z compiled=False, 2025-05-07T20:33:02.5117676Z ) 2025-05-07T20:33:02.5118009Z self = 2025-05-07T20:33:02.5118500Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.5118777Z 2025-05-07T20:33:02.5118864Z @given( 2025-05-07T20:33:02.5119132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.5119466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.5119792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.5120132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.5129991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.5130338Z ) 2025-05-07T20:33:02.5130705Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.5131161Z def test_silu_mul_quant( 2025-05-07T20:33:02.5131416Z self, 2025-05-07T20:33:02.5131624Z T: int, 2025-05-07T20:33:02.5131830Z D: int, 2025-05-07T20:33:02.5132061Z scale_ub: Optional[float], 2025-05-07T20:33:02.5132344Z contiguous: bool, 2025-05-07T20:33:02.5132586Z compiled: bool, 2025-05-07T20:33:02.5132821Z ) -> None: 2025-05-07T20:33:02.5133051Z torch.manual_seed(2025) 2025-05-07T20:33:02.5133298Z 2025-05-07T20:33:02.5133582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.5133936Z 2025-05-07T20:33:02.5134133Z x_sign = torch.sign(x) 2025-05-07T20:33:02.5134599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.5134925Z x = x_sign * x_clamp 2025-05-07T20:33:02.5135170Z x0 = x[:, :D] 2025-05-07T20:33:02.5135439Z x1 = x[:, D:] 2025-05-07T20:33:02.5135658Z 2025-05-07T20:33:02.5135847Z if contiguous: 2025-05-07T20:33:02.5136091Z x0 = x0.contiguous() 2025-05-07T20:33:02.5136363Z x1 = x1.contiguous() 2025-05-07T20:33:02.5136610Z 2025-05-07T20:33:02.5136811Z if scale_ub is not None: 2025-05-07T20:33:02.5137093Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.5137439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.5137751Z ) 2025-05-07T20:33:02.5137960Z else: 2025-05-07T20:33:02.5138184Z scale_ub_tensor = None 2025-05-07T20:33:02.5138439Z 2025-05-07T20:33:02.5138677Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.5139028Z op = silu_mul_quant 2025-05-07T20:33:02.5139305Z if compiled: 2025-05-07T20:33:02.5139562Z op = torch.compile(op) 2025-05-07T20:33:02.5139870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.5140149Z 2025-05-07T20:33:02.5140353Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.5140525Z 2025-05-07T20:33:02.5140638Z moe/activation_test.py:117: 2025-05-07T20:33:02.5140940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.5141283Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.5141576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.5142273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.5142964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.5143507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.5144206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.5144883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.5145423Z kernel = self.compile( 2025-05-07T20:33:02.5145972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.5146639Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.5147039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.5147284Z 2025-05-07T20:33:02.5147492Z self = 2025-05-07T20:33:02.5148592Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.5149974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69074ae0>} 2025-05-07T20:33:02.5151332Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.5152356Z context = 2025-05-07T20:33:02.5152653Z 2025-05-07T20:33:02.5152823Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.5153355Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.5153834Z module_map=module_map) 2025-05-07T20:33:02.5154202Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.5154689Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.5154963Z E ^ 2025-05-07T20:33:02.5155431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.5156020Z 2025-05-07T20:33:02.5156438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.5156961Z 2025-05-07T20:33:02.5157069Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.5157493Z self=, 2025-05-07T20:33:02.5157898Z T=2048, 2025-05-07T20:33:02.5158099Z D=7168, 2025-05-07T20:33:02.5158309Z scale_ub=1200.0, 2025-05-07T20:33:02.5158536Z contiguous=True, 2025-05-07T20:33:02.5158772Z compiled=False, 2025-05-07T20:33:02.5158985Z ) 2025-05-07T20:33:02.5977895Z self = 2025-05-07T20:33:02.5978661Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:02.5979082Z 2025-05-07T20:33:02.5979215Z @given( 2025-05-07T20:33:02.5979529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.5979860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.5980179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.5980522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.5980861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.5981164Z ) 2025-05-07T20:33:02.5981526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.5981978Z def test_silu_mul_quant( 2025-05-07T20:33:02.5982226Z self, 2025-05-07T20:33:02.5982432Z T: int, 2025-05-07T20:33:02.5982642Z D: int, 2025-05-07T20:33:02.5982867Z scale_ub: Optional[float], 2025-05-07T20:33:02.5983154Z contiguous: bool, 2025-05-07T20:33:02.5983400Z compiled: bool, 2025-05-07T20:33:02.5983635Z ) -> None: 2025-05-07T20:33:02.5983858Z torch.manual_seed(2025) 2025-05-07T20:33:02.5984110Z 2025-05-07T20:33:02.5984384Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.5986462Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.5988343Z 2025-05-07T20:33:02.5988468Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.5988690Z 2025-05-07T20:33:02.5988801Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.5989278Z self=, 2025-05-07T20:33:02.5989684Z T=1, 2025-05-07T20:33:02.5989879Z D=5120, 2025-05-07T20:33:02.5990088Z scale_ub=1200.0, 2025-05-07T20:33:02.5990313Z contiguous=True, 2025-05-07T20:33:02.5990543Z compiled=False, 2025-05-07T20:33:02.5990762Z ) 2025-05-07T20:33:02.5991084Z self = 2025-05-07T20:33:02.5991587Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:02.5991852Z 2025-05-07T20:33:02.5991932Z @given( 2025-05-07T20:33:02.5992170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.5992491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.5992808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.5993151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.5993760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.5994113Z ) 2025-05-07T20:33:02.5994467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.5994983Z def test_silu_mul_quant( 2025-05-07T20:33:02.5995235Z self, 2025-05-07T20:33:02.5995429Z T: int, 2025-05-07T20:33:02.5995630Z D: int, 2025-05-07T20:33:02.5995952Z scale_ub: Optional[float], 2025-05-07T20:33:02.5996233Z contiguous: bool, 2025-05-07T20:33:02.5996474Z compiled: bool, 2025-05-07T20:33:02.5996704Z ) -> None: 2025-05-07T20:33:02.5996937Z torch.manual_seed(2025) 2025-05-07T20:33:02.5997178Z 2025-05-07T20:33:02.5997454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.5997803Z 2025-05-07T20:33:02.5998004Z x_sign = torch.sign(x) 2025-05-07T20:33:02.5998304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.5998622Z x = x_sign * x_clamp 2025-05-07T20:33:02.5998880Z x0 = x[:, :D] 2025-05-07T20:33:02.5999104Z x1 = x[:, D:] 2025-05-07T20:33:02.5999318Z 2025-05-07T20:33:02.5999505Z if contiguous: 2025-05-07T20:33:02.5999750Z x0 = x0.contiguous() 2025-05-07T20:33:02.6000017Z x1 = x1.contiguous() 2025-05-07T20:33:02.6000262Z 2025-05-07T20:33:02.6000471Z if scale_ub is not None: 2025-05-07T20:33:02.6000751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.6001094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.6001406Z ) 2025-05-07T20:33:02.6001609Z else: 2025-05-07T20:33:02.6001827Z scale_ub_tensor = None 2025-05-07T20:33:02.6002078Z 2025-05-07T20:33:02.6002321Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6002644Z op = silu_mul_quant 2025-05-07T20:33:02.6002900Z if compiled: 2025-05-07T20:33:02.6003155Z op = torch.compile(op) 2025-05-07T20:33:02.6003463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6003740Z 2025-05-07T20:33:02.6003941Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.6004110Z 2025-05-07T20:33:02.6004222Z moe/activation_test.py:117: 2025-05-07T20:33:02.6004519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6004865Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.6005163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6005860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.6006548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.6007091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.6007775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.6008451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.6008987Z kernel = self.compile( 2025-05-07T20:33:02.6009533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.6010196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.6010598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6010835Z 2025-05-07T20:33:02.6011043Z self = 2025-05-07T20:33:02.6012130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.6013608Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a690760c0>} 2025-05-07T20:33:02.6015020Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.6016089Z context = 2025-05-07T20:33:02.6016390Z 2025-05-07T20:33:02.6016562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.6017098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.6017574Z module_map=module_map) 2025-05-07T20:33:02.6017947Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.6018323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.6018597Z E ^ 2025-05-07T20:33:02.6019125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6019594Z 2025-05-07T20:33:02.6020009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6020530Z 2025-05-07T20:33:02.6020642Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6021066Z self=, 2025-05-07T20:33:02.6021474Z T=2048, 2025-05-07T20:33:02.6021675Z D=5120, 2025-05-07T20:33:02.6021886Z scale_ub=None, 2025-05-07T20:33:02.6022104Z contiguous=True, 2025-05-07T20:33:02.6022339Z compiled=False, 2025-05-07T20:33:02.6022555Z ) 2025-05-07T20:33:02.6022878Z self = 2025-05-07T20:33:02.6023379Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.6023653Z 2025-05-07T20:33:02.6023741Z @given( 2025-05-07T20:33:02.6023982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6024307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6024635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6024975Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6025307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6025604Z ) 2025-05-07T20:33:02.6025961Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6026405Z def test_silu_mul_quant( 2025-05-07T20:33:02.6026659Z self, 2025-05-07T20:33:02.6026862Z T: int, 2025-05-07T20:33:02.6027064Z D: int, 2025-05-07T20:33:02.6027295Z scale_ub: Optional[float], 2025-05-07T20:33:02.6027570Z contiguous: bool, 2025-05-07T20:33:02.6027811Z compiled: bool, 2025-05-07T20:33:02.6028044Z ) -> None: 2025-05-07T20:33:02.6028266Z torch.manual_seed(2025) 2025-05-07T20:33:02.6028517Z 2025-05-07T20:33:02.6028798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6029175Z 2025-05-07T20:33:02.6029397Z > x_sign = torch.sign(x) 2025-05-07T20:33:02.6031359Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.6033223Z 2025-05-07T20:33:02.6033344Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:02.6033572Z 2025-05-07T20:33:02.6033679Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6034224Z self=, 2025-05-07T20:33:02.6034630Z T=16384, 2025-05-07T20:33:02.6034833Z D=5120, 2025-05-07T20:33:02.6035038Z scale_ub=None, 2025-05-07T20:33:02.6035299Z contiguous=True, 2025-05-07T20:33:02.6035530Z compiled=False, 2025-05-07T20:33:02.6035845Z ) 2025-05-07T20:33:02.6801945Z self = 2025-05-07T20:33:02.6802676Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.6803076Z 2025-05-07T20:33:02.6803191Z @given( 2025-05-07T20:33:02.6803516Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6803890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6804207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6804546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6804885Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6805199Z ) 2025-05-07T20:33:02.6805560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6806017Z def test_silu_mul_quant( 2025-05-07T20:33:02.6806276Z self, 2025-05-07T20:33:02.6806485Z T: int, 2025-05-07T20:33:02.6806700Z D: int, 2025-05-07T20:33:02.6806925Z scale_ub: Optional[float], 2025-05-07T20:33:02.6807210Z contiguous: bool, 2025-05-07T20:33:02.6807458Z compiled: bool, 2025-05-07T20:33:02.6807689Z ) -> None: 2025-05-07T20:33:02.6807915Z torch.manual_seed(2025) 2025-05-07T20:33:02.6808165Z 2025-05-07T20:33:02.6808471Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6810547Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.6812420Z 2025-05-07T20:33:02.6812547Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.6812762Z 2025-05-07T20:33:02.6812874Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6813288Z self=, 2025-05-07T20:33:02.6813696Z T=4096, 2025-05-07T20:33:02.6813892Z D=5120, 2025-05-07T20:33:02.6814089Z scale_ub=None, 2025-05-07T20:33:02.6814309Z contiguous=True, 2025-05-07T20:33:02.6814541Z compiled=False, 2025-05-07T20:33:02.6814750Z ) 2025-05-07T20:33:02.6815084Z self = 2025-05-07T20:33:02.6815594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.6815866Z 2025-05-07T20:33:02.6815949Z @given( 2025-05-07T20:33:02.6816185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6816508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6816821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6817160Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6817494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6817787Z ) 2025-05-07T20:33:02.6818137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6818587Z def test_silu_mul_quant( 2025-05-07T20:33:02.6818846Z self, 2025-05-07T20:33:02.6819059Z T: int, 2025-05-07T20:33:02.6819295Z D: int, 2025-05-07T20:33:02.6819526Z scale_ub: Optional[float], 2025-05-07T20:33:02.6819803Z contiguous: bool, 2025-05-07T20:33:02.6820433Z compiled: bool, 2025-05-07T20:33:02.6820674Z ) -> None: 2025-05-07T20:33:02.6820898Z torch.manual_seed(2025) 2025-05-07T20:33:02.6821158Z 2025-05-07T20:33:02.6821442Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6823564Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.6825420Z 2025-05-07T20:33:02.6825547Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.6825769Z 2025-05-07T20:33:02.6825882Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6826306Z self=, 2025-05-07T20:33:02.6826728Z T=2048, 2025-05-07T20:33:02.6826934Z D=5120, 2025-05-07T20:33:02.6827133Z scale_ub=None, 2025-05-07T20:33:02.6827362Z contiguous=False, 2025-05-07T20:33:02.6827607Z compiled=False, 2025-05-07T20:33:02.6827817Z ) 2025-05-07T20:33:02.6828149Z self = 2025-05-07T20:33:02.6828652Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:02.6828928Z 2025-05-07T20:33:02.6829011Z @given( 2025-05-07T20:33:02.6829249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6829568Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6829878Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6830222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6830565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6830860Z ) 2025-05-07T20:33:02.6831208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6831658Z def test_silu_mul_quant( 2025-05-07T20:33:02.6831911Z self, 2025-05-07T20:33:02.6832108Z T: int, 2025-05-07T20:33:02.6832315Z D: int, 2025-05-07T20:33:02.6832546Z scale_ub: Optional[float], 2025-05-07T20:33:02.6832816Z contiguous: bool, 2025-05-07T20:33:02.6833063Z compiled: bool, 2025-05-07T20:33:02.6833290Z ) -> None: 2025-05-07T20:33:02.6833512Z torch.manual_seed(2025) 2025-05-07T20:33:02.6833762Z 2025-05-07T20:33:02.6834044Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6836183Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.6838043Z 2025-05-07T20:33:02.6838176Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.6838388Z 2025-05-07T20:33:02.6838497Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6838917Z self=, 2025-05-07T20:33:02.6839333Z T=4096, 2025-05-07T20:33:02.6839526Z D=7168, 2025-05-07T20:33:02.6839738Z scale_ub=None, 2025-05-07T20:33:02.6839967Z contiguous=True, 2025-05-07T20:33:02.6840194Z compiled=True, 2025-05-07T20:33:02.6840415Z ) 2025-05-07T20:33:02.6840835Z self = 2025-05-07T20:33:02.6841367Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:02.6841652Z 2025-05-07T20:33:02.6841775Z @given( 2025-05-07T20:33:02.6842021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6842353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6842661Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6843000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6843337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6843634Z ) 2025-05-07T20:33:02.6843992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6844445Z def test_silu_mul_quant( 2025-05-07T20:33:02.6844698Z self, 2025-05-07T20:33:02.6844907Z T: int, 2025-05-07T20:33:02.6845121Z D: int, 2025-05-07T20:33:02.6845346Z scale_ub: Optional[float], 2025-05-07T20:33:02.6845626Z contiguous: bool, 2025-05-07T20:33:02.6845876Z compiled: bool, 2025-05-07T20:33:02.6846109Z ) -> None: 2025-05-07T20:33:02.6846325Z torch.manual_seed(2025) 2025-05-07T20:33:02.6846576Z 2025-05-07T20:33:02.6846855Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6848915Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.6850834Z 2025-05-07T20:33:02.6850963Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.6851184Z 2025-05-07T20:33:02.6851291Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6851715Z self=, 2025-05-07T20:33:02.6852132Z T=2048, 2025-05-07T20:33:02.6852321Z D=5120, 2025-05-07T20:33:02.6852518Z scale_ub=1200.0, 2025-05-07T20:33:02.6852752Z contiguous=False, 2025-05-07T20:33:02.6852981Z compiled=False, 2025-05-07T20:33:02.6853192Z ) 2025-05-07T20:33:02.6853531Z self = 2025-05-07T20:33:02.6854032Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.6854323Z 2025-05-07T20:33:02.6854405Z @given( 2025-05-07T20:33:02.6854652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6854972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6855293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6855637Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6855974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6856273Z ) 2025-05-07T20:33:02.6856636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6857090Z def test_silu_mul_quant( 2025-05-07T20:33:02.6857336Z self, 2025-05-07T20:33:02.6857540Z T: int, 2025-05-07T20:33:02.6857748Z D: int, 2025-05-07T20:33:02.6857968Z scale_ub: Optional[float], 2025-05-07T20:33:02.6858246Z contiguous: bool, 2025-05-07T20:33:02.6858504Z compiled: bool, 2025-05-07T20:33:02.6858728Z ) -> None: 2025-05-07T20:33:02.6858960Z torch.manual_seed(2025) 2025-05-07T20:33:02.6859218Z 2025-05-07T20:33:02.6859493Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6861663Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.6863695Z 2025-05-07T20:33:02.6873549Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.6873813Z 2025-05-07T20:33:02.6873926Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6874352Z self=, 2025-05-07T20:33:02.6874769Z T=4096, 2025-05-07T20:33:02.6874965Z D=7168, 2025-05-07T20:33:02.6875167Z scale_ub=1200.0, 2025-05-07T20:33:02.6875405Z contiguous=True, 2025-05-07T20:33:02.6875638Z compiled=False, 2025-05-07T20:33:02.6875941Z ) 2025-05-07T20:33:02.7944879Z self = 2025-05-07T20:33:02.7946317Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:02.7947101Z 2025-05-07T20:33:02.7947395Z @given( 2025-05-07T20:33:02.7948021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.7948703Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.7949195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.7949574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.7949913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.7950207Z ) 2025-05-07T20:33:02.7950574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.7951018Z def test_silu_mul_quant( 2025-05-07T20:33:02.7951269Z self, 2025-05-07T20:33:02.7951476Z T: int, 2025-05-07T20:33:02.7951693Z D: int, 2025-05-07T20:33:02.7951924Z scale_ub: Optional[float], 2025-05-07T20:33:02.7952205Z contiguous: bool, 2025-05-07T20:33:02.7952446Z compiled: bool, 2025-05-07T20:33:02.7952679Z ) -> None: 2025-05-07T20:33:02.7952904Z torch.manual_seed(2025) 2025-05-07T20:33:02.7953149Z 2025-05-07T20:33:02.7953435Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.7955527Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.7957500Z 2025-05-07T20:33:02.7957631Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.7957845Z 2025-05-07T20:33:02.7957961Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.7958378Z self=, 2025-05-07T20:33:02.7958792Z T=16384, 2025-05-07T20:33:02.7958993Z D=7168, 2025-05-07T20:33:02.7959186Z scale_ub=None, 2025-05-07T20:33:02.7959408Z contiguous=False, 2025-05-07T20:33:02.7959644Z compiled=True, 2025-05-07T20:33:02.7959846Z ) 2025-05-07T20:33:02.7960169Z self = 2025-05-07T20:33:02.7960676Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:02.7960955Z 2025-05-07T20:33:02.7961044Z @given( 2025-05-07T20:33:02.7961274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.7961596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.7962311Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.7962645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.7962988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.7963351Z ) 2025-05-07T20:33:02.7963700Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.7964153Z def test_silu_mul_quant( 2025-05-07T20:33:02.7964409Z self, 2025-05-07T20:33:02.7964606Z T: int, 2025-05-07T20:33:02.7964815Z D: int, 2025-05-07T20:33:02.7965042Z scale_ub: Optional[float], 2025-05-07T20:33:02.7965318Z contiguous: bool, 2025-05-07T20:33:02.7965924Z compiled: bool, 2025-05-07T20:33:02.7966160Z ) -> None: 2025-05-07T20:33:02.7966379Z torch.manual_seed(2025) 2025-05-07T20:33:02.7966635Z 2025-05-07T20:33:02.7966919Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.7968996Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.7970866Z 2025-05-07T20:33:02.7970993Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.7971206Z 2025-05-07T20:33:02.7971314Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.7971739Z self=, 2025-05-07T20:33:02.7972157Z T=4096, 2025-05-07T20:33:02.7972356Z D=7168, 2025-05-07T20:33:02.7972550Z scale_ub=None, 2025-05-07T20:33:02.7972785Z contiguous=True, 2025-05-07T20:33:02.7973018Z compiled=False, 2025-05-07T20:33:02.7973227Z ) 2025-05-07T20:33:02.7973556Z self = 2025-05-07T20:33:02.7974066Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.7974345Z 2025-05-07T20:33:02.7974428Z @given( 2025-05-07T20:33:02.7974672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.7974999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.7975315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.7975656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.7975998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.7976296Z ) 2025-05-07T20:33:02.7976647Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.7977098Z def test_silu_mul_quant( 2025-05-07T20:33:02.7977354Z self, 2025-05-07T20:33:02.7977549Z T: int, 2025-05-07T20:33:02.7977756Z D: int, 2025-05-07T20:33:02.7977979Z scale_ub: Optional[float], 2025-05-07T20:33:02.7978251Z contiguous: bool, 2025-05-07T20:33:02.7978499Z compiled: bool, 2025-05-07T20:33:02.7978734Z ) -> None: 2025-05-07T20:33:02.7978951Z torch.manual_seed(2025) 2025-05-07T20:33:02.7979205Z 2025-05-07T20:33:02.7979482Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.7981670Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.7983631Z 2025-05-07T20:33:02.7983759Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.7984029Z 2025-05-07T20:33:02.7984134Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.7984558Z self=, 2025-05-07T20:33:02.7984974Z T=16384, 2025-05-07T20:33:02.7985176Z D=7168, 2025-05-07T20:33:02.7985383Z scale_ub=None, 2025-05-07T20:33:02.7985606Z contiguous=True, 2025-05-07T20:33:02.7985830Z compiled=False, 2025-05-07T20:33:02.7986043Z ) 2025-05-07T20:33:02.7986371Z self = 2025-05-07T20:33:02.7986867Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:02.7987156Z 2025-05-07T20:33:02.7987239Z @given( 2025-05-07T20:33:02.7987489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.7987818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.7988127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.7988467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.7988814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.7989106Z ) 2025-05-07T20:33:02.7989464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.7989916Z def test_silu_mul_quant( 2025-05-07T20:33:02.7990165Z self, 2025-05-07T20:33:02.7990374Z T: int, 2025-05-07T20:33:02.7990584Z D: int, 2025-05-07T20:33:02.7990812Z scale_ub: Optional[float], 2025-05-07T20:33:02.7991087Z contiguous: bool, 2025-05-07T20:33:02.7991329Z compiled: bool, 2025-05-07T20:33:02.7991554Z ) -> None: 2025-05-07T20:33:02.7991777Z torch.manual_seed(2025) 2025-05-07T20:33:02.7992019Z 2025-05-07T20:33:02.7992299Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.7994353Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.7996313Z 2025-05-07T20:33:02.7996432Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.7996644Z 2025-05-07T20:33:02.7996757Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.7997171Z self=, 2025-05-07T20:33:02.7997579Z T=16384, 2025-05-07T20:33:02.7997786Z D=7168, 2025-05-07T20:33:02.7997979Z scale_ub=1200.0, 2025-05-07T20:33:02.7998208Z contiguous=True, 2025-05-07T20:33:02.7998434Z compiled=False, 2025-05-07T20:33:02.7998640Z ) 2025-05-07T20:33:02.7998962Z self = 2025-05-07T20:33:02.7999468Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:02.7999748Z 2025-05-07T20:33:02.7999839Z @given( 2025-05-07T20:33:02.8000066Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.8000388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.8000699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.8001027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.8001362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.8001654Z ) 2025-05-07T20:33:02.8002004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.8002580Z def test_silu_mul_quant( 2025-05-07T20:33:02.8002830Z self, 2025-05-07T20:33:02.8003035Z T: int, 2025-05-07T20:33:02.8003237Z D: int, 2025-05-07T20:33:02.8003501Z scale_ub: Optional[float], 2025-05-07T20:33:02.8003785Z contiguous: bool, 2025-05-07T20:33:02.8004032Z compiled: bool, 2025-05-07T20:33:02.8004266Z ) -> None: 2025-05-07T20:33:02.8004493Z torch.manual_seed(2025) 2025-05-07T20:33:02.8004738Z 2025-05-07T20:33:02.8005016Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.8007083Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.8008951Z 2025-05-07T20:33:02.8009081Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.8009298Z 2025-05-07T20:33:02.8009413Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.8009827Z self=, 2025-05-07T20:33:02.8010240Z T=128, 2025-05-07T20:33:02.8010438Z D=5120, 2025-05-07T20:33:02.8010633Z scale_ub=1200.0, 2025-05-07T20:33:02.8010864Z contiguous=False, 2025-05-07T20:33:02.8011096Z compiled=False, 2025-05-07T20:33:02.8011304Z ) 2025-05-07T20:33:02.9297765Z self = 2025-05-07T20:33:02.9298519Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.9298904Z 2025-05-07T20:33:02.9299090Z @given( 2025-05-07T20:33:02.9299333Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.9299646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.9299952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.9300293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.9300616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.9300907Z ) 2025-05-07T20:33:02.9301256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.9301692Z def test_silu_mul_quant( 2025-05-07T20:33:02.9301940Z self, 2025-05-07T20:33:02.9302143Z T: int, 2025-05-07T20:33:02.9302338Z D: int, 2025-05-07T20:33:02.9302557Z scale_ub: Optional[float], 2025-05-07T20:33:02.9302830Z contiguous: bool, 2025-05-07T20:33:02.9303070Z compiled: bool, 2025-05-07T20:33:02.9303299Z ) -> None: 2025-05-07T20:33:02.9303517Z torch.manual_seed(2025) 2025-05-07T20:33:02.9303770Z 2025-05-07T20:33:02.9304037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.9304384Z 2025-05-07T20:33:02.9304584Z x_sign = torch.sign(x) 2025-05-07T20:33:02.9304872Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.9305187Z x = x_sign * x_clamp 2025-05-07T20:33:02.9305429Z x0 = x[:, :D] 2025-05-07T20:33:02.9305644Z x1 = x[:, D:] 2025-05-07T20:33:02.9305857Z 2025-05-07T20:33:02.9306049Z if contiguous: 2025-05-07T20:33:02.9306279Z x0 = x0.contiguous() 2025-05-07T20:33:02.9306550Z x1 = x1.contiguous() 2025-05-07T20:33:02.9306797Z 2025-05-07T20:33:02.9306995Z if scale_ub is not None: 2025-05-07T20:33:02.9307275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.9307617Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.9307934Z ) 2025-05-07T20:33:02.9308137Z else: 2025-05-07T20:33:02.9308771Z scale_ub_tensor = None 2025-05-07T20:33:02.9309032Z 2025-05-07T20:33:02.9309297Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.9309701Z op = silu_mul_quant 2025-05-07T20:33:02.9309957Z if compiled: 2025-05-07T20:33:02.9310203Z op = torch.compile(op) 2025-05-07T20:33:02.9310509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9310788Z 2025-05-07T20:33:02.9310983Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.9311154Z 2025-05-07T20:33:02.9311255Z moe/activation_test.py:117: 2025-05-07T20:33:02.9311553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9311880Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.9312165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9312863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.9313554Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.9314085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.9314771Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.9315431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.9316045Z kernel = self.compile( 2025-05-07T20:33:02.9316588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.9317246Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.9317648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9317879Z 2025-05-07T20:33:02.9318087Z self = 2025-05-07T20:33:02.9319224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.9320625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a68f04cc0>} 2025-05-07T20:33:02.9321967Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.9322992Z context = 2025-05-07T20:33:02.9323278Z 2025-05-07T20:33:02.9323444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.9323973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.9324449Z module_map=module_map) 2025-05-07T20:33:02.9324808Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.9325173Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.9325442Z E ^ 2025-05-07T20:33:02.9325909Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.9326360Z 2025-05-07T20:33:02.9326776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.9327295Z 2025-05-07T20:33:02.9327400Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9327815Z self=, 2025-05-07T20:33:02.9328229Z T=2048, 2025-05-07T20:33:02.9328417Z D=7168, 2025-05-07T20:33:02.9328622Z scale_ub=None, 2025-05-07T20:33:02.9328846Z contiguous=False, 2025-05-07T20:33:02.9329196Z compiled=False, 2025-05-07T20:33:02.9329414Z ) 2025-05-07T20:33:02.9329737Z self = 2025-05-07T20:33:02.9330272Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:02.9330555Z 2025-05-07T20:33:02.9330635Z @given( 2025-05-07T20:33:02.9330875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.9331188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.9331499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.9331835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.9332173Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.9332463Z ) 2025-05-07T20:33:02.9332811Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.9333256Z def test_silu_mul_quant( 2025-05-07T20:33:02.9333495Z self, 2025-05-07T20:33:02.9333704Z T: int, 2025-05-07T20:33:02.9333905Z D: int, 2025-05-07T20:33:02.9334119Z scale_ub: Optional[float], 2025-05-07T20:33:02.9334392Z contiguous: bool, 2025-05-07T20:33:02.9334637Z compiled: bool, 2025-05-07T20:33:02.9334854Z ) -> None: 2025-05-07T20:33:02.9335077Z torch.manual_seed(2025) 2025-05-07T20:33:02.9335323Z 2025-05-07T20:33:02.9335595Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.9337667Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.9339535Z 2025-05-07T20:33:02.9339655Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:02.9339872Z 2025-05-07T20:33:02.9339982Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9340397Z self=, 2025-05-07T20:33:02.9340795Z T=128, 2025-05-07T20:33:02.9340987Z D=7168, 2025-05-07T20:33:02.9341183Z scale_ub=1200.0, 2025-05-07T20:33:02.9341401Z contiguous=True, 2025-05-07T20:33:02.9341625Z compiled=True, 2025-05-07T20:33:02.9341830Z ) 2025-05-07T20:33:02.9658791Z self = 2025-05-07T20:33:02.9659834Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:02.9660597Z 2025-05-07T20:33:02.9660815Z @given( 2025-05-07T20:33:02.9661437Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.9662320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.9663144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.9664030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.9664713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.9665284Z ) 2025-05-07T20:33:02.9666305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.9667178Z def test_silu_mul_quant( 2025-05-07T20:33:02.9667664Z self, 2025-05-07T20:33:02.9668052Z T: int, 2025-05-07T20:33:02.9668444Z D: int, 2025-05-07T20:33:02.9668921Z scale_ub: Optional[float], 2025-05-07T20:33:02.9669322Z contiguous: bool, 2025-05-07T20:33:02.9669612Z compiled: bool, 2025-05-07T20:33:02.9669842Z ) -> None: 2025-05-07T20:33:02.9670059Z torch.manual_seed(2025) 2025-05-07T20:33:02.9670304Z 2025-05-07T20:33:02.9670580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.9671248Z 2025-05-07T20:33:02.9671444Z x_sign = torch.sign(x) 2025-05-07T20:33:02.9671737Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.9672120Z x = x_sign * x_clamp 2025-05-07T20:33:02.9672359Z x0 = x[:, :D] 2025-05-07T20:33:02.9672580Z x1 = x[:, D:] 2025-05-07T20:33:02.9672792Z 2025-05-07T20:33:02.9672976Z if contiguous: 2025-05-07T20:33:02.9673211Z x0 = x0.contiguous() 2025-05-07T20:33:02.9673473Z x1 = x1.contiguous() 2025-05-07T20:33:02.9673713Z 2025-05-07T20:33:02.9673909Z if scale_ub is not None: 2025-05-07T20:33:02.9674184Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.9674514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.9674831Z ) 2025-05-07T20:33:02.9675027Z else: 2025-05-07T20:33:02.9675236Z scale_ub_tensor = None 2025-05-07T20:33:02.9675492Z 2025-05-07T20:33:02.9675838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.9676156Z op = silu_mul_quant 2025-05-07T20:33:02.9676412Z if compiled: 2025-05-07T20:33:02.9676669Z op = torch.compile(op) 2025-05-07T20:33:02.9676974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9677247Z 2025-05-07T20:33:02.9677447Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.9677612Z 2025-05-07T20:33:02.9677718Z moe/activation_test.py:117: 2025-05-07T20:33:02.9678010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9678347Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.9678633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.9679189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:02.9679753Z return fn(*args, **kwargs) 2025-05-07T20:33:02.9680414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.9681114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.9681650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.9682337Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.9683004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.9683540Z kernel = self.compile( 2025-05-07T20:33:02.9684079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.9684737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.9685139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.9685371Z 2025-05-07T20:33:02.9685583Z self = 2025-05-07T20:33:02.9686667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.9688048Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a68f05a80>} 2025-05-07T20:33:02.9689392Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.9690416Z context = 2025-05-07T20:33:02.9690703Z 2025-05-07T20:33:02.9690869Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.9691532Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.9692007Z module_map=module_map) 2025-05-07T20:33:02.9692407Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.9692763Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.9693026Z E ^ 2025-05-07T20:33:02.9693492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.9693941Z 2025-05-07T20:33:02.9694354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.9694869Z 2025-05-07T20:33:02.9694974Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9695393Z self=, 2025-05-07T20:33:02.9695812Z T=128, 2025-05-07T20:33:02.9696010Z D=7168, 2025-05-07T20:33:02.9696217Z scale_ub=1200.0, 2025-05-07T20:33:02.9696438Z contiguous=True, 2025-05-07T20:33:02.9696666Z compiled=False, 2025-05-07T20:33:02.9696876Z ) 2025-05-07T20:33:02.9697192Z self = 2025-05-07T20:33:02.9697691Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:02.9697970Z 2025-05-07T20:33:02.9698051Z @given( 2025-05-07T20:33:02.9698286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.9698600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.9698915Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.9699255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.9699582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.9699871Z ) 2025-05-07T20:33:02.9700222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.9700667Z def test_silu_mul_quant( 2025-05-07T20:33:02.9700918Z self, 2025-05-07T20:33:02.9701119Z T: int, 2025-05-07T20:33:02.9701322Z D: int, 2025-05-07T20:33:02.9701536Z scale_ub: Optional[float], 2025-05-07T20:33:02.9701814Z contiguous: bool, 2025-05-07T20:33:02.9712911Z compiled: bool, 2025-05-07T20:33:02.9713156Z ) -> None: 2025-05-07T20:33:02.9713371Z torch.manual_seed(2025) 2025-05-07T20:33:02.9713608Z 2025-05-07T20:33:02.9713878Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.9714218Z 2025-05-07T20:33:02.9714408Z x_sign = torch.sign(x) 2025-05-07T20:33:02.9714694Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.9716817Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.9718690Z 2025-05-07T20:33:02.9718810Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:02.9719024Z 2025-05-07T20:33:02.9719134Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9719587Z self=, 2025-05-07T20:33:02.9719990Z T=128, 2025-05-07T20:33:02.9720173Z D=5120, 2025-05-07T20:33:02.9720358Z scale_ub=1200.0, 2025-05-07T20:33:02.9720576Z contiguous=True, 2025-05-07T20:33:02.9720799Z compiled=True, 2025-05-07T20:33:02.9721010Z ) 2025-05-07T20:33:02.9721344Z self = 2025-05-07T20:33:02.9722002Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:02.9722329Z 2025-05-07T20:33:02.9722412Z @given( 2025-05-07T20:33:02.9722652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.9723020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.9723329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.9723667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.9723999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.9724278Z ) 2025-05-07T20:33:02.9724628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.9725087Z def test_silu_mul_quant( 2025-05-07T20:33:02.9725331Z self, 2025-05-07T20:33:02.9725537Z T: int, 2025-05-07T20:33:02.9725744Z D: int, 2025-05-07T20:33:02.9725963Z scale_ub: Optional[float], 2025-05-07T20:33:02.9726243Z contiguous: bool, 2025-05-07T20:33:02.9726504Z compiled: bool, 2025-05-07T20:33:02.9726737Z ) -> None: 2025-05-07T20:33:02.9726955Z torch.manual_seed(2025) 2025-05-07T20:33:02.9727206Z 2025-05-07T20:33:02.9727487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.9727833Z 2025-05-07T20:33:02.9728035Z x_sign = torch.sign(x) 2025-05-07T20:33:02.9728331Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.9730384Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:02.9732251Z 2025-05-07T20:33:02.9732375Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:02.9732598Z 2025-05-07T20:33:02.9732705Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.9733127Z self=, 2025-05-07T20:33:02.9733536Z T=128, 2025-05-07T20:33:02.9733725Z D=7168, 2025-05-07T20:33:02.9733925Z scale_ub=None, 2025-05-07T20:33:02.9734146Z contiguous=True, 2025-05-07T20:33:02.9734368Z compiled=True, 2025-05-07T20:33:02.9734577Z ) 2025-05-07T20:33:03.4556922Z self = 2025-05-07T20:33:03.4557654Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.4557928Z 2025-05-07T20:33:03.4558011Z @given( 2025-05-07T20:33:03.4558256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4558579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4558925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4559255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4559637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4559933Z ) 2025-05-07T20:33:03.4560279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4560725Z def test_silu_mul_quant( 2025-05-07T20:33:03.4560974Z self, 2025-05-07T20:33:03.4561174Z T: int, 2025-05-07T20:33:03.4561380Z D: int, 2025-05-07T20:33:03.4561608Z scale_ub: Optional[float], 2025-05-07T20:33:03.4561880Z contiguous: bool, 2025-05-07T20:33:03.4562127Z compiled: bool, 2025-05-07T20:33:03.4562359Z ) -> None: 2025-05-07T20:33:03.4562574Z torch.manual_seed(2025) 2025-05-07T20:33:03.4562821Z 2025-05-07T20:33:03.4563098Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4565807Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.4567862Z 2025-05-07T20:33:03.4567984Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.4568196Z 2025-05-07T20:33:03.4601789Z FAILED 2025-05-07T20:33:03.4601967Z 2025-05-07T20:33:03.4602405Z =================================== FAILURES =================================== 2025-05-07T20:33:03.4603040Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:03.4603686Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:03.4604530Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:03.4605304Z | yield 2025-05-07T20:33:03.4605908Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:33:03.4606625Z | self._callTestMethod(testMethod) 2025-05-07T20:33:03.4607390Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:33:03.4608158Z | if method() is not None: 2025-05-07T20:33:03.4608518Z | ^^^^^^^^ 2025-05-07T20:33:03.4609406Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:03.4610431Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4610853Z | ^^^^^^^ 2025-05-07T20:33:03.4611622Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:03.4612478Z | raise the_error_hypothesis_found 2025-05-07T20:33:03.4613059Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:03.4613638Z +-+---------------- 1 ---------------- 2025-05-07T20:33:03.4614038Z | Traceback (most recent call last): 2025-05-07T20:33:03.4615013Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:03.4616095Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4616611Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4619787Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.4622536Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:03.4623135Z | self=, 2025-05-07T20:33:03.4623692Z | T=2048, 2025-05-07T20:33:03.4624014Z | D=5120, # or any other generated value 2025-05-07T20:33:03.4624478Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:03.4624968Z | contiguous=True, # or any other generated value 2025-05-07T20:33:03.4625479Z | compiled=False, # or any other generated value 2025-05-07T20:33:03.4626145Z | ) 2025-05-07T20:33:03.4626404Z | 2025-05-07T20:33:03.4627118Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:03.4628008Z +---------------- 2 ---------------- 2025-05-07T20:33:03.4628406Z | Traceback (most recent call last): 2025-05-07T20:33:03.4629378Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:03.4630493Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4631003Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4633649Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.4635639Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:03.4636155Z | self=, 2025-05-07T20:33:03.4636569Z | T=128, 2025-05-07T20:33:03.4636776Z | D=7168, 2025-05-07T20:33:03.4636982Z | scale_ub=None, 2025-05-07T20:33:03.4637225Z | contiguous=True, 2025-05-07T20:33:03.4637475Z | compiled=True, 2025-05-07T20:33:03.4637705Z | ) 2025-05-07T20:33:03.4637885Z | 2025-05-07T20:33:03.4638420Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:03.4639029Z +---------------- 3 ---------------- 2025-05-07T20:33:03.4639316Z | Traceback (most recent call last): 2025-05-07T20:33:03.4640022Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:03.4640805Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4641183Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4643162Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.4645127Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:03.4645573Z | self=, 2025-05-07T20:33:03.4645982Z | T=128, 2025-05-07T20:33:03.4646183Z | D=5120, 2025-05-07T20:33:03.4646396Z | scale_ub=1200.0, 2025-05-07T20:33:03.4646639Z | contiguous=True, 2025-05-07T20:33:03.4646877Z | compiled=True, 2025-05-07T20:33:03.4647103Z | ) 2025-05-07T20:33:03.4647288Z | 2025-05-07T20:33:03.4647811Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:03.4648413Z +---------------- 4 ---------------- 2025-05-07T20:33:03.4648703Z | Traceback (most recent call last): 2025-05-07T20:33:03.4649504Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:03.4650246Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.4650577Z | ^^^^^^^^ 2025-05-07T20:33:03.4651215Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:03.4651911Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4652247Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4653042Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:03.4653836Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.4654443Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:03.4655179Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4655625Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4656268Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:03.4657036Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.4657513Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4658152Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:03.4658878Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.4659399Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4660263Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:03.4661035Z | fn() 2025-05-07T20:33:03.4661801Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:03.4662656Z | self.fn.run( 2025-05-07T20:33:03.4663380Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:03.4664185Z | kernel = self.compile( 2025-05-07T20:33:03.4664545Z | ^^^^^^^^^^^^^ 2025-05-07T20:33:03.4665776Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:03.4666754Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4667302Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4668173Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:03.4669258Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4669924Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.4670453Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4670942Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.4671327Z | ^ 2025-05-07T20:33:03.4671979Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4672777Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:03.4673341Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:03.4674354Z | self=, 2025-05-07T20:33:03.4674956Z | T=1, # or any other generated value 2025-05-07T20:33:03.4675473Z | D=5120, # or any other generated value 2025-05-07T20:33:03.4676070Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:03.4676577Z | contiguous=True, # or any other generated value 2025-05-07T20:33:03.4677078Z | compiled=True, # or any other generated value 2025-05-07T20:33:03.4677502Z | ) 2025-05-07T20:33:03.4677764Z | 2025-05-07T20:33:03.4678480Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:03.4679322Z +------------------------------------ 2025-05-07T20:33:03.4679820Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:03.4680346Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4680928Z self=, 2025-05-07T20:33:03.4681486Z T=1, 2025-05-07T20:33:03.4681754Z D=5120, 2025-05-07T20:33:03.4682023Z scale_ub=None, 2025-05-07T20:33:03.4682324Z contiguous=True, 2025-05-07T20:33:03.4682639Z compiled=True, 2025-05-07T20:33:03.4682924Z ) 2025-05-07T20:33:03.4683364Z self = 2025-05-07T20:33:03.4684037Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.4684404Z 2025-05-07T20:33:03.4684517Z @given( 2025-05-07T20:33:03.4684841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4685284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4685705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4686166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4686633Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4687049Z ) 2025-05-07T20:33:03.4687531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4688151Z def test_silu_mul_quant( 2025-05-07T20:33:03.4688499Z self, 2025-05-07T20:33:03.4688762Z T: int, 2025-05-07T20:33:03.4689037Z D: int, 2025-05-07T20:33:03.4689355Z scale_ub: Optional[float], 2025-05-07T20:33:03.4689769Z contiguous: bool, 2025-05-07T20:33:03.4690101Z compiled: bool, 2025-05-07T20:33:03.4690405Z ) -> None: 2025-05-07T20:33:03.4690699Z torch.manual_seed(2025) 2025-05-07T20:33:03.4691042Z 2025-05-07T20:33:03.4691425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4691895Z 2025-05-07T20:33:03.4692177Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4692582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4693023Z x = x_sign * x_clamp 2025-05-07T20:33:03.4693368Z x0 = x[:, :D] 2025-05-07T20:33:03.4693690Z x1 = x[:, D:] 2025-05-07T20:33:03.4694001Z 2025-05-07T20:33:03.4694262Z if contiguous: 2025-05-07T20:33:03.4694594Z x0 = x0.contiguous() 2025-05-07T20:33:03.4694970Z x1 = x1.contiguous() 2025-05-07T20:33:03.4695315Z 2025-05-07T20:33:03.4695594Z if scale_ub is not None: 2025-05-07T20:33:03.4695973Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4696431Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4696867Z ) 2025-05-07T20:33:03.4697143Z else: 2025-05-07T20:33:03.4697438Z scale_ub_tensor = None 2025-05-07T20:33:03.4697804Z 2025-05-07T20:33:03.4698132Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4698572Z op = silu_mul_quant 2025-05-07T20:33:03.4698916Z if compiled: 2025-05-07T20:33:03.4699262Z op = torch.compile(op) 2025-05-07T20:33:03.4699780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4700209Z 2025-05-07T20:33:03.4700490Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.4700874Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.4701340Z 2025-05-07T20:33:03.4701673Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4702137Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.4702545Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.4702987Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.4703473Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4703896Z 2025-05-07T20:33:03.4704178Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.4704445Z 2025-05-07T20:33:03.4704590Z moe/activation_test.py:126: 2025-05-07T20:33:03.4705002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4705481Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.4705948Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4707031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.4708056Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.4708790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4709689Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4710595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.4711557Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.4712523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.4713369Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.4714159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.4714852Z fn() 2025-05-07T20:33:03.4715530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.4716433Z self.fn.run( 2025-05-07T20:33:03.4717078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4717825Z kernel = self.compile( 2025-05-07T20:33:03.4718546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4719446Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4720015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4720340Z 2025-05-07T20:33:03.4720629Z self = 2025-05-07T20:33:03.4722064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4723885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b91f50c20>} 2025-05-07T20:33:03.4725646Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4726992Z context = 2025-05-07T20:33:03.4727370Z 2025-05-07T20:33:03.4727700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4728466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4729080Z module_map=module_map) 2025-05-07T20:33:03.4729607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4730077Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.4730427Z E ^ 2025-05-07T20:33:03.4731055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4731645Z 2025-05-07T20:33:03.4732199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4732865Z 2025-05-07T20:33:03.4733014Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4733552Z self=, 2025-05-07T20:33:03.4734084Z T=2048, 2025-05-07T20:33:03.4734352Z D=5120, 2025-05-07T20:33:03.4734606Z scale_ub=1200.0, 2025-05-07T20:33:03.4734900Z contiguous=True, 2025-05-07T20:33:03.4735195Z compiled=False, 2025-05-07T20:33:03.4735465Z ) 2025-05-07T20:33:03.4735922Z self = 2025-05-07T20:33:03.4736601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.4736974Z 2025-05-07T20:33:03.4737089Z @given( 2025-05-07T20:33:03.4737400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4737835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4738274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4738741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4739209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4739603Z ) 2025-05-07T20:33:03.4740080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4740701Z def test_silu_mul_quant( 2025-05-07T20:33:03.4741032Z self, 2025-05-07T20:33:03.4741294Z T: int, 2025-05-07T20:33:03.4741585Z D: int, 2025-05-07T20:33:03.4763039Z scale_ub: Optional[float], 2025-05-07T20:33:03.4763427Z contiguous: bool, 2025-05-07T20:33:03.4763759Z compiled: bool, 2025-05-07T20:33:03.4764070Z ) -> None: 2025-05-07T20:33:03.4764351Z torch.manual_seed(2025) 2025-05-07T20:33:03.4764681Z 2025-05-07T20:33:03.4765055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4765769Z 2025-05-07T20:33:03.4766034Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4766429Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4766845Z x = x_sign * x_clamp 2025-05-07T20:33:03.4767177Z x0 = x[:, :D] 2025-05-07T20:33:03.4767479Z x1 = x[:, D:] 2025-05-07T20:33:03.4767752Z 2025-05-07T20:33:03.4768014Z if contiguous: 2025-05-07T20:33:03.4768336Z x0 = x0.contiguous() 2025-05-07T20:33:03.4768676Z x1 = x1.contiguous() 2025-05-07T20:33:03.4768995Z 2025-05-07T20:33:03.4769255Z if scale_ub is not None: 2025-05-07T20:33:03.4769619Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4770074Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4770506Z ) 2025-05-07T20:33:03.4770767Z else: 2025-05-07T20:33:03.4771061Z scale_ub_tensor = None 2025-05-07T20:33:03.4771404Z 2025-05-07T20:33:03.4771710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4772131Z op = silu_mul_quant 2025-05-07T20:33:03.4772477Z if compiled: 2025-05-07T20:33:03.4772811Z op = torch.compile(op) 2025-05-07T20:33:03.4773208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4773576Z 2025-05-07T20:33:03.4773830Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.4774421Z 2025-05-07T20:33:03.4774565Z moe/activation_test.py:117: 2025-05-07T20:33:03.4774974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4775423Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.4775877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4776785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.4777723Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.4778464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4779412Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4780331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4781078Z kernel = self.compile( 2025-05-07T20:33:03.4781831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4782754Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4783322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4783626Z 2025-05-07T20:33:03.4783895Z self = 2025-05-07T20:33:03.4785286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4787083Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b91e10180>} 2025-05-07T20:33:03.4788825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4790203Z context = 2025-05-07T20:33:03.4790574Z 2025-05-07T20:33:03.4790788Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4791465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4792064Z module_map=module_map) 2025-05-07T20:33:03.4792529Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4792984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.4793343Z E ^ 2025-05-07T20:33:03.4793948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4794546Z 2025-05-07T20:33:03.4795101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4795884Z 2025-05-07T20:33:03.4796022Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4796557Z self=, 2025-05-07T20:33:03.4797093Z T=2048, 2025-05-07T20:33:03.4797339Z D=5120, 2025-05-07T20:33:03.4797597Z scale_ub=1200.0, 2025-05-07T20:33:03.4797896Z contiguous=True, 2025-05-07T20:33:03.4798215Z compiled=True, 2025-05-07T20:33:03.4798521Z ) 2025-05-07T20:33:03.4798950Z self = 2025-05-07T20:33:03.4799601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.4799978Z 2025-05-07T20:33:03.4800093Z @given( 2025-05-07T20:33:03.4800424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4800861Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4801379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4801874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4802344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4802772Z ) 2025-05-07T20:33:03.4803249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4803853Z def test_silu_mul_quant( 2025-05-07T20:33:03.4804176Z self, 2025-05-07T20:33:03.4804455Z T: int, 2025-05-07T20:33:03.4804726Z D: int, 2025-05-07T20:33:03.4805022Z scale_ub: Optional[float], 2025-05-07T20:33:03.4805390Z contiguous: bool, 2025-05-07T20:33:03.4805723Z compiled: bool, 2025-05-07T20:33:03.4806037Z ) -> None: 2025-05-07T20:33:03.4806329Z torch.manual_seed(2025) 2025-05-07T20:33:03.4806660Z 2025-05-07T20:33:03.4807035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4807508Z 2025-05-07T20:33:03.4807791Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4808197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4808605Z x = x_sign * x_clamp 2025-05-07T20:33:03.4808926Z x0 = x[:, :D] 2025-05-07T20:33:03.4809236Z x1 = x[:, D:] 2025-05-07T20:33:03.4809521Z 2025-05-07T20:33:03.4809772Z if contiguous: 2025-05-07T20:33:03.4810084Z x0 = x0.contiguous() 2025-05-07T20:33:03.4810440Z x1 = x1.contiguous() 2025-05-07T20:33:03.4810775Z 2025-05-07T20:33:03.4811040Z if scale_ub is not None: 2025-05-07T20:33:03.4811418Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4811875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4812298Z ) 2025-05-07T20:33:03.4812556Z else: 2025-05-07T20:33:03.4812846Z scale_ub_tensor = None 2025-05-07T20:33:03.4813190Z 2025-05-07T20:33:03.4813509Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4813954Z op = silu_mul_quant 2025-05-07T20:33:03.4814297Z if compiled: 2025-05-07T20:33:03.4814628Z op = torch.compile(op) 2025-05-07T20:33:03.4815031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4815395Z 2025-05-07T20:33:03.4815664Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.4816060Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.4816462Z 2025-05-07T20:33:03.4816795Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4817258Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.4817670Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.4818103Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.4818608Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4819041Z 2025-05-07T20:33:03.4819330Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.4819609Z 2025-05-07T20:33:03.4819765Z moe/activation_test.py:126: 2025-05-07T20:33:03.4820165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4820628Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.4821089Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4822132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.4823123Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.4823851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4824773Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4825684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.4826762Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.4827786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.4828642Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.4829502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.4830207Z fn() 2025-05-07T20:33:03.4830881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.4831665Z self.fn.run( 2025-05-07T20:33:03.4832297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4833023Z kernel = self.compile( 2025-05-07T20:33:03.4833769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4834676Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4835210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4835512Z 2025-05-07T20:33:03.4835826Z self = 2025-05-07T20:33:03.4836907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4838278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b906de840>} 2025-05-07T20:33:03.4839615Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4840643Z context = 2025-05-07T20:33:03.4840934Z 2025-05-07T20:33:03.4841099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4841623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4842092Z module_map=module_map) 2025-05-07T20:33:03.4842451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4842806Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.4843077Z E ^ 2025-05-07T20:33:03.4843541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4843991Z 2025-05-07T20:33:03.4844404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4844919Z 2025-05-07T20:33:03.4845029Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4845450Z self=, 2025-05-07T20:33:03.4845851Z T=16384, 2025-05-07T20:33:03.4846055Z D=7168, 2025-05-07T20:33:03.4846257Z scale_ub=1200.0, 2025-05-07T20:33:03.4846482Z contiguous=False, 2025-05-07T20:33:03.4846700Z compiled=False, 2025-05-07T20:33:03.4846910Z ) 2025-05-07T20:33:03.4847230Z self = 2025-05-07T20:33:03.4847728Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.4848010Z 2025-05-07T20:33:03.4848092Z @given( 2025-05-07T20:33:03.4848324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4848634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4848941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4849277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4849779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4850071Z ) 2025-05-07T20:33:03.4850420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4850903Z def test_silu_mul_quant( 2025-05-07T20:33:03.4851145Z self, 2025-05-07T20:33:03.4851343Z T: int, 2025-05-07T20:33:03.4851543Z D: int, 2025-05-07T20:33:03.4851756Z scale_ub: Optional[float], 2025-05-07T20:33:03.4852029Z contiguous: bool, 2025-05-07T20:33:03.4852271Z compiled: bool, 2025-05-07T20:33:03.4852488Z ) -> None: 2025-05-07T20:33:03.4852710Z torch.manual_seed(2025) 2025-05-07T20:33:03.4852954Z 2025-05-07T20:33:03.4853220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4853568Z 2025-05-07T20:33:03.4853774Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4854058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4854376Z x = x_sign * x_clamp 2025-05-07T20:33:03.4854625Z x0 = x[:, :D] 2025-05-07T20:33:03.4854840Z x1 = x[:, D:] 2025-05-07T20:33:03.4855056Z 2025-05-07T20:33:03.4855244Z if contiguous: 2025-05-07T20:33:03.4855481Z x0 = x0.contiguous() 2025-05-07T20:33:03.4855738Z x1 = x1.contiguous() 2025-05-07T20:33:03.4855983Z 2025-05-07T20:33:03.4856180Z if scale_ub is not None: 2025-05-07T20:33:03.4856448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4856787Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4857104Z ) 2025-05-07T20:33:03.4857292Z else: 2025-05-07T20:33:03.4857505Z scale_ub_tensor = None 2025-05-07T20:33:03.4857756Z 2025-05-07T20:33:03.4857984Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4858304Z op = silu_mul_quant 2025-05-07T20:33:03.4858559Z if compiled: 2025-05-07T20:33:03.4858807Z op = torch.compile(op) 2025-05-07T20:33:03.4859104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4859378Z 2025-05-07T20:33:03.4859573Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.4859741Z 2025-05-07T20:33:03.4859844Z moe/activation_test.py:117: 2025-05-07T20:33:03.4860143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4860479Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.4860760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4861451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.4862137Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.4862665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4863347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4864021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4864553Z kernel = self.compile( 2025-05-07T20:33:03.4865086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4866107Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4866510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4866740Z 2025-05-07T20:33:03.4866952Z self = 2025-05-07T20:33:03.4868029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4869625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b908cd260>} 2025-05-07T20:33:03.4871027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4872117Z context = 2025-05-07T20:33:03.4872404Z 2025-05-07T20:33:03.4872570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4873094Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4873568Z module_map=module_map) 2025-05-07T20:33:03.4873940Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4874293Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.4874559Z E ^ 2025-05-07T20:33:03.4875030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4875484Z 2025-05-07T20:33:03.4875960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4876484Z 2025-05-07T20:33:03.4876590Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4877007Z self=, 2025-05-07T20:33:03.4877412Z T=1, 2025-05-07T20:33:03.4877593Z D=7168, 2025-05-07T20:33:03.4877789Z scale_ub=None, 2025-05-07T20:33:03.4878007Z contiguous=True, 2025-05-07T20:33:03.4878224Z compiled=True, 2025-05-07T20:33:03.4878430Z ) 2025-05-07T20:33:03.4878755Z self = 2025-05-07T20:33:03.4879232Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.4879495Z 2025-05-07T20:33:03.4879575Z @given( 2025-05-07T20:33:03.4879815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4880131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4880434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4880771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4881103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4881383Z ) 2025-05-07T20:33:03.4881733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4882178Z def test_silu_mul_quant( 2025-05-07T20:33:03.4882417Z self, 2025-05-07T20:33:03.4882614Z T: int, 2025-05-07T20:33:03.4882819Z D: int, 2025-05-07T20:33:03.4883033Z scale_ub: Optional[float], 2025-05-07T20:33:03.4883305Z contiguous: bool, 2025-05-07T20:33:03.4883550Z compiled: bool, 2025-05-07T20:33:03.4883768Z ) -> None: 2025-05-07T20:33:03.4883988Z torch.manual_seed(2025) 2025-05-07T20:33:03.4884237Z 2025-05-07T20:33:03.4884517Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4884855Z 2025-05-07T20:33:03.4885053Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4885355Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4885662Z x = x_sign * x_clamp 2025-05-07T20:33:03.4885911Z x0 = x[:, :D] 2025-05-07T20:33:03.4886132Z x1 = x[:, D:] 2025-05-07T20:33:03.4886335Z 2025-05-07T20:33:03.4886525Z if contiguous: 2025-05-07T20:33:03.4886759Z x0 = x0.contiguous() 2025-05-07T20:33:03.4887017Z x1 = x1.contiguous() 2025-05-07T20:33:03.4887264Z 2025-05-07T20:33:03.4887465Z if scale_ub is not None: 2025-05-07T20:33:03.4887742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4888086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4888401Z ) 2025-05-07T20:33:03.4888590Z else: 2025-05-07T20:33:03.4888938Z scale_ub_tensor = None 2025-05-07T20:33:03.4889198Z 2025-05-07T20:33:03.4889434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4889750Z op = silu_mul_quant 2025-05-07T20:33:03.4890046Z if compiled: 2025-05-07T20:33:03.4890294Z op = torch.compile(op) 2025-05-07T20:33:03.4890585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4890864Z 2025-05-07T20:33:03.4891064Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.4891344Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.4891637Z 2025-05-07T20:33:03.4891878Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4892214Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.4892512Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.4892829Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.4893188Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4893506Z 2025-05-07T20:33:03.4893713Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.4893909Z 2025-05-07T20:33:03.4894016Z moe/activation_test.py:126: 2025-05-07T20:33:03.4894311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4894650Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.4894979Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4895757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.4896509Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.4897054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4897732Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4898426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.4899153Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.4899889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.4900535Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.4901133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.4901652Z fn() 2025-05-07T20:33:03.4902160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.4902736Z self.fn.run( 2025-05-07T20:33:03.4903201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4903732Z kernel = self.compile( 2025-05-07T20:33:03.4904275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4904926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4905335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4905565Z 2025-05-07T20:33:03.4905779Z self = 2025-05-07T20:33:03.4906861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4908234Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b67ecaac0>} 2025-05-07T20:33:03.4909675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4910743Z context = 2025-05-07T20:33:03.4911070Z 2025-05-07T20:33:03.4911244Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4911764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4912237Z module_map=module_map) 2025-05-07T20:33:03.4912606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4912963Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.4913234Z E ^ 2025-05-07T20:33:03.4913701Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4914151Z 2025-05-07T20:33:03.4914575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4915086Z 2025-05-07T20:33:03.4915193Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4915611Z self=, 2025-05-07T20:33:03.4916077Z T=4096, 2025-05-07T20:33:03.4916267Z D=5120, 2025-05-07T20:33:03.4916465Z scale_ub=None, 2025-05-07T20:33:03.4916689Z contiguous=False, 2025-05-07T20:33:03.4916924Z compiled=False, 2025-05-07T20:33:03.4917127Z ) 2025-05-07T20:33:03.4917453Z self = 2025-05-07T20:33:03.4917953Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.4918230Z 2025-05-07T20:33:03.4918311Z @given( 2025-05-07T20:33:03.4918550Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4918871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4919186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4919520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4919858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4920144Z ) 2025-05-07T20:33:03.4920498Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4920945Z def test_silu_mul_quant( 2025-05-07T20:33:03.4921195Z self, 2025-05-07T20:33:03.4921388Z T: int, 2025-05-07T20:33:03.4921592Z D: int, 2025-05-07T20:33:03.4921814Z scale_ub: Optional[float], 2025-05-07T20:33:03.4922079Z contiguous: bool, 2025-05-07T20:33:03.4922329Z compiled: bool, 2025-05-07T20:33:03.4922555Z ) -> None: 2025-05-07T20:33:03.4922769Z torch.manual_seed(2025) 2025-05-07T20:33:03.4923013Z 2025-05-07T20:33:03.4923285Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4923623Z 2025-05-07T20:33:03.4923826Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4924117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4924422Z x = x_sign * x_clamp 2025-05-07T20:33:03.4924673Z x0 = x[:, :D] 2025-05-07T20:33:03.4924894Z x1 = x[:, D:] 2025-05-07T20:33:03.4925098Z 2025-05-07T20:33:03.4925287Z if contiguous: 2025-05-07T20:33:03.4925523Z x0 = x0.contiguous() 2025-05-07T20:33:03.4925783Z x1 = x1.contiguous() 2025-05-07T20:33:03.4926018Z 2025-05-07T20:33:03.4926216Z if scale_ub is not None: 2025-05-07T20:33:03.4926491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4926822Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4927133Z ) 2025-05-07T20:33:03.4927327Z else: 2025-05-07T20:33:03.4935473Z scale_ub_tensor = None 2025-05-07T20:33:03.4935776Z 2025-05-07T20:33:03.4936018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4936540Z op = silu_mul_quant 2025-05-07T20:33:03.4936809Z if compiled: 2025-05-07T20:33:03.4937070Z op = torch.compile(op) 2025-05-07T20:33:03.4937412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4937698Z 2025-05-07T20:33:03.4937904Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.4938071Z 2025-05-07T20:33:03.4938173Z moe/activation_test.py:117: 2025-05-07T20:33:03.4938476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4938817Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.4939100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4939800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.4940497Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.4941038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4941734Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4942396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4942943Z kernel = self.compile( 2025-05-07T20:33:03.4943327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4943512Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4943645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4943650Z 2025-05-07T20:33:03.4943864Z self = 2025-05-07T20:33:03.4944651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4945160Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b908cdee0>} 2025-05-07T20:33:03.4945918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4946112Z context = 2025-05-07T20:33:03.4946116Z 2025-05-07T20:33:03.4946293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4946560Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4946673Z module_map=module_map) 2025-05-07T20:33:03.4946853Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4946958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.4947047Z E ^ 2025-05-07T20:33:03.4947406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4947413Z 2025-05-07T20:33:03.4947829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4947834Z 2025-05-07T20:33:03.4947948Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4948175Z self=, 2025-05-07T20:33:03.4948265Z T=4096, 2025-05-07T20:33:03.4948346Z D=7168, 2025-05-07T20:33:03.4948431Z scale_ub=None, 2025-05-07T20:33:03.4948530Z contiguous=False, 2025-05-07T20:33:03.4948618Z compiled=False, 2025-05-07T20:33:03.4948697Z ) 2025-05-07T20:33:03.4948928Z self = 2025-05-07T20:33:03.4949235Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.4949241Z 2025-05-07T20:33:03.4949323Z @given( 2025-05-07T20:33:03.4949456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4949600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4949730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4949851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4949970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4950054Z ) 2025-05-07T20:33:03.4950299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4950394Z def test_silu_mul_quant( 2025-05-07T20:33:03.4950482Z self, 2025-05-07T20:33:03.4950561Z T: int, 2025-05-07T20:33:03.4950640Z D: int, 2025-05-07T20:33:03.4950750Z scale_ub: Optional[float], 2025-05-07T20:33:03.4950842Z contiguous: bool, 2025-05-07T20:33:03.4950936Z compiled: bool, 2025-05-07T20:33:03.4951028Z ) -> None: 2025-05-07T20:33:03.4951125Z torch.manual_seed(2025) 2025-05-07T20:33:03.4951209Z 2025-05-07T20:33:03.4951383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4951458Z 2025-05-07T20:33:03.4951565Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4951693Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4951784Z x = x_sign * x_clamp 2025-05-07T20:33:03.4951878Z x0 = x[:, :D] 2025-05-07T20:33:03.4951960Z x1 = x[:, D:] 2025-05-07T20:33:03.4952035Z 2025-05-07T20:33:03.4952128Z if contiguous: 2025-05-07T20:33:03.4952225Z x0 = x0.contiguous() 2025-05-07T20:33:03.4952320Z x1 = x1.contiguous() 2025-05-07T20:33:03.4952403Z 2025-05-07T20:33:03.4952494Z if scale_ub is not None: 2025-05-07T20:33:03.4952603Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4952753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4952835Z ) 2025-05-07T20:33:03.4952920Z else: 2025-05-07T20:33:03.4953018Z scale_ub_tensor = None 2025-05-07T20:33:03.4953096Z 2025-05-07T20:33:03.4953234Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4953325Z op = silu_mul_quant 2025-05-07T20:33:03.4953411Z if compiled: 2025-05-07T20:33:03.4953520Z op = torch.compile(op) 2025-05-07T20:33:03.4953627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4953702Z 2025-05-07T20:33:03.4953801Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.4953806Z 2025-05-07T20:33:03.4953906Z moe/activation_test.py:117: 2025-05-07T20:33:03.4954044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4954149Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.4954255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4954764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.4954865Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.4955223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4955454Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4955883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4955986Z kernel = self.compile( 2025-05-07T20:33:03.4956366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4956545Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4956770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4956813Z 2025-05-07T20:33:03.4957017Z self = 2025-05-07T20:33:03.4957804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4958348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b908cd940>} 2025-05-07T20:33:03.4959093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4959291Z context = 2025-05-07T20:33:03.4959295Z 2025-05-07T20:33:03.4959468Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4959740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4959853Z module_map=module_map) 2025-05-07T20:33:03.4960015Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4960125Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.4960206Z E ^ 2025-05-07T20:33:03.4960561Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4960573Z 2025-05-07T20:33:03.4960984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4960988Z 2025-05-07T20:33:03.4961094Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4961324Z self=, 2025-05-07T20:33:03.4961410Z T=128, 2025-05-07T20:33:03.4961490Z D=7168, 2025-05-07T20:33:03.4961584Z scale_ub=None, 2025-05-07T20:33:03.4961673Z contiguous=False, 2025-05-07T20:33:03.4961760Z compiled=True, 2025-05-07T20:33:03.4961848Z ) 2025-05-07T20:33:03.4962068Z self = 2025-05-07T20:33:03.4962246Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.4962251Z 2025-05-07T20:33:03.4962331Z @given( 2025-05-07T20:33:03.4962452Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4962564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4962680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4962798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4962921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4962997Z ) 2025-05-07T20:33:03.4963258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4963357Z def test_silu_mul_quant( 2025-05-07T20:33:03.4963440Z self, 2025-05-07T20:33:03.4963527Z T: int, 2025-05-07T20:33:03.4963610Z D: int, 2025-05-07T20:33:03.4963711Z scale_ub: Optional[float], 2025-05-07T20:33:03.4963811Z contiguous: bool, 2025-05-07T20:33:03.4963899Z compiled: bool, 2025-05-07T20:33:03.4963980Z ) -> None: 2025-05-07T20:33:03.4964085Z torch.manual_seed(2025) 2025-05-07T20:33:03.4964160Z 2025-05-07T20:33:03.4964329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4964414Z 2025-05-07T20:33:03.4964508Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4964635Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4964733Z x = x_sign * x_clamp 2025-05-07T20:33:03.4964816Z x0 = x[:, :D] 2025-05-07T20:33:03.4964908Z x1 = x[:, D:] 2025-05-07T20:33:03.4964988Z 2025-05-07T20:33:03.4965197Z if contiguous: 2025-05-07T20:33:03.4965301Z x0 = x0.contiguous() 2025-05-07T20:33:03.4965655Z x1 = x1.contiguous() 2025-05-07T20:33:03.4965773Z 2025-05-07T20:33:03.4966045Z if scale_ub is not None: 2025-05-07T20:33:03.4966156Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4966295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4966382Z ) 2025-05-07T20:33:03.4966460Z else: 2025-05-07T20:33:03.4966556Z scale_ub_tensor = None 2025-05-07T20:33:03.4966641Z 2025-05-07T20:33:03.4966770Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4966868Z op = silu_mul_quant 2025-05-07T20:33:03.4966957Z if compiled: 2025-05-07T20:33:03.4967059Z op = torch.compile(op) 2025-05-07T20:33:03.4967172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4967251Z 2025-05-07T20:33:03.4967353Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.4967485Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.4967560Z 2025-05-07T20:33:03.4967697Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4967812Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.4967913Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.4968035Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.4968185Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4968261Z 2025-05-07T20:33:03.4968368Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.4968373Z 2025-05-07T20:33:03.4968473Z moe/activation_test.py:126: 2025-05-07T20:33:03.4968604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4968716Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.4968850Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.4969422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.4969548Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.4969932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4970159Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4970523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.4970780Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.4971158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.4971327Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.4971683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.4971761Z fn() 2025-05-07T20:33:03.4972159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.4972253Z self.fn.run( 2025-05-07T20:33:03.4972588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4972682Z kernel = self.compile( 2025-05-07T20:33:03.4973069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4973243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4973383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4973387Z 2025-05-07T20:33:03.4973594Z self = 2025-05-07T20:33:03.4974599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4975151Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90281620>} 2025-05-07T20:33:03.4975894Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4976093Z context = 2025-05-07T20:33:03.4976098Z 2025-05-07T20:33:03.4976263Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4976542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4976652Z module_map=module_map) 2025-05-07T20:33:03.4976814Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4976935Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.4977015Z E ^ 2025-05-07T20:33:03.4977371Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4977376Z 2025-05-07T20:33:03.4977795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4977800Z 2025-05-07T20:33:03.4977904Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4978134Z self=, 2025-05-07T20:33:03.4978214Z T=128, 2025-05-07T20:33:03.4978294Z D=7168, 2025-05-07T20:33:03.4978385Z scale_ub=None, 2025-05-07T20:33:03.4978473Z contiguous=False, 2025-05-07T20:33:03.4978566Z compiled=False, 2025-05-07T20:33:03.4978651Z ) 2025-05-07T20:33:03.4978871Z self = 2025-05-07T20:33:03.4979045Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.4979049Z 2025-05-07T20:33:03.4979136Z @given( 2025-05-07T20:33:03.4979258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4979381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4979511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4979649Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4979775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4979852Z ) 2025-05-07T20:33:03.4980098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4980201Z def test_silu_mul_quant( 2025-05-07T20:33:03.4980279Z self, 2025-05-07T20:33:03.4980366Z T: int, 2025-05-07T20:33:03.4980455Z D: int, 2025-05-07T20:33:03.4980555Z scale_ub: Optional[float], 2025-05-07T20:33:03.4980656Z contiguous: bool, 2025-05-07T20:33:03.4980745Z compiled: bool, 2025-05-07T20:33:03.4980826Z ) -> None: 2025-05-07T20:33:03.4980928Z torch.manual_seed(2025) 2025-05-07T20:33:03.4981003Z 2025-05-07T20:33:03.4981172Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4981257Z 2025-05-07T20:33:03.4981350Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4981475Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4981568Z x = x_sign * x_clamp 2025-05-07T20:33:03.4981664Z x0 = x[:, :D] 2025-05-07T20:33:03.4981745Z x1 = x[:, D:] 2025-05-07T20:33:03.4981819Z 2025-05-07T20:33:03.4981910Z if contiguous: 2025-05-07T20:33:03.4982005Z x0 = x0.contiguous() 2025-05-07T20:33:03.4982204Z x1 = x1.contiguous() 2025-05-07T20:33:03.4982321Z 2025-05-07T20:33:03.4982413Z if scale_ub is not None: 2025-05-07T20:33:03.4982519Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4982661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4982784Z ) 2025-05-07T20:33:03.4982869Z else: 2025-05-07T20:33:03.4982964Z scale_ub_tensor = None 2025-05-07T20:33:03.4983040Z 2025-05-07T20:33:03.4983174Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4983265Z op = silu_mul_quant 2025-05-07T20:33:03.4983353Z if compiled: 2025-05-07T20:33:03.4983463Z op = torch.compile(op) 2025-05-07T20:33:03.4983568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4983641Z 2025-05-07T20:33:03.4983737Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.4983742Z 2025-05-07T20:33:03.4983838Z moe/activation_test.py:117: 2025-05-07T20:33:03.4983978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4984081Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.4984182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4984687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.4984783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.4985137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4985365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4985704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4985803Z kernel = self.compile( 2025-05-07T20:33:03.4986181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4986366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4986498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4986505Z 2025-05-07T20:33:03.4986709Z self = 2025-05-07T20:33:03.4987488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.4987988Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90282160>} 2025-05-07T20:33:03.4988736Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.4988934Z context = 2025-05-07T20:33:03.4988939Z 2025-05-07T20:33:03.4989101Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.4989371Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.4989479Z module_map=module_map) 2025-05-07T20:33:03.4989638Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.4989744Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.4989823Z E ^ 2025-05-07T20:33:03.4990178Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.4990191Z 2025-05-07T20:33:03.4990602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.4990654Z 2025-05-07T20:33:03.4990829Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.4991060Z self=, 2025-05-07T20:33:03.4991138Z T=4096, 2025-05-07T20:33:03.4991258Z D=5120, 2025-05-07T20:33:03.4991346Z scale_ub=1200.0, 2025-05-07T20:33:03.4991433Z contiguous=True, 2025-05-07T20:33:03.4991517Z compiled=False, 2025-05-07T20:33:03.4991597Z ) 2025-05-07T20:33:03.4991815Z self = 2025-05-07T20:33:03.4991995Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.4991999Z 2025-05-07T20:33:03.4992081Z @given( 2025-05-07T20:33:03.4992200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.4992306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.4992422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.4992538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.4992664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.4992740Z ) 2025-05-07T20:33:03.4992989Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.4993088Z def test_silu_mul_quant( 2025-05-07T20:33:03.4993167Z self, 2025-05-07T20:33:03.4993250Z T: int, 2025-05-07T20:33:03.4993328Z D: int, 2025-05-07T20:33:03.4993430Z scale_ub: Optional[float], 2025-05-07T20:33:03.4993528Z contiguous: bool, 2025-05-07T20:33:03.4993614Z compiled: bool, 2025-05-07T20:33:03.4993692Z ) -> None: 2025-05-07T20:33:03.4993792Z torch.manual_seed(2025) 2025-05-07T20:33:03.4993868Z 2025-05-07T20:33:03.4994034Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.4994116Z 2025-05-07T20:33:03.4994209Z x_sign = torch.sign(x) 2025-05-07T20:33:03.4994334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.4994437Z x = x_sign * x_clamp 2025-05-07T20:33:03.4994520Z x0 = x[:, :D] 2025-05-07T20:33:03.4994608Z x1 = x[:, D:] 2025-05-07T20:33:03.4994683Z 2025-05-07T20:33:03.4994772Z if contiguous: 2025-05-07T20:33:03.4994871Z x0 = x0.contiguous() 2025-05-07T20:33:03.4994961Z x1 = x1.contiguous() 2025-05-07T20:33:03.4995035Z 2025-05-07T20:33:03.4995133Z if scale_ub is not None: 2025-05-07T20:33:03.4995240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.4995377Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.4995459Z ) 2025-05-07T20:33:03.4995538Z else: 2025-05-07T20:33:03.4995633Z scale_ub_tensor = None 2025-05-07T20:33:03.4995715Z 2025-05-07T20:33:03.4995927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.4996029Z op = silu_mul_quant 2025-05-07T20:33:03.4996115Z if compiled: 2025-05-07T20:33:03.4996225Z op = torch.compile(op) 2025-05-07T20:33:03.4996337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4996410Z 2025-05-07T20:33:03.4996502Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.4996509Z 2025-05-07T20:33:03.4996613Z moe/activation_test.py:117: 2025-05-07T20:33:03.4996743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4996844Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.4996951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.4997444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.4997548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.4997908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.4998213Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.4998592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.4998688Z kernel = self.compile( 2025-05-07T20:33:03.4999129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.4999308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.4999435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.4999439Z 2025-05-07T20:33:03.4999647Z self = 2025-05-07T20:33:03.5000421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5000934Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9204efc0>} 2025-05-07T20:33:03.5001678Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5001869Z context = 2025-05-07T20:33:03.5001874Z 2025-05-07T20:33:03.5002043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5002306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5002420Z module_map=module_map) 2025-05-07T20:33:03.5002580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5002682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5002767Z E ^ 2025-05-07T20:33:03.5003127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5003131Z 2025-05-07T20:33:03.5003539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5003552Z 2025-05-07T20:33:03.5003656Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5003879Z self=, 2025-05-07T20:33:03.5003969Z T=1, 2025-05-07T20:33:03.5004046Z D=5120, 2025-05-07T20:33:03.5004128Z scale_ub=None, 2025-05-07T20:33:03.5004222Z contiguous=True, 2025-05-07T20:33:03.5004304Z compiled=True, 2025-05-07T20:33:03.5004379Z ) 2025-05-07T20:33:03.5004602Z self = 2025-05-07T20:33:03.5004764Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5004773Z 2025-05-07T20:33:03.5004859Z @given( 2025-05-07T20:33:03.5004982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5005082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5005202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5005323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5005437Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5005516Z ) 2025-05-07T20:33:03.5005757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5005850Z def test_silu_mul_quant( 2025-05-07T20:33:03.5005934Z self, 2025-05-07T20:33:03.5006011Z T: int, 2025-05-07T20:33:03.5006089Z D: int, 2025-05-07T20:33:03.5006193Z scale_ub: Optional[float], 2025-05-07T20:33:03.5006286Z contiguous: bool, 2025-05-07T20:33:03.5006379Z compiled: bool, 2025-05-07T20:33:03.5006458Z ) -> None: 2025-05-07T20:33:03.5006635Z torch.manual_seed(2025) 2025-05-07T20:33:03.5006776Z 2025-05-07T20:33:03.5006945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5007019Z 2025-05-07T20:33:03.5007116Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5007282Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5007371Z x = x_sign * x_clamp 2025-05-07T20:33:03.5007459Z x0 = x[:, :D] 2025-05-07T20:33:03.5007541Z x1 = x[:, D:] 2025-05-07T20:33:03.5007618Z 2025-05-07T20:33:03.5007708Z if contiguous: 2025-05-07T20:33:03.5007800Z x0 = x0.contiguous() 2025-05-07T20:33:03.5007889Z x1 = x1.contiguous() 2025-05-07T20:33:03.5007968Z 2025-05-07T20:33:03.5008060Z if scale_ub is not None: 2025-05-07T20:33:03.5008178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5008311Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5008388Z ) 2025-05-07T20:33:03.5008478Z else: 2025-05-07T20:33:03.5008575Z scale_ub_tensor = None 2025-05-07T20:33:03.5008649Z 2025-05-07T20:33:03.5008787Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5008888Z op = silu_mul_quant 2025-05-07T20:33:03.5008976Z if compiled: 2025-05-07T20:33:03.5009084Z op = torch.compile(op) 2025-05-07T20:33:03.5009192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5009271Z 2025-05-07T20:33:03.5009369Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.5009494Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.5009577Z 2025-05-07T20:33:03.5009712Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5009814Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.5009927Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.5010049Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.5010197Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5010283Z 2025-05-07T20:33:03.5010388Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.5010392Z 2025-05-07T20:33:03.5010492Z moe/activation_test.py:126: 2025-05-07T20:33:03.5010633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5010739Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.5010877Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5011433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.5011534Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.5011899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5012119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5012497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.5012756Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5013128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.5013297Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.5013636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.5013713Z fn() 2025-05-07T20:33:03.5014120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.5014202Z self.fn.run( 2025-05-07T20:33:03.5014543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5014760Z kernel = self.compile( 2025-05-07T20:33:03.5015140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5015320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5015488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5015492Z 2025-05-07T20:33:03.5015701Z self = 2025-05-07T20:33:03.5016476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5016982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b901cdee0>} 2025-05-07T20:33:03.5017738Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5017933Z context = 2025-05-07T20:33:03.5017937Z 2025-05-07T20:33:03.5018106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5018367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5018475Z module_map=module_map) 2025-05-07T20:33:03.5018642Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5018748Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.5018828Z E ^ 2025-05-07T20:33:03.5019188Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5019196Z 2025-05-07T20:33:03.5019661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5019666Z 2025-05-07T20:33:03.5019776Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5020002Z self=, 2025-05-07T20:33:03.5020080Z T=2048, 2025-05-07T20:33:03.5020164Z D=5120, 2025-05-07T20:33:03.5020246Z scale_ub=None, 2025-05-07T20:33:03.5020338Z contiguous=True, 2025-05-07T20:33:03.5020421Z compiled=True, 2025-05-07T20:33:03.5020494Z ) 2025-05-07T20:33:03.5020717Z self = 2025-05-07T20:33:03.5020886Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5020890Z 2025-05-07T20:33:03.5020967Z @given( 2025-05-07T20:33:03.5021093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5021198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5021314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5021435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5021548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5021632Z ) 2025-05-07T20:33:03.5021878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5021973Z def test_silu_mul_quant( 2025-05-07T20:33:03.5022058Z self, 2025-05-07T20:33:03.5022136Z T: int, 2025-05-07T20:33:03.5022214Z D: int, 2025-05-07T20:33:03.5022322Z scale_ub: Optional[float], 2025-05-07T20:33:03.5022411Z contiguous: bool, 2025-05-07T20:33:03.5022498Z compiled: bool, 2025-05-07T20:33:03.5022587Z ) -> None: 2025-05-07T20:33:03.5022683Z torch.manual_seed(2025) 2025-05-07T20:33:03.5022757Z 2025-05-07T20:33:03.5022933Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5023125Z 2025-05-07T20:33:03.5023219Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5023350Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5023439Z x = x_sign * x_clamp 2025-05-07T20:33:03.5023568Z x0 = x[:, :D] 2025-05-07T20:33:03.5023650Z x1 = x[:, D:] 2025-05-07T20:33:03.5023727Z 2025-05-07T20:33:03.5023817Z if contiguous: 2025-05-07T20:33:03.5023910Z x0 = x0.contiguous() 2025-05-07T20:33:03.5023999Z x1 = x1.contiguous() 2025-05-07T20:33:03.5024078Z 2025-05-07T20:33:03.5024171Z if scale_ub is not None: 2025-05-07T20:33:03.5024278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5024420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5024499Z ) 2025-05-07T20:33:03.5024576Z else: 2025-05-07T20:33:03.5024678Z scale_ub_tensor = None 2025-05-07T20:33:03.5024755Z 2025-05-07T20:33:03.5024896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5024992Z op = silu_mul_quant 2025-05-07T20:33:03.5025079Z if compiled: 2025-05-07T20:33:03.5025186Z op = torch.compile(op) 2025-05-07T20:33:03.5025295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5025370Z 2025-05-07T20:33:03.5025467Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.5025588Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.5025661Z 2025-05-07T20:33:03.5025806Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5025910Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.5026014Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.5026143Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.5026280Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5026361Z 2025-05-07T20:33:03.5026461Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.5026473Z 2025-05-07T20:33:03.5026573Z moe/activation_test.py:126: 2025-05-07T20:33:03.5026708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5026816Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.5026951Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5027515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.5027617Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.5027984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5028205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5028570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.5028839Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5029214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.5029392Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.5029777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.5029858Z fn() 2025-05-07T20:33:03.5030262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.5030345Z self.fn.run( 2025-05-07T20:33:03.5030680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5030779Z kernel = self.compile( 2025-05-07T20:33:03.5031241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5031480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5031608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5031651Z 2025-05-07T20:33:03.5031855Z self = 2025-05-07T20:33:03.5032636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5033138Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90a64ea0>} 2025-05-07T20:33:03.5033893Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5034086Z context = 2025-05-07T20:33:03.5034091Z 2025-05-07T20:33:03.5034256Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5034527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5034634Z module_map=module_map) 2025-05-07T20:33:03.5034800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5034903Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.5034982Z E ^ 2025-05-07T20:33:03.5035345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5035350Z 2025-05-07T20:33:03.5035832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5035845Z 2025-05-07T20:33:03.5035956Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5036180Z self=, 2025-05-07T20:33:03.5036262Z T=128, 2025-05-07T20:33:03.5036347Z D=5120, 2025-05-07T20:33:03.5036431Z scale_ub=None, 2025-05-07T20:33:03.5036516Z contiguous=True, 2025-05-07T20:33:03.5036605Z compiled=True, 2025-05-07T20:33:03.5036683Z ) 2025-05-07T20:33:03.5036900Z self = 2025-05-07T20:33:03.5037073Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5037078Z 2025-05-07T20:33:03.5037154Z @given( 2025-05-07T20:33:03.5037279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5037379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5037493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5037619Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5037736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5037811Z ) 2025-05-07T20:33:03.5038060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5038156Z def test_silu_mul_quant( 2025-05-07T20:33:03.5038241Z self, 2025-05-07T20:33:03.5038326Z T: int, 2025-05-07T20:33:03.5038404Z D: int, 2025-05-07T20:33:03.5038505Z scale_ub: Optional[float], 2025-05-07T20:33:03.5038603Z contiguous: bool, 2025-05-07T20:33:03.5038690Z compiled: bool, 2025-05-07T20:33:03.5038779Z ) -> None: 2025-05-07T20:33:03.5038874Z torch.manual_seed(2025) 2025-05-07T20:33:03.5038953Z 2025-05-07T20:33:03.5039127Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5039201Z 2025-05-07T20:33:03.5039292Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5039425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5039668Z x = x_sign * x_clamp 2025-05-07T20:33:03.5039760Z x0 = x[:, :D] 2025-05-07T20:33:03.5039861Z x1 = x[:, D:] 2025-05-07T20:33:03.5039935Z 2025-05-07T20:33:03.5040058Z if contiguous: 2025-05-07T20:33:03.5040156Z x0 = x0.contiguous() 2025-05-07T20:33:03.5040245Z x1 = x1.contiguous() 2025-05-07T20:33:03.5040318Z 2025-05-07T20:33:03.5040414Z if scale_ub is not None: 2025-05-07T20:33:03.5040521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5040661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5040737Z ) 2025-05-07T20:33:03.5040814Z else: 2025-05-07T20:33:03.5040916Z scale_ub_tensor = None 2025-05-07T20:33:03.5040989Z 2025-05-07T20:33:03.5041117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5041215Z op = silu_mul_quant 2025-05-07T20:33:03.5041300Z if compiled: 2025-05-07T20:33:03.5041407Z op = torch.compile(op) 2025-05-07T20:33:03.5041520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5041593Z 2025-05-07T20:33:03.5041684Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.5041814Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.5041887Z 2025-05-07T20:33:03.5042028Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5042130Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.5042230Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.5042359Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.5042498Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5042572Z 2025-05-07T20:33:03.5042681Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.5042685Z 2025-05-07T20:33:03.5042783Z moe/activation_test.py:126: 2025-05-07T20:33:03.5042919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5043033Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.5043165Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5043730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.5043831Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.5044191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5044417Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5044783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.5045043Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5045420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.5045588Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.5045933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.5046014Z fn() 2025-05-07T20:33:03.5046412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.5046501Z self.fn.run( 2025-05-07T20:33:03.5046836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5046936Z kernel = self.compile( 2025-05-07T20:33:03.5047314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5047488Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5047784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5047830Z 2025-05-07T20:33:03.5048034Z self = 2025-05-07T20:33:03.5048855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5049360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66e86f20>} 2025-05-07T20:33:03.5050153Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5050353Z context = 2025-05-07T20:33:03.5050364Z 2025-05-07T20:33:03.5050530Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5050800Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5050912Z module_map=module_map) 2025-05-07T20:33:03.5051073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5051186Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.5051263Z E ^ 2025-05-07T20:33:03.5051619Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5051629Z 2025-05-07T20:33:03.5052039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5052044Z 2025-05-07T20:33:03.5052151Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5052386Z self=, 2025-05-07T20:33:03.5052466Z T=4096, 2025-05-07T20:33:03.5052545Z D=5120, 2025-05-07T20:33:03.5052634Z scale_ub=None, 2025-05-07T20:33:03.5052720Z contiguous=True, 2025-05-07T20:33:03.5052807Z compiled=True, 2025-05-07T20:33:03.5052889Z ) 2025-05-07T20:33:03.5053107Z self = 2025-05-07T20:33:03.5053282Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5053287Z 2025-05-07T20:33:03.5053367Z @given( 2025-05-07T20:33:03.5053486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5053592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5053706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5053822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5053942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5054018Z ) 2025-05-07T20:33:03.5054281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5054375Z def test_silu_mul_quant( 2025-05-07T20:33:03.5054455Z self, 2025-05-07T20:33:03.5054541Z T: int, 2025-05-07T20:33:03.5054619Z D: int, 2025-05-07T20:33:03.5054719Z scale_ub: Optional[float], 2025-05-07T20:33:03.5054815Z contiguous: bool, 2025-05-07T20:33:03.5054902Z compiled: bool, 2025-05-07T20:33:03.5054980Z ) -> None: 2025-05-07T20:33:03.5055082Z torch.manual_seed(2025) 2025-05-07T20:33:03.5055156Z 2025-05-07T20:33:03.5055322Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5055403Z 2025-05-07T20:33:03.5055495Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5055618Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5055712Z x = x_sign * x_clamp 2025-05-07T20:33:03.5055792Z x0 = x[:, :D] 2025-05-07T20:33:03.5055964Z x1 = x[:, D:] 2025-05-07T20:33:03.5056078Z 2025-05-07T20:33:03.5056162Z if contiguous: 2025-05-07T20:33:03.5056260Z x0 = x0.contiguous() 2025-05-07T20:33:03.5056349Z x1 = x1.contiguous() 2025-05-07T20:33:03.5056462Z 2025-05-07T20:33:03.5056559Z if scale_ub is not None: 2025-05-07T20:33:03.5056665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5056799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5056883Z ) 2025-05-07T20:33:03.5056960Z else: 2025-05-07T20:33:03.5057054Z scale_ub_tensor = None 2025-05-07T20:33:03.5057133Z 2025-05-07T20:33:03.5057261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5057357Z op = silu_mul_quant 2025-05-07T20:33:03.5057443Z if compiled: 2025-05-07T20:33:03.5057542Z op = torch.compile(op) 2025-05-07T20:33:03.5057654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5057735Z 2025-05-07T20:33:03.5057828Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.5057954Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.5058028Z 2025-05-07T20:33:03.5058168Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5058276Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.5058376Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.5058499Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.5058643Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5058717Z 2025-05-07T20:33:03.5058833Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.5058843Z 2025-05-07T20:33:03.5058942Z moe/activation_test.py:126: 2025-05-07T20:33:03.5059071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5065088Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.5065237Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5066180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.5066301Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.5066662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5066893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5067261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.5067531Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5067907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.5068079Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.5068436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.5068516Z fn() 2025-05-07T20:33:03.5068917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.5069009Z self.fn.run( 2025-05-07T20:33:03.5069345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5069447Z kernel = self.compile( 2025-05-07T20:33:03.5069828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5070005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5070143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5070148Z 2025-05-07T20:33:03.5070590Z self = 2025-05-07T20:33:03.5071439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5072007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b671162a0>} 2025-05-07T20:33:03.5072750Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5072955Z context = 2025-05-07T20:33:03.5072960Z 2025-05-07T20:33:03.5073127Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5073406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5073516Z module_map=module_map) 2025-05-07T20:33:03.5073679Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5073786Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.5073873Z E ^ 2025-05-07T20:33:03.5074228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5074233Z 2025-05-07T20:33:03.5074644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5074648Z 2025-05-07T20:33:03.5074764Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5074992Z self=, 2025-05-07T20:33:03.5075073Z T=16384, 2025-05-07T20:33:03.5075158Z D=5120, 2025-05-07T20:33:03.5075247Z scale_ub=None, 2025-05-07T20:33:03.5075346Z contiguous=True, 2025-05-07T20:33:03.5075438Z compiled=True, 2025-05-07T20:33:03.5075514Z ) 2025-05-07T20:33:03.5075810Z self = 2025-05-07T20:33:03.5075990Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5075995Z 2025-05-07T20:33:03.5076077Z @given( 2025-05-07T20:33:03.5076206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5076310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5076426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5076551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5076666Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5076751Z ) 2025-05-07T20:33:03.5076996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5077097Z def test_silu_mul_quant( 2025-05-07T20:33:03.5077190Z self, 2025-05-07T20:33:03.5077270Z T: int, 2025-05-07T20:33:03.5077349Z D: int, 2025-05-07T20:33:03.5077459Z scale_ub: Optional[float], 2025-05-07T20:33:03.5077552Z contiguous: bool, 2025-05-07T20:33:03.5077639Z compiled: bool, 2025-05-07T20:33:03.5077730Z ) -> None: 2025-05-07T20:33:03.5077827Z torch.manual_seed(2025) 2025-05-07T20:33:03.5077903Z 2025-05-07T20:33:03.5078080Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5078156Z 2025-05-07T20:33:03.5078258Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5078387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5078478Z x = x_sign * x_clamp 2025-05-07T20:33:03.5078568Z x0 = x[:, :D] 2025-05-07T20:33:03.5078651Z x1 = x[:, D:] 2025-05-07T20:33:03.5078732Z 2025-05-07T20:33:03.5078824Z if contiguous: 2025-05-07T20:33:03.5078918Z x0 = x0.contiguous() 2025-05-07T20:33:03.5079132Z x1 = x1.contiguous() 2025-05-07T20:33:03.5079217Z 2025-05-07T20:33:03.5079312Z if scale_ub is not None: 2025-05-07T20:33:03.5079420Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5079603Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5079683Z ) 2025-05-07T20:33:03.5079768Z else: 2025-05-07T20:33:03.5079866Z scale_ub_tensor = None 2025-05-07T20:33:03.5079941Z 2025-05-07T20:33:03.5080078Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5080174Z op = silu_mul_quant 2025-05-07T20:33:03.5080262Z if compiled: 2025-05-07T20:33:03.5080375Z op = torch.compile(op) 2025-05-07T20:33:03.5080483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5080559Z 2025-05-07T20:33:03.5080660Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.5080782Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.5080864Z 2025-05-07T20:33:03.5081008Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5081111Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.5081221Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.5081347Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.5081487Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5081573Z 2025-05-07T20:33:03.5081674Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.5081679Z 2025-05-07T20:33:03.5081779Z moe/activation_test.py:126: 2025-05-07T20:33:03.5081917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5082024Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.5082162Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5082731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.5082836Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.5083201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5083426Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5083790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.5084053Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5084426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.5084602Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.5084943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.5085030Z fn() 2025-05-07T20:33:03.5085436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.5085523Z self.fn.run( 2025-05-07T20:33:03.5085857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5085957Z kernel = self.compile( 2025-05-07T20:33:03.5086335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5086515Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5086642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5086647Z 2025-05-07T20:33:03.5086856Z self = 2025-05-07T20:33:03.5087723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5088264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b668df240>} 2025-05-07T20:33:03.5089070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5089267Z context = 2025-05-07T20:33:03.5089271Z 2025-05-07T20:33:03.5089443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5089706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5089822Z module_map=module_map) 2025-05-07T20:33:03.5089994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5090099Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.5090181Z E ^ 2025-05-07T20:33:03.5090550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5090555Z 2025-05-07T20:33:03.5090965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5090971Z 2025-05-07T20:33:03.5091084Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5091308Z self=, 2025-05-07T20:33:03.5091387Z T=1, 2025-05-07T20:33:03.5091478Z D=5120, 2025-05-07T20:33:03.5091564Z scale_ub=1200.0, 2025-05-07T20:33:03.5091651Z contiguous=True, 2025-05-07T20:33:03.5091743Z compiled=True, 2025-05-07T20:33:03.5091821Z ) 2025-05-07T20:33:03.5092047Z self = 2025-05-07T20:33:03.5092220Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5092228Z 2025-05-07T20:33:03.5092311Z @given( 2025-05-07T20:33:03.5092440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5092540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5092659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5092787Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5092902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5092979Z ) 2025-05-07T20:33:03.5093231Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5093327Z def test_silu_mul_quant( 2025-05-07T20:33:03.5093406Z self, 2025-05-07T20:33:03.5093496Z T: int, 2025-05-07T20:33:03.5093576Z D: int, 2025-05-07T20:33:03.5093690Z scale_ub: Optional[float], 2025-05-07T20:33:03.5093786Z contiguous: bool, 2025-05-07T20:33:03.5093876Z compiled: bool, 2025-05-07T20:33:03.5093967Z ) -> None: 2025-05-07T20:33:03.5094064Z torch.manual_seed(2025) 2025-05-07T20:33:03.5094141Z 2025-05-07T20:33:03.5094325Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5094404Z 2025-05-07T20:33:03.5094500Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5094633Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5094725Z x = x_sign * x_clamp 2025-05-07T20:33:03.5094808Z x0 = x[:, :D] 2025-05-07T20:33:03.5094900Z x1 = x[:, D:] 2025-05-07T20:33:03.5094975Z 2025-05-07T20:33:03.5095073Z if contiguous: 2025-05-07T20:33:03.5095169Z x0 = x0.contiguous() 2025-05-07T20:33:03.5095261Z x1 = x1.contiguous() 2025-05-07T20:33:03.5095345Z 2025-05-07T20:33:03.5095437Z if scale_ub is not None: 2025-05-07T20:33:03.5095675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5095820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5095898Z ) 2025-05-07T20:33:03.5096040Z else: 2025-05-07T20:33:03.5096147Z scale_ub_tensor = None 2025-05-07T20:33:03.5096223Z 2025-05-07T20:33:03.5096353Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5096455Z op = silu_mul_quant 2025-05-07T20:33:03.5096542Z if compiled: 2025-05-07T20:33:03.5096644Z op = torch.compile(op) 2025-05-07T20:33:03.5096758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5096833Z 2025-05-07T20:33:03.5096934Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5096938Z 2025-05-07T20:33:03.5097039Z moe/activation_test.py:117: 2025-05-07T20:33:03.5097169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5097289Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5097393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5097759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5097864Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5098356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5098466Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5098825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5099049Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5099401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5099515Z kernel = self.compile( 2025-05-07T20:33:03.5099924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5100107Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5100238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5100243Z 2025-05-07T20:33:03.5100455Z self = 2025-05-07T20:33:03.5101229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5101738Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66bb8cc0>} 2025-05-07T20:33:03.5102485Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5102678Z context = 2025-05-07T20:33:03.5102685Z 2025-05-07T20:33:03.5102857Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5103121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5103234Z module_map=module_map) 2025-05-07T20:33:03.5103395Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5103495Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5103582Z E ^ 2025-05-07T20:33:03.5103936Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5103941Z 2025-05-07T20:33:03.5104443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5104493Z 2025-05-07T20:33:03.5104599Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5104825Z self=, 2025-05-07T20:33:03.5104949Z T=1, 2025-05-07T20:33:03.5105028Z D=5120, 2025-05-07T20:33:03.5105113Z scale_ub=None, 2025-05-07T20:33:03.5105208Z contiguous=False, 2025-05-07T20:33:03.5105294Z compiled=True, 2025-05-07T20:33:03.5105369Z ) 2025-05-07T20:33:03.5105600Z self = 2025-05-07T20:33:03.5105766Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5105770Z 2025-05-07T20:33:03.5105857Z @given( 2025-05-07T20:33:03.5105979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5106079Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5106203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5106328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5106445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5106530Z ) 2025-05-07T20:33:03.5106777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5106875Z def test_silu_mul_quant( 2025-05-07T20:33:03.5106962Z self, 2025-05-07T20:33:03.5107041Z T: int, 2025-05-07T20:33:03.5107125Z D: int, 2025-05-07T20:33:03.5107233Z scale_ub: Optional[float], 2025-05-07T20:33:03.5107327Z contiguous: bool, 2025-05-07T20:33:03.5107423Z compiled: bool, 2025-05-07T20:33:03.5107504Z ) -> None: 2025-05-07T20:33:03.5107602Z torch.manual_seed(2025) 2025-05-07T20:33:03.5107685Z 2025-05-07T20:33:03.5107855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5107932Z 2025-05-07T20:33:03.5108034Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5108169Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5108258Z x = x_sign * x_clamp 2025-05-07T20:33:03.5108349Z x0 = x[:, :D] 2025-05-07T20:33:03.5108431Z x1 = x[:, D:] 2025-05-07T20:33:03.5108510Z 2025-05-07T20:33:03.5108603Z if contiguous: 2025-05-07T20:33:03.5108697Z x0 = x0.contiguous() 2025-05-07T20:33:03.5108789Z x1 = x1.contiguous() 2025-05-07T20:33:03.5108876Z 2025-05-07T20:33:03.5108971Z if scale_ub is not None: 2025-05-07T20:33:03.5109085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5109220Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5109300Z ) 2025-05-07T20:33:03.5109387Z else: 2025-05-07T20:33:03.5109483Z scale_ub_tensor = None 2025-05-07T20:33:03.5109560Z 2025-05-07T20:33:03.5109698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5109791Z op = silu_mul_quant 2025-05-07T20:33:03.5109884Z if compiled: 2025-05-07T20:33:03.5109985Z op = torch.compile(op) 2025-05-07T20:33:03.5110100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5110177Z 2025-05-07T20:33:03.5110270Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.5110397Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.5110471Z 2025-05-07T20:33:03.5110613Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5110714Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.5110815Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.5110944Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.5111083Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5111159Z 2025-05-07T20:33:03.5111269Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.5111273Z 2025-05-07T20:33:03.5111372Z moe/activation_test.py:126: 2025-05-07T20:33:03.5111633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5111748Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.5111883Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5112484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.5112586Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.5112943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5113170Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5113536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.5113802Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5114177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.5114343Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.5114689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.5114768Z fn() 2025-05-07T20:33:03.5115165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.5115254Z self.fn.run( 2025-05-07T20:33:03.5115590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5115690Z kernel = self.compile( 2025-05-07T20:33:03.5116142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5116323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5116462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5116466Z 2025-05-07T20:33:03.5116669Z self = 2025-05-07T20:33:03.5117455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5117963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66bbb2e0>} 2025-05-07T20:33:03.5118706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5118908Z context = 2025-05-07T20:33:03.5118912Z 2025-05-07T20:33:03.5119078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5119353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5119486Z module_map=module_map) 2025-05-07T20:33:03.5119672Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5119783Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.5119862Z E ^ 2025-05-07T20:33:03.5120217Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5120228Z 2025-05-07T20:33:03.5120638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5120643Z 2025-05-07T20:33:03.5120747Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5121101Z self=, 2025-05-07T20:33:03.5121184Z T=1, 2025-05-07T20:33:03.5121262Z D=5120, 2025-05-07T20:33:03.5121351Z scale_ub=None, 2025-05-07T20:33:03.5121478Z contiguous=True, 2025-05-07T20:33:03.5121563Z compiled=False, 2025-05-07T20:33:03.5121646Z ) 2025-05-07T20:33:03.5121867Z self = 2025-05-07T20:33:03.5122036Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5122041Z 2025-05-07T20:33:03.5122119Z @given( 2025-05-07T20:33:03.5122240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5122345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5122466Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5122584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5122707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5122791Z ) 2025-05-07T20:33:03.5123043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5123137Z def test_silu_mul_quant( 2025-05-07T20:33:03.5123218Z self, 2025-05-07T20:33:03.5123303Z T: int, 2025-05-07T20:33:03.5123385Z D: int, 2025-05-07T20:33:03.5123486Z scale_ub: Optional[float], 2025-05-07T20:33:03.5123585Z contiguous: bool, 2025-05-07T20:33:03.5123671Z compiled: bool, 2025-05-07T20:33:03.5123750Z ) -> None: 2025-05-07T20:33:03.5123853Z torch.manual_seed(2025) 2025-05-07T20:33:03.5123928Z 2025-05-07T20:33:03.5124098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5124181Z 2025-05-07T20:33:03.5124277Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5124402Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5124499Z x = x_sign * x_clamp 2025-05-07T20:33:03.5124589Z x0 = x[:, :D] 2025-05-07T20:33:03.5124677Z x1 = x[:, D:] 2025-05-07T20:33:03.5124754Z 2025-05-07T20:33:03.5124838Z if contiguous: 2025-05-07T20:33:03.5124938Z x0 = x0.contiguous() 2025-05-07T20:33:03.5125032Z x1 = x1.contiguous() 2025-05-07T20:33:03.5125107Z 2025-05-07T20:33:03.5125206Z if scale_ub is not None: 2025-05-07T20:33:03.5125313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5125448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5125537Z ) 2025-05-07T20:33:03.5125614Z else: 2025-05-07T20:33:03.5125711Z scale_ub_tensor = None 2025-05-07T20:33:03.5125797Z 2025-05-07T20:33:03.5125928Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5126025Z op = silu_mul_quant 2025-05-07T20:33:03.5126113Z if compiled: 2025-05-07T20:33:03.5126215Z op = torch.compile(op) 2025-05-07T20:33:03.5126339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5126414Z 2025-05-07T20:33:03.5126506Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5126510Z 2025-05-07T20:33:03.5126615Z moe/activation_test.py:117: 2025-05-07T20:33:03.5126751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5126854Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5126964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5127462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5127565Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5127921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5128145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5128575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5128733Z kernel = self.compile( 2025-05-07T20:33:03.5129115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5129339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5129480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5129486Z 2025-05-07T20:33:03.5129729Z self = 2025-05-07T20:33:03.5130508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5131028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66bbbc40>} 2025-05-07T20:33:03.5131772Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5131965Z context = 2025-05-07T20:33:03.5131970Z 2025-05-07T20:33:03.5132141Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5132406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5132520Z module_map=module_map) 2025-05-07T20:33:03.5132682Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5132781Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5132866Z E ^ 2025-05-07T20:33:03.5133230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5133237Z 2025-05-07T20:33:03.5133649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5133663Z 2025-05-07T20:33:03.5133767Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5133991Z self=, 2025-05-07T20:33:03.5134078Z T=128, 2025-05-07T20:33:03.5134157Z D=5120, 2025-05-07T20:33:03.5134244Z scale_ub=None, 2025-05-07T20:33:03.5134338Z contiguous=False, 2025-05-07T20:33:03.5134423Z compiled=True, 2025-05-07T20:33:03.5134498Z ) 2025-05-07T20:33:03.5134721Z self = 2025-05-07T20:33:03.5134892Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5134896Z 2025-05-07T20:33:03.5134976Z @given( 2025-05-07T20:33:03.5135108Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5135211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5135334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5135453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5135568Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5135650Z ) 2025-05-07T20:33:03.5135895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5135989Z def test_silu_mul_quant( 2025-05-07T20:33:03.5136073Z self, 2025-05-07T20:33:03.5136151Z T: int, 2025-05-07T20:33:03.5136228Z D: int, 2025-05-07T20:33:03.5136337Z scale_ub: Optional[float], 2025-05-07T20:33:03.5136428Z contiguous: bool, 2025-05-07T20:33:03.5136521Z compiled: bool, 2025-05-07T20:33:03.5136600Z ) -> None: 2025-05-07T20:33:03.5136698Z torch.manual_seed(2025) 2025-05-07T20:33:03.5136782Z 2025-05-07T20:33:03.5137077Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5137153Z 2025-05-07T20:33:03.5137254Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5137380Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5137512Z x = x_sign * x_clamp 2025-05-07T20:33:03.5137602Z x0 = x[:, :D] 2025-05-07T20:33:03.5137684Z x1 = x[:, D:] 2025-05-07T20:33:03.5137759Z 2025-05-07T20:33:03.5137850Z if contiguous: 2025-05-07T20:33:03.5137943Z x0 = x0.contiguous() 2025-05-07T20:33:03.5138034Z x1 = x1.contiguous() 2025-05-07T20:33:03.5138115Z 2025-05-07T20:33:03.5138207Z if scale_ub is not None: 2025-05-07T20:33:03.5138319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5138454Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5138532Z ) 2025-05-07T20:33:03.5138617Z else: 2025-05-07T20:33:03.5138720Z scale_ub_tensor = None 2025-05-07T20:33:03.5138799Z 2025-05-07T20:33:03.5138936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5139029Z op = silu_mul_quant 2025-05-07T20:33:03.5139120Z if compiled: 2025-05-07T20:33:03.5139232Z op = torch.compile(op) 2025-05-07T20:33:03.5139339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5139418Z 2025-05-07T20:33:03.5139545Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5139550Z 2025-05-07T20:33:03.5139661Z moe/activation_test.py:117: 2025-05-07T20:33:03.5139809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5139911Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5140018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5140399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5140495Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5141001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5141110Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5141471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5141704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5142045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5142143Z kernel = self.compile( 2025-05-07T20:33:03.5142533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5142709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5142846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5142857Z 2025-05-07T20:33:03.5143062Z self = 2025-05-07T20:33:03.5143839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5144353Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66bb9da0>} 2025-05-07T20:33:03.5145099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5145300Z context = 2025-05-07T20:33:03.5145304Z 2025-05-07T20:33:03.5145556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5145908Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5146022Z module_map=module_map) 2025-05-07T20:33:03.5146224Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5146329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5146410Z E ^ 2025-05-07T20:33:03.5146767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5146772Z 2025-05-07T20:33:03.5147190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5147194Z 2025-05-07T20:33:03.5147298Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5147532Z self=, 2025-05-07T20:33:03.5147613Z T=128, 2025-05-07T20:33:03.5147699Z D=7168, 2025-05-07T20:33:03.5147790Z scale_ub=1200.0, 2025-05-07T20:33:03.5147878Z contiguous=False, 2025-05-07T20:33:03.5147967Z compiled=False, 2025-05-07T20:33:03.5148050Z ) 2025-05-07T20:33:03.5148268Z self = 2025-05-07T20:33:03.5148442Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5148447Z 2025-05-07T20:33:03.5148531Z @given( 2025-05-07T20:33:03.5148651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5148758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5148873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5148990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5149112Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5149188Z ) 2025-05-07T20:33:03.5149452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5149569Z def test_silu_mul_quant( 2025-05-07T20:33:03.5149663Z self, 2025-05-07T20:33:03.5149750Z T: int, 2025-05-07T20:33:03.5149834Z D: int, 2025-05-07T20:33:03.5149937Z scale_ub: Optional[float], 2025-05-07T20:33:03.5150027Z contiguous: bool, 2025-05-07T20:33:03.5150121Z compiled: bool, 2025-05-07T20:33:03.5150202Z ) -> None: 2025-05-07T20:33:03.5150305Z torch.manual_seed(2025) 2025-05-07T20:33:03.5150379Z 2025-05-07T20:33:03.5150548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5150631Z 2025-05-07T20:33:03.5150724Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5150849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5150943Z x = x_sign * x_clamp 2025-05-07T20:33:03.5151026Z x0 = x[:, :D] 2025-05-07T20:33:03.5151108Z x1 = x[:, D:] 2025-05-07T20:33:03.5151193Z 2025-05-07T20:33:03.5151285Z if contiguous: 2025-05-07T20:33:03.5151377Z x0 = x0.contiguous() 2025-05-07T20:33:03.5151476Z x1 = x1.contiguous() 2025-05-07T20:33:03.5151550Z 2025-05-07T20:33:03.5151642Z if scale_ub is not None: 2025-05-07T20:33:03.5151758Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5151895Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5151981Z ) 2025-05-07T20:33:03.5152061Z else: 2025-05-07T20:33:03.5152158Z scale_ub_tensor = None 2025-05-07T20:33:03.5152239Z 2025-05-07T20:33:03.5152369Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5152461Z op = silu_mul_quant 2025-05-07T20:33:03.5152553Z if compiled: 2025-05-07T20:33:03.5152658Z op = torch.compile(op) 2025-05-07T20:33:03.5152765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5152847Z 2025-05-07T20:33:03.5152941Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5153067Z 2025-05-07T20:33:03.5153176Z moe/activation_test.py:117: 2025-05-07T20:33:03.5153308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5153457Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5153564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5154065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5154164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5154530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5154753Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5155099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5155194Z kernel = self.compile( 2025-05-07T20:33:03.5155586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5155834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5155969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5155973Z 2025-05-07T20:33:03.5156179Z self = 2025-05-07T20:33:03.5156964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5157468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b663a0ae0>} 2025-05-07T20:33:03.5158224Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5158416Z context = 2025-05-07T20:33:03.5158423Z 2025-05-07T20:33:03.5158595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5158859Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5158967Z module_map=module_map) 2025-05-07T20:33:03.5159137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5159239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5159322Z E ^ 2025-05-07T20:33:03.5159737Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5159741Z 2025-05-07T20:33:03.5160159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5160165Z 2025-05-07T20:33:03.5160275Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5160502Z self=, 2025-05-07T20:33:03.5160581Z T=128, 2025-05-07T20:33:03.5160667Z D=5120, 2025-05-07T20:33:03.5160752Z scale_ub=None, 2025-05-07T20:33:03.5160845Z contiguous=False, 2025-05-07T20:33:03.5160938Z compiled=False, 2025-05-07T20:33:03.5161014Z ) 2025-05-07T20:33:03.5161239Z self = 2025-05-07T20:33:03.5161410Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.5161414Z 2025-05-07T20:33:03.5161492Z @given( 2025-05-07T20:33:03.5161618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5161719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5161983Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5162109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5162224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5162345Z ) 2025-05-07T20:33:03.5162596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5162690Z def test_silu_mul_quant( 2025-05-07T20:33:03.5162776Z self, 2025-05-07T20:33:03.5162857Z T: int, 2025-05-07T20:33:03.5162935Z D: int, 2025-05-07T20:33:03.5163041Z scale_ub: Optional[float], 2025-05-07T20:33:03.5163131Z contiguous: bool, 2025-05-07T20:33:03.5163217Z compiled: bool, 2025-05-07T20:33:03.5163307Z ) -> None: 2025-05-07T20:33:03.5163403Z torch.manual_seed(2025) 2025-05-07T20:33:03.5163477Z 2025-05-07T20:33:03.5163655Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5163734Z 2025-05-07T20:33:03.5163835Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5163964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5164055Z x = x_sign * x_clamp 2025-05-07T20:33:03.5164143Z x0 = x[:, :D] 2025-05-07T20:33:03.5164227Z x1 = x[:, D:] 2025-05-07T20:33:03.5164303Z 2025-05-07T20:33:03.5164393Z if contiguous: 2025-05-07T20:33:03.5164487Z x0 = x0.contiguous() 2025-05-07T20:33:03.5164579Z x1 = x1.contiguous() 2025-05-07T20:33:03.5164659Z 2025-05-07T20:33:03.5164751Z if scale_ub is not None: 2025-05-07T20:33:03.5164858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5164998Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5165076Z ) 2025-05-07T20:33:03.5165154Z else: 2025-05-07T20:33:03.5165255Z scale_ub_tensor = None 2025-05-07T20:33:03.5165330Z 2025-05-07T20:33:03.5165798Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5165940Z op = silu_mul_quant 2025-05-07T20:33:03.5166029Z if compiled: 2025-05-07T20:33:03.5166136Z op = torch.compile(op) 2025-05-07T20:33:03.5166242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5166326Z 2025-05-07T20:33:03.5166423Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5166428Z 2025-05-07T20:33:03.5166527Z moe/activation_test.py:117: 2025-05-07T20:33:03.5166657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5166765Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5166868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5167365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5167469Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5167834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5168069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5168407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5168508Z kernel = self.compile( 2025-05-07T20:33:03.5168895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5169069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5169204Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5169208Z 2025-05-07T20:33:03.5169421Z self = 2025-05-07T20:33:03.5170436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5171002Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b663a1bc0>} 2025-05-07T20:33:03.5171812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5172007Z context = 2025-05-07T20:33:03.5172012Z 2025-05-07T20:33:03.5172176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5172439Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5172552Z module_map=module_map) 2025-05-07T20:33:03.5172718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5172828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5172911Z E ^ 2025-05-07T20:33:03.5173269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5173278Z 2025-05-07T20:33:03.5173696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5173701Z 2025-05-07T20:33:03.5173804Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5174034Z self=, 2025-05-07T20:33:03.5174113Z T=128, 2025-05-07T20:33:03.5174191Z D=5120, 2025-05-07T20:33:03.5174282Z scale_ub=1200.0, 2025-05-07T20:33:03.5174368Z contiguous=True, 2025-05-07T20:33:03.5174453Z compiled=False, 2025-05-07T20:33:03.5174539Z ) 2025-05-07T20:33:03.5174758Z self = 2025-05-07T20:33:03.5174938Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.5174942Z 2025-05-07T20:33:03.5175029Z @given( 2025-05-07T20:33:03.5175148Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5175258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5175374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5175491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5175613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5175689Z ) 2025-05-07T20:33:03.5175932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5176032Z def test_silu_mul_quant( 2025-05-07T20:33:03.5176110Z self, 2025-05-07T20:33:03.5176190Z T: int, 2025-05-07T20:33:03.5176277Z D: int, 2025-05-07T20:33:03.5176377Z scale_ub: Optional[float], 2025-05-07T20:33:03.5176467Z contiguous: bool, 2025-05-07T20:33:03.5176567Z compiled: bool, 2025-05-07T20:33:03.5176649Z ) -> None: 2025-05-07T20:33:03.5176751Z torch.manual_seed(2025) 2025-05-07T20:33:03.5176830Z 2025-05-07T20:33:03.5177005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5177087Z 2025-05-07T20:33:03.5177181Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5177305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5177402Z x = x_sign * x_clamp 2025-05-07T20:33:03.5177483Z x0 = x[:, :D] 2025-05-07T20:33:03.5177565Z x1 = x[:, D:] 2025-05-07T20:33:03.5177650Z 2025-05-07T20:33:03.5177734Z if contiguous: 2025-05-07T20:33:03.5177827Z x0 = x0.contiguous() 2025-05-07T20:33:03.5177923Z x1 = x1.contiguous() 2025-05-07T20:33:03.5177999Z 2025-05-07T20:33:03.5178091Z if scale_ub is not None: 2025-05-07T20:33:03.5178208Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5178547Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5178637Z ) 2025-05-07T20:33:03.5178716Z else: 2025-05-07T20:33:03.5178814Z scale_ub_tensor = None 2025-05-07T20:33:03.5178937Z 2025-05-07T20:33:03.5179067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5179160Z op = silu_mul_quant 2025-05-07T20:33:03.5179255Z if compiled: 2025-05-07T20:33:03.5179360Z op = torch.compile(op) 2025-05-07T20:33:03.5179489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5179579Z 2025-05-07T20:33:03.5179689Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5179693Z 2025-05-07T20:33:03.5179798Z moe/activation_test.py:117: 2025-05-07T20:33:03.5179932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5180035Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5180145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5180651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5180751Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5181122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5181348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5181694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5181788Z kernel = self.compile( 2025-05-07T20:33:03.5182170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5182353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5182481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5182491Z 2025-05-07T20:33:03.5182699Z self = 2025-05-07T20:33:03.5183483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5183989Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b663a2ca0>} 2025-05-07T20:33:03.5184740Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5184930Z context = 2025-05-07T20:33:03.5184935Z 2025-05-07T20:33:03.5185111Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5185375Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5185485Z module_map=module_map) 2025-05-07T20:33:03.5185652Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5185753Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5185833Z E ^ 2025-05-07T20:33:03.5186194Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5186198Z 2025-05-07T20:33:03.5186609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5186614Z 2025-05-07T20:33:03.5186723Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5186946Z self=, 2025-05-07T20:33:03.5187153Z T=1, 2025-05-07T20:33:03.5187239Z D=7168, 2025-05-07T20:33:03.5187323Z scale_ub=1200.0, 2025-05-07T20:33:03.5187409Z contiguous=True, 2025-05-07T20:33:03.5187498Z compiled=True, 2025-05-07T20:33:03.5187615Z ) 2025-05-07T20:33:03.5187840Z self = 2025-05-07T20:33:03.5188005Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5188009Z 2025-05-07T20:33:03.5188086Z @given( 2025-05-07T20:33:03.5188211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5188310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5188427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5188550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5188664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5188740Z ) 2025-05-07T20:33:03.5188999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5189095Z def test_silu_mul_quant( 2025-05-07T20:33:03.5189178Z self, 2025-05-07T20:33:03.5189257Z T: int, 2025-05-07T20:33:03.5189338Z D: int, 2025-05-07T20:33:03.5189446Z scale_ub: Optional[float], 2025-05-07T20:33:03.5189545Z contiguous: bool, 2025-05-07T20:33:03.5189651Z compiled: bool, 2025-05-07T20:33:03.5189748Z ) -> None: 2025-05-07T20:33:03.5189861Z torch.manual_seed(2025) 2025-05-07T20:33:03.5189935Z 2025-05-07T20:33:03.5190110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5190188Z 2025-05-07T20:33:03.5190280Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5190422Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5190512Z x = x_sign * x_clamp 2025-05-07T20:33:03.5190594Z x0 = x[:, :D] 2025-05-07T20:33:03.5196600Z x1 = x[:, D:] 2025-05-07T20:33:03.5196696Z 2025-05-07T20:33:03.5196808Z if contiguous: 2025-05-07T20:33:03.5196903Z x0 = x0.contiguous() 2025-05-07T20:33:03.5196995Z x1 = x1.contiguous() 2025-05-07T20:33:03.5197078Z 2025-05-07T20:33:03.5197174Z if scale_ub is not None: 2025-05-07T20:33:03.5197284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5197432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5197510Z ) 2025-05-07T20:33:03.5197596Z else: 2025-05-07T20:33:03.5197692Z scale_ub_tensor = None 2025-05-07T20:33:03.5197767Z 2025-05-07T20:33:03.5197907Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5197998Z op = silu_mul_quant 2025-05-07T20:33:03.5198086Z if compiled: 2025-05-07T20:33:03.5198193Z op = torch.compile(op) 2025-05-07T20:33:03.5198298Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5198372Z 2025-05-07T20:33:03.5198483Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5198490Z 2025-05-07T20:33:03.5198589Z moe/activation_test.py:117: 2025-05-07T20:33:03.5198720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5198831Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5198932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5199315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5199411Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5199952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5200059Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5200414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5200634Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5201181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5201287Z kernel = self.compile( 2025-05-07T20:33:03.5201710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5201886Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5202023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5202028Z 2025-05-07T20:33:03.5202235Z self = 2025-05-07T20:33:03.5203020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5203530Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef4360>} 2025-05-07T20:33:03.5204282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5204476Z context = 2025-05-07T20:33:03.5204480Z 2025-05-07T20:33:03.5204644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5204916Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5205027Z module_map=module_map) 2025-05-07T20:33:03.5205188Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5205297Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5205379Z E ^ 2025-05-07T20:33:03.5205749Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5205754Z 2025-05-07T20:33:03.5206167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5206175Z 2025-05-07T20:33:03.5206279Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5206512Z self=, 2025-05-07T20:33:03.5206592Z T=1, 2025-05-07T20:33:03.5206671Z D=7168, 2025-05-07T20:33:03.5206763Z scale_ub=1200.0, 2025-05-07T20:33:03.5206851Z contiguous=False, 2025-05-07T20:33:03.5206942Z compiled=True, 2025-05-07T20:33:03.5207019Z ) 2025-05-07T20:33:03.5207237Z self = 2025-05-07T20:33:03.5207408Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5207413Z 2025-05-07T20:33:03.5207498Z @given( 2025-05-07T20:33:03.5207620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5207729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5207845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5207966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5208087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5208165Z ) 2025-05-07T20:33:03.5208416Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5208512Z def test_silu_mul_quant( 2025-05-07T20:33:03.5208591Z self, 2025-05-07T20:33:03.5208677Z T: int, 2025-05-07T20:33:03.5208756Z D: int, 2025-05-07T20:33:03.5208858Z scale_ub: Optional[float], 2025-05-07T20:33:03.5208957Z contiguous: bool, 2025-05-07T20:33:03.5209045Z compiled: bool, 2025-05-07T20:33:03.5209125Z ) -> None: 2025-05-07T20:33:03.5209314Z torch.manual_seed(2025) 2025-05-07T20:33:03.5209427Z 2025-05-07T20:33:03.5209597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5209679Z 2025-05-07T20:33:03.5209773Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5209948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5210038Z x = x_sign * x_clamp 2025-05-07T20:33:03.5210120Z x0 = x[:, :D] 2025-05-07T20:33:03.5210209Z x1 = x[:, D:] 2025-05-07T20:33:03.5210283Z 2025-05-07T20:33:03.5210369Z if contiguous: 2025-05-07T20:33:03.5210470Z x0 = x0.contiguous() 2025-05-07T20:33:03.5210561Z x1 = x1.contiguous() 2025-05-07T20:33:03.5210635Z 2025-05-07T20:33:03.5210735Z if scale_ub is not None: 2025-05-07T20:33:03.5210842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5210977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5211064Z ) 2025-05-07T20:33:03.5211149Z else: 2025-05-07T20:33:03.5211253Z scale_ub_tensor = None 2025-05-07T20:33:03.5211329Z 2025-05-07T20:33:03.5211459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5211561Z op = silu_mul_quant 2025-05-07T20:33:03.5211648Z if compiled: 2025-05-07T20:33:03.5211750Z op = torch.compile(op) 2025-05-07T20:33:03.5211862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5211936Z 2025-05-07T20:33:03.5212029Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5212034Z 2025-05-07T20:33:03.5212139Z moe/activation_test.py:117: 2025-05-07T20:33:03.5212269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5212377Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5212477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5212846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5212954Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5213446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5213547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5213908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5214129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5214473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5214566Z kernel = self.compile( 2025-05-07T20:33:03.5214945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5215125Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5215256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5215263Z 2025-05-07T20:33:03.5215468Z self = 2025-05-07T20:33:03.5216253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5216756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef59e0>} 2025-05-07T20:33:03.5217505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5217698Z context = 2025-05-07T20:33:03.5217828Z 2025-05-07T20:33:03.5218002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5218266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5218422Z module_map=module_map) 2025-05-07T20:33:03.5218593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5218696Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5218775Z E ^ 2025-05-07T20:33:03.5219139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5219144Z 2025-05-07T20:33:03.5219581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5219585Z 2025-05-07T20:33:03.5219719Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5219948Z self=, 2025-05-07T20:33:03.5220029Z T=1, 2025-05-07T20:33:03.5220117Z D=7168, 2025-05-07T20:33:03.5220204Z scale_ub=None, 2025-05-07T20:33:03.5220294Z contiguous=False, 2025-05-07T20:33:03.5220390Z compiled=True, 2025-05-07T20:33:03.5220467Z ) 2025-05-07T20:33:03.5220697Z self = 2025-05-07T20:33:03.5220861Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5220865Z 2025-05-07T20:33:03.5220947Z @given( 2025-05-07T20:33:03.5221078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5221178Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5221295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5221422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5221537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5221614Z ) 2025-05-07T20:33:03.5221873Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5221968Z def test_silu_mul_quant( 2025-05-07T20:33:03.5222054Z self, 2025-05-07T20:33:03.5222135Z T: int, 2025-05-07T20:33:03.5222214Z D: int, 2025-05-07T20:33:03.5222324Z scale_ub: Optional[float], 2025-05-07T20:33:03.5222415Z contiguous: bool, 2025-05-07T20:33:03.5222502Z compiled: bool, 2025-05-07T20:33:03.5222590Z ) -> None: 2025-05-07T20:33:03.5222689Z torch.manual_seed(2025) 2025-05-07T20:33:03.5222765Z 2025-05-07T20:33:03.5222941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5223017Z 2025-05-07T20:33:03.5223113Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5223247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5223337Z x = x_sign * x_clamp 2025-05-07T20:33:03.5223430Z x0 = x[:, :D] 2025-05-07T20:33:03.5223512Z x1 = x[:, D:] 2025-05-07T20:33:03.5223593Z 2025-05-07T20:33:03.5223688Z if contiguous: 2025-05-07T20:33:03.5223782Z x0 = x0.contiguous() 2025-05-07T20:33:03.5223875Z x1 = x1.contiguous() 2025-05-07T20:33:03.5223962Z 2025-05-07T20:33:03.5224054Z if scale_ub is not None: 2025-05-07T20:33:03.5224162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5224306Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5224386Z ) 2025-05-07T20:33:03.5224463Z else: 2025-05-07T20:33:03.5224566Z scale_ub_tensor = None 2025-05-07T20:33:03.5224641Z 2025-05-07T20:33:03.5224778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5224870Z op = silu_mul_quant 2025-05-07T20:33:03.5224956Z if compiled: 2025-05-07T20:33:03.5225064Z op = torch.compile(op) 2025-05-07T20:33:03.5225172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5225936Z 2025-05-07T20:33:03.5226042Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.5226164Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.5226240Z 2025-05-07T20:33:03.5226450Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5226553Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.5226654Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.5226783Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.5226923Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5227007Z 2025-05-07T20:33:03.5227110Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.5227115Z 2025-05-07T20:33:03.5227213Z moe/activation_test.py:126: 2025-05-07T20:33:03.5227354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5227459Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.5227604Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.5228171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.5228277Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.5228641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5228862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5229226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.5229499Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.5229921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.5230098Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.5230440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.5230520Z fn() 2025-05-07T20:33:03.5230928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.5231015Z self.fn.run( 2025-05-07T20:33:03.5231350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5231450Z kernel = self.compile( 2025-05-07T20:33:03.5231830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5232013Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5232142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5232147Z 2025-05-07T20:33:03.5232357Z self = 2025-05-07T20:33:03.5233142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5233648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef6700>} 2025-05-07T20:33:03.5234397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5234587Z context = 2025-05-07T20:33:03.5234592Z 2025-05-07T20:33:03.5234755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5235115Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5235268Z module_map=module_map) 2025-05-07T20:33:03.5235439Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5235583Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.5235664Z E ^ 2025-05-07T20:33:03.5236093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5236098Z 2025-05-07T20:33:03.5236513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5236518Z 2025-05-07T20:33:03.5236633Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5236857Z self=, 2025-05-07T20:33:03.5236938Z T=1, 2025-05-07T20:33:03.5237027Z D=5120, 2025-05-07T20:33:03.5237112Z scale_ub=1200.0, 2025-05-07T20:33:03.5237209Z contiguous=False, 2025-05-07T20:33:03.5237301Z compiled=True, 2025-05-07T20:33:03.5237377Z ) 2025-05-07T20:33:03.5237596Z self = 2025-05-07T20:33:03.5237773Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5237778Z 2025-05-07T20:33:03.5237855Z @given( 2025-05-07T20:33:03.5237984Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5238085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5238203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5238331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5238446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5238524Z ) 2025-05-07T20:33:03.5238775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5238869Z def test_silu_mul_quant( 2025-05-07T20:33:03.5238954Z self, 2025-05-07T20:33:03.5239043Z T: int, 2025-05-07T20:33:03.5239123Z D: int, 2025-05-07T20:33:03.5239222Z scale_ub: Optional[float], 2025-05-07T20:33:03.5239322Z contiguous: bool, 2025-05-07T20:33:03.5239424Z compiled: bool, 2025-05-07T20:33:03.5239524Z ) -> None: 2025-05-07T20:33:03.5239637Z torch.manual_seed(2025) 2025-05-07T20:33:03.5239718Z 2025-05-07T20:33:03.5239899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5239978Z 2025-05-07T20:33:03.5240070Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5240208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5240302Z x = x_sign * x_clamp 2025-05-07T20:33:03.5240387Z x0 = x[:, :D] 2025-05-07T20:33:03.5240473Z x1 = x[:, D:] 2025-05-07T20:33:03.5240554Z 2025-05-07T20:33:03.5240640Z if contiguous: 2025-05-07T20:33:03.5240734Z x0 = x0.contiguous() 2025-05-07T20:33:03.5240839Z x1 = x1.contiguous() 2025-05-07T20:33:03.5240912Z 2025-05-07T20:33:03.5241003Z if scale_ub is not None: 2025-05-07T20:33:03.5241118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5241253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5241337Z ) 2025-05-07T20:33:03.5241415Z else: 2025-05-07T20:33:03.5241510Z scale_ub_tensor = None 2025-05-07T20:33:03.5241592Z 2025-05-07T20:33:03.5241726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5241819Z op = silu_mul_quant 2025-05-07T20:33:03.5241914Z if compiled: 2025-05-07T20:33:03.5242014Z op = torch.compile(op) 2025-05-07T20:33:03.5242120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5242199Z 2025-05-07T20:33:03.5242290Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5242294Z 2025-05-07T20:33:03.5242398Z moe/activation_test.py:117: 2025-05-07T20:33:03.5242660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5242762Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5242872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5243280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5243374Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5243870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5243967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5244326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5244548Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5244892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5244993Z kernel = self.compile( 2025-05-07T20:33:03.5245372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5245551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5245683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5245688Z 2025-05-07T20:33:03.5245892Z self = 2025-05-07T20:33:03.5246675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5247180Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef7e20>} 2025-05-07T20:33:03.5247929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5248120Z context = 2025-05-07T20:33:03.5248124Z 2025-05-07T20:33:03.5248287Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5248554Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5248661Z module_map=module_map) 2025-05-07T20:33:03.5248823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5248930Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5249008Z E ^ 2025-05-07T20:33:03.5249373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5249381Z 2025-05-07T20:33:03.5249841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5249848Z 2025-05-07T20:33:03.5249950Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5250177Z self=, 2025-05-07T20:33:03.5250256Z T=1, 2025-05-07T20:33:03.5250339Z D=5120, 2025-05-07T20:33:03.5250422Z scale_ub=1200.0, 2025-05-07T20:33:03.5250509Z contiguous=False, 2025-05-07T20:33:03.5250599Z compiled=False, 2025-05-07T20:33:03.5250672Z ) 2025-05-07T20:33:03.5250891Z self = 2025-05-07T20:33:03.5251062Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5251066Z 2025-05-07T20:33:03.5251143Z @given( 2025-05-07T20:33:03.5251345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5251489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5251606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5251731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5251887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5251961Z ) 2025-05-07T20:33:03.5252210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5252304Z def test_silu_mul_quant( 2025-05-07T20:33:03.5252381Z self, 2025-05-07T20:33:03.5252467Z T: int, 2025-05-07T20:33:03.5252545Z D: int, 2025-05-07T20:33:03.5252645Z scale_ub: Optional[float], 2025-05-07T20:33:03.5252743Z contiguous: bool, 2025-05-07T20:33:03.5252828Z compiled: bool, 2025-05-07T20:33:03.5252909Z ) -> None: 2025-05-07T20:33:03.5253012Z torch.manual_seed(2025) 2025-05-07T20:33:03.5253087Z 2025-05-07T20:33:03.5253264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5253344Z 2025-05-07T20:33:03.5253436Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5253567Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5253659Z x = x_sign * x_clamp 2025-05-07T20:33:03.5253744Z x0 = x[:, :D] 2025-05-07T20:33:03.5253832Z x1 = x[:, D:] 2025-05-07T20:33:03.5253906Z 2025-05-07T20:33:03.5253993Z if contiguous: 2025-05-07T20:33:03.5254092Z x0 = x0.contiguous() 2025-05-07T20:33:03.5254182Z x1 = x1.contiguous() 2025-05-07T20:33:03.5254256Z 2025-05-07T20:33:03.5254356Z if scale_ub is not None: 2025-05-07T20:33:03.5254462Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5254600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5254684Z ) 2025-05-07T20:33:03.5254761Z else: 2025-05-07T20:33:03.5254863Z scale_ub_tensor = None 2025-05-07T20:33:03.5254945Z 2025-05-07T20:33:03.5255075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5255174Z op = silu_mul_quant 2025-05-07T20:33:03.5255261Z if compiled: 2025-05-07T20:33:03.5255365Z op = torch.compile(op) 2025-05-07T20:33:03.5255477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5255553Z 2025-05-07T20:33:03.5255645Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5255649Z 2025-05-07T20:33:03.5255754Z moe/activation_test.py:117: 2025-05-07T20:33:03.5255882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5255989Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5256089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5256586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5256689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5257056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5257277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5257626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5257721Z kernel = self.compile( 2025-05-07T20:33:03.5258106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5258279Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5258404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5258408Z 2025-05-07T20:33:03.5258618Z self = 2025-05-07T20:33:03.5259477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5260050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66666480>} 2025-05-07T20:33:03.5260834Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5261022Z context = 2025-05-07T20:33:03.5261033Z 2025-05-07T20:33:03.5261199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5261461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5261579Z module_map=module_map) 2025-05-07T20:33:03.5261741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5261842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5261928Z E ^ 2025-05-07T20:33:03.5262284Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5262289Z 2025-05-07T20:33:03.5262705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5262709Z 2025-05-07T20:33:03.5262813Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5263036Z self=, 2025-05-07T20:33:03.5263123Z T=16384, 2025-05-07T20:33:03.5263201Z D=5120, 2025-05-07T20:33:03.5263285Z scale_ub=1200.0, 2025-05-07T20:33:03.5263377Z contiguous=False, 2025-05-07T20:33:03.5263462Z compiled=True, 2025-05-07T20:33:03.5263539Z ) 2025-05-07T20:33:03.5263769Z self = 2025-05-07T20:33:03.5263948Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5263955Z 2025-05-07T20:33:03.5264039Z @given( 2025-05-07T20:33:03.5264158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5264257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5264381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5264501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5264614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5264696Z ) 2025-05-07T20:33:03.5264937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5265035Z def test_silu_mul_quant( 2025-05-07T20:33:03.5265114Z self, 2025-05-07T20:33:03.5265192Z T: int, 2025-05-07T20:33:03.5265274Z D: int, 2025-05-07T20:33:03.5265636Z scale_ub: Optional[float], 2025-05-07T20:33:03.5265771Z contiguous: bool, 2025-05-07T20:33:03.5265901Z compiled: bool, 2025-05-07T20:33:03.5266012Z ) -> None: 2025-05-07T20:33:03.5266118Z torch.manual_seed(2025) 2025-05-07T20:33:03.5266200Z 2025-05-07T20:33:03.5266366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5266443Z 2025-05-07T20:33:03.5266541Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5266667Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5266757Z x = x_sign * x_clamp 2025-05-07T20:33:03.5266845Z x0 = x[:, :D] 2025-05-07T20:33:03.5266926Z x1 = x[:, D:] 2025-05-07T20:33:03.5267005Z 2025-05-07T20:33:03.5267089Z if contiguous: 2025-05-07T20:33:03.5267180Z x0 = x0.contiguous() 2025-05-07T20:33:03.5267276Z x1 = x1.contiguous() 2025-05-07T20:33:03.5267349Z 2025-05-07T20:33:03.5267656Z if scale_ub is not None: 2025-05-07T20:33:03.5267833Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5267968Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5268106Z ) 2025-05-07T20:33:03.5268190Z else: 2025-05-07T20:33:03.5268286Z scale_ub_tensor = None 2025-05-07T20:33:03.5268360Z 2025-05-07T20:33:03.5268496Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5268588Z op = silu_mul_quant 2025-05-07T20:33:03.5268681Z if compiled: 2025-05-07T20:33:03.5268783Z op = torch.compile(op) 2025-05-07T20:33:03.5268889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5268969Z 2025-05-07T20:33:03.5269060Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5269064Z 2025-05-07T20:33:03.5269162Z moe/activation_test.py:117: 2025-05-07T20:33:03.5269300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5269415Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5269516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5269889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5269986Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5270483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5270581Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5270937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5271167Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5271507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5271601Z kernel = self.compile( 2025-05-07T20:33:03.5271994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5272167Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5272302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5272307Z 2025-05-07T20:33:03.5272510Z self = 2025-05-07T20:33:03.5273282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5273789Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66667ce0>} 2025-05-07T20:33:03.5274537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5274735Z context = 2025-05-07T20:33:03.5274741Z 2025-05-07T20:33:03.5274905Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5275174Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5275283Z module_map=module_map) 2025-05-07T20:33:03.5275443Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5275549Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5275629Z E ^ 2025-05-07T20:33:03.5276046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5276051Z 2025-05-07T20:33:03.5276555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5276596Z 2025-05-07T20:33:03.5276702Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5276934Z self=, 2025-05-07T20:33:03.5277055Z T=2048, 2025-05-07T20:33:03.5277138Z D=7168, 2025-05-07T20:33:03.5277228Z scale_ub=1200.0, 2025-05-07T20:33:03.5277317Z contiguous=False, 2025-05-07T20:33:03.5277403Z compiled=True, 2025-05-07T20:33:03.5277484Z ) 2025-05-07T20:33:03.5277701Z self = 2025-05-07T20:33:03.5277877Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5277891Z 2025-05-07T20:33:03.5277970Z @given( 2025-05-07T20:33:03.5278089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5278195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5278315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5278437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5278557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5278639Z ) 2025-05-07T20:33:03.5278882Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5278984Z def test_silu_mul_quant( 2025-05-07T20:33:03.5279065Z self, 2025-05-07T20:33:03.5279151Z T: int, 2025-05-07T20:33:03.5279231Z D: int, 2025-05-07T20:33:03.5279332Z scale_ub: Optional[float], 2025-05-07T20:33:03.5279427Z contiguous: bool, 2025-05-07T20:33:03.5279517Z compiled: bool, 2025-05-07T20:33:03.5279598Z ) -> None: 2025-05-07T20:33:03.5279700Z torch.manual_seed(2025) 2025-05-07T20:33:03.5279779Z 2025-05-07T20:33:03.5279946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5280028Z 2025-05-07T20:33:03.5280126Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5280253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5280352Z x = x_sign * x_clamp 2025-05-07T20:33:03.5280434Z x0 = x[:, :D] 2025-05-07T20:33:03.5280519Z x1 = x[:, D:] 2025-05-07T20:33:03.5280603Z 2025-05-07T20:33:03.5280688Z if contiguous: 2025-05-07T20:33:03.5280787Z x0 = x0.contiguous() 2025-05-07T20:33:03.5280879Z x1 = x1.contiguous() 2025-05-07T20:33:03.5280957Z 2025-05-07T20:33:03.5281059Z if scale_ub is not None: 2025-05-07T20:33:03.5281167Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5281302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5281386Z ) 2025-05-07T20:33:03.5281466Z else: 2025-05-07T20:33:03.5281563Z scale_ub_tensor = None 2025-05-07T20:33:03.5281646Z 2025-05-07T20:33:03.5281776Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5281876Z op = silu_mul_quant 2025-05-07T20:33:03.5281968Z if compiled: 2025-05-07T20:33:03.5282069Z op = torch.compile(op) 2025-05-07T20:33:03.5282179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5282256Z 2025-05-07T20:33:03.5282348Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5282353Z 2025-05-07T20:33:03.5282456Z moe/activation_test.py:117: 2025-05-07T20:33:03.5282588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5282691Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5282799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5283164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5283259Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5283757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5283987Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5284352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5284617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5284953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5285053Z kernel = self.compile( 2025-05-07T20:33:03.5285432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5285613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5285741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5285745Z 2025-05-07T20:33:03.5285946Z self = 2025-05-07T20:33:03.5286735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5287238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b666679c0>} 2025-05-07T20:33:03.5287986Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5288176Z context = 2025-05-07T20:33:03.5288180Z 2025-05-07T20:33:03.5288343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5288615Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5288726Z module_map=module_map) 2025-05-07T20:33:03.5288896Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5289000Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5289079Z E ^ 2025-05-07T20:33:03.5289441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5289446Z 2025-05-07T20:33:03.5289856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5289860Z 2025-05-07T20:33:03.5289973Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5290197Z self=, 2025-05-07T20:33:03.5290276Z T=1, 2025-05-07T20:33:03.5290365Z D=5120, 2025-05-07T20:33:03.5290450Z scale_ub=None, 2025-05-07T20:33:03.5290540Z contiguous=False, 2025-05-07T20:33:03.5290641Z compiled=False, 2025-05-07T20:33:03.5290716Z ) 2025-05-07T20:33:03.5290937Z self = 2025-05-07T20:33:03.5291111Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.5291121Z 2025-05-07T20:33:03.5291200Z @given( 2025-05-07T20:33:03.5291325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5291425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5291541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5291664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5291779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5291854Z ) 2025-05-07T20:33:03.5292104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5292198Z def test_silu_mul_quant( 2025-05-07T20:33:03.5292275Z self, 2025-05-07T20:33:03.5292360Z T: int, 2025-05-07T20:33:03.5292593Z D: int, 2025-05-07T20:33:03.5292694Z scale_ub: Optional[float], 2025-05-07T20:33:03.5292791Z contiguous: bool, 2025-05-07T20:33:03.5292877Z compiled: bool, 2025-05-07T20:33:03.5293004Z ) -> None: 2025-05-07T20:33:03.5293100Z torch.manual_seed(2025) 2025-05-07T20:33:03.5293174Z 2025-05-07T20:33:03.5293348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5293425Z 2025-05-07T20:33:03.5293519Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5293652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5293742Z x = x_sign * x_clamp 2025-05-07T20:33:03.5293828Z x0 = x[:, :D] 2025-05-07T20:33:03.5293918Z x1 = x[:, D:] 2025-05-07T20:33:03.5293993Z 2025-05-07T20:33:03.5294077Z if contiguous: 2025-05-07T20:33:03.5294176Z x0 = x0.contiguous() 2025-05-07T20:33:03.5294264Z x1 = x1.contiguous() 2025-05-07T20:33:03.5294352Z 2025-05-07T20:33:03.5294444Z if scale_ub is not None: 2025-05-07T20:33:03.5294551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5294690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5294770Z ) 2025-05-07T20:33:03.5294847Z else: 2025-05-07T20:33:03.5294951Z scale_ub_tensor = None 2025-05-07T20:33:03.5295025Z 2025-05-07T20:33:03.5295154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5295251Z op = silu_mul_quant 2025-05-07T20:33:03.5295337Z if compiled: 2025-05-07T20:33:03.5295444Z op = torch.compile(op) 2025-05-07T20:33:03.5295556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5295631Z 2025-05-07T20:33:03.5295725Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5295729Z 2025-05-07T20:33:03.5295839Z moe/activation_test.py:117: 2025-05-07T20:33:03.5295975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5296084Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5296184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5296679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5296786Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5297142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5297362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5297705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5297799Z kernel = self.compile( 2025-05-07T20:33:03.5298190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5298371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5298503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5298510Z 2025-05-07T20:33:03.5298722Z self = 2025-05-07T20:33:03.5299495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5300001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b668dd800>} 2025-05-07T20:33:03.5300829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5301063Z context = 2025-05-07T20:33:03.5301067Z 2025-05-07T20:33:03.5301231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5301533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5301646Z module_map=module_map) 2025-05-07T20:33:03.5301808Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5301909Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5301992Z E ^ 2025-05-07T20:33:03.5302347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5302351Z 2025-05-07T20:33:03.5302768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5302773Z 2025-05-07T20:33:03.5302886Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5303109Z self=, 2025-05-07T20:33:03.5303195Z T=4096, 2025-05-07T20:33:03.5303279Z D=7168, 2025-05-07T20:33:03.5303365Z scale_ub=1200.0, 2025-05-07T20:33:03.5303461Z contiguous=False, 2025-05-07T20:33:03.5303547Z compiled=False, 2025-05-07T20:33:03.5303630Z ) 2025-05-07T20:33:03.5303848Z self = 2025-05-07T20:33:03.5304026Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5304030Z 2025-05-07T20:33:03.5304116Z @given( 2025-05-07T20:33:03.5304236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5304337Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5304460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5304579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5304700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5304786Z ) 2025-05-07T20:33:03.5305028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5305133Z def test_silu_mul_quant( 2025-05-07T20:33:03.5305212Z self, 2025-05-07T20:33:03.5305292Z T: int, 2025-05-07T20:33:03.5305379Z D: int, 2025-05-07T20:33:03.5305481Z scale_ub: Optional[float], 2025-05-07T20:33:03.5305572Z contiguous: bool, 2025-05-07T20:33:03.5305667Z compiled: bool, 2025-05-07T20:33:03.5305749Z ) -> None: 2025-05-07T20:33:03.5305847Z torch.manual_seed(2025) 2025-05-07T20:33:03.5305931Z 2025-05-07T20:33:03.5306101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5306177Z 2025-05-07T20:33:03.5306283Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5306409Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5306515Z x = x_sign * x_clamp 2025-05-07T20:33:03.5306601Z x0 = x[:, :D] 2025-05-07T20:33:03.5306684Z x1 = x[:, D:] 2025-05-07T20:33:03.5306766Z 2025-05-07T20:33:03.5306853Z if contiguous: 2025-05-07T20:33:03.5306948Z x0 = x0.contiguous() 2025-05-07T20:33:03.5307045Z x1 = x1.contiguous() 2025-05-07T20:33:03.5307123Z 2025-05-07T20:33:03.5307217Z if scale_ub is not None: 2025-05-07T20:33:03.5307333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5307472Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5307551Z ) 2025-05-07T20:33:03.5307637Z else: 2025-05-07T20:33:03.5307735Z scale_ub_tensor = None 2025-05-07T20:33:03.5307810Z 2025-05-07T20:33:03.5307950Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5308043Z op = silu_mul_quant 2025-05-07T20:33:03.5308136Z if compiled: 2025-05-07T20:33:03.5308327Z op = torch.compile(op) 2025-05-07T20:33:03.5308476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5308557Z 2025-05-07T20:33:03.5308649Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5308690Z 2025-05-07T20:33:03.5308790Z moe/activation_test.py:117: 2025-05-07T20:33:03.5308926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5309031Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5309132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5309634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5309732Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5310097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5310318Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5310664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5310766Z kernel = self.compile( 2025-05-07T20:33:03.5311150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5311332Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5311461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5311465Z 2025-05-07T20:33:03.5311669Z self = 2025-05-07T20:33:03.5312453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5312962Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66e85300>} 2025-05-07T20:33:03.5313710Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5313904Z context = 2025-05-07T20:33:03.5313909Z 2025-05-07T20:33:03.5314073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5314342Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5314450Z module_map=module_map) 2025-05-07T20:33:03.5314617Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5314718Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5314796Z E ^ 2025-05-07T20:33:03.5315163Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5315167Z 2025-05-07T20:33:03.5315577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5315584Z 2025-05-07T20:33:03.5315696Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5316055Z self=, 2025-05-07T20:33:03.5316137Z T=16384, 2025-05-07T20:33:03.5316222Z D=7168, 2025-05-07T20:33:03.5316307Z scale_ub=None, 2025-05-07T20:33:03.5316395Z contiguous=True, 2025-05-07T20:33:03.5316487Z compiled=True, 2025-05-07T20:33:03.5316563Z ) 2025-05-07T20:33:03.5316781Z self = 2025-05-07T20:33:03.5317366Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5317371Z 2025-05-07T20:33:03.5317675Z @given( 2025-05-07T20:33:03.5317805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5317906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5318022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5318195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5318311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5318390Z ) 2025-05-07T20:33:03.5318642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5318738Z def test_silu_mul_quant( 2025-05-07T20:33:03.5318818Z self, 2025-05-07T20:33:03.5318908Z T: int, 2025-05-07T20:33:03.5318987Z D: int, 2025-05-07T20:33:03.5319094Z scale_ub: Optional[float], 2025-05-07T20:33:03.5319199Z contiguous: bool, 2025-05-07T20:33:03.5319287Z compiled: bool, 2025-05-07T20:33:03.5319372Z ) -> None: 2025-05-07T20:33:03.5325656Z torch.manual_seed(2025) 2025-05-07T20:33:03.5325765Z 2025-05-07T20:33:03.5325953Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5326029Z 2025-05-07T20:33:03.5326127Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5326267Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5326359Z x = x_sign * x_clamp 2025-05-07T20:33:03.5326443Z x0 = x[:, :D] 2025-05-07T20:33:03.5326535Z x1 = x[:, D:] 2025-05-07T20:33:03.5326610Z 2025-05-07T20:33:03.5326701Z if contiguous: 2025-05-07T20:33:03.5326794Z x0 = x0.contiguous() 2025-05-07T20:33:03.5326884Z x1 = x1.contiguous() 2025-05-07T20:33:03.5326966Z 2025-05-07T20:33:03.5327058Z if scale_ub is not None: 2025-05-07T20:33:03.5327166Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5327310Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5327387Z ) 2025-05-07T20:33:03.5327469Z else: 2025-05-07T20:33:03.5327580Z scale_ub_tensor = None 2025-05-07T20:33:03.5327655Z 2025-05-07T20:33:03.5327787Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5327894Z op = silu_mul_quant 2025-05-07T20:33:03.5327983Z if compiled: 2025-05-07T20:33:03.5328094Z op = torch.compile(op) 2025-05-07T20:33:03.5328199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5328275Z 2025-05-07T20:33:03.5328372Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5328376Z 2025-05-07T20:33:03.5328476Z moe/activation_test.py:117: 2025-05-07T20:33:03.5328615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5328717Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5328820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5329209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5329311Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5329812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5329914Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5330271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5330503Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5330843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5330946Z kernel = self.compile( 2025-05-07T20:33:03.5331329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5331511Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5331815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5331860Z 2025-05-07T20:33:03.5332068Z self = 2025-05-07T20:33:03.5332897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5333402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90a65440>} 2025-05-07T20:33:03.5334147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5334345Z context = 2025-05-07T20:33:03.5334356Z 2025-05-07T20:33:03.5334522Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5334801Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5334918Z module_map=module_map) 2025-05-07T20:33:03.5335081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5335190Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5335270Z E ^ 2025-05-07T20:33:03.5335624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5335636Z 2025-05-07T20:33:03.5336047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5336052Z 2025-05-07T20:33:03.5336158Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5336395Z self=, 2025-05-07T20:33:03.5336480Z T=4096, 2025-05-07T20:33:03.5336559Z D=5120, 2025-05-07T20:33:03.5336652Z scale_ub=None, 2025-05-07T20:33:03.5336741Z contiguous=False, 2025-05-07T20:33:03.5336829Z compiled=True, 2025-05-07T20:33:03.5336917Z ) 2025-05-07T20:33:03.5337137Z self = 2025-05-07T20:33:03.5337319Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5337324Z 2025-05-07T20:33:03.5337404Z @given( 2025-05-07T20:33:03.5337526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5337634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5337752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5337869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5337992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5338069Z ) 2025-05-07T20:33:03.5338322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5338427Z def test_silu_mul_quant( 2025-05-07T20:33:03.5338506Z self, 2025-05-07T20:33:03.5338594Z T: int, 2025-05-07T20:33:03.5338676Z D: int, 2025-05-07T20:33:03.5338776Z scale_ub: Optional[float], 2025-05-07T20:33:03.5338877Z contiguous: bool, 2025-05-07T20:33:03.5338963Z compiled: bool, 2025-05-07T20:33:03.5339044Z ) -> None: 2025-05-07T20:33:03.5339148Z torch.manual_seed(2025) 2025-05-07T20:33:03.5339222Z 2025-05-07T20:33:03.5339389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5339473Z 2025-05-07T20:33:03.5339567Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5339692Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5339789Z x = x_sign * x_clamp 2025-05-07T20:33:03.5339871Z x0 = x[:, :D] 2025-05-07T20:33:03.5339963Z x1 = x[:, D:] 2025-05-07T20:33:03.5340162Z 2025-05-07T20:33:03.5340249Z if contiguous: 2025-05-07T20:33:03.5340351Z x0 = x0.contiguous() 2025-05-07T20:33:03.5340443Z x1 = x1.contiguous() 2025-05-07T20:33:03.5340557Z 2025-05-07T20:33:03.5340658Z if scale_ub is not None: 2025-05-07T20:33:03.5340766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5340901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5340988Z ) 2025-05-07T20:33:03.5341065Z else: 2025-05-07T20:33:03.5341162Z scale_ub_tensor = None 2025-05-07T20:33:03.5341245Z 2025-05-07T20:33:03.5341374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5341474Z op = silu_mul_quant 2025-05-07T20:33:03.5341560Z if compiled: 2025-05-07T20:33:03.5341662Z op = torch.compile(op) 2025-05-07T20:33:03.5341774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5341855Z 2025-05-07T20:33:03.5341954Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5341959Z 2025-05-07T20:33:03.5342064Z moe/activation_test.py:117: 2025-05-07T20:33:03.5342194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5342304Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5342411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5342776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5342877Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5343368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5343465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5343827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5344057Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5344397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5344505Z kernel = self.compile( 2025-05-07T20:33:03.5344885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5345067Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5345197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5345202Z 2025-05-07T20:33:03.5345405Z self = 2025-05-07T20:33:03.5346194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5346704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b90283060>} 2025-05-07T20:33:03.5347459Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5347648Z context = 2025-05-07T20:33:03.5347653Z 2025-05-07T20:33:03.5347829Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5348094Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5348203Z module_map=module_map) 2025-05-07T20:33:03.5348376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5348478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5348683Z E ^ 2025-05-07T20:33:03.5349048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5349052Z 2025-05-07T20:33:03.5349509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5349515Z 2025-05-07T20:33:03.5349625Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5349848Z self=, 2025-05-07T20:33:03.5349929Z T=4096, 2025-05-07T20:33:03.5350017Z D=5120, 2025-05-07T20:33:03.5350102Z scale_ub=1200.0, 2025-05-07T20:33:03.5350190Z contiguous=False, 2025-05-07T20:33:03.5350283Z compiled=False, 2025-05-07T20:33:03.5350358Z ) 2025-05-07T20:33:03.5350578Z self = 2025-05-07T20:33:03.5350768Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5350775Z 2025-05-07T20:33:03.5350853Z @given( 2025-05-07T20:33:03.5350981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5351082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5351202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5351327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5351442Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5351519Z ) 2025-05-07T20:33:03.5351773Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5351868Z def test_silu_mul_quant( 2025-05-07T20:33:03.5351948Z self, 2025-05-07T20:33:03.5352037Z T: int, 2025-05-07T20:33:03.5352116Z D: int, 2025-05-07T20:33:03.5352224Z scale_ub: Optional[float], 2025-05-07T20:33:03.5352314Z contiguous: bool, 2025-05-07T20:33:03.5352402Z compiled: bool, 2025-05-07T20:33:03.5352502Z ) -> None: 2025-05-07T20:33:03.5352601Z torch.manual_seed(2025) 2025-05-07T20:33:03.5352677Z 2025-05-07T20:33:03.5352853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5352934Z 2025-05-07T20:33:03.5353029Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5353163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5353252Z x = x_sign * x_clamp 2025-05-07T20:33:03.5353339Z x0 = x[:, :D] 2025-05-07T20:33:03.5353428Z x1 = x[:, D:] 2025-05-07T20:33:03.5353503Z 2025-05-07T20:33:03.5353595Z if contiguous: 2025-05-07T20:33:03.5353691Z x0 = x0.contiguous() 2025-05-07T20:33:03.5353782Z x1 = x1.contiguous() 2025-05-07T20:33:03.5353863Z 2025-05-07T20:33:03.5353955Z if scale_ub is not None: 2025-05-07T20:33:03.5354062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5354206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5354292Z ) 2025-05-07T20:33:03.5354370Z else: 2025-05-07T20:33:03.5354478Z scale_ub_tensor = None 2025-05-07T20:33:03.5354554Z 2025-05-07T20:33:03.5354684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5354785Z op = silu_mul_quant 2025-05-07T20:33:03.5354873Z if compiled: 2025-05-07T20:33:03.5354982Z op = torch.compile(op) 2025-05-07T20:33:03.5355090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5355165Z 2025-05-07T20:33:03.5355263Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5355268Z 2025-05-07T20:33:03.5355365Z moe/activation_test.py:117: 2025-05-07T20:33:03.5355495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5355607Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5355708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5356427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5356574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5356932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5357235Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5357572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5357667Z kernel = self.compile( 2025-05-07T20:33:03.5358053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5358228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5358362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5358366Z 2025-05-07T20:33:03.5358577Z self = 2025-05-07T20:33:03.5359358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5359870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b661b1b20>} 2025-05-07T20:33:03.5360614Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5360810Z context = 2025-05-07T20:33:03.5360814Z 2025-05-07T20:33:03.5360979Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5361250Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5361368Z module_map=module_map) 2025-05-07T20:33:03.5361533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5361641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5361721Z E ^ 2025-05-07T20:33:03.5362074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5362079Z 2025-05-07T20:33:03.5362498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5362503Z 2025-05-07T20:33:03.5362610Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5362841Z self=, 2025-05-07T20:33:03.5362921Z T=4096, 2025-05-07T20:33:03.5363004Z D=5120, 2025-05-07T20:33:03.5363108Z scale_ub=1200.0, 2025-05-07T20:33:03.5363200Z contiguous=False, 2025-05-07T20:33:03.5363285Z compiled=True, 2025-05-07T20:33:03.5363370Z ) 2025-05-07T20:33:03.5363590Z self = 2025-05-07T20:33:03.5363768Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5363773Z 2025-05-07T20:33:03.5363862Z @given( 2025-05-07T20:33:03.5363982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5364083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5364207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5364326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5364448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5364527Z ) 2025-05-07T20:33:03.5364772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5364961Z def test_silu_mul_quant( 2025-05-07T20:33:03.5365081Z self, 2025-05-07T20:33:03.5365162Z T: int, 2025-05-07T20:33:03.5365248Z D: int, 2025-05-07T20:33:03.5365348Z scale_ub: Optional[float], 2025-05-07T20:33:03.5365937Z contiguous: bool, 2025-05-07T20:33:03.5366069Z compiled: bool, 2025-05-07T20:33:03.5366179Z ) -> None: 2025-05-07T20:33:03.5366315Z torch.manual_seed(2025) 2025-05-07T20:33:03.5366434Z 2025-05-07T20:33:03.5366617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5366701Z 2025-05-07T20:33:03.5366796Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5366922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5367021Z x = x_sign * x_clamp 2025-05-07T20:33:03.5367105Z x0 = x[:, :D] 2025-05-07T20:33:03.5367187Z x1 = x[:, D:] 2025-05-07T20:33:03.5367270Z 2025-05-07T20:33:03.5367358Z if contiguous: 2025-05-07T20:33:03.5367461Z x0 = x0.contiguous() 2025-05-07T20:33:03.5367565Z x1 = x1.contiguous() 2025-05-07T20:33:03.5367641Z 2025-05-07T20:33:03.5367734Z if scale_ub is not None: 2025-05-07T20:33:03.5367846Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5367999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5368078Z ) 2025-05-07T20:33:03.5368163Z else: 2025-05-07T20:33:03.5368259Z scale_ub_tensor = None 2025-05-07T20:33:03.5368334Z 2025-05-07T20:33:03.5368471Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5368563Z op = silu_mul_quant 2025-05-07T20:33:03.5368648Z if compiled: 2025-05-07T20:33:03.5368755Z op = torch.compile(op) 2025-05-07T20:33:03.5368860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5368934Z 2025-05-07T20:33:03.5369034Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5369039Z 2025-05-07T20:33:03.5369142Z moe/activation_test.py:117: 2025-05-07T20:33:03.5369272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5369383Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5369485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5369862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5369955Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5370442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5370548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5370902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5371129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5371471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5371567Z kernel = self.compile( 2025-05-07T20:33:03.5371953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5372131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5372260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5372265Z 2025-05-07T20:33:03.5372472Z self = 2025-05-07T20:33:03.5373245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5373922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b661b3f60>} 2025-05-07T20:33:03.5374729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5374969Z context = 2025-05-07T20:33:03.5374973Z 2025-05-07T20:33:03.5375138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5375402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5375518Z module_map=module_map) 2025-05-07T20:33:03.5375678Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5375778Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5375863Z E ^ 2025-05-07T20:33:03.5376219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5376227Z 2025-05-07T20:33:03.5376643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5376651Z 2025-05-07T20:33:03.5376755Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5376977Z self=, 2025-05-07T20:33:03.5377062Z T=2048, 2025-05-07T20:33:03.5377140Z D=7168, 2025-05-07T20:33:03.5377224Z scale_ub=1200.0, 2025-05-07T20:33:03.5377320Z contiguous=False, 2025-05-07T20:33:03.5377408Z compiled=False, 2025-05-07T20:33:03.5377489Z ) 2025-05-07T20:33:03.5377711Z self = 2025-05-07T20:33:03.5377889Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5377893Z 2025-05-07T20:33:03.5377978Z @given( 2025-05-07T20:33:03.5378101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5378203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5378328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5378448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5378562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5378643Z ) 2025-05-07T20:33:03.5378886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5378986Z def test_silu_mul_quant( 2025-05-07T20:33:03.5379063Z self, 2025-05-07T20:33:03.5379142Z T: int, 2025-05-07T20:33:03.5379226Z D: int, 2025-05-07T20:33:03.5379326Z scale_ub: Optional[float], 2025-05-07T20:33:03.5379418Z contiguous: bool, 2025-05-07T20:33:03.5379511Z compiled: bool, 2025-05-07T20:33:03.5379591Z ) -> None: 2025-05-07T20:33:03.5379687Z torch.manual_seed(2025) 2025-05-07T20:33:03.5379766Z 2025-05-07T20:33:03.5379940Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5380019Z 2025-05-07T20:33:03.5380117Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5380241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5380343Z x = x_sign * x_clamp 2025-05-07T20:33:03.5380424Z x0 = x[:, :D] 2025-05-07T20:33:03.5380506Z x1 = x[:, D:] 2025-05-07T20:33:03.5380588Z 2025-05-07T20:33:03.5380674Z if contiguous: 2025-05-07T20:33:03.5380770Z x0 = x0.contiguous() 2025-05-07T20:33:03.5380872Z x1 = x1.contiguous() 2025-05-07T20:33:03.5380948Z 2025-05-07T20:33:03.5381041Z if scale_ub is not None: 2025-05-07T20:33:03.5381155Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5381290Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5381368Z ) 2025-05-07T20:33:03.5381451Z else: 2025-05-07T20:33:03.5381652Z scale_ub_tensor = None 2025-05-07T20:33:03.5381772Z 2025-05-07T20:33:03.5381903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5381995Z op = silu_mul_quant 2025-05-07T20:33:03.5382126Z if compiled: 2025-05-07T20:33:03.5382227Z op = torch.compile(op) 2025-05-07T20:33:03.5382332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5382413Z 2025-05-07T20:33:03.5382506Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5382511Z 2025-05-07T20:33:03.5382609Z moe/activation_test.py:117: 2025-05-07T20:33:03.5382747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5382848Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5382957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5383454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5383559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5383921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5384144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5384485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5384587Z kernel = self.compile( 2025-05-07T20:33:03.5384965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5385143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5385270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5385275Z 2025-05-07T20:33:03.5385478Z self = 2025-05-07T20:33:03.5386265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5386769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b661b1440>} 2025-05-07T20:33:03.5387519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5387708Z context = 2025-05-07T20:33:03.5387712Z 2025-05-07T20:33:03.5387875Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5388143Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5388259Z module_map=module_map) 2025-05-07T20:33:03.5388427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5388527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5388609Z E ^ 2025-05-07T20:33:03.5388966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5388971Z 2025-05-07T20:33:03.5389383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5389387Z 2025-05-07T20:33:03.5389496Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5389719Z self=, 2025-05-07T20:33:03.5389797Z T=1, 2025-05-07T20:33:03.5389885Z D=7168, 2025-05-07T20:33:03.5389969Z scale_ub=None, 2025-05-07T20:33:03.5390054Z contiguous=True, 2025-05-07T20:33:03.5390148Z compiled=False, 2025-05-07T20:33:03.5390386Z ) 2025-05-07T20:33:03.5390605Z self = 2025-05-07T20:33:03.5390775Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5390823Z 2025-05-07T20:33:03.5390901Z @given( 2025-05-07T20:33:03.5391028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5391127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5391242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5391366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5391480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5391555Z ) 2025-05-07T20:33:03.5391805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5391899Z def test_silu_mul_quant( 2025-05-07T20:33:03.5391976Z self, 2025-05-07T20:33:03.5392060Z T: int, 2025-05-07T20:33:03.5392138Z D: int, 2025-05-07T20:33:03.5392254Z scale_ub: Optional[float], 2025-05-07T20:33:03.5392345Z contiguous: bool, 2025-05-07T20:33:03.5392432Z compiled: bool, 2025-05-07T20:33:03.5392517Z ) -> None: 2025-05-07T20:33:03.5392615Z torch.manual_seed(2025) 2025-05-07T20:33:03.5392689Z 2025-05-07T20:33:03.5392863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5392939Z 2025-05-07T20:33:03.5393037Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5393167Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5393256Z x = x_sign * x_clamp 2025-05-07T20:33:03.5393339Z x0 = x[:, :D] 2025-05-07T20:33:03.5393428Z x1 = x[:, D:] 2025-05-07T20:33:03.5393503Z 2025-05-07T20:33:03.5393587Z if contiguous: 2025-05-07T20:33:03.5393686Z x0 = x0.contiguous() 2025-05-07T20:33:03.5393774Z x1 = x1.contiguous() 2025-05-07T20:33:03.5393857Z 2025-05-07T20:33:03.5393956Z if scale_ub is not None: 2025-05-07T20:33:03.5394063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5394202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5394283Z ) 2025-05-07T20:33:03.5394360Z else: 2025-05-07T20:33:03.5394462Z scale_ub_tensor = None 2025-05-07T20:33:03.5394537Z 2025-05-07T20:33:03.5394666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5394765Z op = silu_mul_quant 2025-05-07T20:33:03.5394852Z if compiled: 2025-05-07T20:33:03.5394953Z op = torch.compile(op) 2025-05-07T20:33:03.5395065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5395140Z 2025-05-07T20:33:03.5395239Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5395243Z 2025-05-07T20:33:03.5395341Z moe/activation_test.py:117: 2025-05-07T20:33:03.5395471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5395585Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5395687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5396317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5396425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5396784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5397012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5397348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5397443Z kernel = self.compile( 2025-05-07T20:33:03.5397828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5398098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5398265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5398280Z 2025-05-07T20:33:03.5398484Z self = 2025-05-07T20:33:03.5399299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5399806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66837740>} 2025-05-07T20:33:03.5400547Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5400752Z context = 2025-05-07T20:33:03.5400756Z 2025-05-07T20:33:03.5400922Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5401189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5401305Z module_map=module_map) 2025-05-07T20:33:03.5401466Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5401574Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5401654Z E ^ 2025-05-07T20:33:03.5402012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5402017Z 2025-05-07T20:33:03.5402437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5402441Z 2025-05-07T20:33:03.5402546Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5402776Z self=, 2025-05-07T20:33:03.5402863Z T=16384, 2025-05-07T20:33:03.5402940Z D=7168, 2025-05-07T20:33:03.5403032Z scale_ub=1200.0, 2025-05-07T20:33:03.5403120Z contiguous=False, 2025-05-07T20:33:03.5403204Z compiled=True, 2025-05-07T20:33:03.5403285Z ) 2025-05-07T20:33:03.5403503Z self = 2025-05-07T20:33:03.5403681Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5403685Z 2025-05-07T20:33:03.5403769Z @given( 2025-05-07T20:33:03.5403889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5403988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5404108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5404226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5404350Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5404428Z ) 2025-05-07T20:33:03.5404673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5404773Z def test_silu_mul_quant( 2025-05-07T20:33:03.5404852Z self, 2025-05-07T20:33:03.5404931Z T: int, 2025-05-07T20:33:03.5405015Z D: int, 2025-05-07T20:33:03.5405115Z scale_ub: Optional[float], 2025-05-07T20:33:03.5405205Z contiguous: bool, 2025-05-07T20:33:03.5405299Z compiled: bool, 2025-05-07T20:33:03.5405379Z ) -> None: 2025-05-07T20:33:03.5405475Z torch.manual_seed(2025) 2025-05-07T20:33:03.5405554Z 2025-05-07T20:33:03.5405724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5405806Z 2025-05-07T20:33:03.5405898Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5406022Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5406119Z x = x_sign * x_clamp 2025-05-07T20:33:03.5406332Z x0 = x[:, :D] 2025-05-07T20:33:03.5406416Z x1 = x[:, D:] 2025-05-07T20:33:03.5406497Z 2025-05-07T20:33:03.5406581Z if contiguous: 2025-05-07T20:33:03.5406674Z x0 = x0.contiguous() 2025-05-07T20:33:03.5406812Z x1 = x1.contiguous() 2025-05-07T20:33:03.5406886Z 2025-05-07T20:33:03.5406978Z if scale_ub is not None: 2025-05-07T20:33:03.5407091Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5407226Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5407312Z ) 2025-05-07T20:33:03.5407389Z else: 2025-05-07T20:33:03.5407485Z scale_ub_tensor = None 2025-05-07T20:33:03.5407568Z 2025-05-07T20:33:03.5407697Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5407788Z op = silu_mul_quant 2025-05-07T20:33:03.5407881Z if compiled: 2025-05-07T20:33:03.5407981Z op = torch.compile(op) 2025-05-07T20:33:03.5408099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5408185Z 2025-05-07T20:33:03.5408277Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5408282Z 2025-05-07T20:33:03.5408379Z moe/activation_test.py:117: 2025-05-07T20:33:03.5408520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5408621Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5408729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5409097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5409190Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5409685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5409786Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5410146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5410378Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5410716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5410821Z kernel = self.compile( 2025-05-07T20:33:03.5411200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5411376Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5411512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5411516Z 2025-05-07T20:33:03.5411719Z self = 2025-05-07T20:33:03.5412507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5413009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b668345e0>} 2025-05-07T20:33:03.5413756Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5413953Z context = 2025-05-07T20:33:03.5413957Z 2025-05-07T20:33:03.5414122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5414392Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5414501Z module_map=module_map) 2025-05-07T20:33:03.5414748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5414894Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5414973Z E ^ 2025-05-07T20:33:03.5415332Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5415374Z 2025-05-07T20:33:03.5415788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5415793Z 2025-05-07T20:33:03.5415897Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5416127Z self=, 2025-05-07T20:33:03.5416206Z T=1, 2025-05-07T20:33:03.5416284Z D=7168, 2025-05-07T20:33:03.5416375Z scale_ub=None, 2025-05-07T20:33:03.5416464Z contiguous=False, 2025-05-07T20:33:03.5416557Z compiled=False, 2025-05-07T20:33:03.5416632Z ) 2025-05-07T20:33:03.5416852Z self = 2025-05-07T20:33:03.5417034Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.5417039Z 2025-05-07T20:33:03.5417119Z @given( 2025-05-07T20:33:03.5417239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5417348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5417463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5417580Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5417698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5417773Z ) 2025-05-07T20:33:03.5418022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5418116Z def test_silu_mul_quant( 2025-05-07T20:33:03.5418195Z self, 2025-05-07T20:33:03.5418278Z T: int, 2025-05-07T20:33:03.5418357Z D: int, 2025-05-07T20:33:03.5418459Z scale_ub: Optional[float], 2025-05-07T20:33:03.5418555Z contiguous: bool, 2025-05-07T20:33:03.5418651Z compiled: bool, 2025-05-07T20:33:03.5418730Z ) -> None: 2025-05-07T20:33:03.5418832Z torch.manual_seed(2025) 2025-05-07T20:33:03.5418906Z 2025-05-07T20:33:03.5419080Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5419161Z 2025-05-07T20:33:03.5419256Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5419386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5419476Z x = x_sign * x_clamp 2025-05-07T20:33:03.5419556Z x0 = x[:, :D] 2025-05-07T20:33:03.5419644Z x1 = x[:, D:] 2025-05-07T20:33:03.5419718Z 2025-05-07T20:33:03.5419802Z if contiguous: 2025-05-07T20:33:03.5419900Z x0 = x0.contiguous() 2025-05-07T20:33:03.5419991Z x1 = x1.contiguous() 2025-05-07T20:33:03.5420064Z 2025-05-07T20:33:03.5420163Z if scale_ub is not None: 2025-05-07T20:33:03.5420269Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5420414Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5420498Z ) 2025-05-07T20:33:03.5420575Z else: 2025-05-07T20:33:03.5420670Z scale_ub_tensor = None 2025-05-07T20:33:03.5420755Z 2025-05-07T20:33:03.5420885Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5420981Z op = silu_mul_quant 2025-05-07T20:33:03.5421066Z if compiled: 2025-05-07T20:33:03.5421169Z op = torch.compile(op) 2025-05-07T20:33:03.5421280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5421353Z 2025-05-07T20:33:03.5421445Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5421449Z 2025-05-07T20:33:03.5421553Z moe/activation_test.py:117: 2025-05-07T20:33:03.5421682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5421783Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5421889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5422555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5422660Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5423060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5423285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5423636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5423733Z kernel = self.compile( 2025-05-07T20:33:03.5424114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5424293Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5424422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5424437Z 2025-05-07T20:33:03.5424646Z self = 2025-05-07T20:33:03.5425420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5425932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b674593a0>} 2025-05-07T20:33:03.5426674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5426863Z context = 2025-05-07T20:33:03.5426867Z 2025-05-07T20:33:03.5427043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5427306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5427422Z module_map=module_map) 2025-05-07T20:33:03.5427585Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5427686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5427770Z E ^ 2025-05-07T20:33:03.5428125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5428130Z 2025-05-07T20:33:03.5428540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5428551Z 2025-05-07T20:33:03.5428656Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5428879Z self=, 2025-05-07T20:33:03.5428972Z T=2048, 2025-05-07T20:33:03.5429051Z D=7168, 2025-05-07T20:33:03.5429137Z scale_ub=None, 2025-05-07T20:33:03.5429231Z contiguous=False, 2025-05-07T20:33:03.5429314Z compiled=True, 2025-05-07T20:33:03.5429392Z ) 2025-05-07T20:33:03.5429617Z self = 2025-05-07T20:33:03.5429790Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5429795Z 2025-05-07T20:33:03.5429873Z @given( 2025-05-07T20:33:03.5429999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5430098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5430221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5430339Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5430453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5430534Z ) 2025-05-07T20:33:03.5430859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5430991Z def test_silu_mul_quant( 2025-05-07T20:33:03.5431076Z self, 2025-05-07T20:33:03.5431156Z T: int, 2025-05-07T20:33:03.5431235Z D: int, 2025-05-07T20:33:03.5431388Z scale_ub: Optional[float], 2025-05-07T20:33:03.5431478Z contiguous: bool, 2025-05-07T20:33:03.5431573Z compiled: bool, 2025-05-07T20:33:03.5431653Z ) -> None: 2025-05-07T20:33:03.5431749Z torch.manual_seed(2025) 2025-05-07T20:33:03.5431829Z 2025-05-07T20:33:03.5431996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5432070Z 2025-05-07T20:33:03.5432170Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5432295Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5432384Z x = x_sign * x_clamp 2025-05-07T20:33:03.5432472Z x0 = x[:, :D] 2025-05-07T20:33:03.5432552Z x1 = x[:, D:] 2025-05-07T20:33:03.5432626Z 2025-05-07T20:33:03.5432726Z if contiguous: 2025-05-07T20:33:03.5432819Z x0 = x0.contiguous() 2025-05-07T20:33:03.5432909Z x1 = x1.contiguous() 2025-05-07T20:33:03.5432991Z 2025-05-07T20:33:03.5433086Z if scale_ub is not None: 2025-05-07T20:33:03.5433198Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5433331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5433409Z ) 2025-05-07T20:33:03.5433492Z else: 2025-05-07T20:33:03.5433587Z scale_ub_tensor = None 2025-05-07T20:33:03.5433661Z 2025-05-07T20:33:03.5433799Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5433890Z op = silu_mul_quant 2025-05-07T20:33:03.5433980Z if compiled: 2025-05-07T20:33:03.5434092Z op = torch.compile(op) 2025-05-07T20:33:03.5434201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5434276Z 2025-05-07T20:33:03.5434378Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5434386Z 2025-05-07T20:33:03.5434485Z moe/activation_test.py:117: 2025-05-07T20:33:03.5434620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5434726Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5434827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5435200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5435295Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5435898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5436007Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5436364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5436600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5436939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5437035Z kernel = self.compile( 2025-05-07T20:33:03.5437424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5437599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5437734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5437739Z 2025-05-07T20:33:03.5437943Z self = 2025-05-07T20:33:03.5438716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5439320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b67115b20>} 2025-05-07T20:33:03.5440104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5440342Z context = 2025-05-07T20:33:03.5440347Z 2025-05-07T20:33:03.5440511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5440774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5440893Z module_map=module_map) 2025-05-07T20:33:03.5441057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5441164Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5441243Z E ^ 2025-05-07T20:33:03.5441604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5441609Z 2025-05-07T20:33:03.5442026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5442036Z 2025-05-07T20:33:03.5442141Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5442371Z self=, 2025-05-07T20:33:03.5442450Z T=4096, 2025-05-07T20:33:03.5442531Z D=7168, 2025-05-07T20:33:03.5442619Z scale_ub=None, 2025-05-07T20:33:03.5442706Z contiguous=False, 2025-05-07T20:33:03.5442790Z compiled=True, 2025-05-07T20:33:03.5442871Z ) 2025-05-07T20:33:03.5443093Z self = 2025-05-07T20:33:03.5443266Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5443270Z 2025-05-07T20:33:03.5443363Z @given( 2025-05-07T20:33:03.5443482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5443581Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5443709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5443825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5443943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5444020Z ) 2025-05-07T20:33:03.5444263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5444363Z def test_silu_mul_quant( 2025-05-07T20:33:03.5444440Z self, 2025-05-07T20:33:03.5444518Z T: int, 2025-05-07T20:33:03.5444602Z D: int, 2025-05-07T20:33:03.5444701Z scale_ub: Optional[float], 2025-05-07T20:33:03.5444791Z contiguous: bool, 2025-05-07T20:33:03.5444883Z compiled: bool, 2025-05-07T20:33:03.5444963Z ) -> None: 2025-05-07T20:33:03.5445066Z torch.manual_seed(2025) 2025-05-07T20:33:03.5445151Z 2025-05-07T20:33:03.5445319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5445401Z 2025-05-07T20:33:03.5445496Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5445620Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5445716Z x = x_sign * x_clamp 2025-05-07T20:33:03.5445799Z x0 = x[:, :D] 2025-05-07T20:33:03.5445880Z x1 = x[:, D:] 2025-05-07T20:33:03.5445974Z 2025-05-07T20:33:03.5446059Z if contiguous: 2025-05-07T20:33:03.5446152Z x0 = x0.contiguous() 2025-05-07T20:33:03.5446252Z x1 = x1.contiguous() 2025-05-07T20:33:03.5446327Z 2025-05-07T20:33:03.5446420Z if scale_ub is not None: 2025-05-07T20:33:03.5452651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5452820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5452909Z ) 2025-05-07T20:33:03.5453173Z else: 2025-05-07T20:33:03.5453275Z scale_ub_tensor = None 2025-05-07T20:33:03.5453360Z 2025-05-07T20:33:03.5453497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5453636Z op = silu_mul_quant 2025-05-07T20:33:03.5453732Z if compiled: 2025-05-07T20:33:03.5453841Z op = torch.compile(op) 2025-05-07T20:33:03.5453950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5454035Z 2025-05-07T20:33:03.5454128Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5454134Z 2025-05-07T20:33:03.5454241Z moe/activation_test.py:117: 2025-05-07T20:33:03.5454377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5454485Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5454596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5454972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5455078Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5455580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5455683Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5456047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5456273Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5456613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5456717Z kernel = self.compile( 2025-05-07T20:33:03.5457100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5457273Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5457414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5457421Z 2025-05-07T20:33:03.5457638Z self = 2025-05-07T20:33:03.5458421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5458935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b67115760>} 2025-05-07T20:33:03.5459681Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5459875Z context = 2025-05-07T20:33:03.5459886Z 2025-05-07T20:33:03.5460065Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5460331Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5460454Z module_map=module_map) 2025-05-07T20:33:03.5460619Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5460725Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5460816Z E ^ 2025-05-07T20:33:03.5461176Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5461181Z 2025-05-07T20:33:03.5461595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5461610Z 2025-05-07T20:33:03.5461717Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5462161Z self=, 2025-05-07T20:33:03.5462290Z T=16384, 2025-05-07T20:33:03.5462372Z D=5120, 2025-05-07T20:33:03.5462460Z scale_ub=1200.0, 2025-05-07T20:33:03.5462557Z contiguous=False, 2025-05-07T20:33:03.5462685Z compiled=False, 2025-05-07T20:33:03.5462763Z ) 2025-05-07T20:33:03.5462993Z self = 2025-05-07T20:33:03.5463176Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5463181Z 2025-05-07T20:33:03.5463262Z @given( 2025-05-07T20:33:03.5463393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5463495Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5463624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5463745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5463862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5463947Z ) 2025-05-07T20:33:03.5464201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5464297Z def test_silu_mul_quant( 2025-05-07T20:33:03.5464384Z self, 2025-05-07T20:33:03.5464470Z T: int, 2025-05-07T20:33:03.5464554Z D: int, 2025-05-07T20:33:03.5464667Z scale_ub: Optional[float], 2025-05-07T20:33:03.5464758Z contiguous: bool, 2025-05-07T20:33:03.5464856Z compiled: bool, 2025-05-07T20:33:03.5464939Z ) -> None: 2025-05-07T20:33:03.5465037Z torch.manual_seed(2025) 2025-05-07T20:33:03.5465122Z 2025-05-07T20:33:03.5465291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5465855Z 2025-05-07T20:33:03.5466006Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5466178Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5466280Z x = x_sign * x_clamp 2025-05-07T20:33:03.5466369Z x0 = x[:, :D] 2025-05-07T20:33:03.5466461Z x1 = x[:, D:] 2025-05-07T20:33:03.5466537Z 2025-05-07T20:33:03.5466630Z if contiguous: 2025-05-07T20:33:03.5466729Z x0 = x0.contiguous() 2025-05-07T20:33:03.5466821Z x1 = x1.contiguous() 2025-05-07T20:33:03.5466911Z 2025-05-07T20:33:03.5467002Z if scale_ub is not None: 2025-05-07T20:33:03.5467117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5467252Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5467332Z ) 2025-05-07T20:33:03.5467418Z else: 2025-05-07T20:33:03.5467514Z scale_ub_tensor = None 2025-05-07T20:33:03.5467591Z 2025-05-07T20:33:03.5467731Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5467824Z op = silu_mul_quant 2025-05-07T20:33:03.5467912Z if compiled: 2025-05-07T20:33:03.5468022Z op = torch.compile(op) 2025-05-07T20:33:03.5468131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5468214Z 2025-05-07T20:33:03.5468316Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5468320Z 2025-05-07T20:33:03.5468419Z moe/activation_test.py:117: 2025-05-07T20:33:03.5468558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5468662Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5468766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5469269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5469367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5469724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5469955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5470519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5470680Z kernel = self.compile( 2025-05-07T20:33:03.5471064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5471308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5471443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5471447Z 2025-05-07T20:33:03.5471651Z self = 2025-05-07T20:33:03.5472435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5472946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b671147c0>} 2025-05-07T20:33:03.5473694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5473897Z context = 2025-05-07T20:33:03.5473901Z 2025-05-07T20:33:03.5474068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5474344Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5474453Z module_map=module_map) 2025-05-07T20:33:03.5474615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5474729Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5474809Z E ^ 2025-05-07T20:33:03.5475171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5475182Z 2025-05-07T20:33:03.5475594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5475601Z 2025-05-07T20:33:03.5475709Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5476029Z self=, 2025-05-07T20:33:03.5476111Z T=16384, 2025-05-07T20:33:03.5476190Z D=5120, 2025-05-07T20:33:03.5476287Z scale_ub=1200.0, 2025-05-07T20:33:03.5476376Z contiguous=True, 2025-05-07T20:33:03.5476469Z compiled=True, 2025-05-07T20:33:03.5476548Z ) 2025-05-07T20:33:03.5476768Z self = 2025-05-07T20:33:03.5476955Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5476959Z 2025-05-07T20:33:03.5477040Z @given( 2025-05-07T20:33:03.5477161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5477278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5477398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5477517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5477644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5477724Z ) 2025-05-07T20:33:03.5477979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5478074Z def test_silu_mul_quant( 2025-05-07T20:33:03.5478154Z self, 2025-05-07T20:33:03.5478241Z T: int, 2025-05-07T20:33:03.5478320Z D: int, 2025-05-07T20:33:03.5478421Z scale_ub: Optional[float], 2025-05-07T20:33:03.5478520Z contiguous: bool, 2025-05-07T20:33:03.5478608Z compiled: bool, 2025-05-07T20:33:03.5478690Z ) -> None: 2025-05-07T20:33:03.5478795Z torch.manual_seed(2025) 2025-05-07T20:33:03.5478871Z 2025-05-07T20:33:03.5479125Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5479251Z 2025-05-07T20:33:03.5479348Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5479486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5479625Z x = x_sign * x_clamp 2025-05-07T20:33:03.5479708Z x0 = x[:, :D] 2025-05-07T20:33:03.5479801Z x1 = x[:, D:] 2025-05-07T20:33:03.5479877Z 2025-05-07T20:33:03.5479964Z if contiguous: 2025-05-07T20:33:03.5480066Z x0 = x0.contiguous() 2025-05-07T20:33:03.5480157Z x1 = x1.contiguous() 2025-05-07T20:33:03.5480233Z 2025-05-07T20:33:03.5480335Z if scale_ub is not None: 2025-05-07T20:33:03.5480444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5480580Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5480667Z ) 2025-05-07T20:33:03.5480746Z else: 2025-05-07T20:33:03.5480851Z scale_ub_tensor = None 2025-05-07T20:33:03.5480929Z 2025-05-07T20:33:03.5481065Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5481167Z op = silu_mul_quant 2025-05-07T20:33:03.5481253Z if compiled: 2025-05-07T20:33:03.5481359Z op = torch.compile(op) 2025-05-07T20:33:03.5481473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5481548Z 2025-05-07T20:33:03.5481640Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5481645Z 2025-05-07T20:33:03.5481751Z moe/activation_test.py:117: 2025-05-07T20:33:03.5481880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5481982Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5482091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5482459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5482562Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5483058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5483160Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5483523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5483749Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5484091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5484187Z kernel = self.compile( 2025-05-07T20:33:03.5484568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5484751Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5484881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5484886Z 2025-05-07T20:33:03.5485096Z self = 2025-05-07T20:33:03.5485879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5486385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ef7740>} 2025-05-07T20:33:03.5487136Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5487325Z context = 2025-05-07T20:33:03.5487329Z 2025-05-07T20:33:03.5487501Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5487891Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5488001Z module_map=module_map) 2025-05-07T20:33:03.5488241Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5488342Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5488423Z E ^ 2025-05-07T20:33:03.5488786Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5488791Z 2025-05-07T20:33:03.5489201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5489206Z 2025-05-07T20:33:03.5489320Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5489544Z self=, 2025-05-07T20:33:03.5489623Z T=16384, 2025-05-07T20:33:03.5489709Z D=5120, 2025-05-07T20:33:03.5489801Z scale_ub=None, 2025-05-07T20:33:03.5489890Z contiguous=False, 2025-05-07T20:33:03.5489982Z compiled=True, 2025-05-07T20:33:03.5490057Z ) 2025-05-07T20:33:03.5490283Z self = 2025-05-07T20:33:03.5490463Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5490468Z 2025-05-07T20:33:03.5490546Z @given( 2025-05-07T20:33:03.5490672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5490774Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5490890Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5491022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5491141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5491219Z ) 2025-05-07T20:33:03.5491472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5491576Z def test_silu_mul_quant( 2025-05-07T20:33:03.5491664Z self, 2025-05-07T20:33:03.5491745Z T: int, 2025-05-07T20:33:03.5491825Z D: int, 2025-05-07T20:33:03.5491935Z scale_ub: Optional[float], 2025-05-07T20:33:03.5492032Z contiguous: bool, 2025-05-07T20:33:03.5492120Z compiled: bool, 2025-05-07T20:33:03.5492209Z ) -> None: 2025-05-07T20:33:03.5492305Z torch.manual_seed(2025) 2025-05-07T20:33:03.5492384Z 2025-05-07T20:33:03.5492560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5492637Z 2025-05-07T20:33:03.5492731Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5492865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5492956Z x = x_sign * x_clamp 2025-05-07T20:33:03.5493047Z x0 = x[:, :D] 2025-05-07T20:33:03.5493133Z x1 = x[:, D:] 2025-05-07T20:33:03.5493209Z 2025-05-07T20:33:03.5493304Z if contiguous: 2025-05-07T20:33:03.5493404Z x0 = x0.contiguous() 2025-05-07T20:33:03.5493496Z x1 = x1.contiguous() 2025-05-07T20:33:03.5493582Z 2025-05-07T20:33:03.5493674Z if scale_ub is not None: 2025-05-07T20:33:03.5493787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5493932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5494010Z ) 2025-05-07T20:33:03.5494092Z else: 2025-05-07T20:33:03.5494197Z scale_ub_tensor = None 2025-05-07T20:33:03.5494273Z 2025-05-07T20:33:03.5494404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5494503Z op = silu_mul_quant 2025-05-07T20:33:03.5494590Z if compiled: 2025-05-07T20:33:03.5494700Z op = torch.compile(op) 2025-05-07T20:33:03.5494807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5494885Z 2025-05-07T20:33:03.5494992Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5494996Z 2025-05-07T20:33:03.5495226Z moe/activation_test.py:117: 2025-05-07T20:33:03.5495359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5495467Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5495626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5495994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5496098Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5496587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5496685Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5497047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5497268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5497612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5497717Z kernel = self.compile( 2025-05-07T20:33:03.5498097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5498284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5498411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5498416Z 2025-05-07T20:33:03.5498622Z self = 2025-05-07T20:33:03.5499404Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5499910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69accb80>} 2025-05-07T20:33:03.5500658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5500851Z context = 2025-05-07T20:33:03.5500855Z 2025-05-07T20:33:03.5501017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5501285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5501392Z module_map=module_map) 2025-05-07T20:33:03.5501562Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5501662Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5501741Z E ^ 2025-05-07T20:33:03.5502104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5502111Z 2025-05-07T20:33:03.5502522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5502528Z 2025-05-07T20:33:03.5502639Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5502862Z self=, 2025-05-07T20:33:03.5502941Z T=2048, 2025-05-07T20:33:03.5503024Z D=5120, 2025-05-07T20:33:03.5503108Z scale_ub=None, 2025-05-07T20:33:03.5503195Z contiguous=False, 2025-05-07T20:33:03.5503287Z compiled=True, 2025-05-07T20:33:03.5503362Z ) 2025-05-07T20:33:03.5503579Z self = 2025-05-07T20:33:03.5503759Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5503763Z 2025-05-07T20:33:03.5503840Z @given( 2025-05-07T20:33:03.5504089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5504191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5504308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5504471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5504586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5504660Z ) 2025-05-07T20:33:03.5504911Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5505006Z def test_silu_mul_quant( 2025-05-07T20:33:03.5505084Z self, 2025-05-07T20:33:03.5505170Z T: int, 2025-05-07T20:33:03.5505247Z D: int, 2025-05-07T20:33:03.5505351Z scale_ub: Optional[float], 2025-05-07T20:33:03.5505443Z contiguous: bool, 2025-05-07T20:33:03.5505530Z compiled: bool, 2025-05-07T20:33:03.5505615Z ) -> None: 2025-05-07T20:33:03.5505714Z torch.manual_seed(2025) 2025-05-07T20:33:03.5505787Z 2025-05-07T20:33:03.5505968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5506044Z 2025-05-07T20:33:03.5506136Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5506270Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5506361Z x = x_sign * x_clamp 2025-05-07T20:33:03.5506443Z x0 = x[:, :D] 2025-05-07T20:33:03.5506531Z x1 = x[:, D:] 2025-05-07T20:33:03.5506606Z 2025-05-07T20:33:03.5506690Z if contiguous: 2025-05-07T20:33:03.5506789Z x0 = x0.contiguous() 2025-05-07T20:33:03.5506880Z x1 = x1.contiguous() 2025-05-07T20:33:03.5506959Z 2025-05-07T20:33:03.5507050Z if scale_ub is not None: 2025-05-07T20:33:03.5507157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5507297Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5507374Z ) 2025-05-07T20:33:03.5507452Z else: 2025-05-07T20:33:03.5507561Z scale_ub_tensor = None 2025-05-07T20:33:03.5507634Z 2025-05-07T20:33:03.5507766Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5507864Z op = silu_mul_quant 2025-05-07T20:33:03.5507953Z if compiled: 2025-05-07T20:33:03.5508053Z op = torch.compile(op) 2025-05-07T20:33:03.5508165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5508239Z 2025-05-07T20:33:03.5508338Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5508342Z 2025-05-07T20:33:03.5508443Z moe/activation_test.py:117: 2025-05-07T20:33:03.5508574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5508682Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5508783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5509149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5509256Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5509801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5509906Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5510264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5510487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5510834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5510929Z kernel = self.compile( 2025-05-07T20:33:03.5511306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5511492Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5511704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5511746Z 2025-05-07T20:33:03.5511957Z self = 2025-05-07T20:33:03.5512732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5513281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69ace0c0>} 2025-05-07T20:33:03.5514023Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5514214Z context = 2025-05-07T20:33:03.5514219Z 2025-05-07T20:33:03.5514396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5514658Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5514781Z module_map=module_map) 2025-05-07T20:33:03.5514944Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5515046Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5515132Z E ^ 2025-05-07T20:33:03.5515485Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5515490Z 2025-05-07T20:33:03.5516050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5516062Z 2025-05-07T20:33:03.5516169Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5516394Z self=, 2025-05-07T20:33:03.5516489Z T=2048, 2025-05-07T20:33:03.5516568Z D=5120, 2025-05-07T20:33:03.5516654Z scale_ub=1200.0, 2025-05-07T20:33:03.5516751Z contiguous=False, 2025-05-07T20:33:03.5516837Z compiled=True, 2025-05-07T20:33:03.5516916Z ) 2025-05-07T20:33:03.5517140Z self = 2025-05-07T20:33:03.5517316Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5517320Z 2025-05-07T20:33:03.5517401Z @given( 2025-05-07T20:33:03.5517527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5517629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5517749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5517866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5517981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5518069Z ) 2025-05-07T20:33:03.5518317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5518416Z def test_silu_mul_quant( 2025-05-07T20:33:03.5518503Z self, 2025-05-07T20:33:03.5518583Z T: int, 2025-05-07T20:33:03.5518665Z D: int, 2025-05-07T20:33:03.5518772Z scale_ub: Optional[float], 2025-05-07T20:33:03.5518862Z contiguous: bool, 2025-05-07T20:33:03.5518953Z compiled: bool, 2025-05-07T20:33:03.5519033Z ) -> None: 2025-05-07T20:33:03.5519128Z torch.manual_seed(2025) 2025-05-07T20:33:03.5519208Z 2025-05-07T20:33:03.5519381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5519475Z 2025-05-07T20:33:03.5519582Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5519724Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5519815Z x = x_sign * x_clamp 2025-05-07T20:33:03.5519902Z x0 = x[:, :D] 2025-05-07T20:33:03.5519986Z x1 = x[:, D:] 2025-05-07T20:33:03.5520061Z 2025-05-07T20:33:03.5520325Z if contiguous: 2025-05-07T20:33:03.5520419Z x0 = x0.contiguous() 2025-05-07T20:33:03.5520509Z x1 = x1.contiguous() 2025-05-07T20:33:03.5520589Z 2025-05-07T20:33:03.5520721Z if scale_ub is not None: 2025-05-07T20:33:03.5520832Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5520966Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5521043Z ) 2025-05-07T20:33:03.5521126Z else: 2025-05-07T20:33:03.5521222Z scale_ub_tensor = None 2025-05-07T20:33:03.5521296Z 2025-05-07T20:33:03.5521434Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5521524Z op = silu_mul_quant 2025-05-07T20:33:03.5521610Z if compiled: 2025-05-07T20:33:03.5521719Z op = torch.compile(op) 2025-05-07T20:33:03.5521825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5521899Z 2025-05-07T20:33:03.5522004Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5522008Z 2025-05-07T20:33:03.5522107Z moe/activation_test.py:117: 2025-05-07T20:33:03.5522244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5522348Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5522448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5522819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5522913Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5523402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5523508Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5523864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5524097Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5524439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5524533Z kernel = self.compile( 2025-05-07T20:33:03.5524923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5525098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5525233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5525237Z 2025-05-07T20:33:03.5525444Z self = 2025-05-07T20:33:03.5526217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5526731Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69acf2e0>} 2025-05-07T20:33:03.5527472Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5527672Z context = 2025-05-07T20:33:03.5527676Z 2025-05-07T20:33:03.5527840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5528102Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5528218Z module_map=module_map) 2025-05-07T20:33:03.5528382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5528492Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5528574Z E ^ 2025-05-07T20:33:03.5529055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5529060Z 2025-05-07T20:33:03.5529478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5529520Z 2025-05-07T20:33:03.5529626Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5529856Z self=, 2025-05-07T20:33:03.5529941Z T=4096, 2025-05-07T20:33:03.5530021Z D=5120, 2025-05-07T20:33:03.5530114Z scale_ub=1200.0, 2025-05-07T20:33:03.5530203Z contiguous=True, 2025-05-07T20:33:03.5530290Z compiled=True, 2025-05-07T20:33:03.5530374Z ) 2025-05-07T20:33:03.5530598Z self = 2025-05-07T20:33:03.5530772Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5530779Z 2025-05-07T20:33:03.5530872Z @given( 2025-05-07T20:33:03.5530992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5531100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5531221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5531339Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5531458Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5531535Z ) 2025-05-07T20:33:03.5531778Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5531881Z def test_silu_mul_quant( 2025-05-07T20:33:03.5531961Z self, 2025-05-07T20:33:03.5532040Z T: int, 2025-05-07T20:33:03.5532126Z D: int, 2025-05-07T20:33:03.5532227Z scale_ub: Optional[float], 2025-05-07T20:33:03.5532317Z contiguous: bool, 2025-05-07T20:33:03.5532409Z compiled: bool, 2025-05-07T20:33:03.5532489Z ) -> None: 2025-05-07T20:33:03.5532601Z torch.manual_seed(2025) 2025-05-07T20:33:03.5532675Z 2025-05-07T20:33:03.5532843Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5532925Z 2025-05-07T20:33:03.5533020Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5533145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5533240Z x = x_sign * x_clamp 2025-05-07T20:33:03.5533321Z x0 = x[:, :D] 2025-05-07T20:33:03.5533402Z x1 = x[:, D:] 2025-05-07T20:33:03.5533483Z 2025-05-07T20:33:03.5533567Z if contiguous: 2025-05-07T20:33:03.5533660Z x0 = x0.contiguous() 2025-05-07T20:33:03.5533756Z x1 = x1.contiguous() 2025-05-07T20:33:03.5533830Z 2025-05-07T20:33:03.5533923Z if scale_ub is not None: 2025-05-07T20:33:03.5534037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5534170Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5534254Z ) 2025-05-07T20:33:03.5534340Z else: 2025-05-07T20:33:03.5534435Z scale_ub_tensor = None 2025-05-07T20:33:03.5534520Z 2025-05-07T20:33:03.5534651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5534746Z op = silu_mul_quant 2025-05-07T20:33:03.5534841Z if compiled: 2025-05-07T20:33:03.5534944Z op = torch.compile(op) 2025-05-07T20:33:03.5535050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5535131Z 2025-05-07T20:33:03.5535224Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5535228Z 2025-05-07T20:33:03.5535334Z moe/activation_test.py:117: 2025-05-07T20:33:03.5535462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5535563Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5535670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5536210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5536345Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5536840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5536984Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5537349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5537571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5537907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5538008Z kernel = self.compile( 2025-05-07T20:33:03.5538387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5538560Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5538702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5538706Z 2025-05-07T20:33:03.5538911Z self = 2025-05-07T20:33:03.5539693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5540199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a6948c860>} 2025-05-07T20:33:03.5540946Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5541141Z context = 2025-05-07T20:33:03.5541147Z 2025-05-07T20:33:03.5541314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5541585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5541696Z module_map=module_map) 2025-05-07T20:33:03.5541858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5541968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5542047Z E ^ 2025-05-07T20:33:03.5542409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5542413Z 2025-05-07T20:33:03.5542825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5542829Z 2025-05-07T20:33:03.5542936Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5543172Z self=, 2025-05-07T20:33:03.5543254Z T=128, 2025-05-07T20:33:03.5543342Z D=5120, 2025-05-07T20:33:03.5543432Z scale_ub=1200.0, 2025-05-07T20:33:03.5543524Z contiguous=False, 2025-05-07T20:33:03.5543617Z compiled=True, 2025-05-07T20:33:03.5543694Z ) 2025-05-07T20:33:03.5543913Z self = 2025-05-07T20:33:03.5544091Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5544096Z 2025-05-07T20:33:03.5544178Z @given( 2025-05-07T20:33:03.5544299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5544407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5544524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5544650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5544765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5544895Z ) 2025-05-07T20:33:03.5545221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5545318Z def test_silu_mul_quant( 2025-05-07T20:33:03.5545399Z self, 2025-05-07T20:33:03.5545525Z T: int, 2025-05-07T20:33:03.5545607Z D: int, 2025-05-07T20:33:03.5545710Z scale_ub: Optional[float], 2025-05-07T20:33:03.5545809Z contiguous: bool, 2025-05-07T20:33:03.5545897Z compiled: bool, 2025-05-07T20:33:03.5545977Z ) -> None: 2025-05-07T20:33:03.5546077Z torch.manual_seed(2025) 2025-05-07T20:33:03.5546151Z 2025-05-07T20:33:03.5546327Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5546402Z 2025-05-07T20:33:03.5546494Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5546623Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5546713Z x = x_sign * x_clamp 2025-05-07T20:33:03.5546794Z x0 = x[:, :D] 2025-05-07T20:33:03.5546889Z x1 = x[:, D:] 2025-05-07T20:33:03.5546963Z 2025-05-07T20:33:03.5547048Z if contiguous: 2025-05-07T20:33:03.5547148Z x0 = x0.contiguous() 2025-05-07T20:33:03.5547237Z x1 = x1.contiguous() 2025-05-07T20:33:03.5547314Z 2025-05-07T20:33:03.5547411Z if scale_ub is not None: 2025-05-07T20:33:03.5547517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5547652Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5547736Z ) 2025-05-07T20:33:03.5547815Z else: 2025-05-07T20:33:03.5547916Z scale_ub_tensor = None 2025-05-07T20:33:03.5547991Z 2025-05-07T20:33:03.5548120Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5548220Z op = silu_mul_quant 2025-05-07T20:33:03.5548307Z if compiled: 2025-05-07T20:33:03.5548408Z op = torch.compile(op) 2025-05-07T20:33:03.5548519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5548603Z 2025-05-07T20:33:03.5548695Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5548700Z 2025-05-07T20:33:03.5548806Z moe/activation_test.py:117: 2025-05-07T20:33:03.5548936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5549043Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5549144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5549537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5549656Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5550149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5550246Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5550610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5550838Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5551180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5551278Z kernel = self.compile( 2025-05-07T20:33:03.5551659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5551838Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5551965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5551969Z 2025-05-07T20:33:03.5552178Z self = 2025-05-07T20:33:03.5553075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5553616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a6948d580>} 2025-05-07T20:33:03.5554406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5554596Z context = 2025-05-07T20:33:03.5554600Z 2025-05-07T20:33:03.5554773Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5555035Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5555144Z module_map=module_map) 2025-05-07T20:33:03.5555314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5555421Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5555503Z E ^ 2025-05-07T20:33:03.5556020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5556028Z 2025-05-07T20:33:03.5556442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5556446Z 2025-05-07T20:33:03.5556557Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5556783Z self=, 2025-05-07T20:33:03.5556863Z T=16384, 2025-05-07T20:33:03.5556948Z D=7168, 2025-05-07T20:33:03.5557034Z scale_ub=1200.0, 2025-05-07T20:33:03.5557122Z contiguous=True, 2025-05-07T20:33:03.5557213Z compiled=True, 2025-05-07T20:33:03.5557289Z ) 2025-05-07T20:33:03.5557512Z self = 2025-05-07T20:33:03.5557692Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5557698Z 2025-05-07T20:33:03.5557779Z @given( 2025-05-07T20:33:03.5557904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5558006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5558124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5558251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5558365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5558439Z ) 2025-05-07T20:33:03.5558687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5558781Z def test_silu_mul_quant( 2025-05-07T20:33:03.5558864Z self, 2025-05-07T20:33:03.5558943Z T: int, 2025-05-07T20:33:03.5559021Z D: int, 2025-05-07T20:33:03.5559127Z scale_ub: Optional[float], 2025-05-07T20:33:03.5559217Z contiguous: bool, 2025-05-07T20:33:03.5559303Z compiled: bool, 2025-05-07T20:33:03.5559397Z ) -> None: 2025-05-07T20:33:03.5559492Z torch.manual_seed(2025) 2025-05-07T20:33:03.5559568Z 2025-05-07T20:33:03.5559740Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5559819Z 2025-05-07T20:33:03.5559912Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5560042Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5560132Z x = x_sign * x_clamp 2025-05-07T20:33:03.5560220Z x0 = x[:, :D] 2025-05-07T20:33:03.5560301Z x1 = x[:, D:] 2025-05-07T20:33:03.5560376Z 2025-05-07T20:33:03.5560466Z if contiguous: 2025-05-07T20:33:03.5560558Z x0 = x0.contiguous() 2025-05-07T20:33:03.5560647Z x1 = x1.contiguous() 2025-05-07T20:33:03.5560728Z 2025-05-07T20:33:03.5560819Z if scale_ub is not None: 2025-05-07T20:33:03.5560927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5561156Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5561271Z ) 2025-05-07T20:33:03.5561349Z else: 2025-05-07T20:33:03.5561454Z scale_ub_tensor = None 2025-05-07T20:33:03.5561528Z 2025-05-07T20:33:03.5561704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5561803Z op = silu_mul_quant 2025-05-07T20:33:03.5561889Z if compiled: 2025-05-07T20:33:03.5561997Z op = torch.compile(op) 2025-05-07T20:33:03.5562103Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5562178Z 2025-05-07T20:33:03.5562277Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5562282Z 2025-05-07T20:33:03.5562380Z moe/activation_test.py:117: 2025-05-07T20:33:03.5562508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5562620Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5562720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5563092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5563196Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5563688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5563797Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5564154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5564380Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5564724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5564821Z kernel = self.compile( 2025-05-07T20:33:03.5565208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5565740Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5565873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5565878Z 2025-05-07T20:33:03.5566089Z self = 2025-05-07T20:33:03.5566861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5567368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a6948e0c0>} 2025-05-07T20:33:03.5568109Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5568306Z context = 2025-05-07T20:33:03.5568318Z 2025-05-07T20:33:03.5568484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5568752Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5568869Z module_map=module_map) 2025-05-07T20:33:03.5569032Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5569133Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5569220Z E ^ 2025-05-07T20:33:03.5569613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5569619Z 2025-05-07T20:33:03.5570046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5570051Z 2025-05-07T20:33:03.5570356Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5570648Z self=, 2025-05-07T20:33:03.5570735Z T=16384, 2025-05-07T20:33:03.5570818Z D=5120, 2025-05-07T20:33:03.5570970Z scale_ub=1200.0, 2025-05-07T20:33:03.5571065Z contiguous=True, 2025-05-07T20:33:03.5571151Z compiled=False, 2025-05-07T20:33:03.5571228Z ) 2025-05-07T20:33:03.5571453Z self = 2025-05-07T20:33:03.5571633Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.5571637Z 2025-05-07T20:33:03.5571723Z @given( 2025-05-07T20:33:03.5571842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5571943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5572069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5572187Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5572306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5572391Z ) 2025-05-07T20:33:03.5572635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5572734Z def test_silu_mul_quant( 2025-05-07T20:33:03.5572819Z self, 2025-05-07T20:33:03.5572898Z T: int, 2025-05-07T20:33:03.5572983Z D: int, 2025-05-07T20:33:03.5573082Z scale_ub: Optional[float], 2025-05-07T20:33:03.5573171Z contiguous: bool, 2025-05-07T20:33:03.5573264Z compiled: bool, 2025-05-07T20:33:03.5573342Z ) -> None: 2025-05-07T20:33:03.5573436Z torch.manual_seed(2025) 2025-05-07T20:33:03.5573515Z 2025-05-07T20:33:03.5573684Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5573759Z 2025-05-07T20:33:03.5573858Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5573983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5574074Z x = x_sign * x_clamp 2025-05-07T20:33:03.5574166Z x0 = x[:, :D] 2025-05-07T20:33:03.5574248Z x1 = x[:, D:] 2025-05-07T20:33:03.5574321Z 2025-05-07T20:33:03.5574415Z if contiguous: 2025-05-07T20:33:03.5574510Z x0 = x0.contiguous() 2025-05-07T20:33:03.5574606Z x1 = x1.contiguous() 2025-05-07T20:33:03.5574680Z 2025-05-07T20:33:03.5574771Z if scale_ub is not None: 2025-05-07T20:33:03.5574887Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5575025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5575103Z ) 2025-05-07T20:33:03.5575201Z else: 2025-05-07T20:33:03.5575299Z scale_ub_tensor = None 2025-05-07T20:33:03.5575375Z 2025-05-07T20:33:03.5575514Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5575609Z op = silu_mul_quant 2025-05-07T20:33:03.5575698Z if compiled: 2025-05-07T20:33:03.5582012Z op = torch.compile(op) 2025-05-07T20:33:03.5582165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5582250Z 2025-05-07T20:33:03.5582346Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5582351Z 2025-05-07T20:33:03.5582456Z moe/activation_test.py:117: 2025-05-07T20:33:03.5582596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5582700Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5582805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5583319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5584026Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5584561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5585254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5586102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5586634Z kernel = self.compile( 2025-05-07T20:33:03.5587196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5587903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5588307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5588563Z 2025-05-07T20:33:03.5588780Z self = 2025-05-07T20:33:03.5589856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5594103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a6948f1a0>} 2025-05-07T20:33:03.5595499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5596620Z context = 2025-05-07T20:33:03.5596912Z 2025-05-07T20:33:03.5597083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5597615Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5598101Z module_map=module_map) 2025-05-07T20:33:03.5598472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5598843Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5599116Z E ^ 2025-05-07T20:33:03.5599597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5600077Z 2025-05-07T20:33:03.5600497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5601020Z 2025-05-07T20:33:03.5601128Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5601553Z self=, 2025-05-07T20:33:03.5601957Z T=1, 2025-05-07T20:33:03.5602152Z D=7168, 2025-05-07T20:33:03.5602360Z scale_ub=1200.0, 2025-05-07T20:33:03.5602587Z contiguous=False, 2025-05-07T20:33:03.5602824Z compiled=False, 2025-05-07T20:33:03.5603043Z ) 2025-05-07T20:33:03.5603364Z self = 2025-05-07T20:33:03.5603861Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5604137Z 2025-05-07T20:33:03.5604220Z @given( 2025-05-07T20:33:03.5604465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5604782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5605101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5605442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5605770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5606064Z ) 2025-05-07T20:33:03.5606421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5606868Z def test_silu_mul_quant( 2025-05-07T20:33:03.5607122Z self, 2025-05-07T20:33:03.5607328Z T: int, 2025-05-07T20:33:03.5607530Z D: int, 2025-05-07T20:33:03.5607758Z scale_ub: Optional[float], 2025-05-07T20:33:03.5608038Z contiguous: bool, 2025-05-07T20:33:03.5608286Z compiled: bool, 2025-05-07T20:33:03.5608510Z ) -> None: 2025-05-07T20:33:03.5608742Z torch.manual_seed(2025) 2025-05-07T20:33:03.5609113Z 2025-05-07T20:33:03.5609390Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5609744Z 2025-05-07T20:33:03.5609946Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5610369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5610689Z x = x_sign * x_clamp 2025-05-07T20:33:03.5610940Z x0 = x[:, :D] 2025-05-07T20:33:03.5611159Z x1 = x[:, D:] 2025-05-07T20:33:03.5611381Z 2025-05-07T20:33:03.5611580Z if contiguous: 2025-05-07T20:33:03.5611814Z x0 = x0.contiguous() 2025-05-07T20:33:03.5612085Z x1 = x1.contiguous() 2025-05-07T20:33:03.5612337Z 2025-05-07T20:33:03.5612533Z if scale_ub is not None: 2025-05-07T20:33:03.5612817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5613162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5613483Z ) 2025-05-07T20:33:03.5613682Z else: 2025-05-07T20:33:03.5613910Z scale_ub_tensor = None 2025-05-07T20:33:03.5614172Z 2025-05-07T20:33:03.5614511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5614837Z op = silu_mul_quant 2025-05-07T20:33:03.5615103Z if compiled: 2025-05-07T20:33:03.5615355Z op = torch.compile(op) 2025-05-07T20:33:03.5615659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5615948Z 2025-05-07T20:33:03.5616149Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5616324Z 2025-05-07T20:33:03.5616428Z moe/activation_test.py:117: 2025-05-07T20:33:03.5616735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5617072Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5617365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5618064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5618769Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5619310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5620054Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5620726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5621265Z kernel = self.compile( 2025-05-07T20:33:03.5621807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5622474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5622881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5623117Z 2025-05-07T20:33:03.5623325Z self = 2025-05-07T20:33:03.5624418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5625803Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66cf0680>} 2025-05-07T20:33:03.5627155Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5628187Z context = 2025-05-07T20:33:03.5628480Z 2025-05-07T20:33:03.5628652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5629238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5629755Z module_map=module_map) 2025-05-07T20:33:03.5630132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5630490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5630799Z E ^ 2025-05-07T20:33:03.5631269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5631724Z 2025-05-07T20:33:03.5632139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5632658Z 2025-05-07T20:33:03.5632766Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5633196Z self=, 2025-05-07T20:33:03.5633608Z T=4096, 2025-05-07T20:33:03.5633803Z D=7168, 2025-05-07T20:33:03.5634012Z scale_ub=1200.0, 2025-05-07T20:33:03.5634250Z contiguous=False, 2025-05-07T20:33:03.5634491Z compiled=True, 2025-05-07T20:33:03.5634712Z ) 2025-05-07T20:33:03.5635120Z self = 2025-05-07T20:33:03.5635622Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5636007Z 2025-05-07T20:33:03.5636092Z @given( 2025-05-07T20:33:03.5636341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5636666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5636982Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5637315Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5637656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5637953Z ) 2025-05-07T20:33:03.5638303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5638755Z def test_silu_mul_quant( 2025-05-07T20:33:03.5639007Z self, 2025-05-07T20:33:03.5639204Z T: int, 2025-05-07T20:33:03.5639419Z D: int, 2025-05-07T20:33:03.5639647Z scale_ub: Optional[float], 2025-05-07T20:33:03.5639920Z contiguous: bool, 2025-05-07T20:33:03.5640167Z compiled: bool, 2025-05-07T20:33:03.5640399Z ) -> None: 2025-05-07T20:33:03.5640625Z torch.manual_seed(2025) 2025-05-07T20:33:03.5640870Z 2025-05-07T20:33:03.5641149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5641500Z 2025-05-07T20:33:03.5641697Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5641994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5642317Z x = x_sign * x_clamp 2025-05-07T20:33:03.5642561Z x0 = x[:, :D] 2025-05-07T20:33:03.5642786Z x1 = x[:, D:] 2025-05-07T20:33:03.5643003Z 2025-05-07T20:33:03.5643194Z if contiguous: 2025-05-07T20:33:03.5643433Z x0 = x0.contiguous() 2025-05-07T20:33:03.5643697Z x1 = x1.contiguous() 2025-05-07T20:33:03.5643946Z 2025-05-07T20:33:03.5644150Z if scale_ub is not None: 2025-05-07T20:33:03.5644436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5644774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5645097Z ) 2025-05-07T20:33:03.5645299Z else: 2025-05-07T20:33:03.5645520Z scale_ub_tensor = None 2025-05-07T20:33:03.5645778Z 2025-05-07T20:33:03.5646019Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5646345Z op = silu_mul_quant 2025-05-07T20:33:03.5646600Z if compiled: 2025-05-07T20:33:03.5646858Z op = torch.compile(op) 2025-05-07T20:33:03.5647163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5647441Z 2025-05-07T20:33:03.5647647Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5647814Z 2025-05-07T20:33:03.5647924Z moe/activation_test.py:117: 2025-05-07T20:33:03.5648276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5648656Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5648947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5649510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5650111Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5650773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5651466Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5652004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5652693Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5653360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5653902Z kernel = self.compile( 2025-05-07T20:33:03.5654498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5655163Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5655571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5655804Z 2025-05-07T20:33:03.5656017Z self = 2025-05-07T20:33:03.5657095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5658475Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66cf1940>} 2025-05-07T20:33:03.5659836Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5660870Z context = 2025-05-07T20:33:03.5661163Z 2025-05-07T20:33:03.5661330Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5661862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5662337Z module_map=module_map) 2025-05-07T20:33:03.5662708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5663064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5663332Z E ^ 2025-05-07T20:33:03.5663801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5664252Z 2025-05-07T20:33:03.5664674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5665191Z 2025-05-07T20:33:03.5665298Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5666126Z self=, 2025-05-07T20:33:03.5666543Z T=128, 2025-05-07T20:33:03.5666735Z D=7168, 2025-05-07T20:33:03.5666937Z scale_ub=1200.0, 2025-05-07T20:33:03.5667171Z contiguous=False, 2025-05-07T20:33:03.5667398Z compiled=True, 2025-05-07T20:33:03.5667614Z ) 2025-05-07T20:33:03.5667945Z self = 2025-05-07T20:33:03.5668438Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:03.5668718Z 2025-05-07T20:33:03.5668800Z @given( 2025-05-07T20:33:03.5669038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5669501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5669881Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5670224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5670562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5670952Z ) 2025-05-07T20:33:03.5671309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5671760Z def test_silu_mul_quant( 2025-05-07T20:33:03.5672009Z self, 2025-05-07T20:33:03.5672216Z T: int, 2025-05-07T20:33:03.5672426Z D: int, 2025-05-07T20:33:03.5672648Z scale_ub: Optional[float], 2025-05-07T20:33:03.5672935Z contiguous: bool, 2025-05-07T20:33:03.5673185Z compiled: bool, 2025-05-07T20:33:03.5673413Z ) -> None: 2025-05-07T20:33:03.5673642Z torch.manual_seed(2025) 2025-05-07T20:33:03.5673897Z 2025-05-07T20:33:03.5674172Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5674531Z 2025-05-07T20:33:03.5674736Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5675111Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5675426Z x = x_sign * x_clamp 2025-05-07T20:33:03.5675677Z x0 = x[:, :D] 2025-05-07T20:33:03.5675961Z x1 = x[:, D:] 2025-05-07T20:33:03.5676172Z 2025-05-07T20:33:03.5676367Z if contiguous: 2025-05-07T20:33:03.5676610Z x0 = x0.contiguous() 2025-05-07T20:33:03.5676872Z x1 = x1.contiguous() 2025-05-07T20:33:03.5677122Z 2025-05-07T20:33:03.5677322Z if scale_ub is not None: 2025-05-07T20:33:03.5677596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5677938Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5678256Z ) 2025-05-07T20:33:03.5678452Z else: 2025-05-07T20:33:03.5678670Z scale_ub_tensor = None 2025-05-07T20:33:03.5678931Z 2025-05-07T20:33:03.5679167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5679491Z op = silu_mul_quant 2025-05-07T20:33:03.5679753Z if compiled: 2025-05-07T20:33:03.5680007Z op = torch.compile(op) 2025-05-07T20:33:03.5680303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5680583Z 2025-05-07T20:33:03.5680784Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5680950Z 2025-05-07T20:33:03.5681052Z moe/activation_test.py:117: 2025-05-07T20:33:03.5681359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5681697Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5681977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5682542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5683108Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5683769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5684457Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5685006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5685698Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5686359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5686901Z kernel = self.compile( 2025-05-07T20:33:03.5687445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5688104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5688504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5688742Z 2025-05-07T20:33:03.5689005Z self = 2025-05-07T20:33:03.5690129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5691554Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66cf2700>} 2025-05-07T20:33:03.5692904Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5693931Z context = 2025-05-07T20:33:03.5694228Z 2025-05-07T20:33:03.5694396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5694929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5695452Z module_map=module_map) 2025-05-07T20:33:03.5695825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5696191Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5696460Z E ^ 2025-05-07T20:33:03.5696922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5697387Z 2025-05-07T20:33:03.5697801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5698314Z 2025-05-07T20:33:03.5698430Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5698851Z self=, 2025-05-07T20:33:03.5699254Z T=2048, 2025-05-07T20:33:03.5699452Z D=7168, 2025-05-07T20:33:03.5699681Z scale_ub=None, 2025-05-07T20:33:03.5699925Z contiguous=True, 2025-05-07T20:33:03.5700156Z compiled=True, 2025-05-07T20:33:03.5700369Z ) 2025-05-07T20:33:03.5700689Z self = 2025-05-07T20:33:03.5701189Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5701458Z 2025-05-07T20:33:03.5701546Z @given( 2025-05-07T20:33:03.5701778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5702101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5702416Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5702752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5703080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5703372Z ) 2025-05-07T20:33:03.5703726Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5704166Z def test_silu_mul_quant( 2025-05-07T20:33:03.5704423Z self, 2025-05-07T20:33:03.5704634Z T: int, 2025-05-07T20:33:03.5704837Z D: int, 2025-05-07T20:33:03.5705067Z scale_ub: Optional[float], 2025-05-07T20:33:03.5705344Z contiguous: bool, 2025-05-07T20:33:03.5705590Z compiled: bool, 2025-05-07T20:33:03.5705824Z ) -> None: 2025-05-07T20:33:03.5706047Z torch.manual_seed(2025) 2025-05-07T20:33:03.5706291Z 2025-05-07T20:33:03.5706569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5706920Z 2025-05-07T20:33:03.5707118Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5707415Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5707731Z x = x_sign * x_clamp 2025-05-07T20:33:03.5707973Z x0 = x[:, :D] 2025-05-07T20:33:03.5708198Z x1 = x[:, D:] 2025-05-07T20:33:03.5708414Z 2025-05-07T20:33:03.5708603Z if contiguous: 2025-05-07T20:33:03.5708841Z x0 = x0.contiguous() 2025-05-07T20:33:03.5709204Z x1 = x1.contiguous() 2025-05-07T20:33:03.5709456Z 2025-05-07T20:33:03.5709680Z if scale_ub is not None: 2025-05-07T20:33:03.5709984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5710368Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5710679Z ) 2025-05-07T20:33:03.5710880Z else: 2025-05-07T20:33:03.5711104Z scale_ub_tensor = None 2025-05-07T20:33:03.5711356Z 2025-05-07T20:33:03.5711594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5711915Z op = silu_mul_quant 2025-05-07T20:33:03.5712170Z if compiled: 2025-05-07T20:33:03.5712423Z op = torch.compile(op) 2025-05-07T20:33:03.5712725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5713002Z 2025-05-07T20:33:03.5713202Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5713376Z 2025-05-07T20:33:03.5713476Z moe/activation_test.py:117: 2025-05-07T20:33:03.5713837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5714172Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5714459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5715026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5715586Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5716374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5717066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5717607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5718285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5718957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5719497Z kernel = self.compile( 2025-05-07T20:33:03.5720038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5720699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5721104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5721335Z 2025-05-07T20:33:03.5721550Z self = 2025-05-07T20:33:03.5722628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5724009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b66cf37e0>} 2025-05-07T20:33:03.5725358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5726388Z context = 2025-05-07T20:33:03.5726680Z 2025-05-07T20:33:03.5726855Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5727377Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5727850Z module_map=module_map) 2025-05-07T20:33:03.5728219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5728574Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5728841Z E ^ 2025-05-07T20:33:03.5729360Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5729852Z 2025-05-07T20:33:03.5730274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5730853Z 2025-05-07T20:33:03.5730959Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5731378Z self=, 2025-05-07T20:33:03.5731787Z T=16384, 2025-05-07T20:33:03.5731984Z D=5120, 2025-05-07T20:33:03.5732185Z scale_ub=None, 2025-05-07T20:33:03.5732410Z contiguous=False, 2025-05-07T20:33:03.5732637Z compiled=False, 2025-05-07T20:33:03.5732851Z ) 2025-05-07T20:33:03.5733191Z self = 2025-05-07T20:33:03.5733689Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.5733977Z 2025-05-07T20:33:03.5734060Z @given( 2025-05-07T20:33:03.5741642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5742006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5742424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5742761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5743102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5743387Z ) 2025-05-07T20:33:03.5743743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5744193Z def test_silu_mul_quant( 2025-05-07T20:33:03.5744442Z self, 2025-05-07T20:33:03.5744640Z T: int, 2025-05-07T20:33:03.5744845Z D: int, 2025-05-07T20:33:03.5745065Z scale_ub: Optional[float], 2025-05-07T20:33:03.5745348Z contiguous: bool, 2025-05-07T20:33:03.5745588Z compiled: bool, 2025-05-07T20:33:03.5745822Z ) -> None: 2025-05-07T20:33:03.5746049Z torch.manual_seed(2025) 2025-05-07T20:33:03.5746303Z 2025-05-07T20:33:03.5746582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5746938Z 2025-05-07T20:33:03.5747147Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5747437Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5749488Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5751378Z 2025-05-07T20:33:03.5751498Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:03.5751712Z 2025-05-07T20:33:03.5751833Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5752264Z self=, 2025-05-07T20:33:03.5752669Z T=4096, 2025-05-07T20:33:03.5752867Z D=7168, 2025-05-07T20:33:03.5753070Z scale_ub=1200.0, 2025-05-07T20:33:03.5753294Z contiguous=True, 2025-05-07T20:33:03.5753522Z compiled=True, 2025-05-07T20:33:03.5753737Z ) 2025-05-07T20:33:03.5754057Z self = 2025-05-07T20:33:03.5754572Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5754850Z 2025-05-07T20:33:03.5754938Z @given( 2025-05-07T20:33:03.5755172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5755494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5755965Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5756303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5756691Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5757079Z ) 2025-05-07T20:33:03.5757438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5757884Z def test_silu_mul_quant( 2025-05-07T20:33:03.5758180Z self, 2025-05-07T20:33:03.5758385Z T: int, 2025-05-07T20:33:03.5758585Z D: int, 2025-05-07T20:33:03.5758810Z scale_ub: Optional[float], 2025-05-07T20:33:03.5759089Z contiguous: bool, 2025-05-07T20:33:03.5759329Z compiled: bool, 2025-05-07T20:33:03.5759559Z ) -> None: 2025-05-07T20:33:03.5759779Z torch.manual_seed(2025) 2025-05-07T20:33:03.5760021Z 2025-05-07T20:33:03.5760296Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5760648Z 2025-05-07T20:33:03.5760843Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5761141Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5763237Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5765129Z 2025-05-07T20:33:03.5765249Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:03.5765852Z 2025-05-07T20:33:03.5765968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5766379Z self=, 2025-05-07T20:33:03.5766792Z T=16384, 2025-05-07T20:33:03.5766993Z D=7168, 2025-05-07T20:33:03.5767187Z scale_ub=None, 2025-05-07T20:33:03.5767413Z contiguous=False, 2025-05-07T20:33:03.5767652Z compiled=False, 2025-05-07T20:33:03.5767862Z ) 2025-05-07T20:33:03.5768191Z self = 2025-05-07T20:33:03.5768701Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.5768983Z 2025-05-07T20:33:03.5769071Z @given( 2025-05-07T20:33:03.5769300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5769630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5769943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5770276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5770614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5770909Z ) 2025-05-07T20:33:03.5771257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5771708Z def test_silu_mul_quant( 2025-05-07T20:33:03.5771957Z self, 2025-05-07T20:33:03.5772170Z T: int, 2025-05-07T20:33:03.5772370Z D: int, 2025-05-07T20:33:03.5772604Z scale_ub: Optional[float], 2025-05-07T20:33:03.5772886Z contiguous: bool, 2025-05-07T20:33:03.5773134Z compiled: bool, 2025-05-07T20:33:03.5773365Z ) -> None: 2025-05-07T20:33:03.5773591Z torch.manual_seed(2025) 2025-05-07T20:33:03.5773837Z 2025-05-07T20:33:03.5774113Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5776349Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5778292Z 2025-05-07T20:33:03.5778419Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5778633Z 2025-05-07T20:33:03.5778812Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5779226Z self=, 2025-05-07T20:33:03.5779639Z T=2048, 2025-05-07T20:33:03.5779834Z D=7168, 2025-05-07T20:33:03.5780026Z scale_ub=1200.0, 2025-05-07T20:33:03.5780254Z contiguous=True, 2025-05-07T20:33:03.5780483Z compiled=True, 2025-05-07T20:33:03.5780687Z ) 2025-05-07T20:33:03.5781009Z self = 2025-05-07T20:33:03.5781507Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5781778Z 2025-05-07T20:33:03.5781859Z @given( 2025-05-07T20:33:03.5782093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5782421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5782808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5783140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5783477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5783772Z ) 2025-05-07T20:33:03.5784118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5784567Z def test_silu_mul_quant( 2025-05-07T20:33:03.5784815Z self, 2025-05-07T20:33:03.5785013Z T: int, 2025-05-07T20:33:03.5785221Z D: int, 2025-05-07T20:33:03.5785450Z scale_ub: Optional[float], 2025-05-07T20:33:03.5785723Z contiguous: bool, 2025-05-07T20:33:03.5785970Z compiled: bool, 2025-05-07T20:33:03.5786199Z ) -> None: 2025-05-07T20:33:03.5786421Z torch.manual_seed(2025) 2025-05-07T20:33:03.5786674Z 2025-05-07T20:33:03.5786955Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5787308Z 2025-05-07T20:33:03.5787507Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5787807Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5789823Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5791680Z 2025-05-07T20:33:03.5791806Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:03.5792019Z 2025-05-07T20:33:03.5792125Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5792553Z self=, 2025-05-07T20:33:03.5792967Z T=2048, 2025-05-07T20:33:03.5793166Z D=7168, 2025-05-07T20:33:03.5793359Z scale_ub=None, 2025-05-07T20:33:03.5793583Z contiguous=True, 2025-05-07T20:33:03.5793815Z compiled=False, 2025-05-07T20:33:03.5794030Z ) 2025-05-07T20:33:03.5794354Z self = 2025-05-07T20:33:03.5794847Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5795123Z 2025-05-07T20:33:03.5795203Z @given( 2025-05-07T20:33:03.5795436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5795820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5796138Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5796471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5796803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5797186Z ) 2025-05-07T20:33:03.5797540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5797987Z def test_silu_mul_quant( 2025-05-07T20:33:03.5798269Z self, 2025-05-07T20:33:03.5798468Z T: int, 2025-05-07T20:33:03.5798547Z D: int, 2025-05-07T20:33:03.5798653Z scale_ub: Optional[float], 2025-05-07T20:33:03.5798745Z contiguous: bool, 2025-05-07T20:33:03.5798832Z compiled: bool, 2025-05-07T20:33:03.5798921Z ) -> None: 2025-05-07T20:33:03.5799018Z torch.manual_seed(2025) 2025-05-07T20:33:03.5799094Z 2025-05-07T20:33:03.5799268Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5799346Z 2025-05-07T20:33:03.5799446Z > x_sign = torch.sign(x) 2025-05-07T20:33:03.5801282Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5801293Z 2025-05-07T20:33:03.5801413Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:03.5801424Z 2025-05-07T20:33:03.5801527Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5801752Z self=, 2025-05-07T20:33:03.5801837Z T=1, 2025-05-07T20:33:03.5801916Z D=7168, 2025-05-07T20:33:03.5802001Z scale_ub=1200.0, 2025-05-07T20:33:03.5802093Z contiguous=True, 2025-05-07T20:33:03.5802179Z compiled=False, 2025-05-07T20:33:03.5802255Z ) 2025-05-07T20:33:03.5802482Z self = 2025-05-07T20:33:03.5802653Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.5802658Z 2025-05-07T20:33:03.5802740Z @given( 2025-05-07T20:33:03.5802866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5802966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5803086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5803204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5803320Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5803402Z ) 2025-05-07T20:33:03.5803645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5803739Z def test_silu_mul_quant( 2025-05-07T20:33:03.5803826Z self, 2025-05-07T20:33:03.5803905Z T: int, 2025-05-07T20:33:03.5803983Z D: int, 2025-05-07T20:33:03.5804099Z scale_ub: Optional[float], 2025-05-07T20:33:03.5804192Z contiguous: bool, 2025-05-07T20:33:03.5804288Z compiled: bool, 2025-05-07T20:33:03.5804369Z ) -> None: 2025-05-07T20:33:03.5804467Z torch.manual_seed(2025) 2025-05-07T20:33:03.5804556Z 2025-05-07T20:33:03.5804727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5804808Z 2025-05-07T20:33:03.5804907Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5805031Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5805125Z x = x_sign * x_clamp 2025-05-07T20:33:03.5805217Z x0 = x[:, :D] 2025-05-07T20:33:03.5805300Z x1 = x[:, D:] 2025-05-07T20:33:03.5805376Z 2025-05-07T20:33:03.5805468Z if contiguous: 2025-05-07T20:33:03.5805562Z x0 = x0.contiguous() 2025-05-07T20:33:03.5805654Z x1 = x1.contiguous() 2025-05-07T20:33:03.5805737Z 2025-05-07T20:33:03.5805832Z if scale_ub is not None: 2025-05-07T20:33:03.5806040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5806183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5806262Z ) 2025-05-07T20:33:03.5806386Z else: 2025-05-07T20:33:03.5806482Z scale_ub_tensor = None 2025-05-07T20:33:03.5806558Z 2025-05-07T20:33:03.5806698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5806791Z op = silu_mul_quant 2025-05-07T20:33:03.5806879Z if compiled: 2025-05-07T20:33:03.5806987Z op = torch.compile(op) 2025-05-07T20:33:03.5807095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5807173Z 2025-05-07T20:33:03.5807273Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5807277Z 2025-05-07T20:33:03.5807378Z moe/activation_test.py:117: 2025-05-07T20:33:03.5807520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5807623Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5807729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5808287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5808390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5808752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5808983Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5809324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5809427Z kernel = self.compile( 2025-05-07T20:33:03.5809812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5809989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5810133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5810143Z 2025-05-07T20:33:03.5810350Z self = 2025-05-07T20:33:03.5811145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5811650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a692dab60>} 2025-05-07T20:33:03.5812400Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5812604Z context = 2025-05-07T20:33:03.5812610Z 2025-05-07T20:33:03.5812779Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5813057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5813169Z module_map=module_map) 2025-05-07T20:33:03.5813332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5813439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5813519Z E ^ 2025-05-07T20:33:03.5813885Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5813889Z 2025-05-07T20:33:03.5814304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5814308Z 2025-05-07T20:33:03.5814416Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5814722Z self=, 2025-05-07T20:33:03.5814842Z T=128, 2025-05-07T20:33:03.5814924Z D=5120, 2025-05-07T20:33:03.5815018Z scale_ub=None, 2025-05-07T20:33:03.5815111Z contiguous=True, 2025-05-07T20:33:03.5815244Z compiled=False, 2025-05-07T20:33:03.5815321Z ) 2025-05-07T20:33:03.5815538Z self = 2025-05-07T20:33:03.5815716Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5815721Z 2025-05-07T20:33:03.5815799Z @given( 2025-05-07T20:33:03.5815919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5816025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5816141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5816258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5816378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5816455Z ) 2025-05-07T20:33:03.5816766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5816863Z def test_silu_mul_quant( 2025-05-07T20:33:03.5816941Z self, 2025-05-07T20:33:03.5817029Z T: int, 2025-05-07T20:33:03.5817108Z D: int, 2025-05-07T20:33:03.5817211Z scale_ub: Optional[float], 2025-05-07T20:33:03.5817309Z contiguous: bool, 2025-05-07T20:33:03.5817396Z compiled: bool, 2025-05-07T20:33:03.5817477Z ) -> None: 2025-05-07T20:33:03.5817583Z torch.manual_seed(2025) 2025-05-07T20:33:03.5817658Z 2025-05-07T20:33:03.5817828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5817911Z 2025-05-07T20:33:03.5818006Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5818137Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5818229Z x = x_sign * x_clamp 2025-05-07T20:33:03.5818312Z x0 = x[:, :D] 2025-05-07T20:33:03.5818407Z x1 = x[:, D:] 2025-05-07T20:33:03.5818485Z 2025-05-07T20:33:03.5818573Z if contiguous: 2025-05-07T20:33:03.5818675Z x0 = x0.contiguous() 2025-05-07T20:33:03.5818766Z x1 = x1.contiguous() 2025-05-07T20:33:03.5818845Z 2025-05-07T20:33:03.5818947Z if scale_ub is not None: 2025-05-07T20:33:03.5819055Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5819192Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5819279Z ) 2025-05-07T20:33:03.5819357Z else: 2025-05-07T20:33:03.5819454Z scale_ub_tensor = None 2025-05-07T20:33:03.5819537Z 2025-05-07T20:33:03.5819668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5819767Z op = silu_mul_quant 2025-05-07T20:33:03.5819858Z if compiled: 2025-05-07T20:33:03.5819959Z op = torch.compile(op) 2025-05-07T20:33:03.5820073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5820155Z 2025-05-07T20:33:03.5820248Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5820255Z 2025-05-07T20:33:03.5820363Z moe/activation_test.py:117: 2025-05-07T20:33:03.5820496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5820602Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5820713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5821212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5821317Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5821676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5821901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5822293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5822428Z kernel = self.compile( 2025-05-07T20:33:03.5822820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5823036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5823168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5823172Z 2025-05-07T20:33:03.5823386Z self = 2025-05-07T20:33:03.5824172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5824686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a692dbc40>} 2025-05-07T20:33:03.5825481Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5825677Z context = 2025-05-07T20:33:03.5825682Z 2025-05-07T20:33:03.5825858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5826125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5826243Z module_map=module_map) 2025-05-07T20:33:03.5826410Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5826516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5826605Z E ^ 2025-05-07T20:33:03.5826966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5826975Z 2025-05-07T20:33:03.5827390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5827404Z 2025-05-07T20:33:03.5827510Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5827738Z self=, 2025-05-07T20:33:03.5827829Z T=128, 2025-05-07T20:33:03.5827910Z D=7168, 2025-05-07T20:33:03.5827996Z scale_ub=None, 2025-05-07T20:33:03.5828093Z contiguous=True, 2025-05-07T20:33:03.5828184Z compiled=False, 2025-05-07T20:33:03.5828262Z ) 2025-05-07T20:33:03.5828487Z self = 2025-05-07T20:33:03.5828658Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5828662Z 2025-05-07T20:33:03.5828748Z @given( 2025-05-07T20:33:03.5828868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5828975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5829103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5829226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5829344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5829428Z ) 2025-05-07T20:33:03.5829696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5829801Z def test_silu_mul_quant( 2025-05-07T20:33:03.5829908Z self, 2025-05-07T20:33:03.5829991Z T: int, 2025-05-07T20:33:03.5830071Z D: int, 2025-05-07T20:33:03.5830180Z scale_ub: Optional[float], 2025-05-07T20:33:03.5830272Z contiguous: bool, 2025-05-07T20:33:03.5830367Z compiled: bool, 2025-05-07T20:33:03.5830450Z ) -> None: 2025-05-07T20:33:03.5830548Z torch.manual_seed(2025) 2025-05-07T20:33:03.5830633Z 2025-05-07T20:33:03.5830853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5830995Z 2025-05-07T20:33:03.5831098Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5831231Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5831365Z x = x_sign * x_clamp 2025-05-07T20:33:03.5831455Z x0 = x[:, :D] 2025-05-07T20:33:03.5831541Z x1 = x[:, D:] 2025-05-07T20:33:03.5831618Z 2025-05-07T20:33:03.5831712Z if contiguous: 2025-05-07T20:33:03.5831807Z x0 = x0.contiguous() 2025-05-07T20:33:03.5831905Z x1 = x1.contiguous() 2025-05-07T20:33:03.5831981Z 2025-05-07T20:33:03.5832075Z if scale_ub is not None: 2025-05-07T20:33:03.5832187Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5832328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5832408Z ) 2025-05-07T20:33:03.5832493Z else: 2025-05-07T20:33:03.5832593Z scale_ub_tensor = None 2025-05-07T20:33:03.5832669Z 2025-05-07T20:33:03.5832812Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5832956Z op = silu_mul_quant 2025-05-07T20:33:03.5833045Z if compiled: 2025-05-07T20:33:03.5833158Z op = torch.compile(op) 2025-05-07T20:33:03.5833266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5833350Z 2025-05-07T20:33:03.5833444Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5833448Z 2025-05-07T20:33:03.5833548Z moe/activation_test.py:117: 2025-05-07T20:33:03.5833685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5833788Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5833895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5834398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5834496Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5834873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5835097Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5835440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5835543Z kernel = self.compile( 2025-05-07T20:33:03.5836083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5836264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5836402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5836407Z 2025-05-07T20:33:03.5836613Z self = 2025-05-07T20:33:03.5837406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5837913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a69074ae0>} 2025-05-07T20:33:03.5838668Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5838861Z context = 2025-05-07T20:33:03.5838865Z 2025-05-07T20:33:03.5839032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5839303Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5839415Z module_map=module_map) 2025-05-07T20:33:03.5839675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5839789Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5839873Z E ^ 2025-05-07T20:33:03.5840275Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5840280Z 2025-05-07T20:33:03.5840693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5840697Z 2025-05-07T20:33:03.5840806Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5841038Z self=, 2025-05-07T20:33:03.5841119Z T=2048, 2025-05-07T20:33:03.5841207Z D=7168, 2025-05-07T20:33:03.5841299Z scale_ub=1200.0, 2025-05-07T20:33:03.5841392Z contiguous=True, 2025-05-07T20:33:03.5841490Z compiled=False, 2025-05-07T20:33:03.5841569Z ) 2025-05-07T20:33:03.5841799Z self = 2025-05-07T20:33:03.5842055Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.5842061Z 2025-05-07T20:33:03.5842144Z @given( 2025-05-07T20:33:03.5842269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5842377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5842496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5842625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5842742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5842821Z ) 2025-05-07T20:33:03.5843078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5843177Z def test_silu_mul_quant( 2025-05-07T20:33:03.5843257Z self, 2025-05-07T20:33:03.5843346Z T: int, 2025-05-07T20:33:03.5843426Z D: int, 2025-05-07T20:33:03.5843531Z scale_ub: Optional[float], 2025-05-07T20:33:03.5843633Z contiguous: bool, 2025-05-07T20:33:03.5843728Z compiled: bool, 2025-05-07T20:33:03.5843810Z ) -> None: 2025-05-07T20:33:03.5843915Z torch.manual_seed(2025) 2025-05-07T20:33:03.5843996Z 2025-05-07T20:33:03.5844170Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5845966Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5845971Z 2025-05-07T20:33:03.5846102Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5846106Z 2025-05-07T20:33:03.5846214Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5846438Z self=, 2025-05-07T20:33:03.5846525Z T=1, 2025-05-07T20:33:03.5846608Z D=5120, 2025-05-07T20:33:03.5846694Z scale_ub=1200.0, 2025-05-07T20:33:03.5846787Z contiguous=True, 2025-05-07T20:33:03.5846873Z compiled=False, 2025-05-07T20:33:03.5846949Z ) 2025-05-07T20:33:03.5847175Z self = 2025-05-07T20:33:03.5847340Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.5847345Z 2025-05-07T20:33:03.5847432Z @given( 2025-05-07T20:33:03.5847553Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5847652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5847823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5847984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5848100Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5848184Z ) 2025-05-07T20:33:03.5848472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5848567Z def test_silu_mul_quant( 2025-05-07T20:33:03.5848650Z self, 2025-05-07T20:33:03.5848729Z T: int, 2025-05-07T20:33:03.5848818Z D: int, 2025-05-07T20:33:03.5848918Z scale_ub: Optional[float], 2025-05-07T20:33:03.5849007Z contiguous: bool, 2025-05-07T20:33:03.5849099Z compiled: bool, 2025-05-07T20:33:03.5849179Z ) -> None: 2025-05-07T20:33:03.5849276Z torch.manual_seed(2025) 2025-05-07T20:33:03.5849357Z 2025-05-07T20:33:03.5849524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5849600Z 2025-05-07T20:33:03.5849706Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5849836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5849976Z x = x_sign * x_clamp 2025-05-07T20:33:03.5850069Z x0 = x[:, :D] 2025-05-07T20:33:03.5850155Z x1 = x[:, D:] 2025-05-07T20:33:03.5850232Z 2025-05-07T20:33:03.5850327Z if contiguous: 2025-05-07T20:33:03.5850424Z x0 = x0.contiguous() 2025-05-07T20:33:03.5850517Z x1 = x1.contiguous() 2025-05-07T20:33:03.5850602Z 2025-05-07T20:33:03.5850695Z if scale_ub is not None: 2025-05-07T20:33:03.5850809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5850945Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5851024Z ) 2025-05-07T20:33:03.5851111Z else: 2025-05-07T20:33:03.5851210Z scale_ub_tensor = None 2025-05-07T20:33:03.5851286Z 2025-05-07T20:33:03.5851422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5851517Z op = silu_mul_quant 2025-05-07T20:33:03.5851607Z if compiled: 2025-05-07T20:33:03.5851717Z op = torch.compile(op) 2025-05-07T20:33:03.5851825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5851903Z 2025-05-07T20:33:03.5852002Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5852007Z 2025-05-07T20:33:03.5852106Z moe/activation_test.py:117: 2025-05-07T20:33:03.5852242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5852345Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5852446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5852949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5853048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5853412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5853645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5853991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5854095Z kernel = self.compile( 2025-05-07T20:33:03.5854479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5854656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5854791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5854796Z 2025-05-07T20:33:03.5855000Z self = 2025-05-07T20:33:03.5855831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5856476Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a690760c0>} 2025-05-07T20:33:03.5857265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5857466Z context = 2025-05-07T20:33:03.5857470Z 2025-05-07T20:33:03.5857637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5857912Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5858022Z module_map=module_map) 2025-05-07T20:33:03.5858186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5858304Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5858383Z E ^ 2025-05-07T20:33:03.5858820Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5858828Z 2025-05-07T20:33:03.5859241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5859246Z 2025-05-07T20:33:03.5859352Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5859600Z self=, 2025-05-07T20:33:03.5859688Z T=2048, 2025-05-07T20:33:03.5859782Z D=5120, 2025-05-07T20:33:03.5859882Z scale_ub=None, 2025-05-07T20:33:03.5859969Z contiguous=True, 2025-05-07T20:33:03.5860061Z compiled=False, 2025-05-07T20:33:03.5860138Z ) 2025-05-07T20:33:03.5860357Z self = 2025-05-07T20:33:03.5860542Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5860549Z 2025-05-07T20:33:03.5860631Z @given( 2025-05-07T20:33:03.5860751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5860860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5860978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5861098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5861219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5861296Z ) 2025-05-07T20:33:03.5861549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5861645Z def test_silu_mul_quant( 2025-05-07T20:33:03.5861728Z self, 2025-05-07T20:33:03.5861814Z T: int, 2025-05-07T20:33:03.5861894Z D: int, 2025-05-07T20:33:03.5861994Z scale_ub: Optional[float], 2025-05-07T20:33:03.5862096Z contiguous: bool, 2025-05-07T20:33:03.5862187Z compiled: bool, 2025-05-07T20:33:03.5862271Z ) -> None: 2025-05-07T20:33:03.5862378Z torch.manual_seed(2025) 2025-05-07T20:33:03.5862455Z 2025-05-07T20:33:03.5862628Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5862717Z 2025-05-07T20:33:03.5862812Z > x_sign = torch.sign(x) 2025-05-07T20:33:03.5864611Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5864617Z 2025-05-07T20:33:03.5864787Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:03.5864829Z 2025-05-07T20:33:03.5864944Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5865173Z self=, 2025-05-07T20:33:03.5865298Z T=16384, 2025-05-07T20:33:03.5865770Z D=5120, 2025-05-07T20:33:03.5865859Z scale_ub=None, 2025-05-07T20:33:03.5865949Z contiguous=True, 2025-05-07T20:33:03.5866043Z compiled=False, 2025-05-07T20:33:03.5866122Z ) 2025-05-07T20:33:03.5866341Z self = 2025-05-07T20:33:03.5866524Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5866530Z 2025-05-07T20:33:03.5866611Z @given( 2025-05-07T20:33:03.5866738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5866840Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5866958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5867093Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5867327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5867408Z ) 2025-05-07T20:33:03.5867662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5867763Z def test_silu_mul_quant( 2025-05-07T20:33:03.5867844Z self, 2025-05-07T20:33:03.5867935Z T: int, 2025-05-07T20:33:03.5868016Z D: int, 2025-05-07T20:33:03.5868123Z scale_ub: Optional[float], 2025-05-07T20:33:03.5868215Z contiguous: bool, 2025-05-07T20:33:03.5868304Z compiled: bool, 2025-05-07T20:33:03.5868391Z ) -> None: 2025-05-07T20:33:03.5868489Z torch.manual_seed(2025) 2025-05-07T20:33:03.5868566Z 2025-05-07T20:33:03.5868740Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5870535Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5870544Z 2025-05-07T20:33:03.5870674Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5870680Z 2025-05-07T20:33:03.5870785Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5871010Z self=, 2025-05-07T20:33:03.5871097Z T=4096, 2025-05-07T20:33:03.5871179Z D=5120, 2025-05-07T20:33:03.5871269Z scale_ub=None, 2025-05-07T20:33:03.5871358Z contiguous=True, 2025-05-07T20:33:03.5871445Z compiled=False, 2025-05-07T20:33:03.5871533Z ) 2025-05-07T20:33:03.5871755Z self = 2025-05-07T20:33:03.5871927Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5871934Z 2025-05-07T20:33:03.5872021Z @given( 2025-05-07T20:33:03.5872142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5872243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5872366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5872486Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5872608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5872687Z ) 2025-05-07T20:33:03.5872931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5873033Z def test_silu_mul_quant( 2025-05-07T20:33:03.5873114Z self, 2025-05-07T20:33:03.5873199Z T: int, 2025-05-07T20:33:03.5873285Z D: int, 2025-05-07T20:33:03.5873522Z scale_ub: Optional[float], 2025-05-07T20:33:03.5873618Z contiguous: bool, 2025-05-07T20:33:03.5873712Z compiled: bool, 2025-05-07T20:33:03.5873791Z ) -> None: 2025-05-07T20:33:03.5873989Z torch.manual_seed(2025) 2025-05-07T20:33:03.5874072Z 2025-05-07T20:33:03.5874242Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5876088Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5876100Z 2025-05-07T20:33:03.5876220Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5876267Z 2025-05-07T20:33:03.5876379Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5876606Z self=, 2025-05-07T20:33:03.5876687Z T=2048, 2025-05-07T20:33:03.5876775Z D=5120, 2025-05-07T20:33:03.5876859Z scale_ub=None, 2025-05-07T20:33:03.5876949Z contiguous=False, 2025-05-07T20:33:03.5895084Z compiled=False, 2025-05-07T20:33:03.5895187Z ) 2025-05-07T20:33:03.5895431Z self = 2025-05-07T20:33:03.5895625Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.5895630Z 2025-05-07T20:33:03.5895713Z @given( 2025-05-07T20:33:03.5895848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5895954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5896085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5896217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5896335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5896419Z ) 2025-05-07T20:33:03.5896677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5896778Z def test_silu_mul_quant( 2025-05-07T20:33:03.5896861Z self, 2025-05-07T20:33:03.5896951Z T: int, 2025-05-07T20:33:03.5897033Z D: int, 2025-05-07T20:33:03.5897137Z scale_ub: Optional[float], 2025-05-07T20:33:03.5897239Z contiguous: bool, 2025-05-07T20:33:03.5897329Z compiled: bool, 2025-05-07T20:33:03.5897422Z ) -> None: 2025-05-07T20:33:03.5897522Z torch.manual_seed(2025) 2025-05-07T20:33:03.5897601Z 2025-05-07T20:33:03.5897784Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5899599Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5899608Z 2025-05-07T20:33:03.5899739Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5899743Z 2025-05-07T20:33:03.5899852Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5900080Z self=, 2025-05-07T20:33:03.5900174Z T=4096, 2025-05-07T20:33:03.5900259Z D=7168, 2025-05-07T20:33:03.5900348Z scale_ub=None, 2025-05-07T20:33:03.5900444Z contiguous=True, 2025-05-07T20:33:03.5900694Z compiled=True, 2025-05-07T20:33:03.5900781Z ) 2025-05-07T20:33:03.5901006Z self = 2025-05-07T20:33:03.5901226Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5901231Z 2025-05-07T20:33:03.5901325Z @given( 2025-05-07T20:33:03.5901448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5901552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5901680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5901801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5901921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5902009Z ) 2025-05-07T20:33:03.5902258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5902363Z def test_silu_mul_quant( 2025-05-07T20:33:03.5902450Z self, 2025-05-07T20:33:03.5902536Z T: int, 2025-05-07T20:33:03.5902625Z D: int, 2025-05-07T20:33:03.5902771Z scale_ub: Optional[float], 2025-05-07T20:33:03.5902865Z contiguous: bool, 2025-05-07T20:33:03.5902967Z compiled: bool, 2025-05-07T20:33:03.5903052Z ) -> None: 2025-05-07T20:33:03.5903152Z torch.manual_seed(2025) 2025-05-07T20:33:03.5903237Z 2025-05-07T20:33:03.5903411Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5905217Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5905227Z 2025-05-07T20:33:03.5905349Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5905353Z 2025-05-07T20:33:03.5905467Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5905695Z self=, 2025-05-07T20:33:03.5905777Z T=2048, 2025-05-07T20:33:03.5905866Z D=5120, 2025-05-07T20:33:03.5905956Z scale_ub=1200.0, 2025-05-07T20:33:03.5906046Z contiguous=False, 2025-05-07T20:33:03.5906142Z compiled=False, 2025-05-07T20:33:03.5906221Z ) 2025-05-07T20:33:03.5906442Z self = 2025-05-07T20:33:03.5906634Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5906639Z 2025-05-07T20:33:03.5906723Z @given( 2025-05-07T20:33:03.5906857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5906964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5907086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5907212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5907331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5907409Z ) 2025-05-07T20:33:03.5907662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5907760Z def test_silu_mul_quant( 2025-05-07T20:33:03.5907841Z self, 2025-05-07T20:33:03.5907930Z T: int, 2025-05-07T20:33:03.5908011Z D: int, 2025-05-07T20:33:03.5908115Z scale_ub: Optional[float], 2025-05-07T20:33:03.5908215Z contiguous: bool, 2025-05-07T20:33:03.5908307Z compiled: bool, 2025-05-07T20:33:03.5908394Z ) -> None: 2025-05-07T20:33:03.5908492Z torch.manual_seed(2025) 2025-05-07T20:33:03.5908570Z 2025-05-07T20:33:03.5908800Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5910631Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5910701Z 2025-05-07T20:33:03.5910828Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5910832Z 2025-05-07T20:33:03.5910939Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5911166Z self=, 2025-05-07T20:33:03.5911255Z T=4096, 2025-05-07T20:33:03.5911337Z D=7168, 2025-05-07T20:33:03.5911430Z scale_ub=1200.0, 2025-05-07T20:33:03.5911526Z contiguous=True, 2025-05-07T20:33:03.5911653Z compiled=False, 2025-05-07T20:33:03.5911740Z ) 2025-05-07T20:33:03.5911963Z self = 2025-05-07T20:33:03.5912144Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.5912149Z 2025-05-07T20:33:03.5912240Z @given( 2025-05-07T20:33:03.5912363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5912467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5912594Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5912713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5912832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5912919Z ) 2025-05-07T20:33:03.5913165Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5913278Z def test_silu_mul_quant( 2025-05-07T20:33:03.5913363Z self, 2025-05-07T20:33:03.5913447Z T: int, 2025-05-07T20:33:03.5913536Z D: int, 2025-05-07T20:33:03.5913638Z scale_ub: Optional[float], 2025-05-07T20:33:03.5913735Z contiguous: bool, 2025-05-07T20:33:03.5913833Z compiled: bool, 2025-05-07T20:33:03.5913916Z ) -> None: 2025-05-07T20:33:03.5914017Z torch.manual_seed(2025) 2025-05-07T20:33:03.5914104Z 2025-05-07T20:33:03.5914276Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5916200Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5916211Z 2025-05-07T20:33:03.5916335Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5916339Z 2025-05-07T20:33:03.5916453Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5916680Z self=, 2025-05-07T20:33:03.5916763Z T=16384, 2025-05-07T20:33:03.5916852Z D=7168, 2025-05-07T20:33:03.5916941Z scale_ub=None, 2025-05-07T20:33:03.5917031Z contiguous=False, 2025-05-07T20:33:03.5917125Z compiled=True, 2025-05-07T20:33:03.5917204Z ) 2025-05-07T20:33:03.5917424Z self = 2025-05-07T20:33:03.5917611Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.5917615Z 2025-05-07T20:33:03.5917698Z @given( 2025-05-07T20:33:03.5917921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5918029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5918149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5918323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5918441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5918521Z ) 2025-05-07T20:33:03.5918775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5918874Z def test_silu_mul_quant( 2025-05-07T20:33:03.5918955Z self, 2025-05-07T20:33:03.5919048Z T: int, 2025-05-07T20:33:03.5919129Z D: int, 2025-05-07T20:33:03.5919232Z scale_ub: Optional[float], 2025-05-07T20:33:03.5919332Z contiguous: bool, 2025-05-07T20:33:03.5919421Z compiled: bool, 2025-05-07T20:33:03.5919530Z ) -> None: 2025-05-07T20:33:03.5919638Z torch.manual_seed(2025) 2025-05-07T20:33:03.5919734Z 2025-05-07T20:33:03.5919958Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5921756Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5921765Z 2025-05-07T20:33:03.5921894Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5921898Z 2025-05-07T20:33:03.5922004Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5922231Z self=, 2025-05-07T20:33:03.5922326Z T=4096, 2025-05-07T20:33:03.5922409Z D=7168, 2025-05-07T20:33:03.5922498Z scale_ub=None, 2025-05-07T20:33:03.5922595Z contiguous=True, 2025-05-07T20:33:03.5922683Z compiled=False, 2025-05-07T20:33:03.5922770Z ) 2025-05-07T20:33:03.5922990Z self = 2025-05-07T20:33:03.5923167Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5923172Z 2025-05-07T20:33:03.5923266Z @given( 2025-05-07T20:33:03.5923389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5923492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5923618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5923738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5923856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5923944Z ) 2025-05-07T20:33:03.5924193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5924305Z def test_silu_mul_quant( 2025-05-07T20:33:03.5924386Z self, 2025-05-07T20:33:03.5924467Z T: int, 2025-05-07T20:33:03.5924562Z D: int, 2025-05-07T20:33:03.5924665Z scale_ub: Optional[float], 2025-05-07T20:33:03.5924759Z contiguous: bool, 2025-05-07T20:33:03.5924859Z compiled: bool, 2025-05-07T20:33:03.5924942Z ) -> None: 2025-05-07T20:33:03.5925041Z torch.manual_seed(2025) 2025-05-07T20:33:03.5925125Z 2025-05-07T20:33:03.5925297Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5927147Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5927229Z 2025-05-07T20:33:03.5927351Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5927355Z 2025-05-07T20:33:03.5927467Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5927694Z self=, 2025-05-07T20:33:03.5927776Z T=16384, 2025-05-07T20:33:03.5927863Z D=7168, 2025-05-07T20:33:03.5927952Z scale_ub=None, 2025-05-07T20:33:03.5928043Z contiguous=True, 2025-05-07T20:33:03.5928138Z compiled=False, 2025-05-07T20:33:03.5928215Z ) 2025-05-07T20:33:03.5928437Z self = 2025-05-07T20:33:03.5928623Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:03.5928634Z 2025-05-07T20:33:03.5928718Z @given( 2025-05-07T20:33:03.5928894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5928999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5929121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5929250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5929370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5929451Z ) 2025-05-07T20:33:03.5929711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5929811Z def test_silu_mul_quant( 2025-05-07T20:33:03.5929897Z self, 2025-05-07T20:33:03.5929987Z T: int, 2025-05-07T20:33:03.5930067Z D: int, 2025-05-07T20:33:03.5930170Z scale_ub: Optional[float], 2025-05-07T20:33:03.5930272Z contiguous: bool, 2025-05-07T20:33:03.5930362Z compiled: bool, 2025-05-07T20:33:03.5930456Z ) -> None: 2025-05-07T20:33:03.5930563Z torch.manual_seed(2025) 2025-05-07T20:33:03.5930641Z 2025-05-07T20:33:03.5930826Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5932616Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5932624Z 2025-05-07T20:33:03.5932753Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5932758Z 2025-05-07T20:33:03.5932866Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5933095Z self=, 2025-05-07T20:33:03.5933188Z T=16384, 2025-05-07T20:33:03.5933272Z D=7168, 2025-05-07T20:33:03.5933359Z scale_ub=1200.0, 2025-05-07T20:33:03.5933462Z contiguous=True, 2025-05-07T20:33:03.5933550Z compiled=False, 2025-05-07T20:33:03.5933636Z ) 2025-05-07T20:33:03.5933857Z self = 2025-05-07T20:33:03.5934036Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.5934041Z 2025-05-07T20:33:03.5934132Z @given( 2025-05-07T20:33:03.5934256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5934359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5934485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5934606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5934724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5934901Z ) 2025-05-07T20:33:03.5935153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5935264Z def test_silu_mul_quant( 2025-05-07T20:33:03.5935383Z self, 2025-05-07T20:33:03.5935465Z T: int, 2025-05-07T20:33:03.5935551Z D: int, 2025-05-07T20:33:03.5935659Z scale_ub: Optional[float], 2025-05-07T20:33:03.5935755Z contiguous: bool, 2025-05-07T20:33:03.5935844Z compiled: bool, 2025-05-07T20:33:03.5935933Z ) -> None: 2025-05-07T20:33:03.5936032Z torch.manual_seed(2025) 2025-05-07T20:33:03.5936110Z 2025-05-07T20:33:03.5936290Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5938122Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5938133Z 2025-05-07T20:33:03.5938259Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5938263Z 2025-05-07T20:33:03.5938368Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5938601Z self=, 2025-05-07T20:33:03.5938687Z T=128, 2025-05-07T20:33:03.5938767Z D=5120, 2025-05-07T20:33:03.5938860Z scale_ub=1200.0, 2025-05-07T20:33:03.5938949Z contiguous=False, 2025-05-07T20:33:03.5939037Z compiled=False, 2025-05-07T20:33:03.5939119Z ) 2025-05-07T20:33:03.5939337Z self = 2025-05-07T20:33:03.5939520Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:03.5939527Z 2025-05-07T20:33:03.5939614Z @given( 2025-05-07T20:33:03.5939736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5939847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5939964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5940083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5940205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5940283Z ) 2025-05-07T20:33:03.5940527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5940629Z def test_silu_mul_quant( 2025-05-07T20:33:03.5940708Z self, 2025-05-07T20:33:03.5940789Z T: int, 2025-05-07T20:33:03.5940876Z D: int, 2025-05-07T20:33:03.5940980Z scale_ub: Optional[float], 2025-05-07T20:33:03.5941076Z contiguous: bool, 2025-05-07T20:33:03.5941177Z compiled: bool, 2025-05-07T20:33:03.5941259Z ) -> None: 2025-05-07T20:33:03.5941365Z torch.manual_seed(2025) 2025-05-07T20:33:03.5941443Z 2025-05-07T20:33:03.5941613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5941699Z 2025-05-07T20:33:03.5941796Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5941924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5942023Z x = x_sign * x_clamp 2025-05-07T20:33:03.5942108Z x0 = x[:, :D] 2025-05-07T20:33:03.5942192Z x1 = x[:, D:] 2025-05-07T20:33:03.5942273Z 2025-05-07T20:33:03.5942361Z if contiguous: 2025-05-07T20:33:03.5942459Z x0 = x0.contiguous() 2025-05-07T20:33:03.5942559Z x1 = x1.contiguous() 2025-05-07T20:33:03.5942635Z 2025-05-07T20:33:03.5942728Z if scale_ub is not None: 2025-05-07T20:33:03.5942848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5943036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5943185Z ) 2025-05-07T20:33:03.5943269Z else: 2025-05-07T20:33:03.5943369Z scale_ub_tensor = None 2025-05-07T20:33:03.5943492Z 2025-05-07T20:33:03.5943625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5943721Z op = silu_mul_quant 2025-05-07T20:33:03.5943817Z if compiled: 2025-05-07T20:33:03.5943920Z op = torch.compile(op) 2025-05-07T20:33:03.5944028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5944110Z 2025-05-07T20:33:03.5944204Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5944209Z 2025-05-07T20:33:03.5944316Z moe/activation_test.py:117: 2025-05-07T20:33:03.5944448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5944553Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5944662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5945220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5945324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5945700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5945926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5946276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5946374Z kernel = self.compile( 2025-05-07T20:33:03.5946763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5946949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5947081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5947091Z 2025-05-07T20:33:03.5947302Z self = 2025-05-07T20:33:03.5948095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5948609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a68f04cc0>} 2025-05-07T20:33:03.5949371Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5949565Z context = 2025-05-07T20:33:03.5949570Z 2025-05-07T20:33:03.5949745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5950017Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5950131Z module_map=module_map) 2025-05-07T20:33:03.5950303Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5950406Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5950491Z E ^ 2025-05-07T20:33:03.5950861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5950866Z 2025-05-07T20:33:03.5951281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5951286Z 2025-05-07T20:33:03.5951399Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5951626Z self=, 2025-05-07T20:33:03.5951708Z T=2048, 2025-05-07T20:33:03.5951970Z D=7168, 2025-05-07T20:33:03.5952057Z scale_ub=None, 2025-05-07T20:33:03.5952153Z contiguous=False, 2025-05-07T20:33:03.5952248Z compiled=False, 2025-05-07T20:33:03.5952324Z ) 2025-05-07T20:33:03.5952589Z self = 2025-05-07T20:33:03.5952766Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.5952770Z 2025-05-07T20:33:03.5952851Z @given( 2025-05-07T20:33:03.5952979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5953083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5953201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5953327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5953444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5953527Z ) 2025-05-07T20:33:03.5953773Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5953875Z def test_silu_mul_quant( 2025-05-07T20:33:03.5954005Z self, 2025-05-07T20:33:03.5954086Z T: int, 2025-05-07T20:33:03.5954166Z D: int, 2025-05-07T20:33:03.5954275Z scale_ub: Optional[float], 2025-05-07T20:33:03.5954371Z contiguous: bool, 2025-05-07T20:33:03.5954460Z compiled: bool, 2025-05-07T20:33:03.5954548Z ) -> None: 2025-05-07T20:33:03.5954646Z torch.manual_seed(2025) 2025-05-07T20:33:03.5954722Z 2025-05-07T20:33:03.5954900Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5956805Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5956816Z 2025-05-07T20:33:03.5956941Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5956945Z 2025-05-07T20:33:03.5957050Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5957280Z self=, 2025-05-07T20:33:03.5957362Z T=128, 2025-05-07T20:33:03.5957443Z D=7168, 2025-05-07T20:33:03.5957534Z scale_ub=1200.0, 2025-05-07T20:33:03.5957621Z contiguous=True, 2025-05-07T20:33:03.5957708Z compiled=True, 2025-05-07T20:33:03.5957791Z ) 2025-05-07T20:33:03.5958011Z self = 2025-05-07T20:33:03.5958182Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5958187Z 2025-05-07T20:33:03.5958278Z @given( 2025-05-07T20:33:03.5958399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5958510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5958628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5958750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5958873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5958952Z ) 2025-05-07T20:33:03.5959197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5959312Z def test_silu_mul_quant( 2025-05-07T20:33:03.5959396Z self, 2025-05-07T20:33:03.5959486Z T: int, 2025-05-07T20:33:03.5959588Z D: int, 2025-05-07T20:33:03.5959706Z scale_ub: Optional[float], 2025-05-07T20:33:03.5959807Z contiguous: bool, 2025-05-07T20:33:03.5959903Z compiled: bool, 2025-05-07T20:33:03.5959988Z ) -> None: 2025-05-07T20:33:03.5960086Z torch.manual_seed(2025) 2025-05-07T20:33:03.5960260Z 2025-05-07T20:33:03.5960435Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5960525Z 2025-05-07T20:33:03.5960622Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5960792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5960893Z x = x_sign * x_clamp 2025-05-07T20:33:03.5960979Z x0 = x[:, :D] 2025-05-07T20:33:03.5961064Z x1 = x[:, D:] 2025-05-07T20:33:03.5961149Z 2025-05-07T20:33:03.5961236Z if contiguous: 2025-05-07T20:33:03.5961333Z x0 = x0.contiguous() 2025-05-07T20:33:03.5961433Z x1 = x1.contiguous() 2025-05-07T20:33:03.5961510Z 2025-05-07T20:33:03.5961603Z if scale_ub is not None: 2025-05-07T20:33:03.5961719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.5961858Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.5961939Z ) 2025-05-07T20:33:03.5962028Z else: 2025-05-07T20:33:03.5962132Z scale_ub_tensor = None 2025-05-07T20:33:03.5962258Z 2025-05-07T20:33:03.5962394Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.5962496Z op = silu_mul_quant 2025-05-07T20:33:03.5962593Z if compiled: 2025-05-07T20:33:03.5962696Z op = torch.compile(op) 2025-05-07T20:33:03.5962805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5962890Z 2025-05-07T20:33:03.5962986Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.5962991Z 2025-05-07T20:33:03.5963092Z moe/activation_test.py:117: 2025-05-07T20:33:03.5963231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5963336Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.5963445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.5963817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:03.5963923Z return fn(*args, **kwargs) 2025-05-07T20:33:03.5964426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.5964606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.5964999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.5965320Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.5966012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.5966146Z kernel = self.compile( 2025-05-07T20:33:03.5966910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.5967123Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.5967351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.5967359Z 2025-05-07T20:33:03.5967602Z self = 2025-05-07T20:33:03.5968420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.5969050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8a68f05a80>} 2025-05-07T20:33:03.5969848Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.5970130Z context = 2025-05-07T20:33:03.5970361Z 2025-05-07T20:33:03.5970568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.5970867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.5971098Z module_map=module_map) 2025-05-07T20:33:03.5971365Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.5971576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.5971686Z E ^ 2025-05-07T20:33:03.5972134Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.5972139Z 2025-05-07T20:33:03.5972619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.5972624Z 2025-05-07T20:33:03.5972746Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5973141Z self=, 2025-05-07T20:33:03.5973256Z T=128, 2025-05-07T20:33:03.5973434Z D=7168, 2025-05-07T20:33:03.5973584Z scale_ub=1200.0, 2025-05-07T20:33:03.5973699Z contiguous=True, 2025-05-07T20:33:03.5973804Z compiled=False, 2025-05-07T20:33:03.5974031Z ) 2025-05-07T20:33:03.5974282Z self = 2025-05-07T20:33:03.5974484Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:03.5974525Z 2025-05-07T20:33:03.5974636Z @given( 2025-05-07T20:33:03.5974785Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5975005Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5975166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5975314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5975492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5975599Z ) 2025-05-07T20:33:03.5975904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5976102Z def test_silu_mul_quant( 2025-05-07T20:33:03.5976227Z self, 2025-05-07T20:33:03.5976429Z T: int, 2025-05-07T20:33:03.5976541Z D: int, 2025-05-07T20:33:03.5976672Z scale_ub: Optional[float], 2025-05-07T20:33:03.5976835Z contiguous: bool, 2025-05-07T20:33:03.5976998Z compiled: bool, 2025-05-07T20:33:03.5977129Z ) -> None: 2025-05-07T20:33:03.5977290Z torch.manual_seed(2025) 2025-05-07T20:33:03.5977395Z 2025-05-07T20:33:03.5977618Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5977747Z 2025-05-07T20:33:03.5977917Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5978132Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5979966Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5979976Z 2025-05-07T20:33:03.5980183Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:03.5980188Z 2025-05-07T20:33:03.5980325Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5980565Z self=, 2025-05-07T20:33:03.5980768Z T=128, 2025-05-07T20:33:03.5980878Z D=5120, 2025-05-07T20:33:03.5980993Z scale_ub=1200.0, 2025-05-07T20:33:03.5981167Z contiguous=True, 2025-05-07T20:33:03.5981281Z compiled=True, 2025-05-07T20:33:03.5981455Z ) 2025-05-07T20:33:03.5981882Z self = 2025-05-07T20:33:03.5982085Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:03.5982151Z 2025-05-07T20:33:03.5982297Z @given( 2025-05-07T20:33:03.5982446Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5982575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5982795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5982957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5983124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5983267Z ) 2025-05-07T20:33:03.5983541Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5983688Z def test_silu_mul_quant( 2025-05-07T20:33:03.5983847Z self, 2025-05-07T20:33:03.5983993Z T: int, 2025-05-07T20:33:03.5984135Z D: int, 2025-05-07T20:33:03.5984271Z scale_ub: Optional[float], 2025-05-07T20:33:03.5984392Z contiguous: bool, 2025-05-07T20:33:03.5984573Z compiled: bool, 2025-05-07T20:33:03.5984762Z ) -> None: 2025-05-07T20:33:03.5984933Z torch.manual_seed(2025) 2025-05-07T20:33:03.5985074Z 2025-05-07T20:33:03.5985277Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5985423Z 2025-05-07T20:33:03.5985533Z x_sign = torch.sign(x) 2025-05-07T20:33:03.5985835Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.5987707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5987716Z 2025-05-07T20:33:03.5987868Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:03.5987876Z 2025-05-07T20:33:03.5988046Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.5988303Z self=, 2025-05-07T20:33:03.5988400Z T=128, 2025-05-07T20:33:03.5988640Z D=7168, 2025-05-07T20:33:03.5988757Z scale_ub=None, 2025-05-07T20:33:03.5988911Z contiguous=True, 2025-05-07T20:33:03.5989029Z compiled=True, 2025-05-07T20:33:03.5989138Z ) 2025-05-07T20:33:03.5989474Z self = 2025-05-07T20:33:03.5989713Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.5989718Z 2025-05-07T20:33:03.5989856Z @given( 2025-05-07T20:33:03.5990053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.5990191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.5990362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.5990582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.5990748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.5990895Z ) 2025-05-07T20:33:03.5991171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.5991301Z def test_silu_mul_quant( 2025-05-07T20:33:03.5991451Z self, 2025-05-07T20:33:03.5991684Z T: int, 2025-05-07T20:33:03.5991811Z D: int, 2025-05-07T20:33:03.5991983Z scale_ub: Optional[float], 2025-05-07T20:33:03.5992107Z contiguous: bool, 2025-05-07T20:33:03.5992246Z compiled: bool, 2025-05-07T20:33:03.5992379Z ) -> None: 2025-05-07T20:33:03.5992551Z torch.manual_seed(2025) 2025-05-07T20:33:03.5992707Z 2025-05-07T20:33:03.5993001Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.5994840Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:03.5994922Z 2025-05-07T20:33:03.5995071Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:03.5995224Z =============================== warnings summary =============================== 2025-05-07T20:33:03.5995668Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:03.5996180Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:03.5996552Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:03.5997465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:03.5997730Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:03.5997734Z 2025-05-07T20:33:03.5998048Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:03.5998265Z ================= 1 failed, 1 deselected, 3 warnings in 15.28s ================= 2025-05-07T20:33:05.2145062Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:05.2762711Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:33:05.2762974Z 2025-05-07T20:33:07.2778566Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:09.4357853Z ============================= test session starts ============================== 2025-05-07T20:33:09.4358684Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:09.4359269Z cachedir: .pytest_cache 2025-05-07T20:33:09.4359973Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:09.4360954Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:09.4361516Z plugins: hypothesis-6.131.14 2025-05-07T20:33:11.0545777Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:11.1633719Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:11.1634299Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:11.1634601Z 2025-05-07T20:33:13.5127578Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.5128605Z self=, 2025-05-07T20:33:13.5129112Z T=1, 2025-05-07T20:33:13.5129470Z D=5120, 2025-05-07T20:33:13.5129768Z scale_ub=None, 2025-05-07T20:33:13.5130074Z contiguous=True, 2025-05-07T20:33:13.5130444Z compiled=True, 2025-05-07T20:33:13.5130744Z ) 2025-05-07T20:33:13.5131147Z self = 2025-05-07T20:33:13.5132139Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.5132496Z 2025-05-07T20:33:13.5132617Z @given( 2025-05-07T20:33:13.5132941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.5133473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.5133896Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.5134304Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.5134755Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.5135155Z ) 2025-05-07T20:33:13.5135582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.5136204Z def test_silu_mul_quant( 2025-05-07T20:33:13.5136505Z self, 2025-05-07T20:33:13.5136778Z T: int, 2025-05-07T20:33:13.5137148Z D: int, 2025-05-07T20:33:13.5137426Z scale_ub: Optional[float], 2025-05-07T20:33:13.5137777Z contiguous: bool, 2025-05-07T20:33:13.5138202Z compiled: bool, 2025-05-07T20:33:13.5138488Z ) -> None: 2025-05-07T20:33:13.5138870Z torch.manual_seed(2025) 2025-05-07T20:33:13.5139291Z 2025-05-07T20:33:13.5139625Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.5140073Z 2025-05-07T20:33:13.5140436Z x_sign = torch.sign(x) 2025-05-07T20:33:13.5140816Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.5141193Z x = x_sign * x_clamp 2025-05-07T20:33:13.5141595Z x0 = x[:, :D] 2025-05-07T20:33:13.5141902Z x1 = x[:, D:] 2025-05-07T20:33:13.5142176Z 2025-05-07T20:33:13.5142514Z if contiguous: 2025-05-07T20:33:13.5142861Z x0 = x0.contiguous() 2025-05-07T20:33:13.5143185Z x1 = x1.contiguous() 2025-05-07T20:33:13.5150121Z 2025-05-07T20:33:13.5150340Z if scale_ub is not None: 2025-05-07T20:33:13.5150621Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.5150975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.5151297Z ) 2025-05-07T20:33:13.5151494Z else: 2025-05-07T20:33:13.5151714Z scale_ub_tensor = None 2025-05-07T20:33:13.5151976Z 2025-05-07T20:33:13.5152220Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.5152536Z op = silu_mul_quant 2025-05-07T20:33:13.5152794Z if compiled: 2025-05-07T20:33:13.5153053Z op = torch.compile(op) 2025-05-07T20:33:13.5153348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.5153629Z 2025-05-07T20:33:13.5153831Z y_fp8, y_scale = fn() 2025-05-07T20:33:13.5154114Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:13.5154412Z 2025-05-07T20:33:13.5154656Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.5154990Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:13.5155289Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:13.5155616Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:13.5156089Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.5156406Z 2025-05-07T20:33:13.5156615Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:13.5156810Z 2025-05-07T20:33:13.5156928Z moe/activation_test.py:126: 2025-05-07T20:33:13.5157222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.5157569Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:13.5157900Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:13.5158690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:13.5159443Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:13.5159992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.5161576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.5162267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:13.5163043Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:13.5163781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:13.5164420Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:13.5165015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:13.5165843Z fn() 2025-05-07T20:33:13.5166355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:13.5166933Z self.fn.run( 2025-05-07T20:33:13.5167412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.5168049Z kernel = self.compile( 2025-05-07T20:33:13.5168593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.5169242Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.5169647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.5169878Z 2025-05-07T20:33:13.5170091Z self = 2025-05-07T20:33:13.5171172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.5172556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f3351c60>} 2025-05-07T20:33:13.5173901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.5174932Z context = 2025-05-07T20:33:13.5175222Z 2025-05-07T20:33:13.5175395Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.5175915Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.5176394Z module_map=module_map) 2025-05-07T20:33:13.5176762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.5177128Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:13.5177397Z E ^ 2025-05-07T20:33:13.5177873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.5178330Z 2025-05-07T20:33:13.5178751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.5179265Z 2025-05-07T20:33:13.5179371Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.5179787Z self=, 2025-05-07T20:33:13.5180195Z T=2048, 2025-05-07T20:33:13.5180391Z D=5120, 2025-05-07T20:33:13.5180581Z scale_ub=1200.0, 2025-05-07T20:33:13.5180807Z contiguous=True, 2025-05-07T20:33:13.5181034Z compiled=False, 2025-05-07T20:33:13.5181237Z ) 2025-05-07T20:33:14.2520866Z self = 2025-05-07T20:33:14.2521653Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:14.2522002Z 2025-05-07T20:33:14.2522094Z @given( 2025-05-07T20:33:14.2522728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.2523061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.2523380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.2523796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.2524132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.2524430Z ) 2025-05-07T20:33:14.2524776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.2525228Z def test_silu_mul_quant( 2025-05-07T20:33:14.2525480Z self, 2025-05-07T20:33:14.2525679Z T: int, 2025-05-07T20:33:14.2525886Z D: int, 2025-05-07T20:33:14.2526119Z scale_ub: Optional[float], 2025-05-07T20:33:14.2526389Z contiguous: bool, 2025-05-07T20:33:14.2526635Z compiled: bool, 2025-05-07T20:33:14.2526868Z ) -> None: 2025-05-07T20:33:14.2527091Z torch.manual_seed(2025) 2025-05-07T20:33:14.2527349Z 2025-05-07T20:33:14.2527726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.2528081Z 2025-05-07T20:33:14.2528277Z x_sign = torch.sign(x) 2025-05-07T20:33:14.2528587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.2528911Z x = x_sign * x_clamp 2025-05-07T20:33:14.2529152Z x0 = x[:, :D] 2025-05-07T20:33:14.2529379Z x1 = x[:, D:] 2025-05-07T20:33:14.2529594Z 2025-05-07T20:33:14.2529780Z if contiguous: 2025-05-07T20:33:14.2530024Z x0 = x0.contiguous() 2025-05-07T20:33:14.2530293Z x1 = x1.contiguous() 2025-05-07T20:33:14.2530535Z 2025-05-07T20:33:14.2530735Z if scale_ub is not None: 2025-05-07T20:33:14.2531018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.2531356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.2531674Z ) 2025-05-07T20:33:14.2531872Z else: 2025-05-07T20:33:14.2532083Z scale_ub_tensor = None 2025-05-07T20:33:14.2532338Z 2025-05-07T20:33:14.2532579Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.2532907Z op = silu_mul_quant 2025-05-07T20:33:14.2533155Z if compiled: 2025-05-07T20:33:14.2533405Z op = torch.compile(op) 2025-05-07T20:33:14.2533702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.2533974Z 2025-05-07T20:33:14.2534174Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.2534337Z 2025-05-07T20:33:14.2534452Z moe/activation_test.py:117: 2025-05-07T20:33:14.2534748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.2535091Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.2535374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.2536062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.2536763Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.2537302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.2537982Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.2538634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.2539166Z kernel = self.compile( 2025-05-07T20:33:14.2539705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.2540359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.2540760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.2540999Z 2025-05-07T20:33:14.2541202Z self = 2025-05-07T20:33:14.2542339Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.2543837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f31a8220>} 2025-05-07T20:33:14.2545180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.2546204Z context = 2025-05-07T20:33:14.2546488Z 2025-05-07T20:33:14.2546660Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.2547185Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.2547650Z module_map=module_map) 2025-05-07T20:33:14.2548058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.2548419Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.2548676Z E ^ 2025-05-07T20:33:14.2549141Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.2549592Z 2025-05-07T20:33:14.2550010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.2550516Z 2025-05-07T20:33:14.2550626Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.2551032Z self=, 2025-05-07T20:33:14.2551438Z T=2048, 2025-05-07T20:33:14.2551632Z D=5120, 2025-05-07T20:33:14.2551822Z scale_ub=1200.0, 2025-05-07T20:33:14.2552047Z contiguous=True, 2025-05-07T20:33:14.2552280Z compiled=True, 2025-05-07T20:33:14.2552486Z ) 2025-05-07T20:33:14.2552809Z self = 2025-05-07T20:33:14.2553303Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:14.2553575Z 2025-05-07T20:33:14.2553662Z @given( 2025-05-07T20:33:14.2553889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.2554203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.2554516Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.2554843Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.2555174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.2555462Z ) 2025-05-07T20:33:14.2555913Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.2556356Z def test_silu_mul_quant( 2025-05-07T20:33:14.2556606Z self, 2025-05-07T20:33:14.2556799Z T: int, 2025-05-07T20:33:14.2557008Z D: int, 2025-05-07T20:33:14.2557235Z scale_ub: Optional[float], 2025-05-07T20:33:14.2557503Z contiguous: bool, 2025-05-07T20:33:14.2557746Z compiled: bool, 2025-05-07T20:33:14.2557975Z ) -> None: 2025-05-07T20:33:14.2558194Z torch.manual_seed(2025) 2025-05-07T20:33:14.2558437Z 2025-05-07T20:33:14.2558714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.2559059Z 2025-05-07T20:33:14.2559249Z x_sign = torch.sign(x) 2025-05-07T20:33:14.2559537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.2559851Z x = x_sign * x_clamp 2025-05-07T20:33:14.2560087Z x0 = x[:, :D] 2025-05-07T20:33:14.2560307Z x1 = x[:, D:] 2025-05-07T20:33:14.2560518Z 2025-05-07T20:33:14.2560702Z if contiguous: 2025-05-07T20:33:14.2560936Z x0 = x0.contiguous() 2025-05-07T20:33:14.2561202Z x1 = x1.contiguous() 2025-05-07T20:33:14.2561530Z 2025-05-07T20:33:14.2561728Z if scale_ub is not None: 2025-05-07T20:33:14.2562006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.2562338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.2562693Z ) 2025-05-07T20:33:14.2562885Z else: 2025-05-07T20:33:14.2563110Z scale_ub_tensor = None 2025-05-07T20:33:14.2563364Z 2025-05-07T20:33:14.2563598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.2563913Z op = silu_mul_quant 2025-05-07T20:33:14.2564169Z if compiled: 2025-05-07T20:33:14.2564422Z op = torch.compile(op) 2025-05-07T20:33:14.2564723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.2564999Z 2025-05-07T20:33:14.2565197Z y_fp8, y_scale = fn() 2025-05-07T20:33:14.2565789Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:14.2566082Z 2025-05-07T20:33:14.2566329Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.2566751Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:14.2567043Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:14.2567362Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:14.2567724Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.2568038Z 2025-05-07T20:33:14.2568250Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:14.2568446Z 2025-05-07T20:33:14.2568555Z moe/activation_test.py:126: 2025-05-07T20:33:14.2568857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.2569188Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:14.2569514Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:14.2570297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:14.2571046Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:14.2571597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.2572280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.2572966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:14.2573679Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:14.2574410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:14.2575054Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:14.2575658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:14.2576171Z fn() 2025-05-07T20:33:14.2576683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:14.2577264Z self.fn.run( 2025-05-07T20:33:14.2577724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.2578257Z kernel = self.compile( 2025-05-07T20:33:14.2578795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.2579447Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.2579843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.2580077Z 2025-05-07T20:33:14.2580284Z self = 2025-05-07T20:33:14.2581441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.2582870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f31a96c0>} 2025-05-07T20:33:14.2584258Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.2585281Z context = 2025-05-07T20:33:14.2585576Z 2025-05-07T20:33:14.2585745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.2586272Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.2586740Z module_map=module_map) 2025-05-07T20:33:14.2587113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.2587486Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:14.2587802Z E ^ 2025-05-07T20:33:14.2588271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.2588734Z 2025-05-07T20:33:14.2589145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.2589656Z 2025-05-07T20:33:14.2589766Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.2590177Z self=, 2025-05-07T20:33:14.2590581Z T=16384, 2025-05-07T20:33:14.2590779Z D=7168, 2025-05-07T20:33:14.2590976Z scale_ub=1200.0, 2025-05-07T20:33:14.2591210Z contiguous=False, 2025-05-07T20:33:14.2591439Z compiled=False, 2025-05-07T20:33:14.2591641Z ) 2025-05-07T20:33:14.9933414Z self = 2025-05-07T20:33:14.9934372Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:14.9934859Z 2025-05-07T20:33:14.9934982Z @given( 2025-05-07T20:33:14.9935341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.9935845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.9936335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.9936856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.9937371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.9937827Z ) 2025-05-07T20:33:14.9938390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.9939113Z def test_silu_mul_quant( 2025-05-07T20:33:14.9939497Z self, 2025-05-07T20:33:14.9939803Z T: int, 2025-05-07T20:33:14.9940104Z D: int, 2025-05-07T20:33:14.9940444Z scale_ub: Optional[float], 2025-05-07T20:33:14.9940881Z contiguous: bool, 2025-05-07T20:33:14.9941271Z compiled: bool, 2025-05-07T20:33:14.9941631Z ) -> None: 2025-05-07T20:33:14.9941985Z torch.manual_seed(2025) 2025-05-07T20:33:14.9942393Z 2025-05-07T20:33:14.9942832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.9943394Z 2025-05-07T20:33:14.9943696Z x_sign = torch.sign(x) 2025-05-07T20:33:14.9944153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.9944673Z x = x_sign * x_clamp 2025-05-07T20:33:14.9945076Z x0 = x[:, :D] 2025-05-07T20:33:14.9945416Z x1 = x[:, D:] 2025-05-07T20:33:14.9945754Z 2025-05-07T20:33:14.9946065Z if contiguous: 2025-05-07T20:33:14.9946430Z x0 = x0.contiguous() 2025-05-07T20:33:14.9946844Z x1 = x1.contiguous() 2025-05-07T20:33:14.9947231Z 2025-05-07T20:33:14.9947526Z if scale_ub is not None: 2025-05-07T20:33:14.9948111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.9949164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.9949670Z ) 2025-05-07T20:33:14.9949968Z else: 2025-05-07T20:33:14.9950295Z scale_ub_tensor = None 2025-05-07T20:33:14.9950814Z 2025-05-07T20:33:14.9951193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.9951700Z op = silu_mul_quant 2025-05-07T20:33:14.9952110Z if compiled: 2025-05-07T20:33:14.9952508Z op = torch.compile(op) 2025-05-07T20:33:14.9952954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.9953332Z 2025-05-07T20:33:14.9953596Z > y_fp8, y_scale = fn() 2025-05-07T20:33:14.9953841Z 2025-05-07T20:33:14.9953993Z moe/activation_test.py:117: 2025-05-07T20:33:14.9954570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.9955202Z moe/activation_test.py:115: in fn 2025-05-07T20:33:14.9955668Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:14.9957084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:14.9958279Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:14.9959209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:14.9960301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:14.9961397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:14.9962231Z kernel = self.compile( 2025-05-07T20:33:14.9963139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:14.9964243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:14.9964940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:14.9969588Z 2025-05-07T20:33:14.9969833Z self = 2025-05-07T20:33:14.9970974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:14.9972378Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f2058040>} 2025-05-07T20:33:14.9973729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:14.9974762Z context = 2025-05-07T20:33:14.9975052Z 2025-05-07T20:33:14.9975228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:14.9975765Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:14.9976253Z module_map=module_map) 2025-05-07T20:33:14.9976620Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:14.9976991Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:14.9977263Z E ^ 2025-05-07T20:33:14.9977736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:14.9978190Z 2025-05-07T20:33:14.9978607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:14.9979125Z 2025-05-07T20:33:14.9979232Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:14.9979651Z self=, 2025-05-07T20:33:14.9980236Z T=1, 2025-05-07T20:33:14.9980426Z D=7168, 2025-05-07T20:33:14.9980632Z scale_ub=None, 2025-05-07T20:33:14.9980856Z contiguous=True, 2025-05-07T20:33:14.9981081Z compiled=True, 2025-05-07T20:33:14.9981366Z ) 2025-05-07T20:33:14.9981696Z self = 2025-05-07T20:33:14.9982180Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:14.9982445Z 2025-05-07T20:33:14.9982525Z @given( 2025-05-07T20:33:14.9982764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:14.9983087Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:14.9983407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:14.9983754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:14.9984093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:14.9984387Z ) 2025-05-07T20:33:14.9984748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:14.9985203Z def test_silu_mul_quant( 2025-05-07T20:33:14.9985507Z self, 2025-05-07T20:33:14.9985713Z T: int, 2025-05-07T20:33:14.9985918Z D: int, 2025-05-07T20:33:14.9986140Z scale_ub: Optional[float], 2025-05-07T20:33:14.9986424Z contiguous: bool, 2025-05-07T20:33:14.9986670Z compiled: bool, 2025-05-07T20:33:14.9986894Z ) -> None: 2025-05-07T20:33:14.9987119Z torch.manual_seed(2025) 2025-05-07T20:33:14.9987369Z 2025-05-07T20:33:14.9987642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:14.9987997Z 2025-05-07T20:33:14.9988198Z x_sign = torch.sign(x) 2025-05-07T20:33:14.9988490Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:14.9988814Z x = x_sign * x_clamp 2025-05-07T20:33:14.9989061Z x0 = x[:, :D] 2025-05-07T20:33:14.9989285Z x1 = x[:, D:] 2025-05-07T20:33:14.9989496Z 2025-05-07T20:33:14.9989697Z if contiguous: 2025-05-07T20:33:14.9989937Z x0 = x0.contiguous() 2025-05-07T20:33:14.9990197Z x1 = x1.contiguous() 2025-05-07T20:33:14.9990444Z 2025-05-07T20:33:14.9990644Z if scale_ub is not None: 2025-05-07T20:33:14.9990919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:14.9991267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:14.9991588Z ) 2025-05-07T20:33:14.9991789Z else: 2025-05-07T20:33:14.9992009Z scale_ub_tensor = None 2025-05-07T20:33:14.9992271Z 2025-05-07T20:33:14.9992508Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:14.9992832Z op = silu_mul_quant 2025-05-07T20:33:14.9993094Z if compiled: 2025-05-07T20:33:14.9999842Z op = torch.compile(op) 2025-05-07T20:33:15.0000198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.0000500Z 2025-05-07T20:33:15.0000711Z y_fp8, y_scale = fn() 2025-05-07T20:33:15.0001020Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:15.0001334Z 2025-05-07T20:33:15.0001596Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.0001946Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:15.0002259Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:15.0002592Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:15.0002960Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.0003286Z 2025-05-07T20:33:15.0003503Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:15.0003706Z 2025-05-07T20:33:15.0003820Z moe/activation_test.py:126: 2025-05-07T20:33:15.0004140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.0004493Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:15.0004836Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.0005782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:15.0006579Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:15.0007271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.0008083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.0008916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:15.0009789Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:15.0010674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:15.0011434Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:15.0012206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:15.0012741Z fn() 2025-05-07T20:33:15.0013266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:15.0013858Z self.fn.run( 2025-05-07T20:33:15.0014342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.0014889Z kernel = self.compile( 2025-05-07T20:33:15.0015435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.0016106Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.0016523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.0016762Z 2025-05-07T20:33:15.0016982Z self = 2025-05-07T20:33:15.0018081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.0019471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f2058ea0>} 2025-05-07T20:33:15.0020830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.0021866Z context = 2025-05-07T20:33:15.0022159Z 2025-05-07T20:33:15.0022341Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.0022872Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.0023362Z module_map=module_map) 2025-05-07T20:33:15.0023743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.0024113Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:15.0024395Z E ^ 2025-05-07T20:33:15.0024872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.0025332Z 2025-05-07T20:33:15.0025783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.0026325Z 2025-05-07T20:33:15.0026434Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.0026862Z self=, 2025-05-07T20:33:15.0027278Z T=4096, 2025-05-07T20:33:15.0027476Z D=5120, 2025-05-07T20:33:15.0027683Z scale_ub=None, 2025-05-07T20:33:15.0027915Z contiguous=False, 2025-05-07T20:33:15.0028236Z compiled=False, 2025-05-07T20:33:15.0028459Z ) 2025-05-07T20:33:15.7989301Z self = 2025-05-07T20:33:15.7989985Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:15.7990270Z 2025-05-07T20:33:15.7990372Z @given( 2025-05-07T20:33:15.7990614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.7990946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.7991272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.7991613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.7991959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.7992262Z ) 2025-05-07T20:33:15.7992616Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.7993065Z def test_silu_mul_quant( 2025-05-07T20:33:15.7993318Z self, 2025-05-07T20:33:15.7993534Z T: int, 2025-05-07T20:33:15.7993737Z D: int, 2025-05-07T20:33:15.7994068Z scale_ub: Optional[float], 2025-05-07T20:33:15.7994360Z contiguous: bool, 2025-05-07T20:33:15.7994612Z compiled: bool, 2025-05-07T20:33:15.7994851Z ) -> None: 2025-05-07T20:33:15.7995088Z torch.manual_seed(2025) 2025-05-07T20:33:15.7995341Z 2025-05-07T20:33:15.7995626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.7996062Z 2025-05-07T20:33:15.7996293Z x_sign = torch.sign(x) 2025-05-07T20:33:15.7996597Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.7996928Z x = x_sign * x_clamp 2025-05-07T20:33:15.7997179Z x0 = x[:, :D] 2025-05-07T20:33:15.7997414Z x1 = x[:, D:] 2025-05-07T20:33:15.7997640Z 2025-05-07T20:33:15.7997832Z if contiguous: 2025-05-07T20:33:15.7998075Z x0 = x0.contiguous() 2025-05-07T20:33:15.7998361Z x1 = x1.contiguous() 2025-05-07T20:33:15.7998618Z 2025-05-07T20:33:15.7998833Z if scale_ub is not None: 2025-05-07T20:33:15.7999125Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.7999476Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.7999797Z ) 2025-05-07T20:33:15.8000011Z else: 2025-05-07T20:33:15.8000237Z scale_ub_tensor = None 2025-05-07T20:33:15.8000500Z 2025-05-07T20:33:15.8000749Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.8001083Z op = silu_mul_quant 2025-05-07T20:33:15.8001340Z if compiled: 2025-05-07T20:33:15.8001602Z op = torch.compile(op) 2025-05-07T20:33:15.8001915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8002203Z 2025-05-07T20:33:15.8002416Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.8002587Z 2025-05-07T20:33:15.8002699Z moe/activation_test.py:117: 2025-05-07T20:33:15.8003007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8003362Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.8003658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8004361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.8005057Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.8005607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.8006304Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.8006985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.8007525Z kernel = self.compile( 2025-05-07T20:33:15.8008075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.8008872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.8009285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8009561Z 2025-05-07T20:33:15.8009770Z self = 2025-05-07T20:33:15.8010863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.8012251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f317b240>} 2025-05-07T20:33:15.8013602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.8014684Z context = 2025-05-07T20:33:15.8014976Z 2025-05-07T20:33:15.8015146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.8015681Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.8016203Z module_map=module_map) 2025-05-07T20:33:15.8016580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.8016946Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.8017214Z E ^ 2025-05-07T20:33:15.8017680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.8018139Z 2025-05-07T20:33:15.8018556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.8019079Z 2025-05-07T20:33:15.8019189Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.8019620Z self=, 2025-05-07T20:33:15.8020026Z T=4096, 2025-05-07T20:33:15.8020231Z D=7168, 2025-05-07T20:33:15.8020434Z scale_ub=None, 2025-05-07T20:33:15.8020655Z contiguous=False, 2025-05-07T20:33:15.8020889Z compiled=False, 2025-05-07T20:33:15.8021102Z ) 2025-05-07T20:33:15.8021430Z self = 2025-05-07T20:33:15.8021928Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:15.8022216Z 2025-05-07T20:33:15.8022299Z @given( 2025-05-07T20:33:15.8022540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.8022860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.8023179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.8023520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.8023858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.8024160Z ) 2025-05-07T20:33:15.8024519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.8024971Z def test_silu_mul_quant( 2025-05-07T20:33:15.8025216Z self, 2025-05-07T20:33:15.8025422Z T: int, 2025-05-07T20:33:15.8025629Z D: int, 2025-05-07T20:33:15.8025852Z scale_ub: Optional[float], 2025-05-07T20:33:15.8026163Z contiguous: bool, 2025-05-07T20:33:15.8026437Z compiled: bool, 2025-05-07T20:33:15.8026669Z ) -> None: 2025-05-07T20:33:15.8026897Z torch.manual_seed(2025) 2025-05-07T20:33:15.8027150Z 2025-05-07T20:33:15.8027424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.8027778Z 2025-05-07T20:33:15.8027980Z x_sign = torch.sign(x) 2025-05-07T20:33:15.8028275Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.8028689Z x = x_sign * x_clamp 2025-05-07T20:33:15.8028941Z x0 = x[:, :D] 2025-05-07T20:33:15.8029165Z x1 = x[:, D:] 2025-05-07T20:33:15.8029384Z 2025-05-07T20:33:15.8029580Z if contiguous: 2025-05-07T20:33:15.8029857Z x0 = x0.contiguous() 2025-05-07T20:33:15.8030126Z x1 = x1.contiguous() 2025-05-07T20:33:15.8030374Z 2025-05-07T20:33:15.8030574Z if scale_ub is not None: 2025-05-07T20:33:15.8030849Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.8031193Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.8031512Z ) 2025-05-07T20:33:15.8031706Z else: 2025-05-07T20:33:15.8031925Z scale_ub_tensor = None 2025-05-07T20:33:15.8032183Z 2025-05-07T20:33:15.8032417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.8032745Z op = silu_mul_quant 2025-05-07T20:33:15.8033003Z if compiled: 2025-05-07T20:33:15.8033258Z op = torch.compile(op) 2025-05-07T20:33:15.8033610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8033896Z 2025-05-07T20:33:15.8034093Z > y_fp8, y_scale = fn() 2025-05-07T20:33:15.8034270Z 2025-05-07T20:33:15.8034372Z moe/activation_test.py:117: 2025-05-07T20:33:15.8034687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8035025Z moe/activation_test.py:115: in fn 2025-05-07T20:33:15.8035315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8036085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:15.8036832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:15.8037376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.8038068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.8038758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.8039292Z kernel = self.compile( 2025-05-07T20:33:15.8039849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.8040514Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.8040925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8041160Z 2025-05-07T20:33:15.8041370Z self = 2025-05-07T20:33:15.8042464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.8043855Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f16f25c0>} 2025-05-07T20:33:15.8045209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.8046253Z context = 2025-05-07T20:33:15.8046545Z 2025-05-07T20:33:15.8046716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.8047260Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.8047741Z module_map=module_map) 2025-05-07T20:33:15.8048106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.8048475Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:15.8048749Z E ^ 2025-05-07T20:33:15.8049318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.8049778Z 2025-05-07T20:33:15.8050192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.8050753Z 2025-05-07T20:33:15.8050863Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.8051288Z self=, 2025-05-07T20:33:15.8051703Z T=128, 2025-05-07T20:33:15.8051895Z D=7168, 2025-05-07T20:33:15.8052099Z scale_ub=None, 2025-05-07T20:33:15.8052320Z contiguous=False, 2025-05-07T20:33:15.8052546Z compiled=True, 2025-05-07T20:33:15.8052759Z ) 2025-05-07T20:33:15.8608962Z self = 2025-05-07T20:33:15.8609546Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:15.8609900Z 2025-05-07T20:33:15.8609997Z @given( 2025-05-07T20:33:15.8610325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:15.8610661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:15.8610975Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:15.8611316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:15.8611661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:15.8611957Z ) 2025-05-07T20:33:15.8612308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:15.8612761Z def test_silu_mul_quant( 2025-05-07T20:33:15.8613021Z self, 2025-05-07T20:33:15.8613223Z T: int, 2025-05-07T20:33:15.8613439Z D: int, 2025-05-07T20:33:15.8613672Z scale_ub: Optional[float], 2025-05-07T20:33:15.8613955Z contiguous: bool, 2025-05-07T20:33:15.8614207Z compiled: bool, 2025-05-07T20:33:15.8614444Z ) -> None: 2025-05-07T20:33:15.8614672Z torch.manual_seed(2025) 2025-05-07T20:33:15.8614932Z 2025-05-07T20:33:15.8615223Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:15.8615570Z 2025-05-07T20:33:15.8615775Z x_sign = torch.sign(x) 2025-05-07T20:33:15.8616080Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:15.8616407Z x = x_sign * x_clamp 2025-05-07T20:33:15.8616649Z x0 = x[:, :D] 2025-05-07T20:33:15.8616869Z x1 = x[:, D:] 2025-05-07T20:33:15.8617082Z 2025-05-07T20:33:15.8617273Z if contiguous: 2025-05-07T20:33:15.8617514Z x0 = x0.contiguous() 2025-05-07T20:33:15.8617782Z x1 = x1.contiguous() 2025-05-07T20:33:15.8618032Z 2025-05-07T20:33:15.8618230Z if scale_ub is not None: 2025-05-07T20:33:15.8618503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:15.8618836Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:15.8619152Z ) 2025-05-07T20:33:15.8619355Z else: 2025-05-07T20:33:15.8619570Z scale_ub_tensor = None 2025-05-07T20:33:15.8619831Z 2025-05-07T20:33:15.8620068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.8620384Z op = silu_mul_quant 2025-05-07T20:33:15.8620637Z if compiled: 2025-05-07T20:33:15.8620889Z op = torch.compile(op) 2025-05-07T20:33:15.8621185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:15.8621462Z 2025-05-07T20:33:15.8621659Z y_fp8, y_scale = fn() 2025-05-07T20:33:15.8621947Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:15.8622237Z 2025-05-07T20:33:15.8622480Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:15.8622820Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:15.8623112Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:15.8623430Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:15.8623922Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.8624235Z 2025-05-07T20:33:15.8624445Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:15.8624646Z 2025-05-07T20:33:15.8624836Z moe/activation_test.py:126: 2025-05-07T20:33:15.8625137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8625472Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:15.8625804Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:15.8626593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:15.8627341Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:15.8627889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:15.8628571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:15.8629313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:15.8630040Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:15.8630780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:15.8631421Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:15.8632033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:15.8632549Z fn() 2025-05-07T20:33:15.8633058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:15.8633644Z self.fn.run( 2025-05-07T20:33:15.8634109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:15.8634648Z kernel = self.compile( 2025-05-07T20:33:15.8635193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:15.8635906Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:15.8636356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:15.8636600Z 2025-05-07T20:33:15.8636811Z self = 2025-05-07T20:33:15.8637896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:15.8639274Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f16f31a0>} 2025-05-07T20:33:15.8640617Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:15.8641648Z context = 2025-05-07T20:33:15.8641940Z 2025-05-07T20:33:15.8642109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:15.8642633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:15.8643102Z module_map=module_map) 2025-05-07T20:33:15.8643469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:15.8643833Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:15.8644098Z E ^ 2025-05-07T20:33:15.8644566Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:15.8645075Z 2025-05-07T20:33:15.8645529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:15.8646041Z 2025-05-07T20:33:15.8646215Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:15.8646655Z self=, 2025-05-07T20:33:15.8647062Z T=128, 2025-05-07T20:33:15.8647254Z D=7168, 2025-05-07T20:33:15.8647444Z scale_ub=None, 2025-05-07T20:33:15.8647662Z contiguous=False, 2025-05-07T20:33:15.8647890Z compiled=False, 2025-05-07T20:33:15.8648097Z ) 2025-05-07T20:33:16.0645837Z self = 2025-05-07T20:33:16.0646369Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:16.0646738Z 2025-05-07T20:33:16.0646858Z @given( 2025-05-07T20:33:16.0647200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.0647570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.0648037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.0648386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.0648733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.0649025Z ) 2025-05-07T20:33:16.0649388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.0649847Z def test_silu_mul_quant( 2025-05-07T20:33:16.0650096Z self, 2025-05-07T20:33:16.0650331Z T: int, 2025-05-07T20:33:16.0650531Z D: int, 2025-05-07T20:33:16.0650758Z scale_ub: Optional[float], 2025-05-07T20:33:16.0651039Z contiguous: bool, 2025-05-07T20:33:16.0651292Z compiled: bool, 2025-05-07T20:33:16.0651524Z ) -> None: 2025-05-07T20:33:16.0651757Z torch.manual_seed(2025) 2025-05-07T20:33:16.0652016Z 2025-05-07T20:33:16.0652293Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.0652662Z 2025-05-07T20:33:16.0652871Z x_sign = torch.sign(x) 2025-05-07T20:33:16.0653166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.0653494Z x = x_sign * x_clamp 2025-05-07T20:33:16.0653753Z x0 = x[:, :D] 2025-05-07T20:33:16.0653975Z x1 = x[:, D:] 2025-05-07T20:33:16.0654199Z 2025-05-07T20:33:16.0654394Z if contiguous: 2025-05-07T20:33:16.0654632Z x0 = x0.contiguous() 2025-05-07T20:33:16.0654902Z x1 = x1.contiguous() 2025-05-07T20:33:16.0655158Z 2025-05-07T20:33:16.0655354Z if scale_ub is not None: 2025-05-07T20:33:16.0655644Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.0655991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.0656304Z ) 2025-05-07T20:33:16.0656519Z else: 2025-05-07T20:33:16.0656738Z scale_ub_tensor = None 2025-05-07T20:33:16.0657002Z 2025-05-07T20:33:16.0657241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.0657578Z op = silu_mul_quant 2025-05-07T20:33:16.0657845Z if compiled: 2025-05-07T20:33:16.0658094Z op = torch.compile(op) 2025-05-07T20:33:16.0658400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0658682Z 2025-05-07T20:33:16.0658882Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.0659057Z 2025-05-07T20:33:16.0659163Z moe/activation_test.py:117: 2025-05-07T20:33:16.0659467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0659805Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.0660095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0667063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.0667771Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.0668440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.0669184Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.0669915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.0670462Z kernel = self.compile( 2025-05-07T20:33:16.0671012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.0671663Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.0672068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0672302Z 2025-05-07T20:33:16.0672517Z self = 2025-05-07T20:33:16.0673653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.0675035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f1a645e0>} 2025-05-07T20:33:16.0676436Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.0677464Z context = 2025-05-07T20:33:16.0677751Z 2025-05-07T20:33:16.0677933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.0678455Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.0678932Z module_map=module_map) 2025-05-07T20:33:16.0679312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.0679678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.0679936Z E ^ 2025-05-07T20:33:16.0680407Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.0680866Z 2025-05-07T20:33:16.0681290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.0681802Z 2025-05-07T20:33:16.0681908Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.0682331Z self=, 2025-05-07T20:33:16.0682740Z T=4096, 2025-05-07T20:33:16.0682934Z D=5120, 2025-05-07T20:33:16.0683130Z scale_ub=1200.0, 2025-05-07T20:33:16.0683358Z contiguous=True, 2025-05-07T20:33:16.0683588Z compiled=False, 2025-05-07T20:33:16.0683793Z ) 2025-05-07T20:33:16.0684123Z self = 2025-05-07T20:33:16.0684629Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:16.0684903Z 2025-05-07T20:33:16.0684984Z @given( 2025-05-07T20:33:16.0685222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.0685542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.0685848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.0686185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.0686523Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.0686813Z ) 2025-05-07T20:33:16.0687159Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.0687608Z def test_silu_mul_quant( 2025-05-07T20:33:16.0687859Z self, 2025-05-07T20:33:16.0688050Z T: int, 2025-05-07T20:33:16.0688251Z D: int, 2025-05-07T20:33:16.0688472Z scale_ub: Optional[float], 2025-05-07T20:33:16.0688831Z contiguous: bool, 2025-05-07T20:33:16.0689082Z compiled: bool, 2025-05-07T20:33:16.0689315Z ) -> None: 2025-05-07T20:33:16.0689529Z torch.manual_seed(2025) 2025-05-07T20:33:16.0689818Z 2025-05-07T20:33:16.0690103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.0690446Z 2025-05-07T20:33:16.0690648Z x_sign = torch.sign(x) 2025-05-07T20:33:16.0690942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.0691252Z x = x_sign * x_clamp 2025-05-07T20:33:16.0691506Z x0 = x[:, :D] 2025-05-07T20:33:16.0691734Z x1 = x[:, D:] 2025-05-07T20:33:16.0691950Z 2025-05-07T20:33:16.0692138Z if contiguous: 2025-05-07T20:33:16.0692379Z x0 = x0.contiguous() 2025-05-07T20:33:16.0692643Z x1 = x1.contiguous() 2025-05-07T20:33:16.0692874Z 2025-05-07T20:33:16.0693075Z if scale_ub is not None: 2025-05-07T20:33:16.0693360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.0693740Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.0694061Z ) 2025-05-07T20:33:16.0694264Z else: 2025-05-07T20:33:16.0694484Z scale_ub_tensor = None 2025-05-07T20:33:16.0694741Z 2025-05-07T20:33:16.0694976Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.0695287Z op = silu_mul_quant 2025-05-07T20:33:16.0695546Z if compiled: 2025-05-07T20:33:16.0695796Z op = torch.compile(op) 2025-05-07T20:33:16.0696090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0696411Z 2025-05-07T20:33:16.0696623Z > y_fp8, y_scale = fn() 2025-05-07T20:33:16.0696787Z 2025-05-07T20:33:16.0696897Z moe/activation_test.py:117: 2025-05-07T20:33:16.0697189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0697524Z moe/activation_test.py:115: in fn 2025-05-07T20:33:16.0697815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.0698499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:16.0699191Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:16.0699727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.0700411Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.0701068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.0701601Z kernel = self.compile( 2025-05-07T20:33:16.0702142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.0702796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.0703199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.0703435Z 2025-05-07T20:33:16.0703643Z self = 2025-05-07T20:33:16.0704724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.0706087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f1a651c0>} 2025-05-07T20:33:16.0707476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.0708501Z context = 2025-05-07T20:33:16.0708874Z 2025-05-07T20:33:16.0709054Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.0709584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.0710095Z module_map=module_map) 2025-05-07T20:33:16.0710471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.0710831Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:16.0711096Z E ^ 2025-05-07T20:33:16.0711568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.0712018Z 2025-05-07T20:33:16.0712438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.0712946Z 2025-05-07T20:33:16.0713060Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.0713472Z self=, 2025-05-07T20:33:16.0713886Z T=1, 2025-05-07T20:33:16.0714125Z D=5120, 2025-05-07T20:33:16.0714319Z scale_ub=None, 2025-05-07T20:33:16.0714543Z contiguous=True, 2025-05-07T20:33:16.0714776Z compiled=True, 2025-05-07T20:33:16.0714979Z ) 2025-05-07T20:33:16.4488718Z self = 2025-05-07T20:33:16.4489245Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:16.4489510Z 2025-05-07T20:33:16.4489626Z @given( 2025-05-07T20:33:16.4489865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.4490187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.4490498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.4490829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.4491167Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.4491449Z ) 2025-05-07T20:33:16.4491817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.4492273Z def test_silu_mul_quant( 2025-05-07T20:33:16.4492515Z self, 2025-05-07T20:33:16.4492717Z T: int, 2025-05-07T20:33:16.4492923Z D: int, 2025-05-07T20:33:16.4493144Z scale_ub: Optional[float], 2025-05-07T20:33:16.4493411Z contiguous: bool, 2025-05-07T20:33:16.4493659Z compiled: bool, 2025-05-07T20:33:16.4493886Z ) -> None: 2025-05-07T20:33:16.4494103Z torch.manual_seed(2025) 2025-05-07T20:33:16.4494353Z 2025-05-07T20:33:16.4494629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.4494971Z 2025-05-07T20:33:16.4495168Z x_sign = torch.sign(x) 2025-05-07T20:33:16.4495463Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.4495778Z x = x_sign * x_clamp 2025-05-07T20:33:16.4496022Z x0 = x[:, :D] 2025-05-07T20:33:16.4496250Z x1 = x[:, D:] 2025-05-07T20:33:16.4496497Z 2025-05-07T20:33:16.4496698Z if contiguous: 2025-05-07T20:33:16.4496937Z x0 = x0.contiguous() 2025-05-07T20:33:16.4497190Z x1 = x1.contiguous() 2025-05-07T20:33:16.4497431Z 2025-05-07T20:33:16.4497627Z if scale_ub is not None: 2025-05-07T20:33:16.4497901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.4498238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.4498561Z ) 2025-05-07T20:33:16.4498761Z else: 2025-05-07T20:33:16.4498980Z scale_ub_tensor = None 2025-05-07T20:33:16.4499234Z 2025-05-07T20:33:16.4499479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.4499798Z op = silu_mul_quant 2025-05-07T20:33:16.4500060Z if compiled: 2025-05-07T20:33:16.4500312Z op = torch.compile(op) 2025-05-07T20:33:16.4500613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.4501027Z 2025-05-07T20:33:16.4501290Z y_fp8, y_scale = fn() 2025-05-07T20:33:16.4501586Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:16.4501880Z 2025-05-07T20:33:16.4502122Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.4502524Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:16.4502820Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:16.4503138Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:16.4503506Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:16.4503815Z 2025-05-07T20:33:16.4504019Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:16.4504212Z 2025-05-07T20:33:16.4504326Z moe/activation_test.py:126: 2025-05-07T20:33:16.4504627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4504965Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:16.4505301Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:16.4506177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:16.4506988Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:16.4507532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.4508213Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.4508895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:16.4509617Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:16.4510348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:16.4510984Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:16.4511587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:16.4512105Z fn() 2025-05-07T20:33:16.4512613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:16.4513193Z self.fn.run( 2025-05-07T20:33:16.4513659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.4514190Z kernel = self.compile( 2025-05-07T20:33:16.4514726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.4515374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.4515869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.4516100Z 2025-05-07T20:33:16.4516318Z self = 2025-05-07T20:33:16.4517403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.4518792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f1a66a20>} 2025-05-07T20:33:16.4520140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.4521166Z context = 2025-05-07T20:33:16.4521451Z 2025-05-07T20:33:16.4521623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.4522196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.4522713Z module_map=module_map) 2025-05-07T20:33:16.4523080Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.4523514Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:16.4523788Z E ^ 2025-05-07T20:33:16.4524251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.4524702Z 2025-05-07T20:33:16.4525121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.4525628Z 2025-05-07T20:33:16.4525732Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.4526152Z self=, 2025-05-07T20:33:16.4526605Z T=2048, 2025-05-07T20:33:16.4526791Z D=5120, 2025-05-07T20:33:16.4526987Z scale_ub=None, 2025-05-07T20:33:16.4527213Z contiguous=True, 2025-05-07T20:33:16.4527445Z compiled=True, 2025-05-07T20:33:16.4527691Z ) 2025-05-07T20:33:16.8166998Z self = 2025-05-07T20:33:16.8167751Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:16.8168130Z 2025-05-07T20:33:16.8168243Z @given( 2025-05-07T20:33:16.8168528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:16.8168841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:16.8169151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:16.8169485Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:16.8169808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:16.8170094Z ) 2025-05-07T20:33:16.8170450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:16.8170888Z def test_silu_mul_quant( 2025-05-07T20:33:16.8171143Z self, 2025-05-07T20:33:16.8171341Z T: int, 2025-05-07T20:33:16.8171543Z D: int, 2025-05-07T20:33:16.8171758Z scale_ub: Optional[float], 2025-05-07T20:33:16.8172030Z contiguous: bool, 2025-05-07T20:33:16.8172279Z compiled: bool, 2025-05-07T20:33:16.8172501Z ) -> None: 2025-05-07T20:33:16.8172728Z torch.manual_seed(2025) 2025-05-07T20:33:16.8172981Z 2025-05-07T20:33:16.8173248Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:16.8173597Z 2025-05-07T20:33:16.8173800Z x_sign = torch.sign(x) 2025-05-07T20:33:16.8174086Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:16.8174404Z x = x_sign * x_clamp 2025-05-07T20:33:16.8174653Z x0 = x[:, :D] 2025-05-07T20:33:16.8174874Z x1 = x[:, D:] 2025-05-07T20:33:16.8175085Z 2025-05-07T20:33:16.8175281Z if contiguous: 2025-05-07T20:33:16.8175513Z x0 = x0.contiguous() 2025-05-07T20:33:16.8175782Z x1 = x1.contiguous() 2025-05-07T20:33:16.8176028Z 2025-05-07T20:33:16.8176218Z if scale_ub is not None: 2025-05-07T20:33:16.8176490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:16.8176835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:16.8177149Z ) 2025-05-07T20:33:16.8177347Z else: 2025-05-07T20:33:16.8177561Z scale_ub_tensor = None 2025-05-07T20:33:16.8177813Z 2025-05-07T20:33:16.8178040Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.8178355Z op = silu_mul_quant 2025-05-07T20:33:16.8178608Z if compiled: 2025-05-07T20:33:16.8178860Z op = torch.compile(op) 2025-05-07T20:33:16.8179162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:16.8179443Z 2025-05-07T20:33:16.8179636Z y_fp8, y_scale = fn() 2025-05-07T20:33:16.8179928Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:16.8180402Z 2025-05-07T20:33:16.8180643Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:16.8180980Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:16.8181274Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:16.8181675Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:16.8182033Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:16.8182344Z 2025-05-07T20:33:16.8182550Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:16.8182744Z 2025-05-07T20:33:16.8182847Z moe/activation_test.py:126: 2025-05-07T20:33:16.8183146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8183490Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:16.8183812Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:16.8184601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:16.8185412Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:16.8185959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:16.8186640Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:16.8187325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:16.8188048Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:16.8188777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:16.8189410Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:16.8190007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:16.8190529Z fn() 2025-05-07T20:33:16.8191033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:16.8191616Z self.fn.run( 2025-05-07T20:33:16.8192085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:16.8192615Z kernel = self.compile( 2025-05-07T20:33:16.8193146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:16.8193795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:16.8194191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:16.8194420Z 2025-05-07T20:33:16.8194627Z self = 2025-05-07T20:33:16.8195791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:16.8197218Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f1a4a020>} 2025-05-07T20:33:16.8198568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:16.8199592Z context = 2025-05-07T20:33:16.8199876Z 2025-05-07T20:33:16.8200042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:16.8200563Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:16.8201032Z module_map=module_map) 2025-05-07T20:33:16.8201489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:16.8201846Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:16.8202114Z E ^ 2025-05-07T20:33:16.8202618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:16.8203067Z 2025-05-07T20:33:16.8203478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:16.8203989Z 2025-05-07T20:33:16.8204093Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:16.8204507Z self=, 2025-05-07T20:33:16.8204910Z T=128, 2025-05-07T20:33:16.8205099Z D=5120, 2025-05-07T20:33:16.8205294Z scale_ub=None, 2025-05-07T20:33:16.8205510Z contiguous=True, 2025-05-07T20:33:16.8205729Z compiled=True, 2025-05-07T20:33:16.8205934Z ) 2025-05-07T20:33:17.2437404Z self = 2025-05-07T20:33:17.2438317Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:17.2438680Z 2025-05-07T20:33:17.2438779Z @given( 2025-05-07T20:33:17.2439017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.2439336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.2439655Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.2439993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.2440328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.2440627Z ) 2025-05-07T20:33:17.2440992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.2441439Z def test_silu_mul_quant( 2025-05-07T20:33:17.2441689Z self, 2025-05-07T20:33:17.2441893Z T: int, 2025-05-07T20:33:17.2442098Z D: int, 2025-05-07T20:33:17.2442331Z scale_ub: Optional[float], 2025-05-07T20:33:17.2442619Z contiguous: bool, 2025-05-07T20:33:17.2442865Z compiled: bool, 2025-05-07T20:33:17.2443096Z ) -> None: 2025-05-07T20:33:17.2443320Z torch.manual_seed(2025) 2025-05-07T20:33:17.2443576Z 2025-05-07T20:33:17.2443848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.2444200Z 2025-05-07T20:33:17.2444399Z x_sign = torch.sign(x) 2025-05-07T20:33:17.2444692Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.2445010Z x = x_sign * x_clamp 2025-05-07T20:33:17.2445259Z x0 = x[:, :D] 2025-05-07T20:33:17.2445477Z x1 = x[:, D:] 2025-05-07T20:33:17.2445699Z 2025-05-07T20:33:17.2445893Z if contiguous: 2025-05-07T20:33:17.2446131Z x0 = x0.contiguous() 2025-05-07T20:33:17.2446399Z x1 = x1.contiguous() 2025-05-07T20:33:17.2446649Z 2025-05-07T20:33:17.2446849Z if scale_ub is not None: 2025-05-07T20:33:17.2447136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.2447490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.2447810Z ) 2025-05-07T20:33:17.2448017Z else: 2025-05-07T20:33:17.2448238Z scale_ub_tensor = None 2025-05-07T20:33:17.2448500Z 2025-05-07T20:33:17.2448739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.2449071Z op = silu_mul_quant 2025-05-07T20:33:17.2449331Z if compiled: 2025-05-07T20:33:17.2449589Z op = torch.compile(op) 2025-05-07T20:33:17.2449896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.2450181Z 2025-05-07T20:33:17.2450377Z y_fp8, y_scale = fn() 2025-05-07T20:33:17.2450672Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:17.2450970Z 2025-05-07T20:33:17.2451212Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.2451635Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:17.2451991Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:17.2452313Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:17.2452683Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:17.2453070Z 2025-05-07T20:33:17.2453281Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:17.2453480Z 2025-05-07T20:33:17.2453584Z moe/activation_test.py:126: 2025-05-07T20:33:17.2453889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.2454236Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:17.2454568Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:17.2455366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:17.2456126Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:17.2461587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.2462359Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.2463064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:17.2463797Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:17.2464529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:17.2465176Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:17.2466052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:17.2466573Z fn() 2025-05-07T20:33:17.2467146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:17.2467737Z self.fn.run( 2025-05-07T20:33:17.2468212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.2468754Z kernel = self.compile( 2025-05-07T20:33:17.2469302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.2469967Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.2470370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.2470616Z 2025-05-07T20:33:17.2470825Z self = 2025-05-07T20:33:17.2471919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.2473308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f18400e0>} 2025-05-07T20:33:17.2474663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.2475687Z context = 2025-05-07T20:33:17.2476063Z 2025-05-07T20:33:17.2476232Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.2476765Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.2477248Z module_map=module_map) 2025-05-07T20:33:17.2477614Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.2477983Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:17.2478408Z E ^ 2025-05-07T20:33:17.2478880Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.2479397Z 2025-05-07T20:33:17.2479812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.2480327Z 2025-05-07T20:33:17.2480435Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.2480860Z self=, 2025-05-07T20:33:17.2481266Z T=4096, 2025-05-07T20:33:17.2481464Z D=5120, 2025-05-07T20:33:17.2481664Z scale_ub=None, 2025-05-07T20:33:17.2481880Z contiguous=True, 2025-05-07T20:33:17.2482111Z compiled=True, 2025-05-07T20:33:17.2482322Z ) 2025-05-07T20:33:17.6735880Z self = 2025-05-07T20:33:17.6736663Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:17.6737060Z 2025-05-07T20:33:17.6737182Z @given( 2025-05-07T20:33:17.6737614Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.6737942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.6738266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.6738611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.6738953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.6739248Z ) 2025-05-07T20:33:17.6739597Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.6740046Z def test_silu_mul_quant( 2025-05-07T20:33:17.6740296Z self, 2025-05-07T20:33:17.6740496Z T: int, 2025-05-07T20:33:17.6740704Z D: int, 2025-05-07T20:33:17.6740932Z scale_ub: Optional[float], 2025-05-07T20:33:17.6741209Z contiguous: bool, 2025-05-07T20:33:17.6741460Z compiled: bool, 2025-05-07T20:33:17.6741701Z ) -> None: 2025-05-07T20:33:17.6741925Z torch.manual_seed(2025) 2025-05-07T20:33:17.6742184Z 2025-05-07T20:33:17.6742473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.6742825Z 2025-05-07T20:33:17.6743029Z x_sign = torch.sign(x) 2025-05-07T20:33:17.6743337Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.6743670Z x = x_sign * x_clamp 2025-05-07T20:33:17.6743920Z x0 = x[:, :D] 2025-05-07T20:33:17.6744151Z x1 = x[:, D:] 2025-05-07T20:33:17.6744367Z 2025-05-07T20:33:17.6744554Z if contiguous: 2025-05-07T20:33:17.6744796Z x0 = x0.contiguous() 2025-05-07T20:33:17.6745072Z x1 = x1.contiguous() 2025-05-07T20:33:17.6745311Z 2025-05-07T20:33:17.6745516Z if scale_ub is not None: 2025-05-07T20:33:17.6745802Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.6746154Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.6746488Z ) 2025-05-07T20:33:17.6746701Z else: 2025-05-07T20:33:17.6746924Z scale_ub_tensor = None 2025-05-07T20:33:17.6747229Z 2025-05-07T20:33:17.6747490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.6747821Z op = silu_mul_quant 2025-05-07T20:33:17.6748082Z if compiled: 2025-05-07T20:33:17.6748347Z op = torch.compile(op) 2025-05-07T20:33:17.6748662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.6748942Z 2025-05-07T20:33:17.6749142Z y_fp8, y_scale = fn() 2025-05-07T20:33:17.6749440Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:17.6749738Z 2025-05-07T20:33:17.6749979Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.6750327Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:17.6750628Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:17.6751033Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:17.6751491Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:17.6751804Z 2025-05-07T20:33:17.6752012Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:17.6752281Z 2025-05-07T20:33:17.6752392Z moe/activation_test.py:126: 2025-05-07T20:33:17.6752699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6753039Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:17.6753376Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:17.6754171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:17.6754924Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:17.6755475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.6756254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.6757003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:17.6757734Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:17.6758474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:17.6759123Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:17.6759733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:17.6760254Z fn() 2025-05-07T20:33:17.6760772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:17.6761366Z self.fn.run( 2025-05-07T20:33:17.6761837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.6762379Z kernel = self.compile( 2025-05-07T20:33:17.6762930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.6763598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.6764002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.6764240Z 2025-05-07T20:33:17.6764450Z self = 2025-05-07T20:33:17.6765712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.6767127Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f0d2b1a0>} 2025-05-07T20:33:17.6768502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.6769536Z context = 2025-05-07T20:33:17.6769832Z 2025-05-07T20:33:17.6770003Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.6770533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.6771003Z module_map=module_map) 2025-05-07T20:33:17.6771376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.6771742Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:17.6772012Z E ^ 2025-05-07T20:33:17.6772557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.6773066Z 2025-05-07T20:33:17.6773485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.6774054Z 2025-05-07T20:33:17.6774165Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.6774579Z self=, 2025-05-07T20:33:17.6774988Z T=16384, 2025-05-07T20:33:17.6775190Z D=5120, 2025-05-07T20:33:17.6775385Z scale_ub=None, 2025-05-07T20:33:17.6775609Z contiguous=True, 2025-05-07T20:33:17.6775839Z compiled=True, 2025-05-07T20:33:17.6776043Z ) 2025-05-07T20:33:17.7032333Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:17.7033669Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:17.7035106Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:17.7036179Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:17.7037285Z W0507 20:33:17.701000 96512 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:17.7922589Z self = 2025-05-07T20:33:17.7923808Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:17.7924377Z 2025-05-07T20:33:17.7924543Z @given( 2025-05-07T20:33:17.7925015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.7925660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.7926285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.7926956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.7927295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.7927590Z ) 2025-05-07T20:33:17.7927946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.7928397Z def test_silu_mul_quant( 2025-05-07T20:33:17.7928642Z self, 2025-05-07T20:33:17.7928845Z T: int, 2025-05-07T20:33:17.7929050Z D: int, 2025-05-07T20:33:17.7929269Z scale_ub: Optional[float], 2025-05-07T20:33:17.7929546Z contiguous: bool, 2025-05-07T20:33:17.7929792Z compiled: bool, 2025-05-07T20:33:17.7930016Z ) -> None: 2025-05-07T20:33:17.7930240Z torch.manual_seed(2025) 2025-05-07T20:33:17.7930488Z 2025-05-07T20:33:17.7930762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.7931115Z 2025-05-07T20:33:17.7931316Z x_sign = torch.sign(x) 2025-05-07T20:33:17.7931608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.7931926Z x = x_sign * x_clamp 2025-05-07T20:33:17.7932173Z x0 = x[:, :D] 2025-05-07T20:33:17.7932392Z x1 = x[:, D:] 2025-05-07T20:33:17.7932606Z 2025-05-07T20:33:17.7932798Z if contiguous: 2025-05-07T20:33:17.7933031Z x0 = x0.contiguous() 2025-05-07T20:33:17.7933295Z x1 = x1.contiguous() 2025-05-07T20:33:17.7933544Z 2025-05-07T20:33:17.7933743Z if scale_ub is not None: 2025-05-07T20:33:17.7934017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.7934357Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.7934673Z ) 2025-05-07T20:33:17.7934866Z else: 2025-05-07T20:33:17.7935082Z scale_ub_tensor = None 2025-05-07T20:33:17.7935571Z 2025-05-07T20:33:17.7935807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.7936132Z op = silu_mul_quant 2025-05-07T20:33:17.7936389Z if compiled: 2025-05-07T20:33:17.7936704Z op = torch.compile(op) 2025-05-07T20:33:17.7937010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.7937293Z 2025-05-07T20:33:17.7937490Z y_fp8, y_scale = fn() 2025-05-07T20:33:17.7937789Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:17.7938084Z 2025-05-07T20:33:17.7938322Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.7938663Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:17.7938962Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:17.7939276Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:17.7939643Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:17.7939961Z 2025-05-07T20:33:17.7940176Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:17.7940373Z 2025-05-07T20:33:17.7940540Z moe/activation_test.py:126: 2025-05-07T20:33:17.7940844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.7941190Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:17.7941518Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:17.7942309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:17.7943069Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:17.7943620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.7944300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.7944994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:17.7945730Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:17.7946469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:17.7947107Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:17.7947714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:17.7948237Z fn() 2025-05-07T20:33:17.7948741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:17.7949328Z self.fn.run( 2025-05-07T20:33:17.7949799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.7950336Z kernel = self.compile( 2025-05-07T20:33:17.7950877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.7951538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.7951944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.7952180Z 2025-05-07T20:33:17.7952386Z self = 2025-05-07T20:33:17.7953475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.7954853Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f0171300>} 2025-05-07T20:33:17.7956318Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.7957438Z context = 2025-05-07T20:33:17.7957769Z 2025-05-07T20:33:17.7957941Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.7958477Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.7958952Z module_map=module_map) 2025-05-07T20:33:17.7959323Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.7959690Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:17.7959964Z E ^ 2025-05-07T20:33:17.7960439Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.7960890Z 2025-05-07T20:33:17.7961309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.7961828Z 2025-05-07T20:33:17.7961979Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.7962402Z self=, 2025-05-07T20:33:17.7962814Z T=1, 2025-05-07T20:33:17.7963000Z D=5120, 2025-05-07T20:33:17.7963203Z scale_ub=1200.0, 2025-05-07T20:33:17.7963435Z contiguous=True, 2025-05-07T20:33:17.7963658Z compiled=True, 2025-05-07T20:33:17.7963870Z ) 2025-05-07T20:33:17.9384366Z self = 2025-05-07T20:33:17.9385041Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:17.9385399Z 2025-05-07T20:33:17.9385481Z @given( 2025-05-07T20:33:17.9385717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:17.9386035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:17.9386346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:17.9386685Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:17.9387028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:17.9387318Z ) 2025-05-07T20:33:17.9387664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:17.9388110Z def test_silu_mul_quant( 2025-05-07T20:33:17.9388353Z self, 2025-05-07T20:33:17.9388548Z T: int, 2025-05-07T20:33:17.9388752Z D: int, 2025-05-07T20:33:17.9388979Z scale_ub: Optional[float], 2025-05-07T20:33:17.9389250Z contiguous: bool, 2025-05-07T20:33:17.9389493Z compiled: bool, 2025-05-07T20:33:17.9389718Z ) -> None: 2025-05-07T20:33:17.9389933Z torch.manual_seed(2025) 2025-05-07T20:33:17.9390176Z 2025-05-07T20:33:17.9390451Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:17.9390793Z 2025-05-07T20:33:17.9390984Z x_sign = torch.sign(x) 2025-05-07T20:33:17.9391282Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:17.9391597Z x = x_sign * x_clamp 2025-05-07T20:33:17.9391839Z x0 = x[:, :D] 2025-05-07T20:33:17.9392060Z x1 = x[:, D:] 2025-05-07T20:33:17.9392275Z 2025-05-07T20:33:17.9392459Z if contiguous: 2025-05-07T20:33:17.9392693Z x0 = x0.contiguous() 2025-05-07T20:33:17.9392951Z x1 = x1.contiguous() 2025-05-07T20:33:17.9393186Z 2025-05-07T20:33:17.9393386Z if scale_ub is not None: 2025-05-07T20:33:17.9393660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:17.9393993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:17.9394304Z ) 2025-05-07T20:33:17.9394500Z else: 2025-05-07T20:33:17.9394707Z scale_ub_tensor = None 2025-05-07T20:33:17.9394958Z 2025-05-07T20:33:17.9395195Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:17.9395510Z op = silu_mul_quant 2025-05-07T20:33:17.9396083Z if compiled: 2025-05-07T20:33:17.9396340Z op = torch.compile(op) 2025-05-07T20:33:17.9396641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.9396974Z 2025-05-07T20:33:17.9397173Z > y_fp8, y_scale = fn() 2025-05-07T20:33:17.9397337Z 2025-05-07T20:33:17.9397442Z moe/activation_test.py:117: 2025-05-07T20:33:17.9397733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.9398063Z moe/activation_test.py:115: in fn 2025-05-07T20:33:17.9398348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:17.9398898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:17.9399458Z return fn(*args, **kwargs) 2025-05-07T20:33:17.9400115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:17.9400806Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:17.9401400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:17.9402082Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:17.9402745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:17.9403275Z kernel = self.compile( 2025-05-07T20:33:17.9403808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:17.9404463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:17.9404860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:17.9405088Z 2025-05-07T20:33:17.9405293Z self = 2025-05-07T20:33:17.9406375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:17.9407757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cfd2c720>} 2025-05-07T20:33:17.9409098Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:17.9410122Z context = 2025-05-07T20:33:17.9410408Z 2025-05-07T20:33:17.9410572Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:17.9411092Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:17.9411565Z module_map=module_map) 2025-05-07T20:33:17.9411925Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:17.9412282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:17.9412550Z E ^ 2025-05-07T20:33:17.9413014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:17.9413463Z 2025-05-07T20:33:17.9413872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:17.9414384Z 2025-05-07T20:33:17.9414488Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:17.9414900Z self=, 2025-05-07T20:33:17.9415302Z T=1, 2025-05-07T20:33:17.9415481Z D=5120, 2025-05-07T20:33:17.9415673Z scale_ub=None, 2025-05-07T20:33:17.9415888Z contiguous=False, 2025-05-07T20:33:17.9416106Z compiled=True, 2025-05-07T20:33:17.9416399Z ) 2025-05-07T20:33:18.1557380Z self = 2025-05-07T20:33:18.1557960Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:18.1558512Z 2025-05-07T20:33:18.1558637Z @given( 2025-05-07T20:33:18.1558951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.1559301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.1559614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.1559939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.1560271Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.1560563Z ) 2025-05-07T20:33:18.1560916Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.1561365Z def test_silu_mul_quant( 2025-05-07T20:33:18.1561614Z self, 2025-05-07T20:33:18.1561805Z T: int, 2025-05-07T20:33:18.1562016Z D: int, 2025-05-07T20:33:18.1562236Z scale_ub: Optional[float], 2025-05-07T20:33:18.1562580Z contiguous: bool, 2025-05-07T20:33:18.1562826Z compiled: bool, 2025-05-07T20:33:18.1563058Z ) -> None: 2025-05-07T20:33:18.1563279Z torch.manual_seed(2025) 2025-05-07T20:33:18.1563519Z 2025-05-07T20:33:18.1563795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.1564155Z 2025-05-07T20:33:18.1564345Z x_sign = torch.sign(x) 2025-05-07T20:33:18.1564647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.1564958Z x = x_sign * x_clamp 2025-05-07T20:33:18.1565195Z x0 = x[:, :D] 2025-05-07T20:33:18.1565714Z x1 = x[:, D:] 2025-05-07T20:33:18.1565929Z 2025-05-07T20:33:18.1566114Z if contiguous: 2025-05-07T20:33:18.1566350Z x0 = x0.contiguous() 2025-05-07T20:33:18.1566614Z x1 = x1.contiguous() 2025-05-07T20:33:18.1566850Z 2025-05-07T20:33:18.1567062Z if scale_ub is not None: 2025-05-07T20:33:18.1567364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.1567724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.1568041Z ) 2025-05-07T20:33:18.1568237Z else: 2025-05-07T20:33:18.1568450Z scale_ub_tensor = None 2025-05-07T20:33:18.1568700Z 2025-05-07T20:33:18.1568939Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.1569262Z op = silu_mul_quant 2025-05-07T20:33:18.1569514Z if compiled: 2025-05-07T20:33:18.1569764Z op = torch.compile(op) 2025-05-07T20:33:18.1570064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.1570451Z 2025-05-07T20:33:18.1570745Z y_fp8, y_scale = fn() 2025-05-07T20:33:18.1576700Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:18.1577027Z 2025-05-07T20:33:18.1577287Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.1577635Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:18.1577945Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:18.1578278Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:18.1578645Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:18.1578958Z 2025-05-07T20:33:18.1579169Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:18.1579367Z 2025-05-07T20:33:18.1579484Z moe/activation_test.py:126: 2025-05-07T20:33:18.1579787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1580136Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:18.1580471Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:18.1581258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:18.1582136Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:18.1582743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.1583430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.1584205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:18.1584938Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:18.1585671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:18.1586312Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:18.1586913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:18.1587443Z fn() 2025-05-07T20:33:18.1587965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:18.1588604Z self.fn.run( 2025-05-07T20:33:18.1589075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.1589609Z kernel = self.compile( 2025-05-07T20:33:18.1590150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.1590796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.1591198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.1591428Z 2025-05-07T20:33:18.1591640Z self = 2025-05-07T20:33:18.1592719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.1594099Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cfd2e8e0>} 2025-05-07T20:33:18.1595447Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.1596566Z context = 2025-05-07T20:33:18.1596855Z 2025-05-07T20:33:18.1597031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.1597597Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.1598072Z module_map=module_map) 2025-05-07T20:33:18.1598438Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.1598802Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:18.1599065Z E ^ 2025-05-07T20:33:18.1599535Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.1599989Z 2025-05-07T20:33:18.1600412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.1600923Z 2025-05-07T20:33:18.1601040Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.1601453Z self=, 2025-05-07T20:33:18.1601868Z T=1, 2025-05-07T20:33:18.1602058Z D=5120, 2025-05-07T20:33:18.1602252Z scale_ub=None, 2025-05-07T20:33:18.1602471Z contiguous=True, 2025-05-07T20:33:18.1602699Z compiled=False, 2025-05-07T20:33:18.1602903Z ) 2025-05-07T20:33:18.3111665Z self = 2025-05-07T20:33:18.3112652Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:18.3113043Z 2025-05-07T20:33:18.3113172Z @given( 2025-05-07T20:33:18.3113484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.3113989Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.3114307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.3114643Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.3114968Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.3115259Z ) 2025-05-07T20:33:18.3115621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.3116160Z def test_silu_mul_quant( 2025-05-07T20:33:18.3116407Z self, 2025-05-07T20:33:18.3116607Z T: int, 2025-05-07T20:33:18.3116803Z D: int, 2025-05-07T20:33:18.3117022Z scale_ub: Optional[float], 2025-05-07T20:33:18.3117305Z contiguous: bool, 2025-05-07T20:33:18.3117553Z compiled: bool, 2025-05-07T20:33:18.3117786Z ) -> None: 2025-05-07T20:33:18.3118079Z torch.manual_seed(2025) 2025-05-07T20:33:18.3118341Z 2025-05-07T20:33:18.3118642Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.3119039Z 2025-05-07T20:33:18.3119248Z x_sign = torch.sign(x) 2025-05-07T20:33:18.3119536Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.3119846Z x = x_sign * x_clamp 2025-05-07T20:33:18.3120085Z x0 = x[:, :D] 2025-05-07T20:33:18.3120297Z x1 = x[:, D:] 2025-05-07T20:33:18.3120511Z 2025-05-07T20:33:18.3120704Z if contiguous: 2025-05-07T20:33:18.3120935Z x0 = x0.contiguous() 2025-05-07T20:33:18.3121199Z x1 = x1.contiguous() 2025-05-07T20:33:18.3121441Z 2025-05-07T20:33:18.3121634Z if scale_ub is not None: 2025-05-07T20:33:18.3121912Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.3122253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.3122564Z ) 2025-05-07T20:33:18.3122765Z else: 2025-05-07T20:33:18.3122983Z scale_ub_tensor = None 2025-05-07T20:33:18.3123237Z 2025-05-07T20:33:18.3123474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.3123795Z op = silu_mul_quant 2025-05-07T20:33:18.3124042Z if compiled: 2025-05-07T20:33:18.3124295Z op = torch.compile(op) 2025-05-07T20:33:18.3124591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.3124866Z 2025-05-07T20:33:18.3125056Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.3125228Z 2025-05-07T20:33:18.3125327Z moe/activation_test.py:117: 2025-05-07T20:33:18.3125624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.3125952Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.3126236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.3126934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.3127628Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.3128157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.3128842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.3129505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.3130033Z kernel = self.compile( 2025-05-07T20:33:18.3130576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.3131233Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.3131637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.3131958Z 2025-05-07T20:33:18.3132165Z self = 2025-05-07T20:33:18.3133250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.3134663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cfd2dda0>} 2025-05-07T20:33:18.3136005Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.3137029Z context = 2025-05-07T20:33:18.3137317Z 2025-05-07T20:33:18.3137490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.3138053Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.3138528Z module_map=module_map) 2025-05-07T20:33:18.3138893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.3139245Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.3139502Z E ^ 2025-05-07T20:33:18.3139963Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.3140411Z 2025-05-07T20:33:18.3140822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.3141337Z 2025-05-07T20:33:18.3141441Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.3141851Z self=, 2025-05-07T20:33:18.3142251Z T=128, 2025-05-07T20:33:18.3142442Z D=5120, 2025-05-07T20:33:18.3142643Z scale_ub=None, 2025-05-07T20:33:18.3142862Z contiguous=False, 2025-05-07T20:33:18.3143081Z compiled=True, 2025-05-07T20:33:18.3143287Z ) 2025-05-07T20:33:18.3143609Z self = 2025-05-07T20:33:18.3144096Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:18.3144373Z 2025-05-07T20:33:18.3144454Z @given( 2025-05-07T20:33:18.3144687Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.3144999Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.3145303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.3145629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.3145957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.3146236Z ) 2025-05-07T20:33:18.3146586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.3147037Z def test_silu_mul_quant( 2025-05-07T20:33:18.3147280Z self, 2025-05-07T20:33:18.3147484Z T: int, 2025-05-07T20:33:18.3147732Z D: int, 2025-05-07T20:33:18.3147973Z scale_ub: Optional[float], 2025-05-07T20:33:18.3148246Z contiguous: bool, 2025-05-07T20:33:18.3148490Z compiled: bool, 2025-05-07T20:33:18.3148714Z ) -> None: 2025-05-07T20:33:18.3148924Z torch.manual_seed(2025) 2025-05-07T20:33:18.3149165Z 2025-05-07T20:33:18.3149439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.3149773Z 2025-05-07T20:33:18.3149966Z x_sign = torch.sign(x) 2025-05-07T20:33:18.3150257Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.3150566Z x = x_sign * x_clamp 2025-05-07T20:33:18.3150800Z x0 = x[:, :D] 2025-05-07T20:33:18.3151017Z x1 = x[:, D:] 2025-05-07T20:33:18.3151229Z 2025-05-07T20:33:18.3151410Z if contiguous: 2025-05-07T20:33:18.3151741Z x0 = x0.contiguous() 2025-05-07T20:33:18.3152002Z x1 = x1.contiguous() 2025-05-07T20:33:18.3152241Z 2025-05-07T20:33:18.3152442Z if scale_ub is not None: 2025-05-07T20:33:18.3152761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.3153091Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.3153402Z ) 2025-05-07T20:33:18.3153598Z else: 2025-05-07T20:33:18.3153803Z scale_ub_tensor = None 2025-05-07T20:33:18.3154050Z 2025-05-07T20:33:18.3154288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.3154598Z op = silu_mul_quant 2025-05-07T20:33:18.3154849Z if compiled: 2025-05-07T20:33:18.3155094Z op = torch.compile(op) 2025-05-07T20:33:18.3155395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.3155665Z 2025-05-07T20:33:18.3155909Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.3156078Z 2025-05-07T20:33:18.3156180Z moe/activation_test.py:117: 2025-05-07T20:33:18.3156515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.3156852Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.3157138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.3157685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.3158246Z return fn(*args, **kwargs) 2025-05-07T20:33:18.3158905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.3159588Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.3160127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.3160810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.3161479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.3162009Z kernel = self.compile( 2025-05-07T20:33:18.3162549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.3163204Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.3163598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.3163827Z 2025-05-07T20:33:18.3164034Z self = 2025-05-07T20:33:18.3165112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.3166746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f01a3920>} 2025-05-07T20:33:18.3168082Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.3169098Z context = 2025-05-07T20:33:18.3169393Z 2025-05-07T20:33:18.3169562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.3170083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.3170547Z module_map=module_map) 2025-05-07T20:33:18.3170903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.3171256Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.3171512Z E ^ 2025-05-07T20:33:18.3172052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.3172595Z 2025-05-07T20:33:18.3173006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.3173579Z 2025-05-07T20:33:18.3173684Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.3174093Z self=, 2025-05-07T20:33:18.3174485Z T=128, 2025-05-07T20:33:18.3174677Z D=7168, 2025-05-07T20:33:18.3174869Z scale_ub=1200.0, 2025-05-07T20:33:18.3175085Z contiguous=False, 2025-05-07T20:33:18.3175308Z compiled=False, 2025-05-07T20:33:18.3175514Z ) 2025-05-07T20:33:18.4315049Z self = 2025-05-07T20:33:18.4315891Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:18.4316270Z 2025-05-07T20:33:18.4316388Z @given( 2025-05-07T20:33:18.4316709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.4317243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.4317877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.4318531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.4319184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.4319741Z ) 2025-05-07T20:33:18.4320428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.4321316Z def test_silu_mul_quant( 2025-05-07T20:33:18.4321792Z self, 2025-05-07T20:33:18.4322169Z T: int, 2025-05-07T20:33:18.4322551Z D: int, 2025-05-07T20:33:18.4322979Z scale_ub: Optional[float], 2025-05-07T20:33:18.4323500Z contiguous: bool, 2025-05-07T20:33:18.4323975Z compiled: bool, 2025-05-07T20:33:18.4324420Z ) -> None: 2025-05-07T20:33:18.4325168Z torch.manual_seed(2025) 2025-05-07T20:33:18.4325657Z 2025-05-07T20:33:18.4326196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.4326870Z 2025-05-07T20:33:18.4327252Z x_sign = torch.sign(x) 2025-05-07T20:33:18.4327595Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.4327924Z x = x_sign * x_clamp 2025-05-07T20:33:18.4328162Z x0 = x[:, :D] 2025-05-07T20:33:18.4328382Z x1 = x[:, D:] 2025-05-07T20:33:18.4328584Z 2025-05-07T20:33:18.4328770Z if contiguous: 2025-05-07T20:33:18.4329001Z x0 = x0.contiguous() 2025-05-07T20:33:18.4329254Z x1 = x1.contiguous() 2025-05-07T20:33:18.4329496Z 2025-05-07T20:33:18.4329690Z if scale_ub is not None: 2025-05-07T20:33:18.4329962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.4330287Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.4330594Z ) 2025-05-07T20:33:18.4330790Z else: 2025-05-07T20:33:18.4331000Z scale_ub_tensor = None 2025-05-07T20:33:18.4331258Z 2025-05-07T20:33:18.4331490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.4331796Z op = silu_mul_quant 2025-05-07T20:33:18.4332056Z if compiled: 2025-05-07T20:33:18.4332304Z op = torch.compile(op) 2025-05-07T20:33:18.4332592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4332864Z 2025-05-07T20:33:18.4333058Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.4333220Z 2025-05-07T20:33:18.4333318Z moe/activation_test.py:117: 2025-05-07T20:33:18.4333611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4333944Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.4334227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4334984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.4335727Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.4336264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.4336995Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.4337649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.4338175Z kernel = self.compile( 2025-05-07T20:33:18.4338711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.4339361Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.4339757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4339984Z 2025-05-07T20:33:18.4340192Z self = 2025-05-07T20:33:18.4341319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.4342684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f16f16c0>} 2025-05-07T20:33:18.4344019Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.4345041Z context = 2025-05-07T20:33:18.4345327Z 2025-05-07T20:33:18.4345498Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.4346012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.4346484Z module_map=module_map) 2025-05-07T20:33:18.4346849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.4347208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.4347485Z E ^ 2025-05-07T20:33:18.4347985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.4348435Z 2025-05-07T20:33:18.4348852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.4349361Z 2025-05-07T20:33:18.4349472Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.4349882Z self=, 2025-05-07T20:33:18.4350288Z T=128, 2025-05-07T20:33:18.4350480Z D=5120, 2025-05-07T20:33:18.4350673Z scale_ub=None, 2025-05-07T20:33:18.4350893Z contiguous=False, 2025-05-07T20:33:18.4351125Z compiled=False, 2025-05-07T20:33:18.4351325Z ) 2025-05-07T20:33:18.4351645Z self = 2025-05-07T20:33:18.4352136Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:18.4352407Z 2025-05-07T20:33:18.4352485Z @given( 2025-05-07T20:33:18.4352716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.4353038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.4353349Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.4353673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.4354001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.4354285Z ) 2025-05-07T20:33:18.4354626Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.4355071Z def test_silu_mul_quant( 2025-05-07T20:33:18.4355312Z self, 2025-05-07T20:33:18.4355597Z T: int, 2025-05-07T20:33:18.4355846Z D: int, 2025-05-07T20:33:18.4356072Z scale_ub: Optional[float], 2025-05-07T20:33:18.4356339Z contiguous: bool, 2025-05-07T20:33:18.4356577Z compiled: bool, 2025-05-07T20:33:18.4356848Z ) -> None: 2025-05-07T20:33:18.4357060Z torch.manual_seed(2025) 2025-05-07T20:33:18.4357307Z 2025-05-07T20:33:18.4357577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.4357915Z 2025-05-07T20:33:18.4358102Z x_sign = torch.sign(x) 2025-05-07T20:33:18.4358389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.4358704Z x = x_sign * x_clamp 2025-05-07T20:33:18.4358942Z x0 = x[:, :D] 2025-05-07T20:33:18.4359152Z x1 = x[:, D:] 2025-05-07T20:33:18.4359360Z 2025-05-07T20:33:18.4359547Z if contiguous: 2025-05-07T20:33:18.4359777Z x0 = x0.contiguous() 2025-05-07T20:33:18.4360038Z x1 = x1.contiguous() 2025-05-07T20:33:18.4360279Z 2025-05-07T20:33:18.4360491Z if scale_ub is not None: 2025-05-07T20:33:18.4360811Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.4361151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.4361458Z ) 2025-05-07T20:33:18.4361655Z else: 2025-05-07T20:33:18.4361860Z scale_ub_tensor = None 2025-05-07T20:33:18.4362111Z 2025-05-07T20:33:18.4362346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.4362657Z op = silu_mul_quant 2025-05-07T20:33:18.4362903Z if compiled: 2025-05-07T20:33:18.4363147Z op = torch.compile(op) 2025-05-07T20:33:18.4363445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4363715Z 2025-05-07T20:33:18.4363907Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.4364068Z 2025-05-07T20:33:18.4364167Z moe/activation_test.py:117: 2025-05-07T20:33:18.4364461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4364793Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.4365073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.4366059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.4366878Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.4367499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.4368308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.4369090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.4369705Z kernel = self.compile( 2025-05-07T20:33:18.4370330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.4371108Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.4371557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.4371826Z 2025-05-07T20:33:18.4372058Z self = 2025-05-07T20:33:18.4373371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.4375067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cfd3fba0>} 2025-05-07T20:33:18.4376799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.4377871Z context = 2025-05-07T20:33:18.4378161Z 2025-05-07T20:33:18.4378326Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.4378912Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.4379383Z module_map=module_map) 2025-05-07T20:33:18.4379741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.4380092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.4380348Z E ^ 2025-05-07T20:33:18.4380801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.4381256Z 2025-05-07T20:33:18.4381666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.4382176Z 2025-05-07T20:33:18.4382285Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.4382761Z self=, 2025-05-07T20:33:18.4383166Z T=128, 2025-05-07T20:33:18.4383358Z D=5120, 2025-05-07T20:33:18.4383550Z scale_ub=1200.0, 2025-05-07T20:33:18.4383771Z contiguous=True, 2025-05-07T20:33:18.4383989Z compiled=False, 2025-05-07T20:33:18.4384326Z ) 2025-05-07T20:33:18.6111289Z self = 2025-05-07T20:33:18.6112005Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:18.6112403Z 2025-05-07T20:33:18.6112515Z @given( 2025-05-07T20:33:18.6112838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6113277Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6113610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6113948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6114293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6114579Z ) 2025-05-07T20:33:18.6114943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6115403Z def test_silu_mul_quant( 2025-05-07T20:33:18.6115648Z self, 2025-05-07T20:33:18.6115936Z T: int, 2025-05-07T20:33:18.6116146Z D: int, 2025-05-07T20:33:18.6116369Z scale_ub: Optional[float], 2025-05-07T20:33:18.6116638Z contiguous: bool, 2025-05-07T20:33:18.6116879Z compiled: bool, 2025-05-07T20:33:18.6117113Z ) -> None: 2025-05-07T20:33:18.6117328Z torch.manual_seed(2025) 2025-05-07T20:33:18.6117571Z 2025-05-07T20:33:18.6117848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6118194Z 2025-05-07T20:33:18.6118389Z x_sign = torch.sign(x) 2025-05-07T20:33:18.6118682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.6119003Z x = x_sign * x_clamp 2025-05-07T20:33:18.6119252Z x0 = x[:, :D] 2025-05-07T20:33:18.6119474Z x1 = x[:, D:] 2025-05-07T20:33:18.6119680Z 2025-05-07T20:33:18.6119865Z if contiguous: 2025-05-07T20:33:18.6120106Z x0 = x0.contiguous() 2025-05-07T20:33:18.6120359Z x1 = x1.contiguous() 2025-05-07T20:33:18.6120603Z 2025-05-07T20:33:18.6120798Z if scale_ub is not None: 2025-05-07T20:33:18.6121067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.6121400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.6121716Z ) 2025-05-07T20:33:18.6121912Z else: 2025-05-07T20:33:18.6122123Z scale_ub_tensor = None 2025-05-07T20:33:18.6122377Z 2025-05-07T20:33:18.6122614Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.6122927Z op = silu_mul_quant 2025-05-07T20:33:18.6123174Z if compiled: 2025-05-07T20:33:18.6123539Z op = torch.compile(op) 2025-05-07T20:33:18.6123898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.6124172Z 2025-05-07T20:33:18.6124368Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.6124593Z 2025-05-07T20:33:18.6124691Z moe/activation_test.py:117: 2025-05-07T20:33:18.6124997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.6125328Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.6125614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.6126299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.6126987Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.6127529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.6128253Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.6128985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.6129521Z kernel = self.compile( 2025-05-07T20:33:18.6130063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.6130711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.6131117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.6131348Z 2025-05-07T20:33:18.6131560Z self = 2025-05-07T20:33:18.6132640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.6134020Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f078cb80>} 2025-05-07T20:33:18.6135365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.6136399Z context = 2025-05-07T20:33:18.6136688Z 2025-05-07T20:33:18.6136862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.6137380Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.6137850Z module_map=module_map) 2025-05-07T20:33:18.6138221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.6138577Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.6138834Z E ^ 2025-05-07T20:33:18.6139308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.6139761Z 2025-05-07T20:33:18.6140184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.6140697Z 2025-05-07T20:33:18.6140800Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6141216Z self=, 2025-05-07T20:33:18.6141621Z T=1, 2025-05-07T20:33:18.6141809Z D=7168, 2025-05-07T20:33:18.6142000Z scale_ub=1200.0, 2025-05-07T20:33:18.6142225Z contiguous=True, 2025-05-07T20:33:18.6142448Z compiled=True, 2025-05-07T20:33:18.6142652Z ) 2025-05-07T20:33:18.6142973Z self = 2025-05-07T20:33:18.6143467Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:18.6143725Z 2025-05-07T20:33:18.6143902Z @given( 2025-05-07T20:33:18.6144136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.6144456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.6144763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.6145142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.6145476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.6145762Z ) 2025-05-07T20:33:18.6146103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.6146547Z def test_silu_mul_quant( 2025-05-07T20:33:18.6146788Z self, 2025-05-07T20:33:18.6146988Z T: int, 2025-05-07T20:33:18.6147197Z D: int, 2025-05-07T20:33:18.6147419Z scale_ub: Optional[float], 2025-05-07T20:33:18.6147717Z contiguous: bool, 2025-05-07T20:33:18.6147980Z compiled: bool, 2025-05-07T20:33:18.6148207Z ) -> None: 2025-05-07T20:33:18.6148417Z torch.manual_seed(2025) 2025-05-07T20:33:18.6148673Z 2025-05-07T20:33:18.6149023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.6149367Z 2025-05-07T20:33:18.6149561Z x_sign = torch.sign(x) 2025-05-07T20:33:18.6149852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.6150166Z x = x_sign * x_clamp 2025-05-07T20:33:18.6150405Z x0 = x[:, :D] 2025-05-07T20:33:18.6150619Z x1 = x[:, D:] 2025-05-07T20:33:18.6150835Z 2025-05-07T20:33:18.6151022Z if contiguous: 2025-05-07T20:33:18.6151248Z x0 = x0.contiguous() 2025-05-07T20:33:18.6151510Z x1 = x1.contiguous() 2025-05-07T20:33:18.6151748Z 2025-05-07T20:33:18.6151933Z if scale_ub is not None: 2025-05-07T20:33:18.6152203Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.6152539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.6152844Z ) 2025-05-07T20:33:18.6153033Z else: 2025-05-07T20:33:18.6153252Z scale_ub_tensor = None 2025-05-07T20:33:18.6153497Z 2025-05-07T20:33:18.6153748Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.6154069Z op = silu_mul_quant 2025-05-07T20:33:18.6154315Z if compiled: 2025-05-07T20:33:18.6154559Z op = torch.compile(op) 2025-05-07T20:33:18.6154854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.6155128Z 2025-05-07T20:33:18.6155311Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.6155477Z 2025-05-07T20:33:18.6155579Z moe/activation_test.py:117: 2025-05-07T20:33:18.6155931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.6156261Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.6156534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.6157088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.6157650Z return fn(*args, **kwargs) 2025-05-07T20:33:18.6158303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.6158994Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.6159531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.6160213Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.6160878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.6161413Z kernel = self.compile( 2025-05-07T20:33:18.6161960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.6162609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.6163060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.6163334Z 2025-05-07T20:33:18.6163546Z self = 2025-05-07T20:33:18.6164678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.6166405Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f078e2a0>} 2025-05-07T20:33:18.6168057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.6169306Z context = 2025-05-07T20:33:18.6169655Z 2025-05-07T20:33:18.6169926Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.6170447Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.6170910Z module_map=module_map) 2025-05-07T20:33:18.6171273Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.6171628Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.6171886Z E ^ 2025-05-07T20:33:18.6172353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.6172806Z 2025-05-07T20:33:18.6173219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.6173728Z 2025-05-07T20:33:18.6173838Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.6174244Z self=, 2025-05-07T20:33:18.6174650Z T=1, 2025-05-07T20:33:18.6174834Z D=7168, 2025-05-07T20:33:18.6175022Z scale_ub=1200.0, 2025-05-07T20:33:18.6175245Z contiguous=False, 2025-05-07T20:33:18.6175467Z compiled=True, 2025-05-07T20:33:18.6175664Z ) 2025-05-07T20:33:18.7493156Z self = 2025-05-07T20:33:18.7493905Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:18.7494268Z 2025-05-07T20:33:18.7494380Z @given( 2025-05-07T20:33:18.7494690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.7495096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.7495493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.7495896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.7496221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.7496505Z ) 2025-05-07T20:33:18.7496860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.7497301Z def test_silu_mul_quant( 2025-05-07T20:33:18.7497551Z self, 2025-05-07T20:33:18.7497752Z T: int, 2025-05-07T20:33:18.7497953Z D: int, 2025-05-07T20:33:18.7498179Z scale_ub: Optional[float], 2025-05-07T20:33:18.7498454Z contiguous: bool, 2025-05-07T20:33:18.7498693Z compiled: bool, 2025-05-07T20:33:18.7498923Z ) -> None: 2025-05-07T20:33:18.7499145Z torch.manual_seed(2025) 2025-05-07T20:33:18.7499383Z 2025-05-07T20:33:18.7499659Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.7500009Z 2025-05-07T20:33:18.7500204Z x_sign = torch.sign(x) 2025-05-07T20:33:18.7500493Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.7500808Z x = x_sign * x_clamp 2025-05-07T20:33:18.7501053Z x0 = x[:, :D] 2025-05-07T20:33:18.7501267Z x1 = x[:, D:] 2025-05-07T20:33:18.7501655Z 2025-05-07T20:33:18.7501844Z if contiguous: 2025-05-07T20:33:18.7502074Z x0 = x0.contiguous() 2025-05-07T20:33:18.7502331Z x1 = x1.contiguous() 2025-05-07T20:33:18.7502633Z 2025-05-07T20:33:18.7502826Z if scale_ub is not None: 2025-05-07T20:33:18.7503098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.7503440Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.7503748Z ) 2025-05-07T20:33:18.7503947Z else: 2025-05-07T20:33:18.7504161Z scale_ub_tensor = None 2025-05-07T20:33:18.7504410Z 2025-05-07T20:33:18.7504646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.7504962Z op = silu_mul_quant 2025-05-07T20:33:18.7505221Z if compiled: 2025-05-07T20:33:18.7505464Z op = torch.compile(op) 2025-05-07T20:33:18.7505760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.7506043Z 2025-05-07T20:33:18.7506236Z > y_fp8, y_scale = fn() 2025-05-07T20:33:18.7506405Z 2025-05-07T20:33:18.7506571Z moe/activation_test.py:117: 2025-05-07T20:33:18.7506871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.7507206Z moe/activation_test.py:115: in fn 2025-05-07T20:33:18.7507485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.7508042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:18.7508601Z return fn(*args, **kwargs) 2025-05-07T20:33:18.7509251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:18.7509939Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:18.7510476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.7511152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.7511817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.7512351Z kernel = self.compile( 2025-05-07T20:33:18.7512897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.7513550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.7513952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.7514181Z 2025-05-07T20:33:18.7514392Z self = 2025-05-07T20:33:18.7515474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.7516947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f078f9c0>} 2025-05-07T20:33:18.7518289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.7519314Z context = 2025-05-07T20:33:18.7519603Z 2025-05-07T20:33:18.7519772Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.7520291Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.7520763Z module_map=module_map) 2025-05-07T20:33:18.7521128Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.7521486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:18.7521835Z E ^ 2025-05-07T20:33:18.7522309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.7522760Z 2025-05-07T20:33:18.7523221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.7523731Z 2025-05-07T20:33:18.7523838Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.7524260Z self=, 2025-05-07T20:33:18.7524664Z T=1, 2025-05-07T20:33:18.7524849Z D=7168, 2025-05-07T20:33:18.7525040Z scale_ub=None, 2025-05-07T20:33:18.7525257Z contiguous=False, 2025-05-07T20:33:18.7525486Z compiled=True, 2025-05-07T20:33:18.7525684Z ) 2025-05-07T20:33:18.8391496Z self = 2025-05-07T20:33:18.8392206Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:18.8392621Z 2025-05-07T20:33:18.8392749Z @given( 2025-05-07T20:33:18.8393216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.8393658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.8394090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.8394467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.8394793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.8395078Z ) 2025-05-07T20:33:18.8395437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.8395946Z def test_silu_mul_quant( 2025-05-07T20:33:18.8396192Z self, 2025-05-07T20:33:18.8396389Z T: int, 2025-05-07T20:33:18.8396581Z D: int, 2025-05-07T20:33:18.8396799Z scale_ub: Optional[float], 2025-05-07T20:33:18.8397070Z contiguous: bool, 2025-05-07T20:33:18.8397310Z compiled: bool, 2025-05-07T20:33:18.8397537Z ) -> None: 2025-05-07T20:33:18.8397765Z torch.manual_seed(2025) 2025-05-07T20:33:18.8398047Z 2025-05-07T20:33:18.8398329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.8398674Z 2025-05-07T20:33:18.8398864Z x_sign = torch.sign(x) 2025-05-07T20:33:18.8399158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:18.8399472Z x = x_sign * x_clamp 2025-05-07T20:33:18.8399717Z x0 = x[:, :D] 2025-05-07T20:33:18.8399930Z x1 = x[:, D:] 2025-05-07T20:33:18.8400139Z 2025-05-07T20:33:18.8400327Z if contiguous: 2025-05-07T20:33:18.8400555Z x0 = x0.contiguous() 2025-05-07T20:33:18.8400817Z x1 = x1.contiguous() 2025-05-07T20:33:18.8401058Z 2025-05-07T20:33:18.8401253Z if scale_ub is not None: 2025-05-07T20:33:18.8401527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:18.8401867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:18.8402184Z ) 2025-05-07T20:33:18.8402379Z else: 2025-05-07T20:33:18.8402594Z scale_ub_tensor = None 2025-05-07T20:33:18.8402850Z 2025-05-07T20:33:18.8403090Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.8403407Z op = silu_mul_quant 2025-05-07T20:33:18.8403654Z if compiled: 2025-05-07T20:33:18.8403905Z op = torch.compile(op) 2025-05-07T20:33:18.8404199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:18.8404477Z 2025-05-07T20:33:18.8404667Z y_fp8, y_scale = fn() 2025-05-07T20:33:18.8404954Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:18.8405251Z 2025-05-07T20:33:18.8405486Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:18.8405822Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:18.8406116Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:18.8406518Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:18.8406938Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:18.8407252Z 2025-05-07T20:33:18.8407454Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:18.8407769Z 2025-05-07T20:33:18.8407894Z moe/activation_test.py:126: 2025-05-07T20:33:18.8408192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.8408531Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:18.8408855Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:18.8409643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:18.8410392Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:18.8410928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:18.8411612Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:18.8412342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:18.8413065Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:18.8413794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:18.8414430Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:18.8415037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:18.8415558Z fn() 2025-05-07T20:33:18.8416061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:18.8416646Z self.fn.run( 2025-05-07T20:33:18.8417123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:18.8417671Z kernel = self.compile( 2025-05-07T20:33:18.8418213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:18.8418869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:18.8419270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:18.8419502Z 2025-05-07T20:33:18.8419711Z self = 2025-05-07T20:33:18.8420793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:18.8422175Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cff5cb80>} 2025-05-07T20:33:18.8423519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:18.8424543Z context = 2025-05-07T20:33:18.8424832Z 2025-05-07T20:33:18.8424999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:18.8425520Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:18.8425994Z module_map=module_map) 2025-05-07T20:33:18.8426356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:18.8426715Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:18.8426983Z E ^ 2025-05-07T20:33:18.8427509Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:18.8428045Z 2025-05-07T20:33:18.8428463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:18.8429021Z 2025-05-07T20:33:18.8429126Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:18.8429537Z self=, 2025-05-07T20:33:18.8429941Z T=1, 2025-05-07T20:33:18.8430126Z D=5120, 2025-05-07T20:33:18.8430328Z scale_ub=1200.0, 2025-05-07T20:33:18.8430553Z contiguous=False, 2025-05-07T20:33:18.8430774Z compiled=True, 2025-05-07T20:33:18.8430983Z ) 2025-05-07T20:33:18.9989365Z self = 2025-05-07T20:33:18.9990069Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:18.9990440Z 2025-05-07T20:33:18.9990551Z @given( 2025-05-07T20:33:18.9990863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:18.9991253Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:18.9991674Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:18.9992012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:18.9992348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:18.9992634Z ) 2025-05-07T20:33:18.9992978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:18.9993421Z def test_silu_mul_quant( 2025-05-07T20:33:18.9993669Z self, 2025-05-07T20:33:18.9993861Z T: int, 2025-05-07T20:33:18.9994064Z D: int, 2025-05-07T20:33:18.9994282Z scale_ub: Optional[float], 2025-05-07T20:33:18.9994549Z contiguous: bool, 2025-05-07T20:33:18.9994796Z compiled: bool, 2025-05-07T20:33:18.9995019Z ) -> None: 2025-05-07T20:33:18.9995233Z torch.manual_seed(2025) 2025-05-07T20:33:18.9995483Z 2025-05-07T20:33:18.9995850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:18.9996195Z 2025-05-07T20:33:19.0002620Z x_sign = torch.sign(x) 2025-05-07T20:33:19.0003024Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.0003357Z x = x_sign * x_clamp 2025-05-07T20:33:19.0003599Z x0 = x[:, :D] 2025-05-07T20:33:19.0003829Z x1 = x[:, D:] 2025-05-07T20:33:19.0004041Z 2025-05-07T20:33:19.0004230Z if contiguous: 2025-05-07T20:33:19.0004463Z x0 = x0.contiguous() 2025-05-07T20:33:19.0004726Z x1 = x1.contiguous() 2025-05-07T20:33:19.0004969Z 2025-05-07T20:33:19.0005167Z if scale_ub is not None: 2025-05-07T20:33:19.0005445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.0005776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.0006093Z ) 2025-05-07T20:33:19.0006297Z else: 2025-05-07T20:33:19.0006512Z scale_ub_tensor = None 2025-05-07T20:33:19.0006765Z 2025-05-07T20:33:19.0007007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.0007332Z op = silu_mul_quant 2025-05-07T20:33:19.0007582Z if compiled: 2025-05-07T20:33:19.0007834Z op = torch.compile(op) 2025-05-07T20:33:19.0008138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0008415Z 2025-05-07T20:33:19.0008611Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.0008775Z 2025-05-07T20:33:19.0008882Z moe/activation_test.py:117: 2025-05-07T20:33:19.0009176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0009510Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.0009795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0010355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.0010918Z return fn(*args, **kwargs) 2025-05-07T20:33:19.0011702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.0012456Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.0012988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.0013736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.0014410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.0014941Z kernel = self.compile( 2025-05-07T20:33:19.0015481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.0016137Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.0016537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0016771Z 2025-05-07T20:33:19.0016996Z self = 2025-05-07T20:33:19.0018122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.0019503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cff5de40>} 2025-05-07T20:33:19.0020850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.0021881Z context = 2025-05-07T20:33:19.0022173Z 2025-05-07T20:33:19.0022346Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.0022880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.0023361Z module_map=module_map) 2025-05-07T20:33:19.0023737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.0024096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.0024360Z E ^ 2025-05-07T20:33:19.0024831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.0025282Z 2025-05-07T20:33:19.0025697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.0026218Z 2025-05-07T20:33:19.0026326Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.0026747Z self=, 2025-05-07T20:33:19.0027161Z T=1, 2025-05-07T20:33:19.0027345Z D=5120, 2025-05-07T20:33:19.0027553Z scale_ub=1200.0, 2025-05-07T20:33:19.0027784Z contiguous=False, 2025-05-07T20:33:19.0028009Z compiled=False, 2025-05-07T20:33:19.0028223Z ) 2025-05-07T20:33:19.0028552Z self = 2025-05-07T20:33:19.0029045Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.0029317Z 2025-05-07T20:33:19.0029399Z @given( 2025-05-07T20:33:19.0029635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.0029951Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.0030261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.0030590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.0030923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.0031205Z ) 2025-05-07T20:33:19.0031561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.0032101Z def test_silu_mul_quant( 2025-05-07T20:33:19.0032348Z self, 2025-05-07T20:33:19.0032549Z T: int, 2025-05-07T20:33:19.0032749Z D: int, 2025-05-07T20:33:19.0032965Z scale_ub: Optional[float], 2025-05-07T20:33:19.0033282Z contiguous: bool, 2025-05-07T20:33:19.0033547Z compiled: bool, 2025-05-07T20:33:19.0033777Z ) -> None: 2025-05-07T20:33:19.0033996Z torch.manual_seed(2025) 2025-05-07T20:33:19.0034245Z 2025-05-07T20:33:19.0034520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.0034860Z 2025-05-07T20:33:19.0035056Z x_sign = torch.sign(x) 2025-05-07T20:33:19.0035346Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.0035667Z x = x_sign * x_clamp 2025-05-07T20:33:19.0035967Z x0 = x[:, :D] 2025-05-07T20:33:19.0036193Z x1 = x[:, D:] 2025-05-07T20:33:19.0036413Z 2025-05-07T20:33:19.0036598Z if contiguous: 2025-05-07T20:33:19.0036844Z x0 = x0.contiguous() 2025-05-07T20:33:19.0037156Z x1 = x1.contiguous() 2025-05-07T20:33:19.0037398Z 2025-05-07T20:33:19.0037595Z if scale_ub is not None: 2025-05-07T20:33:19.0037900Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.0038260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.0038576Z ) 2025-05-07T20:33:19.0038774Z else: 2025-05-07T20:33:19.0038988Z scale_ub_tensor = None 2025-05-07T20:33:19.0039240Z 2025-05-07T20:33:19.0039471Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.0039793Z op = silu_mul_quant 2025-05-07T20:33:19.0040052Z if compiled: 2025-05-07T20:33:19.0040305Z op = torch.compile(op) 2025-05-07T20:33:19.0040607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0040884Z 2025-05-07T20:33:19.0041086Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.0041253Z 2025-05-07T20:33:19.0041367Z moe/activation_test.py:117: 2025-05-07T20:33:19.0041657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0041994Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.0042281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0042961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.0043647Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.0044182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.0044866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.0045543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.0046073Z kernel = self.compile( 2025-05-07T20:33:19.0046624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.0047287Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.0047682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0047921Z 2025-05-07T20:33:19.0048128Z self = 2025-05-07T20:33:19.0049206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.0050578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cff5eac0>} 2025-05-07T20:33:19.0052001Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.0053072Z context = 2025-05-07T20:33:19.0053405Z 2025-05-07T20:33:19.0053577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.0054097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.0054566Z module_map=module_map) 2025-05-07T20:33:19.0054925Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.0055284Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.0055546Z E ^ 2025-05-07T20:33:19.0056008Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.0056463Z 2025-05-07T20:33:19.0056882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.0057398Z 2025-05-07T20:33:19.0057548Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.0057962Z self=, 2025-05-07T20:33:19.0058365Z T=16384, 2025-05-07T20:33:19.0058556Z D=5120, 2025-05-07T20:33:19.0058749Z scale_ub=1200.0, 2025-05-07T20:33:19.0058969Z contiguous=False, 2025-05-07T20:33:19.0059194Z compiled=True, 2025-05-07T20:33:19.0059402Z ) 2025-05-07T20:33:19.0929785Z self = 2025-05-07T20:33:19.0930563Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.0930967Z 2025-05-07T20:33:19.0931078Z @given( 2025-05-07T20:33:19.0931404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.0931812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.0932128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.0932467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.0932797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.0933081Z ) 2025-05-07T20:33:19.0933440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.0933886Z def test_silu_mul_quant( 2025-05-07T20:33:19.0934124Z self, 2025-05-07T20:33:19.0934322Z T: int, 2025-05-07T20:33:19.0934526Z D: int, 2025-05-07T20:33:19.0934742Z scale_ub: Optional[float], 2025-05-07T20:33:19.0935013Z contiguous: bool, 2025-05-07T20:33:19.0935254Z compiled: bool, 2025-05-07T20:33:19.0935481Z ) -> None: 2025-05-07T20:33:19.0935696Z torch.manual_seed(2025) 2025-05-07T20:33:19.0935940Z 2025-05-07T20:33:19.0936218Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.0936567Z 2025-05-07T20:33:19.0936759Z x_sign = torch.sign(x) 2025-05-07T20:33:19.0937055Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.0937363Z x = x_sign * x_clamp 2025-05-07T20:33:19.0937605Z x0 = x[:, :D] 2025-05-07T20:33:19.0937824Z x1 = x[:, D:] 2025-05-07T20:33:19.0938043Z 2025-05-07T20:33:19.0938233Z if contiguous: 2025-05-07T20:33:19.0938483Z x0 = x0.contiguous() 2025-05-07T20:33:19.0938761Z x1 = x1.contiguous() 2025-05-07T20:33:19.0939020Z 2025-05-07T20:33:19.0939216Z if scale_ub is not None: 2025-05-07T20:33:19.0939511Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.0939885Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.0940221Z ) 2025-05-07T20:33:19.0940422Z else: 2025-05-07T20:33:19.0940642Z scale_ub_tensor = None 2025-05-07T20:33:19.0940909Z 2025-05-07T20:33:19.0941153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.0941631Z op = silu_mul_quant 2025-05-07T20:33:19.0941938Z if compiled: 2025-05-07T20:33:19.0942192Z op = torch.compile(op) 2025-05-07T20:33:19.0942486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0942821Z 2025-05-07T20:33:19.0943021Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.0943191Z 2025-05-07T20:33:19.0943291Z moe/activation_test.py:117: 2025-05-07T20:33:19.0943590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0943920Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.0944206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0944769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.0945327Z return fn(*args, **kwargs) 2025-05-07T20:33:19.0945983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.0946677Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.0947290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.0947968Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.0948634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.0949169Z kernel = self.compile( 2025-05-07T20:33:19.0949705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.0950362Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.0950759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0950988Z 2025-05-07T20:33:19.0951209Z self = 2025-05-07T20:33:19.0952289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.0953663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf910180>} 2025-05-07T20:33:19.0955002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.0956143Z context = 2025-05-07T20:33:19.0956431Z 2025-05-07T20:33:19.0956600Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.0957123Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.0957613Z module_map=module_map) 2025-05-07T20:33:19.0958029Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.0958380Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.0958651Z E ^ 2025-05-07T20:33:19.0959115Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.0959567Z 2025-05-07T20:33:19.0959986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.0960497Z 2025-05-07T20:33:19.0960602Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.0961017Z self=, 2025-05-07T20:33:19.0961422Z T=2048, 2025-05-07T20:33:19.0961610Z D=7168, 2025-05-07T20:33:19.0961829Z scale_ub=1200.0, 2025-05-07T20:33:19.0962055Z contiguous=False, 2025-05-07T20:33:19.0962388Z compiled=True, 2025-05-07T20:33:19.0962594Z ) 2025-05-07T20:33:19.0962920Z self = 2025-05-07T20:33:19.0963415Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.0963732Z 2025-05-07T20:33:19.0963817Z @given( 2025-05-07T20:33:19.0964042Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.0964354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.0964661Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.0964986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.0965326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.0965940Z ) 2025-05-07T20:33:19.0966288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.0966728Z def test_silu_mul_quant( 2025-05-07T20:33:19.0966972Z self, 2025-05-07T20:33:19.0967162Z T: int, 2025-05-07T20:33:19.0967369Z D: int, 2025-05-07T20:33:19.0967677Z scale_ub: Optional[float], 2025-05-07T20:33:19.0967977Z contiguous: bool, 2025-05-07T20:33:19.0968238Z compiled: bool, 2025-05-07T20:33:19.0968468Z ) -> None: 2025-05-07T20:33:19.0968687Z torch.manual_seed(2025) 2025-05-07T20:33:19.0968926Z 2025-05-07T20:33:19.0969195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.0969539Z 2025-05-07T20:33:19.0969731Z x_sign = torch.sign(x) 2025-05-07T20:33:19.0970019Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.0970330Z x = x_sign * x_clamp 2025-05-07T20:33:19.0970567Z x0 = x[:, :D] 2025-05-07T20:33:19.0970789Z x1 = x[:, D:] 2025-05-07T20:33:19.0970999Z 2025-05-07T20:33:19.0971183Z if contiguous: 2025-05-07T20:33:19.0971411Z x0 = x0.contiguous() 2025-05-07T20:33:19.0971669Z x1 = x1.contiguous() 2025-05-07T20:33:19.0971908Z 2025-05-07T20:33:19.0972100Z if scale_ub is not None: 2025-05-07T20:33:19.0972376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.0972703Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.0973014Z ) 2025-05-07T20:33:19.0973208Z else: 2025-05-07T20:33:19.0973418Z scale_ub_tensor = None 2025-05-07T20:33:19.0973664Z 2025-05-07T20:33:19.0973900Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.0974214Z op = silu_mul_quant 2025-05-07T20:33:19.0974463Z if compiled: 2025-05-07T20:33:19.0974709Z op = torch.compile(op) 2025-05-07T20:33:19.0975002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0975269Z 2025-05-07T20:33:19.0975464Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.0975627Z 2025-05-07T20:33:19.0975731Z moe/activation_test.py:117: 2025-05-07T20:33:19.0976022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0976356Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.0976641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.0977198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.0977756Z return fn(*args, **kwargs) 2025-05-07T20:33:19.0978412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.0979098Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.0979629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.0980305Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.0980972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.0981637Z kernel = self.compile( 2025-05-07T20:33:19.0982177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.0982831Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.0983291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.0983524Z 2025-05-07T20:33:19.0983738Z self = 2025-05-07T20:33:19.0984819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.0986191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf910ea0>} 2025-05-07T20:33:19.0987633Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.0988714Z context = 2025-05-07T20:33:19.0989002Z 2025-05-07T20:33:19.0989168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.0989689Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.0990163Z module_map=module_map) 2025-05-07T20:33:19.0990529Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.0990882Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.0991146Z E ^ 2025-05-07T20:33:19.0991608Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.0992058Z 2025-05-07T20:33:19.0992480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.0992992Z 2025-05-07T20:33:19.2150540Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2151148Z self=, 2025-05-07T20:33:19.2151746Z T=1, 2025-05-07T20:33:19.2151999Z D=5120, 2025-05-07T20:33:19.2152269Z scale_ub=None, 2025-05-07T20:33:19.2152567Z contiguous=False, 2025-05-07T20:33:19.2152823Z compiled=False, 2025-05-07T20:33:19.2153028Z ) 2025-05-07T20:33:19.2153354Z self = 2025-05-07T20:33:19.2153848Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:19.2154113Z 2025-05-07T20:33:19.2154192Z @given( 2025-05-07T20:33:19.2154435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2154755Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2155065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2155403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2155835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2156130Z ) 2025-05-07T20:33:19.2156487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2156940Z def test_silu_mul_quant( 2025-05-07T20:33:19.2157189Z self, 2025-05-07T20:33:19.2157385Z T: int, 2025-05-07T20:33:19.2157593Z D: int, 2025-05-07T20:33:19.2157816Z scale_ub: Optional[float], 2025-05-07T20:33:19.2158138Z contiguous: bool, 2025-05-07T20:33:19.2158391Z compiled: bool, 2025-05-07T20:33:19.2158625Z ) -> None: 2025-05-07T20:33:19.2158841Z torch.manual_seed(2025) 2025-05-07T20:33:19.2159092Z 2025-05-07T20:33:19.2159377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2159899Z 2025-05-07T20:33:19.2160172Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2160468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2160788Z x = x_sign * x_clamp 2025-05-07T20:33:19.2161099Z x0 = x[:, :D] 2025-05-07T20:33:19.2161318Z x1 = x[:, D:] 2025-05-07T20:33:19.2161532Z 2025-05-07T20:33:19.2161722Z if contiguous: 2025-05-07T20:33:19.2161958Z x0 = x0.contiguous() 2025-05-07T20:33:19.2162221Z x1 = x1.contiguous() 2025-05-07T20:33:19.2162472Z 2025-05-07T20:33:19.2162665Z if scale_ub is not None: 2025-05-07T20:33:19.2162951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2163295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2163604Z ) 2025-05-07T20:33:19.2163803Z else: 2025-05-07T20:33:19.2164021Z scale_ub_tensor = None 2025-05-07T20:33:19.2164279Z 2025-05-07T20:33:19.2164518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2164849Z op = silu_mul_quant 2025-05-07T20:33:19.2165179Z if compiled: 2025-05-07T20:33:19.2165773Z op = torch.compile(op) 2025-05-07T20:33:19.2166151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2166432Z 2025-05-07T20:33:19.2166631Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2166802Z 2025-05-07T20:33:19.2166905Z moe/activation_test.py:117: 2025-05-07T20:33:19.2167216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2167549Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2167842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2168536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2169236Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2169779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2170596Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2171399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2172172Z kernel = self.compile( 2025-05-07T20:33:19.2172817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2179513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2179928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2180171Z 2025-05-07T20:33:19.2180381Z self = 2025-05-07T20:33:19.2181492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2182880Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf911e40>} 2025-05-07T20:33:19.2184238Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2185266Z context = 2025-05-07T20:33:19.2185567Z 2025-05-07T20:33:19.2185739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2186280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2186758Z module_map=module_map) 2025-05-07T20:33:19.2187308Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2187761Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2188058Z E ^ 2025-05-07T20:33:19.2188525Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2189042Z 2025-05-07T20:33:19.2189460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2189984Z 2025-05-07T20:33:19.2190091Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2190515Z self=, 2025-05-07T20:33:19.2190925Z T=4096, 2025-05-07T20:33:19.2191119Z D=7168, 2025-05-07T20:33:19.2191322Z scale_ub=1200.0, 2025-05-07T20:33:19.2191549Z contiguous=False, 2025-05-07T20:33:19.2191794Z compiled=False, 2025-05-07T20:33:19.2192010Z ) 2025-05-07T20:33:19.2192332Z self = 2025-05-07T20:33:19.2192905Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.2193193Z 2025-05-07T20:33:19.2193275Z @given( 2025-05-07T20:33:19.2193522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.2193845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.2194164Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.2194505Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.2194841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.2195134Z ) 2025-05-07T20:33:19.2195494Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.2196004Z def test_silu_mul_quant( 2025-05-07T20:33:19.2196253Z self, 2025-05-07T20:33:19.2196458Z T: int, 2025-05-07T20:33:19.2196663Z D: int, 2025-05-07T20:33:19.2196894Z scale_ub: Optional[float], 2025-05-07T20:33:19.2197189Z contiguous: bool, 2025-05-07T20:33:19.2197441Z compiled: bool, 2025-05-07T20:33:19.2197673Z ) -> None: 2025-05-07T20:33:19.2197906Z torch.manual_seed(2025) 2025-05-07T20:33:19.2198173Z 2025-05-07T20:33:19.2198451Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.2198813Z 2025-05-07T20:33:19.2199024Z x_sign = torch.sign(x) 2025-05-07T20:33:19.2199322Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.2199650Z x = x_sign * x_clamp 2025-05-07T20:33:19.2199904Z x0 = x[:, :D] 2025-05-07T20:33:19.2200122Z x1 = x[:, D:] 2025-05-07T20:33:19.2200342Z 2025-05-07T20:33:19.2200544Z if contiguous: 2025-05-07T20:33:19.2200783Z x0 = x0.contiguous() 2025-05-07T20:33:19.2201054Z x1 = x1.contiguous() 2025-05-07T20:33:19.2201310Z 2025-05-07T20:33:19.2201509Z if scale_ub is not None: 2025-05-07T20:33:19.2201802Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.2202155Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.2202468Z ) 2025-05-07T20:33:19.2202673Z else: 2025-05-07T20:33:19.2202898Z scale_ub_tensor = None 2025-05-07T20:33:19.2203162Z 2025-05-07T20:33:19.2203393Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.2203720Z op = silu_mul_quant 2025-05-07T20:33:19.2203988Z if compiled: 2025-05-07T20:33:19.2204238Z op = torch.compile(op) 2025-05-07T20:33:19.2204540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2204825Z 2025-05-07T20:33:19.2205020Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.2205207Z 2025-05-07T20:33:19.2205312Z moe/activation_test.py:117: 2025-05-07T20:33:19.2205621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2205959Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.2206378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.2207079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.2207823Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.2208371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.2209065Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.2209745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.2210280Z kernel = self.compile( 2025-05-07T20:33:19.2210836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.2211501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.2211918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.2212154Z 2025-05-07T20:33:19.2212405Z self = 2025-05-07T20:33:19.2213494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.2214889Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf913380>} 2025-05-07T20:33:19.2216246Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.2217291Z context = 2025-05-07T20:33:19.2217586Z 2025-05-07T20:33:19.2217763Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.2218305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.2218793Z module_map=module_map) 2025-05-07T20:33:19.2219160Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.2219525Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.2219799Z E ^ 2025-05-07T20:33:19.2220272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.2220724Z 2025-05-07T20:33:19.2221145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.2221663Z 2025-05-07T20:33:19.2221770Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.2222196Z self=, 2025-05-07T20:33:19.2222616Z T=16384, 2025-05-07T20:33:19.2222812Z D=7168, 2025-05-07T20:33:19.2223021Z scale_ub=None, 2025-05-07T20:33:19.2223252Z contiguous=True, 2025-05-07T20:33:19.2223483Z compiled=True, 2025-05-07T20:33:19.2223688Z ) 2025-05-07T20:33:19.3970326Z self = 2025-05-07T20:33:19.3970877Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:19.3971165Z 2025-05-07T20:33:19.3971255Z @given( 2025-05-07T20:33:19.3971486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.3971804Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.3972117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.3972453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.3972780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.3973070Z ) 2025-05-07T20:33:19.3973596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.3974103Z def test_silu_mul_quant( 2025-05-07T20:33:19.3974345Z self, 2025-05-07T20:33:19.3974552Z T: int, 2025-05-07T20:33:19.3974824Z D: int, 2025-05-07T20:33:19.3975039Z scale_ub: Optional[float], 2025-05-07T20:33:19.3975312Z contiguous: bool, 2025-05-07T20:33:19.3975552Z compiled: bool, 2025-05-07T20:33:19.3975778Z ) -> None: 2025-05-07T20:33:19.3975994Z torch.manual_seed(2025) 2025-05-07T20:33:19.3976241Z 2025-05-07T20:33:19.3976514Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.3976860Z 2025-05-07T20:33:19.3977059Z x_sign = torch.sign(x) 2025-05-07T20:33:19.3977344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.3977659Z x = x_sign * x_clamp 2025-05-07T20:33:19.3977901Z x0 = x[:, :D] 2025-05-07T20:33:19.3978119Z x1 = x[:, D:] 2025-05-07T20:33:19.3978333Z 2025-05-07T20:33:19.3978524Z if contiguous: 2025-05-07T20:33:19.3978821Z x0 = x0.contiguous() 2025-05-07T20:33:19.3979088Z x1 = x1.contiguous() 2025-05-07T20:33:19.3979331Z 2025-05-07T20:33:19.3979519Z if scale_ub is not None: 2025-05-07T20:33:19.3979796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.3980136Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.3980453Z ) 2025-05-07T20:33:19.3980645Z else: 2025-05-07T20:33:19.3980856Z scale_ub_tensor = None 2025-05-07T20:33:19.3981101Z 2025-05-07T20:33:19.3981335Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.3981654Z op = silu_mul_quant 2025-05-07T20:33:19.3981903Z if compiled: 2025-05-07T20:33:19.3982153Z op = torch.compile(op) 2025-05-07T20:33:19.3982462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.3982737Z 2025-05-07T20:33:19.3982934Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.3983099Z 2025-05-07T20:33:19.3983201Z moe/activation_test.py:117: 2025-05-07T20:33:19.3983500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.3983840Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.3984119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.3984679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.3985242Z return fn(*args, **kwargs) 2025-05-07T20:33:19.3985898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.3986583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.3987120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.3987810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.3988472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.3989008Z kernel = self.compile( 2025-05-07T20:33:19.3989549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.3990199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.3990598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.3990835Z 2025-05-07T20:33:19.3991040Z self = 2025-05-07T20:33:19.3992129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.3993604Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf6a44a0>} 2025-05-07T20:33:19.3994940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.3996063Z context = 2025-05-07T20:33:19.3996351Z 2025-05-07T20:33:19.3996520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.3997039Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.3997502Z module_map=module_map) 2025-05-07T20:33:19.3997868Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.3998273Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.3998531Z E ^ 2025-05-07T20:33:19.3999037Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.3999491Z 2025-05-07T20:33:19.3999904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.4000413Z 2025-05-07T20:33:19.4000520Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.4000928Z self=, 2025-05-07T20:33:19.4001332Z T=4096, 2025-05-07T20:33:19.4001526Z D=5120, 2025-05-07T20:33:19.4001723Z scale_ub=None, 2025-05-07T20:33:19.4001942Z contiguous=False, 2025-05-07T20:33:19.4002169Z compiled=True, 2025-05-07T20:33:19.4002368Z ) 2025-05-07T20:33:19.4002690Z self = 2025-05-07T20:33:19.4003188Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:19.4003465Z 2025-05-07T20:33:19.4003545Z @given( 2025-05-07T20:33:19.4003778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.4004098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.4004419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.4004749Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.4005081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.4005366Z ) 2025-05-07T20:33:19.4005712Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.4006168Z def test_silu_mul_quant( 2025-05-07T20:33:19.4006412Z self, 2025-05-07T20:33:19.4006601Z T: int, 2025-05-07T20:33:19.4006803Z D: int, 2025-05-07T20:33:19.4007020Z scale_ub: Optional[float], 2025-05-07T20:33:19.4007287Z contiguous: bool, 2025-05-07T20:33:19.4007528Z compiled: bool, 2025-05-07T20:33:19.4007757Z ) -> None: 2025-05-07T20:33:19.4007978Z torch.manual_seed(2025) 2025-05-07T20:33:19.4008236Z 2025-05-07T20:33:19.4008513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.4008863Z 2025-05-07T20:33:19.4009052Z x_sign = torch.sign(x) 2025-05-07T20:33:19.4009340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.4009652Z x = x_sign * x_clamp 2025-05-07T20:33:19.4009890Z x0 = x[:, :D] 2025-05-07T20:33:19.4010107Z x1 = x[:, D:] 2025-05-07T20:33:19.4010320Z 2025-05-07T20:33:19.4010500Z if contiguous: 2025-05-07T20:33:19.4010742Z x0 = x0.contiguous() 2025-05-07T20:33:19.4011004Z x1 = x1.contiguous() 2025-05-07T20:33:19.4011239Z 2025-05-07T20:33:19.4011431Z if scale_ub is not None: 2025-05-07T20:33:19.4011701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.4012038Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.4012432Z ) 2025-05-07T20:33:19.4012628Z else: 2025-05-07T20:33:19.4012841Z scale_ub_tensor = None 2025-05-07T20:33:19.4013090Z 2025-05-07T20:33:19.4013323Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.4013713Z op = silu_mul_quant 2025-05-07T20:33:19.4013959Z if compiled: 2025-05-07T20:33:19.4014209Z op = torch.compile(op) 2025-05-07T20:33:19.4014507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.4014774Z 2025-05-07T20:33:19.4014970Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.4015133Z 2025-05-07T20:33:19.4015237Z moe/activation_test.py:117: 2025-05-07T20:33:19.4015525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.4015855Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.4016134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.4016692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.4017413Z return fn(*args, **kwargs) 2025-05-07T20:33:19.4018117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.4018803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.4019332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.4020006Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.4020666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.4021195Z kernel = self.compile( 2025-05-07T20:33:19.4021725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.4022375Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.4022778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.4023005Z 2025-05-07T20:33:19.4023213Z self = 2025-05-07T20:33:19.4024285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.4025650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf6a51c0>} 2025-05-07T20:33:19.4026995Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.4028044Z context = 2025-05-07T20:33:19.4028356Z 2025-05-07T20:33:19.4028531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.4029046Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.4029514Z module_map=module_map) 2025-05-07T20:33:19.4029876Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.4030222Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.4030482Z E ^ 2025-05-07T20:33:19.4030950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.4031395Z 2025-05-07T20:33:19.4031809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.4032314Z 2025-05-07T20:33:19.7114603Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.7115273Z self=, 2025-05-07T20:33:19.7115696Z T=4096, 2025-05-07T20:33:19.7115962Z D=5120, 2025-05-07T20:33:19.7116157Z scale_ub=1200.0, 2025-05-07T20:33:19.7116456Z contiguous=False, 2025-05-07T20:33:19.7116677Z compiled=False, 2025-05-07T20:33:19.7116885Z ) 2025-05-07T20:33:19.7117209Z self = 2025-05-07T20:33:19.7117710Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.7118012Z 2025-05-07T20:33:19.7118104Z @given( 2025-05-07T20:33:19.7118362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.7118681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.7118993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.7119328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.7119662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.7119957Z ) 2025-05-07T20:33:19.7120370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.7120821Z def test_silu_mul_quant( 2025-05-07T20:33:19.7121070Z self, 2025-05-07T20:33:19.7121268Z T: int, 2025-05-07T20:33:19.7121472Z D: int, 2025-05-07T20:33:19.7121687Z scale_ub: Optional[float], 2025-05-07T20:33:19.7121966Z contiguous: bool, 2025-05-07T20:33:19.7122212Z compiled: bool, 2025-05-07T20:33:19.7122440Z ) -> None: 2025-05-07T20:33:19.7122655Z torch.manual_seed(2025) 2025-05-07T20:33:19.7122908Z 2025-05-07T20:33:19.7123184Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.7123526Z 2025-05-07T20:33:19.7123724Z x_sign = torch.sign(x) 2025-05-07T20:33:19.7124017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.7124328Z x = x_sign * x_clamp 2025-05-07T20:33:19.7124582Z x0 = x[:, :D] 2025-05-07T20:33:19.7124809Z x1 = x[:, D:] 2025-05-07T20:33:19.7125014Z 2025-05-07T20:33:19.7125208Z if contiguous: 2025-05-07T20:33:19.7125441Z x0 = x0.contiguous() 2025-05-07T20:33:19.7125704Z x1 = x1.contiguous() 2025-05-07T20:33:19.7125950Z 2025-05-07T20:33:19.7126150Z if scale_ub is not None: 2025-05-07T20:33:19.7126421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.7126761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.7127071Z ) 2025-05-07T20:33:19.7127272Z else: 2025-05-07T20:33:19.7127485Z scale_ub_tensor = None 2025-05-07T20:33:19.7127740Z 2025-05-07T20:33:19.7127977Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.7128292Z op = silu_mul_quant 2025-05-07T20:33:19.7128544Z if compiled: 2025-05-07T20:33:19.7128797Z op = torch.compile(op) 2025-05-07T20:33:19.7129097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.7129386Z 2025-05-07T20:33:19.7129583Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.7129748Z 2025-05-07T20:33:19.7129850Z moe/activation_test.py:117: 2025-05-07T20:33:19.7130159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.7130505Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.7130783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.7131480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.7132172Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.7132709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.7133389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.7134106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.7134680Z kernel = self.compile( 2025-05-07T20:33:19.7135227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.7135927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.7136324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.7136554Z 2025-05-07T20:33:19.7136764Z self = 2025-05-07T20:33:19.7137849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.7139221Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf6a6160>} 2025-05-07T20:33:19.7140608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.7141644Z context = 2025-05-07T20:33:19.7141937Z 2025-05-07T20:33:19.7142109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.7142630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.7143102Z module_map=module_map) 2025-05-07T20:33:19.7143468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.7143828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.7144085Z E ^ 2025-05-07T20:33:19.7144559Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.7145012Z 2025-05-07T20:33:19.7145435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.7145949Z 2025-05-07T20:33:19.7146059Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.7146496Z self=, 2025-05-07T20:33:19.7146899Z T=4096, 2025-05-07T20:33:19.7147093Z D=5120, 2025-05-07T20:33:19.7147285Z scale_ub=1200.0, 2025-05-07T20:33:19.7147516Z contiguous=False, 2025-05-07T20:33:19.7147740Z compiled=True, 2025-05-07T20:33:19.7147950Z ) 2025-05-07T20:33:19.7148267Z self = 2025-05-07T20:33:19.7148773Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:19.7149047Z 2025-05-07T20:33:19.7149132Z @given( 2025-05-07T20:33:19.7149369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.7149692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.7150005Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.7150336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.7150667Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.7150962Z ) 2025-05-07T20:33:19.7151313Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.7151764Z def test_silu_mul_quant( 2025-05-07T20:33:19.7152004Z self, 2025-05-07T20:33:19.7152204Z T: int, 2025-05-07T20:33:19.7152404Z D: int, 2025-05-07T20:33:19.7152619Z scale_ub: Optional[float], 2025-05-07T20:33:19.7152893Z contiguous: bool, 2025-05-07T20:33:19.7158652Z compiled: bool, 2025-05-07T20:33:19.7158892Z ) -> None: 2025-05-07T20:33:19.7159117Z torch.manual_seed(2025) 2025-05-07T20:33:19.7159370Z 2025-05-07T20:33:19.7159718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.7160098Z 2025-05-07T20:33:19.7160300Z x_sign = torch.sign(x) 2025-05-07T20:33:19.7160583Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.7160940Z x = x_sign * x_clamp 2025-05-07T20:33:19.7161184Z x0 = x[:, :D] 2025-05-07T20:33:19.7161401Z x1 = x[:, D:] 2025-05-07T20:33:19.7161610Z 2025-05-07T20:33:19.7161804Z if contiguous: 2025-05-07T20:33:19.7162035Z x0 = x0.contiguous() 2025-05-07T20:33:19.7162292Z x1 = x1.contiguous() 2025-05-07T20:33:19.7162533Z 2025-05-07T20:33:19.7162722Z if scale_ub is not None: 2025-05-07T20:33:19.7162995Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.7163340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.7163656Z ) 2025-05-07T20:33:19.7163846Z else: 2025-05-07T20:33:19.7164058Z scale_ub_tensor = None 2025-05-07T20:33:19.7164321Z 2025-05-07T20:33:19.7164595Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.7164916Z op = silu_mul_quant 2025-05-07T20:33:19.7165170Z if compiled: 2025-05-07T20:33:19.7165663Z op = torch.compile(op) 2025-05-07T20:33:19.7165965Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.7166245Z 2025-05-07T20:33:19.7166437Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.7166605Z 2025-05-07T20:33:19.7166707Z moe/activation_test.py:117: 2025-05-07T20:33:19.7167009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.7167345Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.7167624Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.7168184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:19.7168744Z return fn(*args, **kwargs) 2025-05-07T20:33:19.7169403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.7170088Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.7170629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.7171304Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.7171961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.7172494Z kernel = self.compile( 2025-05-07T20:33:19.7173032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.7173680Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.7174083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.7174324Z 2025-05-07T20:33:19.7174532Z self = 2025-05-07T20:33:19.7175753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.7177134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf6a7240>} 2025-05-07T20:33:19.7178467Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.7179492Z context = 2025-05-07T20:33:19.7179779Z 2025-05-07T20:33:19.7180039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.7180646Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.7181113Z module_map=module_map) 2025-05-07T20:33:19.7181543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.7181904Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.7182160Z E ^ 2025-05-07T20:33:19.7182632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.7183082Z 2025-05-07T20:33:19.7183501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.7184010Z 2025-05-07T20:33:19.8328876Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8329334Z self=, 2025-05-07T20:33:19.8329747Z T=2048, 2025-05-07T20:33:19.8329955Z D=7168, 2025-05-07T20:33:19.8330157Z scale_ub=1200.0, 2025-05-07T20:33:19.8330494Z contiguous=False, 2025-05-07T20:33:19.8330722Z compiled=False, 2025-05-07T20:33:19.8330938Z ) 2025-05-07T20:33:19.8331262Z self = 2025-05-07T20:33:19.8331759Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:19.8332039Z 2025-05-07T20:33:19.8332127Z @given( 2025-05-07T20:33:19.8332358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8332678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8332994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8333332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8333657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8333954Z ) 2025-05-07T20:33:19.8334309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8334755Z def test_silu_mul_quant( 2025-05-07T20:33:19.8335007Z self, 2025-05-07T20:33:19.8335208Z T: int, 2025-05-07T20:33:19.8335409Z D: int, 2025-05-07T20:33:19.8335634Z scale_ub: Optional[float], 2025-05-07T20:33:19.8335914Z contiguous: bool, 2025-05-07T20:33:19.8336149Z compiled: bool, 2025-05-07T20:33:19.8336373Z ) -> None: 2025-05-07T20:33:19.8336594Z torch.manual_seed(2025) 2025-05-07T20:33:19.8336832Z 2025-05-07T20:33:19.8337108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8337460Z 2025-05-07T20:33:19.8337649Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8337943Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8338254Z x = x_sign * x_clamp 2025-05-07T20:33:19.8338500Z x0 = x[:, :D] 2025-05-07T20:33:19.8338713Z x1 = x[:, D:] 2025-05-07T20:33:19.8338925Z 2025-05-07T20:33:19.8339116Z if contiguous: 2025-05-07T20:33:19.8339346Z x0 = x0.contiguous() 2025-05-07T20:33:19.8339611Z x1 = x1.contiguous() 2025-05-07T20:33:19.8339856Z 2025-05-07T20:33:19.8340075Z if scale_ub is not None: 2025-05-07T20:33:19.8340355Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8340692Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8341003Z ) 2025-05-07T20:33:19.8341196Z else: 2025-05-07T20:33:19.8341405Z scale_ub_tensor = None 2025-05-07T20:33:19.8341656Z 2025-05-07T20:33:19.8341892Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8342201Z op = silu_mul_quant 2025-05-07T20:33:19.8342453Z if compiled: 2025-05-07T20:33:19.8342704Z op = torch.compile(op) 2025-05-07T20:33:19.8342998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8343281Z 2025-05-07T20:33:19.8343471Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8343769Z 2025-05-07T20:33:19.8343874Z moe/activation_test.py:117: 2025-05-07T20:33:19.8344182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8344572Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8344863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8345553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8346250Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8346785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8347471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8348141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8348667Z kernel = self.compile( 2025-05-07T20:33:19.8349268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8349921Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8350325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8350553Z 2025-05-07T20:33:19.8350758Z self = 2025-05-07T20:33:19.8351838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8353206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf72c220>} 2025-05-07T20:33:19.8354550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8355569Z context = 2025-05-07T20:33:19.8355925Z 2025-05-07T20:33:19.8356091Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8356611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8357082Z module_map=module_map) 2025-05-07T20:33:19.8357437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8357791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8358056Z E ^ 2025-05-07T20:33:19.8358562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8359016Z 2025-05-07T20:33:19.8359428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8359946Z 2025-05-07T20:33:19.8360051Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8360465Z self=, 2025-05-07T20:33:19.8360859Z T=1, 2025-05-07T20:33:19.8361049Z D=7168, 2025-05-07T20:33:19.8361238Z scale_ub=None, 2025-05-07T20:33:19.8361452Z contiguous=True, 2025-05-07T20:33:19.8361678Z compiled=False, 2025-05-07T20:33:19.8361880Z ) 2025-05-07T20:33:19.8362198Z self = 2025-05-07T20:33:19.8362692Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:19.8362957Z 2025-05-07T20:33:19.8363038Z @given( 2025-05-07T20:33:19.8363275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:19.8363585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:19.8363945Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:19.8364322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:19.8364649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:19.8364974Z ) 2025-05-07T20:33:19.8365322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:19.8366006Z def test_silu_mul_quant( 2025-05-07T20:33:19.8366244Z self, 2025-05-07T20:33:19.8366442Z T: int, 2025-05-07T20:33:19.8366641Z D: int, 2025-05-07T20:33:19.8366857Z scale_ub: Optional[float], 2025-05-07T20:33:19.8367128Z contiguous: bool, 2025-05-07T20:33:19.8367367Z compiled: bool, 2025-05-07T20:33:19.8367587Z ) -> None: 2025-05-07T20:33:19.8367798Z torch.manual_seed(2025) 2025-05-07T20:33:19.8368040Z 2025-05-07T20:33:19.8368311Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:19.8368654Z 2025-05-07T20:33:19.8368854Z x_sign = torch.sign(x) 2025-05-07T20:33:19.8369212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:19.8369524Z x = x_sign * x_clamp 2025-05-07T20:33:19.8369771Z x0 = x[:, :D] 2025-05-07T20:33:19.8369987Z x1 = x[:, D:] 2025-05-07T20:33:19.8370195Z 2025-05-07T20:33:19.8370384Z if contiguous: 2025-05-07T20:33:19.8370607Z x0 = x0.contiguous() 2025-05-07T20:33:19.8370869Z x1 = x1.contiguous() 2025-05-07T20:33:19.8371109Z 2025-05-07T20:33:19.8371300Z if scale_ub is not None: 2025-05-07T20:33:19.8371575Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:19.8371908Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:19.8372214Z ) 2025-05-07T20:33:19.8372408Z else: 2025-05-07T20:33:19.8372622Z scale_ub_tensor = None 2025-05-07T20:33:19.8372875Z 2025-05-07T20:33:19.8373105Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:19.8373424Z op = silu_mul_quant 2025-05-07T20:33:19.8373675Z if compiled: 2025-05-07T20:33:19.8373922Z op = torch.compile(op) 2025-05-07T20:33:19.8374220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8374494Z 2025-05-07T20:33:19.8374686Z > y_fp8, y_scale = fn() 2025-05-07T20:33:19.8374852Z 2025-05-07T20:33:19.8374950Z moe/activation_test.py:117: 2025-05-07T20:33:19.8375244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8375568Z moe/activation_test.py:115: in fn 2025-05-07T20:33:19.8375848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:19.8376534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:19.8377221Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:19.8377757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:19.8378439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:19.8379109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:19.8379634Z kernel = self.compile( 2025-05-07T20:33:19.8380171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:19.8380821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:19.8381221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:19.8381449Z 2025-05-07T20:33:19.8381653Z self = 2025-05-07T20:33:19.8382808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:19.8384235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf72d120>} 2025-05-07T20:33:19.8385630Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:19.8386661Z context = 2025-05-07T20:33:19.8386950Z 2025-05-07T20:33:19.8387118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:19.8387647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:19.8388118Z module_map=module_map) 2025-05-07T20:33:19.8388532Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:19.8388896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:19.8389205Z E ^ 2025-05-07T20:33:19.8389670Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:19.8390124Z 2025-05-07T20:33:19.8390535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:19.8391050Z 2025-05-07T20:33:19.8391159Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:19.8391572Z self=, 2025-05-07T20:33:19.8391977Z T=16384, 2025-05-07T20:33:19.8392171Z D=7168, 2025-05-07T20:33:19.8392372Z scale_ub=1200.0, 2025-05-07T20:33:19.8392601Z contiguous=False, 2025-05-07T20:33:19.8392824Z compiled=True, 2025-05-07T20:33:20.0812626Z ) 2025-05-07T20:33:20.0813265Z self = 2025-05-07T20:33:20.0813955Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:20.0814246Z 2025-05-07T20:33:20.0814354Z @given( 2025-05-07T20:33:20.0814581Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.0814904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.0815208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.0815534Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.0815863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.0816150Z ) 2025-05-07T20:33:20.0816492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.0816937Z def test_silu_mul_quant( 2025-05-07T20:33:20.0817177Z self, 2025-05-07T20:33:20.0817378Z T: int, 2025-05-07T20:33:20.0817568Z D: int, 2025-05-07T20:33:20.0817786Z scale_ub: Optional[float], 2025-05-07T20:33:20.0818062Z contiguous: bool, 2025-05-07T20:33:20.0818327Z compiled: bool, 2025-05-07T20:33:20.0818579Z ) -> None: 2025-05-07T20:33:20.0818806Z torch.manual_seed(2025) 2025-05-07T20:33:20.0819048Z 2025-05-07T20:33:20.0819323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.0819665Z 2025-05-07T20:33:20.0819857Z x_sign = torch.sign(x) 2025-05-07T20:33:20.0820151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.0820459Z x = x_sign * x_clamp 2025-05-07T20:33:20.0820691Z x0 = x[:, :D] 2025-05-07T20:33:20.0820911Z x1 = x[:, D:] 2025-05-07T20:33:20.0821121Z 2025-05-07T20:33:20.0821306Z if contiguous: 2025-05-07T20:33:20.0821538Z x0 = x0.contiguous() 2025-05-07T20:33:20.0821795Z x1 = x1.contiguous() 2025-05-07T20:33:20.0822037Z 2025-05-07T20:33:20.0822223Z if scale_ub is not None: 2025-05-07T20:33:20.0822499Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.0823025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.0823340Z ) 2025-05-07T20:33:20.0823538Z else: 2025-05-07T20:33:20.0823757Z scale_ub_tensor = None 2025-05-07T20:33:20.0824078Z 2025-05-07T20:33:20.0824310Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.0824627Z op = silu_mul_quant 2025-05-07T20:33:20.0824872Z if compiled: 2025-05-07T20:33:20.0825118Z op = torch.compile(op) 2025-05-07T20:33:20.0825409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.0825677Z 2025-05-07T20:33:20.0825874Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.0826036Z 2025-05-07T20:33:20.0826141Z moe/activation_test.py:117: 2025-05-07T20:33:20.0826436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.0826760Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.0827039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.0827663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:20.0828216Z return fn(*args, **kwargs) 2025-05-07T20:33:20.0828876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.0829556Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.0830090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.0830762Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.0831422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.0831949Z kernel = self.compile( 2025-05-07T20:33:20.0832480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.0833136Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.0833534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.0833766Z 2025-05-07T20:33:20.0833975Z self = 2025-05-07T20:33:20.0835049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.0836544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf72e520>} 2025-05-07T20:33:20.0837889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.0838928Z context = 2025-05-07T20:33:20.0839215Z 2025-05-07T20:33:20.0839383Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.0839902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.0840367Z module_map=module_map) 2025-05-07T20:33:20.0840732Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.0841082Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.0841339Z E ^ 2025-05-07T20:33:20.0841803Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.0842252Z 2025-05-07T20:33:20.0842666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.0843176Z 2025-05-07T20:33:20.0843375Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.0843845Z self=, 2025-05-07T20:33:20.0844304Z T=1, 2025-05-07T20:33:20.0844544Z D=7168, 2025-05-07T20:33:20.0844734Z scale_ub=None, 2025-05-07T20:33:20.0844957Z contiguous=False, 2025-05-07T20:33:20.0845272Z compiled=False, 2025-05-07T20:33:20.0845490Z ) 2025-05-07T20:33:20.0845811Z self = 2025-05-07T20:33:20.0846295Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:20.0846564Z 2025-05-07T20:33:20.0846642Z @given( 2025-05-07T20:33:20.0846880Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.0847192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.0847525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.0847853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.0848194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.0848533Z ) 2025-05-07T20:33:20.0848877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.0849326Z def test_silu_mul_quant( 2025-05-07T20:33:20.0849574Z self, 2025-05-07T20:33:20.0849762Z T: int, 2025-05-07T20:33:20.0849964Z D: int, 2025-05-07T20:33:20.0850185Z scale_ub: Optional[float], 2025-05-07T20:33:20.0850457Z contiguous: bool, 2025-05-07T20:33:20.0850701Z compiled: bool, 2025-05-07T20:33:20.0850919Z ) -> None: 2025-05-07T20:33:20.0851137Z torch.manual_seed(2025) 2025-05-07T20:33:20.0851375Z 2025-05-07T20:33:20.0851644Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.0851982Z 2025-05-07T20:33:20.0852171Z x_sign = torch.sign(x) 2025-05-07T20:33:20.0852457Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.0852774Z x = x_sign * x_clamp 2025-05-07T20:33:20.0853007Z x0 = x[:, :D] 2025-05-07T20:33:20.0853225Z x1 = x[:, D:] 2025-05-07T20:33:20.0853446Z 2025-05-07T20:33:20.0853708Z if contiguous: 2025-05-07T20:33:20.0853956Z x0 = x0.contiguous() 2025-05-07T20:33:20.0854215Z x1 = x1.contiguous() 2025-05-07T20:33:20.0854452Z 2025-05-07T20:33:20.0854650Z if scale_ub is not None: 2025-05-07T20:33:20.0854989Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.0855331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.0855639Z ) 2025-05-07T20:33:20.0855888Z else: 2025-05-07T20:33:20.0856190Z scale_ub_tensor = None 2025-05-07T20:33:20.0856555Z 2025-05-07T20:33:20.0856802Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.0857118Z op = silu_mul_quant 2025-05-07T20:33:20.0857364Z if compiled: 2025-05-07T20:33:20.0857614Z op = torch.compile(op) 2025-05-07T20:33:20.0857918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.0858200Z 2025-05-07T20:33:20.0858427Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.0858600Z 2025-05-07T20:33:20.0858707Z moe/activation_test.py:117: 2025-05-07T20:33:20.0859002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.0859333Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.0859615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.0860305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.0860993Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.0861542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.0862221Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.0863007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.0863549Z kernel = self.compile( 2025-05-07T20:33:20.0864131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.0870579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.0871080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.0871319Z 2025-05-07T20:33:20.0871525Z self = 2025-05-07T20:33:20.0872617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.0874162Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf72f100>} 2025-05-07T20:33:20.0875521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.0876637Z context = 2025-05-07T20:33:20.0876929Z 2025-05-07T20:33:20.0877096Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.0877619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.0878095Z module_map=module_map) 2025-05-07T20:33:20.0878511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.0878871Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.0879131Z E ^ 2025-05-07T20:33:20.0879601Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.0880056Z 2025-05-07T20:33:20.0880469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.0880984Z 2025-05-07T20:33:20.0881087Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.0881500Z self=, 2025-05-07T20:33:20.0881898Z T=2048, 2025-05-07T20:33:20.0882092Z D=7168, 2025-05-07T20:33:20.0882285Z scale_ub=None, 2025-05-07T20:33:20.0882501Z contiguous=False, 2025-05-07T20:33:20.0882733Z compiled=True, 2025-05-07T20:33:20.0882937Z ) 2025-05-07T20:33:20.1751459Z self = 2025-05-07T20:33:20.1752014Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:20.1752298Z 2025-05-07T20:33:20.1752389Z @given( 2025-05-07T20:33:20.1752633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.1752952Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.1753249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.1753582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.1753914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.1754195Z ) 2025-05-07T20:33:20.1754545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.1754989Z def test_silu_mul_quant( 2025-05-07T20:33:20.1755229Z self, 2025-05-07T20:33:20.1755434Z T: int, 2025-05-07T20:33:20.1755639Z D: int, 2025-05-07T20:33:20.1755940Z scale_ub: Optional[float], 2025-05-07T20:33:20.1756211Z contiguous: bool, 2025-05-07T20:33:20.1756446Z compiled: bool, 2025-05-07T20:33:20.1756670Z ) -> None: 2025-05-07T20:33:20.1756886Z torch.manual_seed(2025) 2025-05-07T20:33:20.1757306Z 2025-05-07T20:33:20.1757589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.1757931Z 2025-05-07T20:33:20.1758127Z x_sign = torch.sign(x) 2025-05-07T20:33:20.1758476Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.1758782Z x = x_sign * x_clamp 2025-05-07T20:33:20.1759027Z x0 = x[:, :D] 2025-05-07T20:33:20.1759247Z x1 = x[:, D:] 2025-05-07T20:33:20.1759453Z 2025-05-07T20:33:20.1759652Z if contiguous: 2025-05-07T20:33:20.1759891Z x0 = x0.contiguous() 2025-05-07T20:33:20.1760148Z x1 = x1.contiguous() 2025-05-07T20:33:20.1760393Z 2025-05-07T20:33:20.1760593Z if scale_ub is not None: 2025-05-07T20:33:20.1760874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.1761206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.1761526Z ) 2025-05-07T20:33:20.1761731Z else: 2025-05-07T20:33:20.1761947Z scale_ub_tensor = None 2025-05-07T20:33:20.1762273Z 2025-05-07T20:33:20.1762518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.1762836Z op = silu_mul_quant 2025-05-07T20:33:20.1763093Z if compiled: 2025-05-07T20:33:20.1763348Z op = torch.compile(op) 2025-05-07T20:33:20.1763637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.1763916Z 2025-05-07T20:33:20.1764112Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.1764275Z 2025-05-07T20:33:20.1764375Z moe/activation_test.py:117: 2025-05-07T20:33:20.1764673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.1765004Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.1765280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.1766252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:20.1766827Z return fn(*args, **kwargs) 2025-05-07T20:33:20.1767484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.1768184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.1768725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.1769416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.1770084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.1770618Z kernel = self.compile( 2025-05-07T20:33:20.1771175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.1771915Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.1772320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.1772611Z 2025-05-07T20:33:20.1772848Z self = 2025-05-07T20:33:20.1774040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.1775463Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf558720>} 2025-05-07T20:33:20.1776806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.1777824Z context = 2025-05-07T20:33:20.1778273Z 2025-05-07T20:33:20.1778443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.1778963Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.1779488Z module_map=module_map) 2025-05-07T20:33:20.1779845Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.1780198Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.1780458Z E ^ 2025-05-07T20:33:20.1780913Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.1781366Z 2025-05-07T20:33:20.1781778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.1782290Z 2025-05-07T20:33:20.1782392Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.1782809Z self=, 2025-05-07T20:33:20.1783204Z T=4096, 2025-05-07T20:33:20.1783453Z D=7168, 2025-05-07T20:33:20.1783654Z scale_ub=None, 2025-05-07T20:33:20.1783869Z contiguous=False, 2025-05-07T20:33:20.1784096Z compiled=True, 2025-05-07T20:33:20.1784304Z ) 2025-05-07T20:33:20.1784625Z self = 2025-05-07T20:33:20.1785121Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:20.1785394Z 2025-05-07T20:33:20.1785479Z @given( 2025-05-07T20:33:20.1785706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.1786025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.1786331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.1786663Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.1786997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.1787280Z ) 2025-05-07T20:33:20.1787626Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.1788072Z def test_silu_mul_quant( 2025-05-07T20:33:20.1788316Z self, 2025-05-07T20:33:20.1788510Z T: int, 2025-05-07T20:33:20.1788709Z D: int, 2025-05-07T20:33:20.1788931Z scale_ub: Optional[float], 2025-05-07T20:33:20.1789196Z contiguous: bool, 2025-05-07T20:33:20.1789436Z compiled: bool, 2025-05-07T20:33:20.1789663Z ) -> None: 2025-05-07T20:33:20.1789874Z torch.manual_seed(2025) 2025-05-07T20:33:20.1790114Z 2025-05-07T20:33:20.1790387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.1790723Z 2025-05-07T20:33:20.1790915Z x_sign = torch.sign(x) 2025-05-07T20:33:20.1791200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.1791504Z x = x_sign * x_clamp 2025-05-07T20:33:20.1791758Z x0 = x[:, :D] 2025-05-07T20:33:20.1791972Z x1 = x[:, D:] 2025-05-07T20:33:20.1792183Z 2025-05-07T20:33:20.1792362Z if contiguous: 2025-05-07T20:33:20.1792597Z x0 = x0.contiguous() 2025-05-07T20:33:20.1792858Z x1 = x1.contiguous() 2025-05-07T20:33:20.1793097Z 2025-05-07T20:33:20.1793286Z if scale_ub is not None: 2025-05-07T20:33:20.1793556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.1793883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.1794189Z ) 2025-05-07T20:33:20.1794385Z else: 2025-05-07T20:33:20.1794588Z scale_ub_tensor = None 2025-05-07T20:33:20.1794845Z 2025-05-07T20:33:20.1795076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.1795393Z op = silu_mul_quant 2025-05-07T20:33:20.1795640Z if compiled: 2025-05-07T20:33:20.1795943Z op = torch.compile(op) 2025-05-07T20:33:20.1796244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.1796572Z 2025-05-07T20:33:20.1796803Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.1796969Z 2025-05-07T20:33:20.1797074Z moe/activation_test.py:117: 2025-05-07T20:33:20.1797361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.1797730Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.1798014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.1798559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:20.1799119Z return fn(*args, **kwargs) 2025-05-07T20:33:20.1799774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.1800463Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.1800995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.1801679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.1802382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.1802920Z kernel = self.compile( 2025-05-07T20:33:20.1803453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.1804111Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.1804509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.1804736Z 2025-05-07T20:33:20.1804944Z self = 2025-05-07T20:33:20.1806017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.1807395Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf559440>} 2025-05-07T20:33:20.1808789Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.1809815Z context = 2025-05-07T20:33:20.1810104Z 2025-05-07T20:33:20.1810267Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.1810785Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.1811254Z module_map=module_map) 2025-05-07T20:33:20.1811626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.1811980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.1812246Z E ^ 2025-05-07T20:33:20.1812708Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.1813157Z 2025-05-07T20:33:20.1813566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.1814076Z 2025-05-07T20:33:20.3408741Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.3409372Z self=, 2025-05-07T20:33:20.3409949Z T=16384, 2025-05-07T20:33:20.3410263Z D=5120, 2025-05-07T20:33:20.3410540Z scale_ub=1200.0, 2025-05-07T20:33:20.3410849Z contiguous=False, 2025-05-07T20:33:20.3411105Z compiled=False, 2025-05-07T20:33:20.3411319Z ) 2025-05-07T20:33:20.3411649Z self = 2025-05-07T20:33:20.3412302Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:20.3412658Z 2025-05-07T20:33:20.3412739Z @given( 2025-05-07T20:33:20.3412980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.3413302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.3413673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.3414009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.3414342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.3414633Z ) 2025-05-07T20:33:20.3414986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.3415433Z def test_silu_mul_quant( 2025-05-07T20:33:20.3415673Z self, 2025-05-07T20:33:20.3415877Z T: int, 2025-05-07T20:33:20.3416080Z D: int, 2025-05-07T20:33:20.3416295Z scale_ub: Optional[float], 2025-05-07T20:33:20.3416572Z contiguous: bool, 2025-05-07T20:33:20.3416815Z compiled: bool, 2025-05-07T20:33:20.3417054Z ) -> None: 2025-05-07T20:33:20.3417269Z torch.manual_seed(2025) 2025-05-07T20:33:20.3417581Z 2025-05-07T20:33:20.3417859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.3418206Z 2025-05-07T20:33:20.3418420Z x_sign = torch.sign(x) 2025-05-07T20:33:20.3418753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.3419062Z x = x_sign * x_clamp 2025-05-07T20:33:20.3419309Z x0 = x[:, :D] 2025-05-07T20:33:20.3419538Z x1 = x[:, D:] 2025-05-07T20:33:20.3419743Z 2025-05-07T20:33:20.3419932Z if contiguous: 2025-05-07T20:33:20.3420169Z x0 = x0.contiguous() 2025-05-07T20:33:20.3420429Z x1 = x1.contiguous() 2025-05-07T20:33:20.3420669Z 2025-05-07T20:33:20.3420868Z if scale_ub is not None: 2025-05-07T20:33:20.3421137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.3421475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.3421797Z ) 2025-05-07T20:33:20.3421993Z else: 2025-05-07T20:33:20.3422207Z scale_ub_tensor = None 2025-05-07T20:33:20.3422467Z 2025-05-07T20:33:20.3422702Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.3423019Z op = silu_mul_quant 2025-05-07T20:33:20.3423275Z if compiled: 2025-05-07T20:33:20.3423524Z op = torch.compile(op) 2025-05-07T20:33:20.3423820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.3424106Z 2025-05-07T20:33:20.3424301Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.3424464Z 2025-05-07T20:33:20.3424566Z moe/activation_test.py:117: 2025-05-07T20:33:20.3424865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.3425207Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.3425602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.3426305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.3427005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.3427548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.3428230Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.3428943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.3429481Z kernel = self.compile( 2025-05-07T20:33:20.3430026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.3430682Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.3431083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.3431315Z 2025-05-07T20:33:20.3431588Z self = 2025-05-07T20:33:20.3432951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.3434369Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf55a340>} 2025-05-07T20:33:20.3435786Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.3436843Z context = 2025-05-07T20:33:20.3437133Z 2025-05-07T20:33:20.3437305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.3437877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.3438352Z module_map=module_map) 2025-05-07T20:33:20.3438720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.3439077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.3439333Z E ^ 2025-05-07T20:33:20.3439802Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.3440255Z 2025-05-07T20:33:20.3440675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.3441185Z 2025-05-07T20:33:20.3441288Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.3441701Z self=, 2025-05-07T20:33:20.3442104Z T=16384, 2025-05-07T20:33:20.3442301Z D=5120, 2025-05-07T20:33:20.3442497Z scale_ub=1200.0, 2025-05-07T20:33:20.3442719Z contiguous=True, 2025-05-07T20:33:20.3442944Z compiled=True, 2025-05-07T20:33:20.3443141Z ) 2025-05-07T20:33:20.3443458Z self = 2025-05-07T20:33:20.3443952Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:20.3444227Z 2025-05-07T20:33:20.3444306Z @given( 2025-05-07T20:33:20.3444537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.3444853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.3445157Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.3445488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.3445820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.3446110Z ) 2025-05-07T20:33:20.3446455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.3446898Z def test_silu_mul_quant( 2025-05-07T20:33:20.3447140Z self, 2025-05-07T20:33:20.3447334Z T: int, 2025-05-07T20:33:20.3447534Z D: int, 2025-05-07T20:33:20.3447752Z scale_ub: Optional[float], 2025-05-07T20:33:20.3448024Z contiguous: bool, 2025-05-07T20:33:20.3448265Z compiled: bool, 2025-05-07T20:33:20.3448489Z ) -> None: 2025-05-07T20:33:20.3448704Z torch.manual_seed(2025) 2025-05-07T20:33:20.3448950Z 2025-05-07T20:33:20.3449229Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.3449571Z 2025-05-07T20:33:20.3449771Z x_sign = torch.sign(x) 2025-05-07T20:33:20.3450063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.3450378Z x = x_sign * x_clamp 2025-05-07T20:33:20.3450615Z x0 = x[:, :D] 2025-05-07T20:33:20.3450831Z x1 = x[:, D:] 2025-05-07T20:33:20.3451042Z 2025-05-07T20:33:20.3451223Z if contiguous: 2025-05-07T20:33:20.3451511Z x0 = x0.contiguous() 2025-05-07T20:33:20.3451806Z x1 = x1.contiguous() 2025-05-07T20:33:20.3452050Z 2025-05-07T20:33:20.3452244Z if scale_ub is not None: 2025-05-07T20:33:20.3452518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.3452889Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.3453198Z ) 2025-05-07T20:33:20.3453395Z else: 2025-05-07T20:33:20.3453608Z scale_ub_tensor = None 2025-05-07T20:33:20.3453863Z 2025-05-07T20:33:20.3454096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.3454408Z op = silu_mul_quant 2025-05-07T20:33:20.3454659Z if compiled: 2025-05-07T20:33:20.3454911Z op = torch.compile(op) 2025-05-07T20:33:20.3455205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.3455491Z 2025-05-07T20:33:20.3455695Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.3455857Z 2025-05-07T20:33:20.3455967Z moe/activation_test.py:117: 2025-05-07T20:33:20.3456301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.3456639Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.3456934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.3457490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:20.3458053Z return fn(*args, **kwargs) 2025-05-07T20:33:20.3458713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.3459400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.3459936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.3460616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.3461286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.3461820Z kernel = self.compile( 2025-05-07T20:33:20.3462359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.3463020Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.3463419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.3463654Z 2025-05-07T20:33:20.3463862Z self = 2025-05-07T20:33:20.3464949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.3466663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf55b9c0>} 2025-05-07T20:33:20.3468025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.3469061Z context = 2025-05-07T20:33:20.3469349Z 2025-05-07T20:33:20.3469518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.3470046Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.3470518Z module_map=module_map) 2025-05-07T20:33:20.3470884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.3471242Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.3471504Z E ^ 2025-05-07T20:33:20.3472059Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.3472605Z 2025-05-07T20:33:20.3473111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.3473845Z 2025-05-07T20:33:20.5181147Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.5181822Z self=, 2025-05-07T20:33:20.5182395Z T=16384, 2025-05-07T20:33:20.5182660Z D=5120, 2025-05-07T20:33:20.5182929Z scale_ub=None, 2025-05-07T20:33:20.5183152Z contiguous=False, 2025-05-07T20:33:20.5183382Z compiled=True, 2025-05-07T20:33:20.5183595Z ) 2025-05-07T20:33:20.5183920Z self = 2025-05-07T20:33:20.5184425Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:20.5184751Z 2025-05-07T20:33:20.5184831Z @given( 2025-05-07T20:33:20.5185090Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.5185565Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.5185874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.5186214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.5186548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.5186836Z ) 2025-05-07T20:33:20.5187187Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.5187632Z def test_silu_mul_quant( 2025-05-07T20:33:20.5187874Z self, 2025-05-07T20:33:20.5188078Z T: int, 2025-05-07T20:33:20.5188278Z D: int, 2025-05-07T20:33:20.5188507Z scale_ub: Optional[float], 2025-05-07T20:33:20.5188829Z contiguous: bool, 2025-05-07T20:33:20.5189069Z compiled: bool, 2025-05-07T20:33:20.5189297Z ) -> None: 2025-05-07T20:33:20.5189512Z torch.manual_seed(2025) 2025-05-07T20:33:20.5189764Z 2025-05-07T20:33:20.5190052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.5190397Z 2025-05-07T20:33:20.5190602Z x_sign = torch.sign(x) 2025-05-07T20:33:20.5190902Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.5197432Z x = x_sign * x_clamp 2025-05-07T20:33:20.5197692Z x0 = x[:, :D] 2025-05-07T20:33:20.5197915Z x1 = x[:, D:] 2025-05-07T20:33:20.5198129Z 2025-05-07T20:33:20.5198325Z if contiguous: 2025-05-07T20:33:20.5198554Z x0 = x0.contiguous() 2025-05-07T20:33:20.5198817Z x1 = x1.contiguous() 2025-05-07T20:33:20.5199062Z 2025-05-07T20:33:20.5199254Z if scale_ub is not None: 2025-05-07T20:33:20.5199529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.5199875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.5200186Z ) 2025-05-07T20:33:20.5200379Z else: 2025-05-07T20:33:20.5200596Z scale_ub_tensor = None 2025-05-07T20:33:20.5200850Z 2025-05-07T20:33:20.5201097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.5201417Z op = silu_mul_quant 2025-05-07T20:33:20.5201681Z if compiled: 2025-05-07T20:33:20.5201925Z op = torch.compile(op) 2025-05-07T20:33:20.5202230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.5202508Z 2025-05-07T20:33:20.5202697Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.5202870Z 2025-05-07T20:33:20.5202971Z moe/activation_test.py:117: 2025-05-07T20:33:20.5203286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.5203624Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.5203903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.5204467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:20.5205027Z return fn(*args, **kwargs) 2025-05-07T20:33:20.5205851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.5206542Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.5207141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.5207821Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.5208494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.5209066Z kernel = self.compile( 2025-05-07T20:33:20.5209605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.5210251Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.5210651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.5210890Z 2025-05-07T20:33:20.5211137Z self = 2025-05-07T20:33:20.5212221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.5213603Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf368c20>} 2025-05-07T20:33:20.5214936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.5215967Z context = 2025-05-07T20:33:20.5216263Z 2025-05-07T20:33:20.5216433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.5216964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.5217435Z module_map=module_map) 2025-05-07T20:33:20.5217804Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.5218158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.5218419Z E ^ 2025-05-07T20:33:20.5218935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.5219389Z 2025-05-07T20:33:20.5219801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.5220309Z 2025-05-07T20:33:20.5220421Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.5220837Z self=, 2025-05-07T20:33:20.5221251Z T=2048, 2025-05-07T20:33:20.5221448Z D=5120, 2025-05-07T20:33:20.5221644Z scale_ub=None, 2025-05-07T20:33:20.5221865Z contiguous=False, 2025-05-07T20:33:20.5222092Z compiled=True, 2025-05-07T20:33:20.5222292Z ) 2025-05-07T20:33:20.6126825Z self = 2025-05-07T20:33:20.6127604Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:20.6128012Z 2025-05-07T20:33:20.6128132Z @given( 2025-05-07T20:33:20.6128461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.6129341Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.6129975Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.6130642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.6131300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.6131881Z ) 2025-05-07T20:33:20.6132820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.6133817Z def test_silu_mul_quant( 2025-05-07T20:33:20.6134311Z self, 2025-05-07T20:33:20.6134708Z T: int, 2025-05-07T20:33:20.6135103Z D: int, 2025-05-07T20:33:20.6135661Z scale_ub: Optional[float], 2025-05-07T20:33:20.6136207Z contiguous: bool, 2025-05-07T20:33:20.6136685Z compiled: bool, 2025-05-07T20:33:20.6137143Z ) -> None: 2025-05-07T20:33:20.6137585Z torch.manual_seed(2025) 2025-05-07T20:33:20.6138076Z 2025-05-07T20:33:20.6138557Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.6138961Z 2025-05-07T20:33:20.6139158Z x_sign = torch.sign(x) 2025-05-07T20:33:20.6139456Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.6139778Z x = x_sign * x_clamp 2025-05-07T20:33:20.6140024Z x0 = x[:, :D] 2025-05-07T20:33:20.6140241Z x1 = x[:, D:] 2025-05-07T20:33:20.6140454Z 2025-05-07T20:33:20.6140662Z if contiguous: 2025-05-07T20:33:20.6140905Z x0 = x0.contiguous() 2025-05-07T20:33:20.6141234Z x1 = x1.contiguous() 2025-05-07T20:33:20.6141483Z 2025-05-07T20:33:20.6141685Z if scale_ub is not None: 2025-05-07T20:33:20.6141974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.6142329Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.6142654Z ) 2025-05-07T20:33:20.6142857Z else: 2025-05-07T20:33:20.6143082Z scale_ub_tensor = None 2025-05-07T20:33:20.6143336Z 2025-05-07T20:33:20.6143572Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.6143896Z op = silu_mul_quant 2025-05-07T20:33:20.6144152Z if compiled: 2025-05-07T20:33:20.6144408Z op = torch.compile(op) 2025-05-07T20:33:20.6144706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.6144991Z 2025-05-07T20:33:20.6145197Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.6145370Z 2025-05-07T20:33:20.6145481Z moe/activation_test.py:117: 2025-05-07T20:33:20.6145788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.6146134Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.6146422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.6146986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:20.6147557Z return fn(*args, **kwargs) 2025-05-07T20:33:20.6148225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.6148918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.6149454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.6150139Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.6150816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.6151349Z kernel = self.compile( 2025-05-07T20:33:20.6151898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.6152560Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.6152964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.6153197Z 2025-05-07T20:33:20.6153408Z self = 2025-05-07T20:33:20.6154498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.6156005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf3699e0>} 2025-05-07T20:33:20.6157393Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.6158467Z context = 2025-05-07T20:33:20.6158760Z 2025-05-07T20:33:20.6158931Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.6159464Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.6159969Z module_map=module_map) 2025-05-07T20:33:20.6160345Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.6160714Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.6160978Z E ^ 2025-05-07T20:33:20.6161504Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.6161960Z 2025-05-07T20:33:20.6162383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.6162899Z 2025-05-07T20:33:20.6163010Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.6163436Z self=, 2025-05-07T20:33:20.6163854Z T=2048, 2025-05-07T20:33:20.6164053Z D=5120, 2025-05-07T20:33:20.6164256Z scale_ub=1200.0, 2025-05-07T20:33:20.6164488Z contiguous=False, 2025-05-07T20:33:20.6164718Z compiled=True, 2025-05-07T20:33:20.6164940Z ) 2025-05-07T20:33:20.6165269Z self = 2025-05-07T20:33:20.6166025Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:20.6166308Z 2025-05-07T20:33:20.6166399Z @given( 2025-05-07T20:33:20.6166647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.6166970Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.6167284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.6167620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.6167958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.6168245Z ) 2025-05-07T20:33:20.6168628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.6169113Z def test_silu_mul_quant( 2025-05-07T20:33:20.6169358Z self, 2025-05-07T20:33:20.6169565Z T: int, 2025-05-07T20:33:20.6169768Z D: int, 2025-05-07T20:33:20.6169990Z scale_ub: Optional[float], 2025-05-07T20:33:20.6170265Z contiguous: bool, 2025-05-07T20:33:20.6170518Z compiled: bool, 2025-05-07T20:33:20.6170751Z ) -> None: 2025-05-07T20:33:20.6170971Z torch.manual_seed(2025) 2025-05-07T20:33:20.6171227Z 2025-05-07T20:33:20.6171515Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.6171859Z 2025-05-07T20:33:20.6172059Z x_sign = torch.sign(x) 2025-05-07T20:33:20.6172358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.6172675Z x = x_sign * x_clamp 2025-05-07T20:33:20.6172924Z x0 = x[:, :D] 2025-05-07T20:33:20.6173151Z x1 = x[:, D:] 2025-05-07T20:33:20.6173362Z 2025-05-07T20:33:20.6173560Z if contiguous: 2025-05-07T20:33:20.6173797Z x0 = x0.contiguous() 2025-05-07T20:33:20.6174059Z x1 = x1.contiguous() 2025-05-07T20:33:20.6174308Z 2025-05-07T20:33:20.6174507Z if scale_ub is not None: 2025-05-07T20:33:20.6174781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.6175127Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.6175441Z ) 2025-05-07T20:33:20.6175797Z else: 2025-05-07T20:33:20.6176015Z scale_ub_tensor = None 2025-05-07T20:33:20.6176280Z 2025-05-07T20:33:20.6176511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.6176903Z op = silu_mul_quant 2025-05-07T20:33:20.6177157Z if compiled: 2025-05-07T20:33:20.6177408Z op = torch.compile(op) 2025-05-07T20:33:20.6177707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.6177989Z 2025-05-07T20:33:20.6178194Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.6178359Z 2025-05-07T20:33:20.6178458Z moe/activation_test.py:117: 2025-05-07T20:33:20.6178760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.6179103Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.6179385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.6179950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:20.6180523Z return fn(*args, **kwargs) 2025-05-07T20:33:20.6181244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.6181936Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.6182481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.6183168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.6183834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.6184366Z kernel = self.compile( 2025-05-07T20:33:20.6184913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.6185568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.6185971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.6186213Z 2025-05-07T20:33:20.6186427Z self = 2025-05-07T20:33:20.6187516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.6188952Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf36ab60>} 2025-05-07T20:33:20.6190293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.6191327Z context = 2025-05-07T20:33:20.6191630Z 2025-05-07T20:33:20.6191800Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.6192330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.6192800Z module_map=module_map) 2025-05-07T20:33:20.6193171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.6193539Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.6193810Z E ^ 2025-05-07T20:33:20.6194271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.6194725Z 2025-05-07T20:33:20.6195140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.6195651Z 2025-05-07T20:33:20.7948358Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.7949287Z self=, 2025-05-07T20:33:20.7949890Z T=4096, 2025-05-07T20:33:20.7950085Z D=5120, 2025-05-07T20:33:20.7950285Z scale_ub=1200.0, 2025-05-07T20:33:20.7950510Z contiguous=True, 2025-05-07T20:33:20.7950808Z compiled=True, 2025-05-07T20:33:20.7951013Z ) 2025-05-07T20:33:20.7951334Z self = 2025-05-07T20:33:20.7951831Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:20.7952119Z 2025-05-07T20:33:20.7952198Z @given( 2025-05-07T20:33:20.7952430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.7952742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.7953042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.7953374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.7953702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.7953989Z ) 2025-05-07T20:33:20.7954342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.7954842Z def test_silu_mul_quant( 2025-05-07T20:33:20.7955083Z self, 2025-05-07T20:33:20.7955284Z T: int, 2025-05-07T20:33:20.7955479Z D: int, 2025-05-07T20:33:20.7955691Z scale_ub: Optional[float], 2025-05-07T20:33:20.7956029Z contiguous: bool, 2025-05-07T20:33:20.7956272Z compiled: bool, 2025-05-07T20:33:20.7956499Z ) -> None: 2025-05-07T20:33:20.7956711Z torch.manual_seed(2025) 2025-05-07T20:33:20.7956960Z 2025-05-07T20:33:20.7957229Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.7957572Z 2025-05-07T20:33:20.7957768Z x_sign = torch.sign(x) 2025-05-07T20:33:20.7958057Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.7958366Z x = x_sign * x_clamp 2025-05-07T20:33:20.7958611Z x0 = x[:, :D] 2025-05-07T20:33:20.7958859Z x1 = x[:, D:] 2025-05-07T20:33:20.7959084Z 2025-05-07T20:33:20.7959274Z if contiguous: 2025-05-07T20:33:20.7959510Z x0 = x0.contiguous() 2025-05-07T20:33:20.7959761Z x1 = x1.contiguous() 2025-05-07T20:33:20.7960004Z 2025-05-07T20:33:20.7960201Z if scale_ub is not None: 2025-05-07T20:33:20.7960469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.7960804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.7961111Z ) 2025-05-07T20:33:20.7961302Z else: 2025-05-07T20:33:20.7961515Z scale_ub_tensor = None 2025-05-07T20:33:20.7961766Z 2025-05-07T20:33:20.7961999Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.7962309Z op = silu_mul_quant 2025-05-07T20:33:20.7962557Z if compiled: 2025-05-07T20:33:20.7962803Z op = torch.compile(op) 2025-05-07T20:33:20.7963091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.7963370Z 2025-05-07T20:33:20.7963564Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.7963728Z 2025-05-07T20:33:20.7963828Z moe/activation_test.py:117: 2025-05-07T20:33:20.7964126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.7964457Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.7964731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.7965292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:20.7966111Z return fn(*args, **kwargs) 2025-05-07T20:33:20.7966767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.7967446Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.7967978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.7968744Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.7969499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.7970080Z kernel = self.compile( 2025-05-07T20:33:20.7970618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.7971265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.7971660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.7971893Z 2025-05-07T20:33:20.7972101Z self = 2025-05-07T20:33:20.7973191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.7974626Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf4b8180>} 2025-05-07T20:33:20.7975975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.7976992Z context = 2025-05-07T20:33:20.7977282Z 2025-05-07T20:33:20.7977449Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.7977976Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.7978448Z module_map=module_map) 2025-05-07T20:33:20.7978842Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.7979213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.7979475Z E ^ 2025-05-07T20:33:20.7980087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.7980547Z 2025-05-07T20:33:20.7980958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.7981469Z 2025-05-07T20:33:20.7981573Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.7981990Z self=, 2025-05-07T20:33:20.7982390Z T=128, 2025-05-07T20:33:20.7982577Z D=5120, 2025-05-07T20:33:20.7982769Z scale_ub=1200.0, 2025-05-07T20:33:20.7982987Z contiguous=False, 2025-05-07T20:33:20.7983209Z compiled=True, 2025-05-07T20:33:20.7983418Z ) 2025-05-07T20:33:21.0645583Z self = 2025-05-07T20:33:21.0646372Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:21.0646789Z 2025-05-07T20:33:21.0646916Z @given( 2025-05-07T20:33:21.0647234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.0647672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.0648010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.0648349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.0648676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.0648973Z ) 2025-05-07T20:33:21.0649331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.0649779Z def test_silu_mul_quant( 2025-05-07T20:33:21.0650032Z self, 2025-05-07T20:33:21.0650232Z T: int, 2025-05-07T20:33:21.0650439Z D: int, 2025-05-07T20:33:21.0650662Z scale_ub: Optional[float], 2025-05-07T20:33:21.0650943Z contiguous: bool, 2025-05-07T20:33:21.0651186Z compiled: bool, 2025-05-07T20:33:21.0651595Z ) -> None: 2025-05-07T20:33:21.0651819Z torch.manual_seed(2025) 2025-05-07T20:33:21.0652070Z 2025-05-07T20:33:21.0652344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.0652757Z 2025-05-07T20:33:21.0652961Z x_sign = torch.sign(x) 2025-05-07T20:33:21.0653256Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.0653578Z x = x_sign * x_clamp 2025-05-07T20:33:21.0653827Z x0 = x[:, :D] 2025-05-07T20:33:21.0654045Z x1 = x[:, D:] 2025-05-07T20:33:21.0654259Z 2025-05-07T20:33:21.0654455Z if contiguous: 2025-05-07T20:33:21.0654688Z x0 = x0.contiguous() 2025-05-07T20:33:21.0654957Z x1 = x1.contiguous() 2025-05-07T20:33:21.0655209Z 2025-05-07T20:33:21.0655401Z if scale_ub is not None: 2025-05-07T20:33:21.0655681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.0656028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.0656349Z ) 2025-05-07T20:33:21.0656551Z else: 2025-05-07T20:33:21.0656838Z scale_ub_tensor = None 2025-05-07T20:33:21.0657094Z 2025-05-07T20:33:21.0657339Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.0657667Z op = silu_mul_quant 2025-05-07T20:33:21.0657930Z if compiled: 2025-05-07T20:33:21.0658182Z op = torch.compile(op) 2025-05-07T20:33:21.0658494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.0658785Z 2025-05-07T20:33:21.0659022Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.0659200Z 2025-05-07T20:33:21.0659306Z moe/activation_test.py:117: 2025-05-07T20:33:21.0659616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.0659964Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.0660250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.0666885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:21.0667499Z return fn(*args, **kwargs) 2025-05-07T20:33:21.0668172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.0668868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.0669411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.0670093Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.0670763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.0671302Z kernel = self.compile( 2025-05-07T20:33:21.0671853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.0672508Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.0672922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.0673157Z 2025-05-07T20:33:21.0673374Z self = 2025-05-07T20:33:21.0674464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.0675912Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf4b8ea0>} 2025-05-07T20:33:21.0677259Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.0678486Z context = 2025-05-07T20:33:21.0678780Z 2025-05-07T20:33:21.0678955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.0679541Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.0680019Z module_map=module_map) 2025-05-07T20:33:21.0680390Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.0680758Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.0681028Z E ^ 2025-05-07T20:33:21.0681501Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.0681956Z 2025-05-07T20:33:21.0682383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.0682896Z 2025-05-07T20:33:21.0683006Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.0683505Z self=, 2025-05-07T20:33:21.0683915Z T=16384, 2025-05-07T20:33:21.0684122Z D=7168, 2025-05-07T20:33:21.0684341Z scale_ub=1200.0, 2025-05-07T20:33:21.0684574Z contiguous=True, 2025-05-07T20:33:21.0684805Z compiled=True, 2025-05-07T20:33:21.0685021Z ) 2025-05-07T20:33:21.0685345Z self = 2025-05-07T20:33:21.0685849Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:21.0686128Z 2025-05-07T20:33:21.0686215Z @given( 2025-05-07T20:33:21.0686443Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.0686765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.0687079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.0687407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.0687741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.0688037Z ) 2025-05-07T20:33:21.0688394Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.0688843Z def test_silu_mul_quant( 2025-05-07T20:33:21.0689138Z self, 2025-05-07T20:33:21.0689341Z T: int, 2025-05-07T20:33:21.0689539Z D: int, 2025-05-07T20:33:21.0689763Z scale_ub: Optional[float], 2025-05-07T20:33:21.0690039Z contiguous: bool, 2025-05-07T20:33:21.0690278Z compiled: bool, 2025-05-07T20:33:21.0690513Z ) -> None: 2025-05-07T20:33:21.0690733Z torch.manual_seed(2025) 2025-05-07T20:33:21.0690980Z 2025-05-07T20:33:21.0691261Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.0691611Z 2025-05-07T20:33:21.0691809Z x_sign = torch.sign(x) 2025-05-07T20:33:21.0692107Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.0692424Z x = x_sign * x_clamp 2025-05-07T20:33:21.0692671Z x0 = x[:, :D] 2025-05-07T20:33:21.0692895Z x1 = x[:, D:] 2025-05-07T20:33:21.0693109Z 2025-05-07T20:33:21.0693295Z if contiguous: 2025-05-07T20:33:21.0693528Z x0 = x0.contiguous() 2025-05-07T20:33:21.0693794Z x1 = x1.contiguous() 2025-05-07T20:33:21.0694037Z 2025-05-07T20:33:21.0694230Z if scale_ub is not None: 2025-05-07T20:33:21.0694509Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.0694850Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.0695159Z ) 2025-05-07T20:33:21.0695354Z else: 2025-05-07T20:33:21.0695568Z scale_ub_tensor = None 2025-05-07T20:33:21.0695818Z 2025-05-07T20:33:21.0696050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.0696374Z op = silu_mul_quant 2025-05-07T20:33:21.0696621Z if compiled: 2025-05-07T20:33:21.0696872Z op = torch.compile(op) 2025-05-07T20:33:21.0697262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.0697536Z 2025-05-07T20:33:21.0697731Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.0697895Z 2025-05-07T20:33:21.0698045Z moe/activation_test.py:117: 2025-05-07T20:33:21.0698346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.0698681Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.0698970Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.0699535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:21.0700093Z return fn(*args, **kwargs) 2025-05-07T20:33:21.0700759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.0701454Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.0701996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.0702739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.0703411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.0703953Z kernel = self.compile( 2025-05-07T20:33:21.0704489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.0705153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.0705562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.0705797Z 2025-05-07T20:33:21.0706013Z self = 2025-05-07T20:33:21.0707096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.0708479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf4ba0c0>} 2025-05-07T20:33:21.0709882Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.0710914Z context = 2025-05-07T20:33:21.0711204Z 2025-05-07T20:33:21.0711381Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.0711902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.0712380Z module_map=module_map) 2025-05-07T20:33:21.0712752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.0713105Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.0713382Z E ^ 2025-05-07T20:33:21.0713854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.0714310Z 2025-05-07T20:33:21.0714728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.0715238Z 2025-05-07T20:33:21.1941022Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.1941492Z self=, 2025-05-07T20:33:21.1942093Z T=16384, 2025-05-07T20:33:21.1942344Z D=5120, 2025-05-07T20:33:21.1942539Z scale_ub=1200.0, 2025-05-07T20:33:21.1942763Z contiguous=True, 2025-05-07T20:33:21.1942981Z compiled=False, 2025-05-07T20:33:21.1943194Z ) 2025-05-07T20:33:21.1943519Z self = 2025-05-07T20:33:21.1944209Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:21.1944497Z 2025-05-07T20:33:21.1944580Z @given( 2025-05-07T20:33:21.1944817Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.1945200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.1945507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.1945839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.1946172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.1946462Z ) 2025-05-07T20:33:21.1946815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.1947263Z def test_silu_mul_quant( 2025-05-07T20:33:21.1947512Z self, 2025-05-07T20:33:21.1947713Z T: int, 2025-05-07T20:33:21.1947918Z D: int, 2025-05-07T20:33:21.1948142Z scale_ub: Optional[float], 2025-05-07T20:33:21.1948415Z contiguous: bool, 2025-05-07T20:33:21.1948690Z compiled: bool, 2025-05-07T20:33:21.1948980Z ) -> None: 2025-05-07T20:33:21.1949202Z torch.manual_seed(2025) 2025-05-07T20:33:21.1949446Z 2025-05-07T20:33:21.1949726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.1950070Z 2025-05-07T20:33:21.1950271Z x_sign = torch.sign(x) 2025-05-07T20:33:21.1950558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.1950870Z x = x_sign * x_clamp 2025-05-07T20:33:21.1951113Z x0 = x[:, :D] 2025-05-07T20:33:21.1951328Z x1 = x[:, D:] 2025-05-07T20:33:21.1951542Z 2025-05-07T20:33:21.1951736Z if contiguous: 2025-05-07T20:33:21.1951968Z x0 = x0.contiguous() 2025-05-07T20:33:21.1952228Z x1 = x1.contiguous() 2025-05-07T20:33:21.1952481Z 2025-05-07T20:33:21.1952670Z if scale_ub is not None: 2025-05-07T20:33:21.1952950Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.1953296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.1953609Z ) 2025-05-07T20:33:21.1953810Z else: 2025-05-07T20:33:21.1954027Z scale_ub_tensor = None 2025-05-07T20:33:21.1954280Z 2025-05-07T20:33:21.1954516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.1954837Z op = silu_mul_quant 2025-05-07T20:33:21.1955093Z if compiled: 2025-05-07T20:33:21.1955343Z op = torch.compile(op) 2025-05-07T20:33:21.1955641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.1955993Z 2025-05-07T20:33:21.1956188Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.1956358Z 2025-05-07T20:33:21.1956463Z moe/activation_test.py:117: 2025-05-07T20:33:21.1956771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.1957103Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.1957398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.1958093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.1958786Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.1959372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.1960057Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.1960728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.1961255Z kernel = self.compile( 2025-05-07T20:33:21.1961805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.1962467Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.1962927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.1963194Z 2025-05-07T20:33:21.1963406Z self = 2025-05-07T20:33:21.1964488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.1966086Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf4b9a80>} 2025-05-07T20:33:21.1967426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.1968450Z context = 2025-05-07T20:33:21.1968739Z 2025-05-07T20:33:21.1968913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.1969535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.1970016Z module_map=module_map) 2025-05-07T20:33:21.1970377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.1970737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.1971003Z E ^ 2025-05-07T20:33:21.1971465Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.1971924Z 2025-05-07T20:33:21.1972338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.1972857Z 2025-05-07T20:33:21.1972962Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.1973382Z self=, 2025-05-07T20:33:21.1973791Z T=1, 2025-05-07T20:33:21.1973980Z D=7168, 2025-05-07T20:33:21.1974186Z scale_ub=1200.0, 2025-05-07T20:33:21.1974406Z contiguous=False, 2025-05-07T20:33:21.1974639Z compiled=False, 2025-05-07T20:33:21.1974851Z ) 2025-05-07T20:33:21.1975167Z self = 2025-05-07T20:33:21.1975661Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:21.1975937Z 2025-05-07T20:33:21.1976019Z @given( 2025-05-07T20:33:21.1976253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.1976574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.1976889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.1977220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.1977549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.1977840Z ) 2025-05-07T20:33:21.1978196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.1978648Z def test_silu_mul_quant( 2025-05-07T20:33:21.1978900Z self, 2025-05-07T20:33:21.1979103Z T: int, 2025-05-07T20:33:21.1979306Z D: int, 2025-05-07T20:33:21.1979532Z scale_ub: Optional[float], 2025-05-07T20:33:21.1979810Z contiguous: bool, 2025-05-07T20:33:21.1980054Z compiled: bool, 2025-05-07T20:33:21.1980281Z ) -> None: 2025-05-07T20:33:21.1980498Z torch.manual_seed(2025) 2025-05-07T20:33:21.1980746Z 2025-05-07T20:33:21.1981018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.1981370Z 2025-05-07T20:33:21.1981572Z x_sign = torch.sign(x) 2025-05-07T20:33:21.1981864Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.1982188Z x = x_sign * x_clamp 2025-05-07T20:33:21.1982429Z x0 = x[:, :D] 2025-05-07T20:33:21.1982643Z x1 = x[:, D:] 2025-05-07T20:33:21.1982856Z 2025-05-07T20:33:21.1983182Z if contiguous: 2025-05-07T20:33:21.1983416Z x0 = x0.contiguous() 2025-05-07T20:33:21.1983680Z x1 = x1.contiguous() 2025-05-07T20:33:21.1983930Z 2025-05-07T20:33:21.1984181Z if scale_ub is not None: 2025-05-07T20:33:21.1984466Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.1984803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.1985117Z ) 2025-05-07T20:33:21.1985313Z else: 2025-05-07T20:33:21.1985532Z scale_ub_tensor = None 2025-05-07T20:33:21.1985791Z 2025-05-07T20:33:21.1986029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.1986353Z op = silu_mul_quant 2025-05-07T20:33:21.1986607Z if compiled: 2025-05-07T20:33:21.1986853Z op = torch.compile(op) 2025-05-07T20:33:21.1987156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.1987443Z 2025-05-07T20:33:21.1987646Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.1987812Z 2025-05-07T20:33:21.1987962Z moe/activation_test.py:117: 2025-05-07T20:33:21.1988265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.1988598Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.1988908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.1989630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.1990324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.1990863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.1991546Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.1992209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.1992747Z kernel = self.compile( 2025-05-07T20:33:21.1993288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.1993944Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.1994346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.1994577Z 2025-05-07T20:33:21.1994783Z self = 2025-05-07T20:33:21.1995912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.1997287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf1d80e0>} 2025-05-07T20:33:21.1998639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.1999674Z context = 2025-05-07T20:33:21.1999963Z 2025-05-07T20:33:21.2000128Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.2000652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.2001128Z module_map=module_map) 2025-05-07T20:33:21.2001499Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.2001848Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.2002115Z E ^ 2025-05-07T20:33:21.2002591Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.2003041Z 2025-05-07T20:33:21.2003549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.2004070Z 2025-05-07T20:33:21.3743040Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.3744085Z self=, 2025-05-07T20:33:21.3744842Z T=4096, 2025-05-07T20:33:21.3745193Z D=7168, 2025-05-07T20:33:21.3745548Z scale_ub=1200.0, 2025-05-07T20:33:21.3745969Z contiguous=False, 2025-05-07T20:33:21.3746377Z compiled=True, 2025-05-07T20:33:21.3746746Z ) 2025-05-07T20:33:21.3747334Z self = 2025-05-07T20:33:21.3748247Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:21.3748753Z 2025-05-07T20:33:21.3748900Z @given( 2025-05-07T20:33:21.3749287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.3749617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.3749935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.3750332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.3750671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.3750964Z ) 2025-05-07T20:33:21.3751317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.3751766Z def test_silu_mul_quant( 2025-05-07T20:33:21.3752019Z self, 2025-05-07T20:33:21.3752213Z T: int, 2025-05-07T20:33:21.3752416Z D: int, 2025-05-07T20:33:21.3752637Z scale_ub: Optional[float], 2025-05-07T20:33:21.3752906Z contiguous: bool, 2025-05-07T20:33:21.3753154Z compiled: bool, 2025-05-07T20:33:21.3753385Z ) -> None: 2025-05-07T20:33:21.3753606Z torch.manual_seed(2025) 2025-05-07T20:33:21.3753854Z 2025-05-07T20:33:21.3754130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.3754472Z 2025-05-07T20:33:21.3754667Z x_sign = torch.sign(x) 2025-05-07T20:33:21.3754973Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.3755293Z x = x_sign * x_clamp 2025-05-07T20:33:21.3755534Z x0 = x[:, :D] 2025-05-07T20:33:21.3755814Z x1 = x[:, D:] 2025-05-07T20:33:21.3756025Z 2025-05-07T20:33:21.3756209Z if contiguous: 2025-05-07T20:33:21.3756440Z x0 = x0.contiguous() 2025-05-07T20:33:21.3756699Z x1 = x1.contiguous() 2025-05-07T20:33:21.3756937Z 2025-05-07T20:33:21.3757133Z if scale_ub is not None: 2025-05-07T20:33:21.3757408Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.3757742Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.3758062Z ) 2025-05-07T20:33:21.3758258Z else: 2025-05-07T20:33:21.3758472Z scale_ub_tensor = None 2025-05-07T20:33:21.3758727Z 2025-05-07T20:33:21.3758979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.3759329Z op = silu_mul_quant 2025-05-07T20:33:21.3759582Z if compiled: 2025-05-07T20:33:21.3759833Z op = torch.compile(op) 2025-05-07T20:33:21.3760131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.3760406Z 2025-05-07T20:33:21.3760601Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.3760763Z 2025-05-07T20:33:21.3760871Z moe/activation_test.py:117: 2025-05-07T20:33:21.3761162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.3761499Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.3761781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.3762335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:21.3762899Z return fn(*args, **kwargs) 2025-05-07T20:33:21.3763632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.3764386Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.3764919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.3765811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.3766482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.3767009Z kernel = self.compile( 2025-05-07T20:33:21.3767549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.3768205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.3768605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.3768836Z 2025-05-07T20:33:21.3769046Z self = 2025-05-07T20:33:21.3770187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.3771568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf1d9300>} 2025-05-07T20:33:21.3772911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.3773941Z context = 2025-05-07T20:33:21.3774229Z 2025-05-07T20:33:21.3774396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.3774925Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.3775404Z module_map=module_map) 2025-05-07T20:33:21.3775762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.3776119Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.3776385Z E ^ 2025-05-07T20:33:21.3776851Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.3777301Z 2025-05-07T20:33:21.3777713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.3778228Z 2025-05-07T20:33:21.3778331Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.3778753Z self=, 2025-05-07T20:33:21.3779161Z T=128, 2025-05-07T20:33:21.3779349Z D=7168, 2025-05-07T20:33:21.3779544Z scale_ub=1200.0, 2025-05-07T20:33:21.3779780Z contiguous=False, 2025-05-07T20:33:21.3780002Z compiled=True, 2025-05-07T20:33:21.3780205Z ) 2025-05-07T20:33:21.4692858Z self = 2025-05-07T20:33:21.4693416Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:21.4693695Z 2025-05-07T20:33:21.4693774Z @given( 2025-05-07T20:33:21.4694063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.4694506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.4694911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.4695344Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.4695695Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.4695980Z ) 2025-05-07T20:33:21.4696322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.4696767Z def test_silu_mul_quant( 2025-05-07T20:33:21.4697177Z self, 2025-05-07T20:33:21.4697371Z T: int, 2025-05-07T20:33:21.4697580Z D: int, 2025-05-07T20:33:21.4697805Z scale_ub: Optional[float], 2025-05-07T20:33:21.4698071Z contiguous: bool, 2025-05-07T20:33:21.4698405Z compiled: bool, 2025-05-07T20:33:21.4704391Z ) -> None: 2025-05-07T20:33:21.4704619Z torch.manual_seed(2025) 2025-05-07T20:33:21.4704859Z 2025-05-07T20:33:21.4705130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.4705480Z 2025-05-07T20:33:21.4705675Z x_sign = torch.sign(x) 2025-05-07T20:33:21.4705977Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.4706286Z x = x_sign * x_clamp 2025-05-07T20:33:21.4706524Z x0 = x[:, :D] 2025-05-07T20:33:21.4706745Z x1 = x[:, D:] 2025-05-07T20:33:21.4706957Z 2025-05-07T20:33:21.4707148Z if contiguous: 2025-05-07T20:33:21.4707385Z x0 = x0.contiguous() 2025-05-07T20:33:21.4707664Z x1 = x1.contiguous() 2025-05-07T20:33:21.4707911Z 2025-05-07T20:33:21.4708206Z if scale_ub is not None: 2025-05-07T20:33:21.4708491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.4708843Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.4709188Z ) 2025-05-07T20:33:21.4709410Z else: 2025-05-07T20:33:21.4709630Z scale_ub_tensor = None 2025-05-07T20:33:21.4709886Z 2025-05-07T20:33:21.4710127Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.4710449Z op = silu_mul_quant 2025-05-07T20:33:21.4710699Z if compiled: 2025-05-07T20:33:21.4710952Z op = torch.compile(op) 2025-05-07T20:33:21.4711251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.4711528Z 2025-05-07T20:33:21.4711730Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.4711902Z 2025-05-07T20:33:21.4712004Z moe/activation_test.py:117: 2025-05-07T20:33:21.4712305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.4712640Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.4712923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.4713486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:21.4714044Z return fn(*args, **kwargs) 2025-05-07T20:33:21.4714698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.4715384Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.4716010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.4716691Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.4717355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.4717893Z kernel = self.compile( 2025-05-07T20:33:21.4718434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.4719107Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.4719540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.4719769Z 2025-05-07T20:33:21.4719979Z self = 2025-05-07T20:33:21.4721052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.4722485Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf1da020>} 2025-05-07T20:33:21.4724187Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.4725255Z context = 2025-05-07T20:33:21.4725550Z 2025-05-07T20:33:21.4725726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.4726256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.4726732Z module_map=module_map) 2025-05-07T20:33:21.4727101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.4727456Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.4727726Z E ^ 2025-05-07T20:33:21.4728224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.4728684Z 2025-05-07T20:33:21.4729173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.4729714Z 2025-05-07T20:33:21.4729829Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.4730245Z self=, 2025-05-07T20:33:21.4730659Z T=2048, 2025-05-07T20:33:21.4730851Z D=7168, 2025-05-07T20:33:21.4731045Z scale_ub=None, 2025-05-07T20:33:21.4731264Z contiguous=True, 2025-05-07T20:33:21.4731493Z compiled=True, 2025-05-07T20:33:21.4731691Z ) 2025-05-07T20:33:21.4732011Z self = 2025-05-07T20:33:21.4732511Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:21.4732779Z 2025-05-07T20:33:21.4732869Z @given( 2025-05-07T20:33:21.4733103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.4733425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.4733741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.4734064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.4734402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.4734687Z ) 2025-05-07T20:33:21.4735032Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.4735489Z def test_silu_mul_quant( 2025-05-07T20:33:21.4735732Z self, 2025-05-07T20:33:21.4735930Z T: int, 2025-05-07T20:33:21.4736131Z D: int, 2025-05-07T20:33:21.4736354Z scale_ub: Optional[float], 2025-05-07T20:33:21.4736621Z contiguous: bool, 2025-05-07T20:33:21.4736859Z compiled: bool, 2025-05-07T20:33:21.4737088Z ) -> None: 2025-05-07T20:33:21.4737314Z torch.manual_seed(2025) 2025-05-07T20:33:21.4737551Z 2025-05-07T20:33:21.4737830Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.4738178Z 2025-05-07T20:33:21.4738368Z x_sign = torch.sign(x) 2025-05-07T20:33:21.4738664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.4738980Z x = x_sign * x_clamp 2025-05-07T20:33:21.4739218Z x0 = x[:, :D] 2025-05-07T20:33:21.4739442Z x1 = x[:, D:] 2025-05-07T20:33:21.4739649Z 2025-05-07T20:33:21.4739834Z if contiguous: 2025-05-07T20:33:21.4740076Z x0 = x0.contiguous() 2025-05-07T20:33:21.4740339Z x1 = x1.contiguous() 2025-05-07T20:33:21.4740572Z 2025-05-07T20:33:21.4740773Z if scale_ub is not None: 2025-05-07T20:33:21.4741052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.4741382Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.4741696Z ) 2025-05-07T20:33:21.4741894Z else: 2025-05-07T20:33:21.4742111Z scale_ub_tensor = None 2025-05-07T20:33:21.4742415Z 2025-05-07T20:33:21.4742694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.4743011Z op = silu_mul_quant 2025-05-07T20:33:21.4743260Z if compiled: 2025-05-07T20:33:21.4743560Z op = torch.compile(op) 2025-05-07T20:33:21.4743855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.4744133Z 2025-05-07T20:33:21.4744334Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.4744501Z 2025-05-07T20:33:21.4744605Z moe/activation_test.py:117: 2025-05-07T20:33:21.4744900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.4745233Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.4745516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.4746084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:21.4746635Z return fn(*args, **kwargs) 2025-05-07T20:33:21.4747335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.4748031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.4748573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.4749314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.4749993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.4750523Z kernel = self.compile( 2025-05-07T20:33:21.4751062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.4751717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.4752113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.4752345Z 2025-05-07T20:33:21.4752556Z self = 2025-05-07T20:33:21.4753645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.4755020Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf1db240>} 2025-05-07T20:33:21.4756415Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.4757434Z context = 2025-05-07T20:33:21.4757725Z 2025-05-07T20:33:21.4757891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.4758416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.4758886Z module_map=module_map) 2025-05-07T20:33:21.4759281Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.4759660Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.4759920Z E ^ 2025-05-07T20:33:21.4760377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.4760831Z 2025-05-07T20:33:21.4761243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.4761754Z 2025-05-07T20:33:21.5444672Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.5445166Z self=, 2025-05-07T20:33:21.5445708Z T=16384, 2025-05-07T20:33:21.5445907Z D=5120, 2025-05-07T20:33:21.5446270Z scale_ub=None, 2025-05-07T20:33:21.5446497Z contiguous=False, 2025-05-07T20:33:21.5446727Z compiled=False, 2025-05-07T20:33:21.5446934Z ) 2025-05-07T20:33:21.5447259Z self = 2025-05-07T20:33:21.5447826Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:21.5448115Z 2025-05-07T20:33:21.5448194Z @given( 2025-05-07T20:33:21.5448430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.5448746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.5449080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.5449447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.5449772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.5450064Z ) 2025-05-07T20:33:21.5450412Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.5450863Z def test_silu_mul_quant( 2025-05-07T20:33:21.5451107Z self, 2025-05-07T20:33:21.5451370Z T: int, 2025-05-07T20:33:21.5451576Z D: int, 2025-05-07T20:33:21.5451797Z scale_ub: Optional[float], 2025-05-07T20:33:21.5452082Z contiguous: bool, 2025-05-07T20:33:21.5452324Z compiled: bool, 2025-05-07T20:33:21.5452547Z ) -> None: 2025-05-07T20:33:21.5452766Z torch.manual_seed(2025) 2025-05-07T20:33:21.5453018Z 2025-05-07T20:33:21.5453292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.5453644Z 2025-05-07T20:33:21.5453841Z x_sign = torch.sign(x) 2025-05-07T20:33:21.5454127Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.5456169Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.5458067Z 2025-05-07T20:33:21.5458191Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:21.5458407Z 2025-05-07T20:33:21.5458510Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.5458924Z self=, 2025-05-07T20:33:21.5459325Z T=4096, 2025-05-07T20:33:21.5459514Z D=7168, 2025-05-07T20:33:21.5459708Z scale_ub=1200.0, 2025-05-07T20:33:21.5459927Z contiguous=True, 2025-05-07T20:33:21.5460157Z compiled=True, 2025-05-07T20:33:21.5460366Z ) 2025-05-07T20:33:21.5460685Z self = 2025-05-07T20:33:21.5461186Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:21.5461470Z 2025-05-07T20:33:21.5461572Z @given( 2025-05-07T20:33:21.5461813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.5462135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.5462441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.5462776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.5463117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.5463405Z ) 2025-05-07T20:33:21.5463750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.5464192Z def test_silu_mul_quant( 2025-05-07T20:33:21.5464437Z self, 2025-05-07T20:33:21.5464631Z T: int, 2025-05-07T20:33:21.5464829Z D: int, 2025-05-07T20:33:21.5465047Z scale_ub: Optional[float], 2025-05-07T20:33:21.5465315Z contiguous: bool, 2025-05-07T20:33:21.5465911Z compiled: bool, 2025-05-07T20:33:21.5466145Z ) -> None: 2025-05-07T20:33:21.5466362Z torch.manual_seed(2025) 2025-05-07T20:33:21.5466611Z 2025-05-07T20:33:21.5466955Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.5467299Z 2025-05-07T20:33:21.5467496Z x_sign = torch.sign(x) 2025-05-07T20:33:21.5467790Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.5469818Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.5471690Z 2025-05-07T20:33:21.5471872Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:21.5472087Z 2025-05-07T20:33:21.5472194Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.5472608Z self=, 2025-05-07T20:33:21.5473018Z T=16384, 2025-05-07T20:33:21.5473211Z D=7168, 2025-05-07T20:33:21.5473405Z scale_ub=None, 2025-05-07T20:33:21.5473626Z contiguous=False, 2025-05-07T20:33:21.5473848Z compiled=False, 2025-05-07T20:33:21.5474055Z ) 2025-05-07T20:33:21.5474373Z self = 2025-05-07T20:33:21.5474873Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:21.5475153Z 2025-05-07T20:33:21.5475233Z @given( 2025-05-07T20:33:21.5475466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.5475842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.5476155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.5476493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.5476830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.5477117Z ) 2025-05-07T20:33:21.5477470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.5477915Z def test_silu_mul_quant( 2025-05-07T20:33:21.5478155Z self, 2025-05-07T20:33:21.5478353Z T: int, 2025-05-07T20:33:21.5478553Z D: int, 2025-05-07T20:33:21.5478781Z scale_ub: Optional[float], 2025-05-07T20:33:21.5479055Z contiguous: bool, 2025-05-07T20:33:21.5479296Z compiled: bool, 2025-05-07T20:33:21.5479522Z ) -> None: 2025-05-07T20:33:21.5479740Z torch.manual_seed(2025) 2025-05-07T20:33:21.5479993Z 2025-05-07T20:33:21.5480266Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.5482336Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.5484227Z 2025-05-07T20:33:21.5484345Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:21.5484564Z 2025-05-07T20:33:21.5484669Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.5485085Z self=, 2025-05-07T20:33:21.5485489Z T=2048, 2025-05-07T20:33:21.5485678Z D=7168, 2025-05-07T20:33:21.5485981Z scale_ub=1200.0, 2025-05-07T20:33:21.5486208Z contiguous=True, 2025-05-07T20:33:21.5486432Z compiled=True, 2025-05-07T20:33:21.5486639Z ) 2025-05-07T20:33:21.5486961Z self = 2025-05-07T20:33:21.5487498Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:21.5487774Z 2025-05-07T20:33:21.5487854Z @given( 2025-05-07T20:33:21.5488087Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.5488399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.5488707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.5489043Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.5489421Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.5489707Z ) 2025-05-07T20:33:21.5490055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.5490500Z def test_silu_mul_quant( 2025-05-07T20:33:21.5490744Z self, 2025-05-07T20:33:21.5490943Z T: int, 2025-05-07T20:33:21.5491188Z D: int, 2025-05-07T20:33:21.5491412Z scale_ub: Optional[float], 2025-05-07T20:33:21.5491690Z contiguous: bool, 2025-05-07T20:33:21.5491930Z compiled: bool, 2025-05-07T20:33:21.5492150Z ) -> None: 2025-05-07T20:33:21.5492368Z torch.manual_seed(2025) 2025-05-07T20:33:21.5492611Z 2025-05-07T20:33:21.5492881Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.5493225Z 2025-05-07T20:33:21.5493422Z x_sign = torch.sign(x) 2025-05-07T20:33:21.5493712Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.5495713Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.5497576Z 2025-05-07T20:33:21.5497694Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:21.5497910Z 2025-05-07T20:33:21.5498014Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.5498425Z self=, 2025-05-07T20:33:21.5498825Z T=2048, 2025-05-07T20:33:21.5499019Z D=7168, 2025-05-07T20:33:21.5499215Z scale_ub=None, 2025-05-07T20:33:21.5499423Z contiguous=True, 2025-05-07T20:33:21.5499649Z compiled=False, 2025-05-07T20:33:21.5499857Z ) 2025-05-07T20:33:21.6638345Z self = 2025-05-07T20:33:21.6639393Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:21.6639811Z 2025-05-07T20:33:21.6639899Z @given( 2025-05-07T20:33:21.6640138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.6640459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.6640763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.6641097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.6641430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.6641715Z ) 2025-05-07T20:33:21.6642066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.6642517Z def test_silu_mul_quant( 2025-05-07T20:33:21.6642761Z self, 2025-05-07T20:33:21.6642963Z T: int, 2025-05-07T20:33:21.6643170Z D: int, 2025-05-07T20:33:21.6643385Z scale_ub: Optional[float], 2025-05-07T20:33:21.6643654Z contiguous: bool, 2025-05-07T20:33:21.6644002Z compiled: bool, 2025-05-07T20:33:21.6644282Z ) -> None: 2025-05-07T20:33:21.6644512Z torch.manual_seed(2025) 2025-05-07T20:33:21.6644756Z 2025-05-07T20:33:21.6645025Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.6645431Z 2025-05-07T20:33:21.6645626Z > x_sign = torch.sign(x) 2025-05-07T20:33:21.6647583Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.6649437Z 2025-05-07T20:33:21.6649568Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:21.6649783Z 2025-05-07T20:33:21.6649950Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.6650365Z self=, 2025-05-07T20:33:21.6650772Z T=1, 2025-05-07T20:33:21.6650957Z D=7168, 2025-05-07T20:33:21.6651154Z scale_ub=1200.0, 2025-05-07T20:33:21.6651382Z contiguous=True, 2025-05-07T20:33:21.6651606Z compiled=False, 2025-05-07T20:33:21.6651808Z ) 2025-05-07T20:33:21.6652131Z self = 2025-05-07T20:33:21.6652621Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:21.6652885Z 2025-05-07T20:33:21.6652963Z @given( 2025-05-07T20:33:21.6653200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.6653519Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.6653825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.6654166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.6654499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.6654793Z ) 2025-05-07T20:33:21.6655147Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.6655589Z def test_silu_mul_quant( 2025-05-07T20:33:21.6655834Z self, 2025-05-07T20:33:21.6656025Z T: int, 2025-05-07T20:33:21.6656225Z D: int, 2025-05-07T20:33:21.6656443Z scale_ub: Optional[float], 2025-05-07T20:33:21.6656708Z contiguous: bool, 2025-05-07T20:33:21.6656954Z compiled: bool, 2025-05-07T20:33:21.6657177Z ) -> None: 2025-05-07T20:33:21.6657391Z torch.manual_seed(2025) 2025-05-07T20:33:21.6657631Z 2025-05-07T20:33:21.6657904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.6658243Z 2025-05-07T20:33:21.6658440Z x_sign = torch.sign(x) 2025-05-07T20:33:21.6658735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.6659050Z x = x_sign * x_clamp 2025-05-07T20:33:21.6659295Z x0 = x[:, :D] 2025-05-07T20:33:21.6659516Z x1 = x[:, D:] 2025-05-07T20:33:21.6659724Z 2025-05-07T20:33:21.6659910Z if contiguous: 2025-05-07T20:33:21.6660145Z x0 = x0.contiguous() 2025-05-07T20:33:21.6660402Z x1 = x1.contiguous() 2025-05-07T20:33:21.6660645Z 2025-05-07T20:33:21.6660839Z if scale_ub is not None: 2025-05-07T20:33:21.6661115Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.6661446Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.6661756Z ) 2025-05-07T20:33:21.6661950Z else: 2025-05-07T20:33:21.6662161Z scale_ub_tensor = None 2025-05-07T20:33:21.6662414Z 2025-05-07T20:33:21.6662649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.6662970Z op = silu_mul_quant 2025-05-07T20:33:21.6663310Z if compiled: 2025-05-07T20:33:21.6663566Z op = torch.compile(op) 2025-05-07T20:33:21.6663861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.6664178Z 2025-05-07T20:33:21.6664377Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.6664542Z 2025-05-07T20:33:21.6664642Z moe/activation_test.py:117: 2025-05-07T20:33:21.6664945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.6665276Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.6665750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.6671698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.6672430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.6672969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.6673662Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.6674431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.6674971Z kernel = self.compile( 2025-05-07T20:33:21.6675510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.6676227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.6676624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.6676857Z 2025-05-07T20:33:21.6677064Z self = 2025-05-07T20:33:21.6678153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.6679600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf0aa520>} 2025-05-07T20:33:21.6680942Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.6681973Z context = 2025-05-07T20:33:21.6682263Z 2025-05-07T20:33:21.6682433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.6682958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.6683429Z module_map=module_map) 2025-05-07T20:33:21.6683796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.6684152Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.6684417Z E ^ 2025-05-07T20:33:21.6684880Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.6685329Z 2025-05-07T20:33:21.6685741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.6686259Z 2025-05-07T20:33:21.6686363Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.6686780Z self=, 2025-05-07T20:33:21.6687189Z T=128, 2025-05-07T20:33:21.6687379Z D=5120, 2025-05-07T20:33:21.6687573Z scale_ub=None, 2025-05-07T20:33:21.6687795Z contiguous=True, 2025-05-07T20:33:21.6688014Z compiled=False, 2025-05-07T20:33:21.6688224Z ) 2025-05-07T20:33:21.7361286Z self = 2025-05-07T20:33:21.7362069Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:21.7362537Z 2025-05-07T20:33:21.7362647Z @given( 2025-05-07T20:33:21.7362961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.7363338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.7363731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.7364058Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.7364387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.7364672Z ) 2025-05-07T20:33:21.7365017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.7365627Z def test_silu_mul_quant( 2025-05-07T20:33:21.7365866Z self, 2025-05-07T20:33:21.7366055Z T: int, 2025-05-07T20:33:21.7366248Z D: int, 2025-05-07T20:33:21.7366465Z scale_ub: Optional[float], 2025-05-07T20:33:21.7366734Z contiguous: bool, 2025-05-07T20:33:21.7366980Z compiled: bool, 2025-05-07T20:33:21.7367212Z ) -> None: 2025-05-07T20:33:21.7367428Z torch.manual_seed(2025) 2025-05-07T20:33:21.7367741Z 2025-05-07T20:33:21.7368014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.7368367Z 2025-05-07T20:33:21.7368555Z x_sign = torch.sign(x) 2025-05-07T20:33:21.7368845Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.7369156Z x = x_sign * x_clamp 2025-05-07T20:33:21.7369426Z x0 = x[:, :D] 2025-05-07T20:33:21.7369661Z x1 = x[:, D:] 2025-05-07T20:33:21.7369874Z 2025-05-07T20:33:21.7370056Z if contiguous: 2025-05-07T20:33:21.7370293Z x0 = x0.contiguous() 2025-05-07T20:33:21.7370556Z x1 = x1.contiguous() 2025-05-07T20:33:21.7370790Z 2025-05-07T20:33:21.7370987Z if scale_ub is not None: 2025-05-07T20:33:21.7371257Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.7371587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.7371902Z ) 2025-05-07T20:33:21.7372097Z else: 2025-05-07T20:33:21.7372314Z scale_ub_tensor = None 2025-05-07T20:33:21.7372565Z 2025-05-07T20:33:21.7372799Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.7373111Z op = silu_mul_quant 2025-05-07T20:33:21.7373363Z if compiled: 2025-05-07T20:33:21.7373607Z op = torch.compile(op) 2025-05-07T20:33:21.7373900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.7374168Z 2025-05-07T20:33:21.7374364Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.7374526Z 2025-05-07T20:33:21.7374631Z moe/activation_test.py:117: 2025-05-07T20:33:21.7374926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.7375265Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.7375542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.7376225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.7376919Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.7377456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.7378136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.7378793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.7379326Z kernel = self.compile( 2025-05-07T20:33:21.7379862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.7380518Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.7380908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.7381142Z 2025-05-07T20:33:21.7381422Z self = 2025-05-07T20:33:21.7382560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.7383986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf0ab420>} 2025-05-07T20:33:21.7385322Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.7386351Z context = 2025-05-07T20:33:21.7386645Z 2025-05-07T20:33:21.7386809Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.7387374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.7387840Z module_map=module_map) 2025-05-07T20:33:21.7388210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.7388572Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.7388827Z E ^ 2025-05-07T20:33:21.7389300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.7389749Z 2025-05-07T20:33:21.7390158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.7390664Z 2025-05-07T20:33:21.7390773Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.7391178Z self=, 2025-05-07T20:33:21.7391577Z T=128, 2025-05-07T20:33:21.7391772Z D=7168, 2025-05-07T20:33:21.7391969Z scale_ub=None, 2025-05-07T20:33:21.7392179Z contiguous=True, 2025-05-07T20:33:21.7392410Z compiled=False, 2025-05-07T20:33:21.7392636Z ) 2025-05-07T20:33:21.7392953Z self = 2025-05-07T20:33:21.7393444Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:21.7393720Z 2025-05-07T20:33:21.7393795Z @given( 2025-05-07T20:33:21.7394033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.7394339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.7394644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.7394968Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.7395295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.7395580Z ) 2025-05-07T20:33:21.7395990Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.7396434Z def test_silu_mul_quant( 2025-05-07T20:33:21.7396681Z self, 2025-05-07T20:33:21.7396872Z T: int, 2025-05-07T20:33:21.7397074Z D: int, 2025-05-07T20:33:21.7397289Z scale_ub: Optional[float], 2025-05-07T20:33:21.7397565Z contiguous: bool, 2025-05-07T20:33:21.7397800Z compiled: bool, 2025-05-07T20:33:21.7398018Z ) -> None: 2025-05-07T20:33:21.7398229Z torch.manual_seed(2025) 2025-05-07T20:33:21.7398464Z 2025-05-07T20:33:21.7398733Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.7399071Z 2025-05-07T20:33:21.7399280Z x_sign = torch.sign(x) 2025-05-07T20:33:21.7399602Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.7399912Z x = x_sign * x_clamp 2025-05-07T20:33:21.7400146Z x0 = x[:, :D] 2025-05-07T20:33:21.7400364Z x1 = x[:, D:] 2025-05-07T20:33:21.7400570Z 2025-05-07T20:33:21.7400748Z if contiguous: 2025-05-07T20:33:21.7401032Z x0 = x0.contiguous() 2025-05-07T20:33:21.7401329Z x1 = x1.contiguous() 2025-05-07T20:33:21.7401566Z 2025-05-07T20:33:21.7401754Z if scale_ub is not None: 2025-05-07T20:33:21.7402022Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.7402390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.7402699Z ) 2025-05-07T20:33:21.7402890Z else: 2025-05-07T20:33:21.7403101Z scale_ub_tensor = None 2025-05-07T20:33:21.7403346Z 2025-05-07T20:33:21.7403582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.7403892Z op = silu_mul_quant 2025-05-07T20:33:21.7404138Z if compiled: 2025-05-07T20:33:21.7404379Z op = torch.compile(op) 2025-05-07T20:33:21.7404671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.7404935Z 2025-05-07T20:33:21.7405130Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.7405296Z 2025-05-07T20:33:21.7405408Z moe/activation_test.py:117: 2025-05-07T20:33:21.7405778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.7406122Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.7406406Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.7407096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.7407783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.7408324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.7409005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.7409666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.7410202Z kernel = self.compile( 2025-05-07T20:33:21.7410747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.7411406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.7411807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.7412049Z 2025-05-07T20:33:21.7412255Z self = 2025-05-07T20:33:21.7413334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.7414706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cef8c4a0>} 2025-05-07T20:33:21.7416050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.7417081Z context = 2025-05-07T20:33:21.7417375Z 2025-05-07T20:33:21.7417542Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.7418069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.7418532Z module_map=module_map) 2025-05-07T20:33:21.7418900Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.7419269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.7419575Z E ^ 2025-05-07T20:33:21.7420041Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.7420495Z 2025-05-07T20:33:21.7420956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.7421502Z 2025-05-07T20:33:21.7421616Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.7422032Z self=, 2025-05-07T20:33:21.7422474Z T=2048, 2025-05-07T20:33:21.7422666Z D=7168, 2025-05-07T20:33:21.7422864Z scale_ub=1200.0, 2025-05-07T20:33:21.7423089Z contiguous=True, 2025-05-07T20:33:21.7423316Z compiled=False, 2025-05-07T20:33:21.7423521Z ) 2025-05-07T20:33:21.8233600Z self = 2025-05-07T20:33:21.8234349Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:21.8234735Z 2025-05-07T20:33:21.8234844Z @given( 2025-05-07T20:33:21.8235160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.8235507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.8235883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.8236235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.8236695Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.8236986Z ) 2025-05-07T20:33:21.8237347Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.8237803Z def test_silu_mul_quant( 2025-05-07T20:33:21.8238047Z self, 2025-05-07T20:33:21.8238248Z T: int, 2025-05-07T20:33:21.8238452Z D: int, 2025-05-07T20:33:21.8238676Z scale_ub: Optional[float], 2025-05-07T20:33:21.8238953Z contiguous: bool, 2025-05-07T20:33:21.8239209Z compiled: bool, 2025-05-07T20:33:21.8239438Z ) -> None: 2025-05-07T20:33:21.8239657Z torch.manual_seed(2025) 2025-05-07T20:33:21.8239906Z 2025-05-07T20:33:21.8240183Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.8242252Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.8244126Z 2025-05-07T20:33:21.8244247Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:21.8244469Z 2025-05-07T20:33:21.8244576Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.8244991Z self=, 2025-05-07T20:33:21.8245405Z T=1, 2025-05-07T20:33:21.8245591Z D=5120, 2025-05-07T20:33:21.8245793Z scale_ub=1200.0, 2025-05-07T20:33:21.8246019Z contiguous=True, 2025-05-07T20:33:21.8246245Z compiled=False, 2025-05-07T20:33:21.8246467Z ) 2025-05-07T20:33:21.8246791Z self = 2025-05-07T20:33:21.8247279Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:21.8247551Z 2025-05-07T20:33:21.8247631Z @given( 2025-05-07T20:33:21.8247868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.8248181Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.8248493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.8248827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.8249169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.8249498Z ) 2025-05-07T20:33:21.8249857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.8250304Z def test_silu_mul_quant( 2025-05-07T20:33:21.8250550Z self, 2025-05-07T20:33:21.8250756Z T: int, 2025-05-07T20:33:21.8251087Z D: int, 2025-05-07T20:33:21.8251309Z scale_ub: Optional[float], 2025-05-07T20:33:21.8251590Z contiguous: bool, 2025-05-07T20:33:21.8251835Z compiled: bool, 2025-05-07T20:33:21.8252118Z ) -> None: 2025-05-07T20:33:21.8252338Z torch.manual_seed(2025) 2025-05-07T20:33:21.8252588Z 2025-05-07T20:33:21.8252861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.8253210Z 2025-05-07T20:33:21.8253407Z x_sign = torch.sign(x) 2025-05-07T20:33:21.8253699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.8254015Z x = x_sign * x_clamp 2025-05-07T20:33:21.8254261Z x0 = x[:, :D] 2025-05-07T20:33:21.8254485Z x1 = x[:, D:] 2025-05-07T20:33:21.8254695Z 2025-05-07T20:33:21.8254888Z if contiguous: 2025-05-07T20:33:21.8255123Z x0 = x0.contiguous() 2025-05-07T20:33:21.8255379Z x1 = x1.contiguous() 2025-05-07T20:33:21.8255626Z 2025-05-07T20:33:21.8255828Z if scale_ub is not None: 2025-05-07T20:33:21.8256149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.8256490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.8256809Z ) 2025-05-07T20:33:21.8257006Z else: 2025-05-07T20:33:21.8257223Z scale_ub_tensor = None 2025-05-07T20:33:21.8257484Z 2025-05-07T20:33:21.8257718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.8258036Z op = silu_mul_quant 2025-05-07T20:33:21.8258297Z if compiled: 2025-05-07T20:33:21.8258546Z op = torch.compile(op) 2025-05-07T20:33:21.8258844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.8259125Z 2025-05-07T20:33:21.8259321Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.8259491Z 2025-05-07T20:33:21.8259593Z moe/activation_test.py:117: 2025-05-07T20:33:21.8259894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.8260238Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.8260519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.8261214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.8261908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.8262446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.8263137Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.8263809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.8264347Z kernel = self.compile( 2025-05-07T20:33:21.8264891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.8265739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.8266148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.8266380Z 2025-05-07T20:33:21.8266593Z self = 2025-05-07T20:33:21.8267675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.8269044Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cef8da80>} 2025-05-07T20:33:21.8270390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.8271767Z context = 2025-05-07T20:33:21.8272111Z 2025-05-07T20:33:21.8272301Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.8272968Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.8273513Z module_map=module_map) 2025-05-07T20:33:21.8273918Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.8274314Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.8274600Z E ^ 2025-05-07T20:33:21.8275145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.8275690Z 2025-05-07T20:33:21.8276241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.8276859Z 2025-05-07T20:33:21.8276974Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.8277516Z self=, 2025-05-07T20:33:21.8277982Z T=2048, 2025-05-07T20:33:21.8278177Z D=5120, 2025-05-07T20:33:21.8278388Z scale_ub=None, 2025-05-07T20:33:21.8278621Z contiguous=True, 2025-05-07T20:33:21.8278855Z compiled=False, 2025-05-07T20:33:21.8279081Z ) 2025-05-07T20:33:21.8279440Z self = 2025-05-07T20:33:21.8280007Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:21.8280325Z 2025-05-07T20:33:21.8280409Z @given( 2025-05-07T20:33:21.8280656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.8281003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.8281339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.8281709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.8282081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.8282397Z ) 2025-05-07T20:33:21.8282796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.8283308Z def test_silu_mul_quant( 2025-05-07T20:33:21.8283569Z self, 2025-05-07T20:33:21.8283773Z T: int, 2025-05-07T20:33:21.8283983Z D: int, 2025-05-07T20:33:21.8284212Z scale_ub: Optional[float], 2025-05-07T20:33:21.8284512Z contiguous: bool, 2025-05-07T20:33:21.8284772Z compiled: bool, 2025-05-07T20:33:21.8285008Z ) -> None: 2025-05-07T20:33:21.8285234Z torch.manual_seed(2025) 2025-05-07T20:33:21.8285500Z 2025-05-07T20:33:21.8285790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.8286136Z 2025-05-07T20:33:21.8286337Z > x_sign = torch.sign(x) 2025-05-07T20:33:21.8288300Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.8290212Z 2025-05-07T20:33:21.8290335Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:21.8290550Z 2025-05-07T20:33:21.8290658Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.8291072Z self=, 2025-05-07T20:33:21.8291484Z T=16384, 2025-05-07T20:33:21.8291687Z D=5120, 2025-05-07T20:33:21.8291880Z scale_ub=None, 2025-05-07T20:33:21.8292097Z contiguous=True, 2025-05-07T20:33:21.8292327Z compiled=False, 2025-05-07T20:33:21.8292533Z ) 2025-05-07T20:33:21.9048859Z self = 2025-05-07T20:33:21.9049654Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:21.9050116Z 2025-05-07T20:33:21.9050234Z @given( 2025-05-07T20:33:21.9050548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.9050878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.9051184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.9051514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.9051839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.9052124Z ) 2025-05-07T20:33:21.9052475Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.9052910Z def test_silu_mul_quant( 2025-05-07T20:33:21.9053147Z self, 2025-05-07T20:33:21.9053340Z T: int, 2025-05-07T20:33:21.9053533Z D: int, 2025-05-07T20:33:21.9053759Z scale_ub: Optional[float], 2025-05-07T20:33:21.9054029Z contiguous: bool, 2025-05-07T20:33:21.9054335Z compiled: bool, 2025-05-07T20:33:21.9054562Z ) -> None: 2025-05-07T20:33:21.9054788Z torch.manual_seed(2025) 2025-05-07T20:33:21.9055031Z 2025-05-07T20:33:21.9055299Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.9057358Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.9059232Z 2025-05-07T20:33:21.9059360Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:21.9059609Z 2025-05-07T20:33:21.9059719Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.9060124Z self=, 2025-05-07T20:33:21.9060526Z T=4096, 2025-05-07T20:33:21.9060716Z D=5120, 2025-05-07T20:33:21.9060911Z scale_ub=None, 2025-05-07T20:33:21.9061118Z contiguous=True, 2025-05-07T20:33:21.9061335Z compiled=False, 2025-05-07T20:33:21.9061536Z ) 2025-05-07T20:33:21.9061847Z self = 2025-05-07T20:33:21.9062336Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:21.9062605Z 2025-05-07T20:33:21.9062688Z @given( 2025-05-07T20:33:21.9062909Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.9063222Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.9063527Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.9063865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.9069746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.9070072Z ) 2025-05-07T20:33:21.9070427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.9070872Z def test_silu_mul_quant( 2025-05-07T20:33:21.9071109Z self, 2025-05-07T20:33:21.9071311Z T: int, 2025-05-07T20:33:21.9071516Z D: int, 2025-05-07T20:33:21.9071737Z scale_ub: Optional[float], 2025-05-07T20:33:21.9072010Z contiguous: bool, 2025-05-07T20:33:21.9072254Z compiled: bool, 2025-05-07T20:33:21.9072476Z ) -> None: 2025-05-07T20:33:21.9072697Z torch.manual_seed(2025) 2025-05-07T20:33:21.9072943Z 2025-05-07T20:33:21.9073219Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.9075390Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.9077901Z 2025-05-07T20:33:21.9078030Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:21.9078279Z 2025-05-07T20:33:21.9078390Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.9078862Z self=, 2025-05-07T20:33:21.9079332Z T=2048, 2025-05-07T20:33:21.9079566Z D=5120, 2025-05-07T20:33:21.9079774Z scale_ub=None, 2025-05-07T20:33:21.9079999Z contiguous=False, 2025-05-07T20:33:21.9080243Z compiled=False, 2025-05-07T20:33:21.9080461Z ) 2025-05-07T20:33:21.9080882Z self = 2025-05-07T20:33:21.9081385Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:21.9081667Z 2025-05-07T20:33:21.9081748Z @given( 2025-05-07T20:33:21.9081978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.9082291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.9082602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.9082934Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.9083262Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.9083555Z ) 2025-05-07T20:33:21.9083907Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.9084352Z def test_silu_mul_quant( 2025-05-07T20:33:21.9084593Z self, 2025-05-07T20:33:21.9084803Z T: int, 2025-05-07T20:33:21.9085001Z D: int, 2025-05-07T20:33:21.9085219Z scale_ub: Optional[float], 2025-05-07T20:33:21.9085490Z contiguous: bool, 2025-05-07T20:33:21.9085733Z compiled: bool, 2025-05-07T20:33:21.9085978Z ) -> None: 2025-05-07T20:33:21.9086186Z torch.manual_seed(2025) 2025-05-07T20:33:21.9086427Z 2025-05-07T20:33:21.9086697Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.9088750Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.9090615Z 2025-05-07T20:33:21.9090734Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:21.9090950Z 2025-05-07T20:33:21.9091055Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.9091469Z self=, 2025-05-07T20:33:21.9091875Z T=4096, 2025-05-07T20:33:21.9092058Z D=7168, 2025-05-07T20:33:21.9092253Z scale_ub=None, 2025-05-07T20:33:21.9092466Z contiguous=True, 2025-05-07T20:33:21.9092687Z compiled=True, 2025-05-07T20:33:21.9092887Z ) 2025-05-07T20:33:21.9093208Z self = 2025-05-07T20:33:21.9093696Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:21.9093969Z 2025-05-07T20:33:21.9094048Z @given( 2025-05-07T20:33:21.9094280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.9094649Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.9095000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.9095332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.9095712Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.9095994Z ) 2025-05-07T20:33:21.9096339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.9096779Z def test_silu_mul_quant( 2025-05-07T20:33:21.9097018Z self, 2025-05-07T20:33:21.9097214Z T: int, 2025-05-07T20:33:21.9097419Z D: int, 2025-05-07T20:33:21.9097631Z scale_ub: Optional[float], 2025-05-07T20:33:21.9097898Z contiguous: bool, 2025-05-07T20:33:21.9098136Z compiled: bool, 2025-05-07T20:33:21.9098355Z ) -> None: 2025-05-07T20:33:21.9098572Z torch.manual_seed(2025) 2025-05-07T20:33:21.9098812Z 2025-05-07T20:33:21.9099079Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.9101239Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.9103111Z 2025-05-07T20:33:21.9103228Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:21.9103443Z 2025-05-07T20:33:21.9103547Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.9103963Z self=, 2025-05-07T20:33:21.9104361Z T=2048, 2025-05-07T20:33:21.9104548Z D=5120, 2025-05-07T20:33:21.9104752Z scale_ub=1200.0, 2025-05-07T20:33:21.9104968Z contiguous=False, 2025-05-07T20:33:21.9105201Z compiled=False, 2025-05-07T20:33:21.9105410Z ) 2025-05-07T20:33:21.9105736Z self = 2025-05-07T20:33:21.9106227Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:21.9106510Z 2025-05-07T20:33:21.9106587Z @given( 2025-05-07T20:33:21.9106817Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.9107123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.9107433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.9107765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.9108087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.9108376Z ) 2025-05-07T20:33:21.9108721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.9109159Z def test_silu_mul_quant( 2025-05-07T20:33:21.9109405Z self, 2025-05-07T20:33:21.9109600Z T: int, 2025-05-07T20:33:21.9109800Z D: int, 2025-05-07T20:33:21.9110012Z scale_ub: Optional[float], 2025-05-07T20:33:21.9110284Z contiguous: bool, 2025-05-07T20:33:21.9110527Z compiled: bool, 2025-05-07T20:33:21.9110745Z ) -> None: 2025-05-07T20:33:21.9110964Z torch.manual_seed(2025) 2025-05-07T20:33:21.9111210Z 2025-05-07T20:33:21.9111475Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.9113581Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:21.9115478Z 2025-05-07T20:33:21.9115594Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:21.9115898Z 2025-05-07T20:33:21.9116000Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.9116416Z self=, 2025-05-07T20:33:21.9116817Z T=4096, 2025-05-07T20:33:21.9117009Z D=7168, 2025-05-07T20:33:21.9117204Z scale_ub=1200.0, 2025-05-07T20:33:21.9117421Z contiguous=True, 2025-05-07T20:33:21.9117642Z compiled=False, 2025-05-07T20:33:21.9117846Z ) 2025-05-07T20:33:22.0193508Z self = 2025-05-07T20:33:22.0194986Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.0195832Z 2025-05-07T20:33:22.0196068Z @given( 2025-05-07T20:33:22.0196545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.0197384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.0197994Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.0198654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.0199297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.0199582Z ) 2025-05-07T20:33:22.0199928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.0200368Z def test_silu_mul_quant( 2025-05-07T20:33:22.0200609Z self, 2025-05-07T20:33:22.0200800Z T: int, 2025-05-07T20:33:22.0200995Z D: int, 2025-05-07T20:33:22.0201208Z scale_ub: Optional[float], 2025-05-07T20:33:22.0201477Z contiguous: bool, 2025-05-07T20:33:22.0201716Z compiled: bool, 2025-05-07T20:33:22.0201945Z ) -> None: 2025-05-07T20:33:22.0202158Z torch.manual_seed(2025) 2025-05-07T20:33:22.0202400Z 2025-05-07T20:33:22.0202673Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.0204723Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.0206586Z 2025-05-07T20:33:22.0206707Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.0206916Z 2025-05-07T20:33:22.0207018Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.0207429Z self=, 2025-05-07T20:33:22.0207831Z T=16384, 2025-05-07T20:33:22.0208020Z D=7168, 2025-05-07T20:33:22.0208210Z scale_ub=None, 2025-05-07T20:33:22.0208432Z contiguous=False, 2025-05-07T20:33:22.0208650Z compiled=True, 2025-05-07T20:33:22.0208858Z ) 2025-05-07T20:33:22.0209171Z self = 2025-05-07T20:33:22.0209718Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.0209995Z 2025-05-07T20:33:22.0210073Z @given( 2025-05-07T20:33:22.0210304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.0210620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.0210923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.0211251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.0211579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.0211861Z ) 2025-05-07T20:33:22.0212282Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.0212778Z def test_silu_mul_quant( 2025-05-07T20:33:22.0213019Z self, 2025-05-07T20:33:22.0213215Z T: int, 2025-05-07T20:33:22.0213413Z D: int, 2025-05-07T20:33:22.0213691Z scale_ub: Optional[float], 2025-05-07T20:33:22.0213958Z contiguous: bool, 2025-05-07T20:33:22.0214197Z compiled: bool, 2025-05-07T20:33:22.0214422Z ) -> None: 2025-05-07T20:33:22.0214638Z torch.manual_seed(2025) 2025-05-07T20:33:22.0214883Z 2025-05-07T20:33:22.0215155Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.0217245Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.0219114Z 2025-05-07T20:33:22.0219231Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.0219472Z 2025-05-07T20:33:22.0219596Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.0220007Z self=, 2025-05-07T20:33:22.0220408Z T=4096, 2025-05-07T20:33:22.0220590Z D=7168, 2025-05-07T20:33:22.0220779Z scale_ub=None, 2025-05-07T20:33:22.0220990Z contiguous=True, 2025-05-07T20:33:22.0221207Z compiled=False, 2025-05-07T20:33:22.0221410Z ) 2025-05-07T20:33:22.0221727Z self = 2025-05-07T20:33:22.0222217Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.0222489Z 2025-05-07T20:33:22.0222573Z @given( 2025-05-07T20:33:22.0222810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.0223119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.0223431Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.0223764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.0224089Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.0224376Z ) 2025-05-07T20:33:22.0224725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.0225171Z def test_silu_mul_quant( 2025-05-07T20:33:22.0225409Z self, 2025-05-07T20:33:22.0225603Z T: int, 2025-05-07T20:33:22.0225803Z D: int, 2025-05-07T20:33:22.0226016Z scale_ub: Optional[float], 2025-05-07T20:33:22.0226287Z contiguous: bool, 2025-05-07T20:33:22.0226523Z compiled: bool, 2025-05-07T20:33:22.0226738Z ) -> None: 2025-05-07T20:33:22.0226956Z torch.manual_seed(2025) 2025-05-07T20:33:22.0227201Z 2025-05-07T20:33:22.0227467Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.0229514Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.0231381Z 2025-05-07T20:33:22.0231498Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.0231711Z 2025-05-07T20:33:22.0231814Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.0232271Z self=, 2025-05-07T20:33:22.0232734Z T=16384, 2025-05-07T20:33:22.0232925Z D=7168, 2025-05-07T20:33:22.0233116Z scale_ub=None, 2025-05-07T20:33:22.0233323Z contiguous=True, 2025-05-07T20:33:22.0233586Z compiled=False, 2025-05-07T20:33:22.0233795Z ) 2025-05-07T20:33:22.0234109Z self = 2025-05-07T20:33:22.0234602Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.0234882Z 2025-05-07T20:33:22.0234967Z @given( 2025-05-07T20:33:22.0235199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.0235507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.0235883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.0236209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.0236540Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.0236827Z ) 2025-05-07T20:33:22.0237227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.0237662Z def test_silu_mul_quant( 2025-05-07T20:33:22.0237905Z self, 2025-05-07T20:33:22.0238110Z T: int, 2025-05-07T20:33:22.0238303Z D: int, 2025-05-07T20:33:22.0238520Z scale_ub: Optional[float], 2025-05-07T20:33:22.0238789Z contiguous: bool, 2025-05-07T20:33:22.0239026Z compiled: bool, 2025-05-07T20:33:22.0239251Z ) -> None: 2025-05-07T20:33:22.0239505Z torch.manual_seed(2025) 2025-05-07T20:33:22.0239750Z 2025-05-07T20:33:22.0240025Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.0242074Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.0243944Z 2025-05-07T20:33:22.0244061Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.0244274Z 2025-05-07T20:33:22.0244382Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.0244799Z self=, 2025-05-07T20:33:22.0245209Z T=16384, 2025-05-07T20:33:22.0245401Z D=7168, 2025-05-07T20:33:22.0245597Z scale_ub=1200.0, 2025-05-07T20:33:22.0245816Z contiguous=True, 2025-05-07T20:33:22.0246037Z compiled=False, 2025-05-07T20:33:22.0246242Z ) 2025-05-07T20:33:22.0246557Z self = 2025-05-07T20:33:22.0247052Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.0247337Z 2025-05-07T20:33:22.0247419Z @given( 2025-05-07T20:33:22.0247647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.0247961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.0248267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.0248589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.0248920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.0249216Z ) 2025-05-07T20:33:22.0249566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.0250000Z def test_silu_mul_quant( 2025-05-07T20:33:22.0250237Z self, 2025-05-07T20:33:22.0250433Z T: int, 2025-05-07T20:33:22.0250626Z D: int, 2025-05-07T20:33:22.0250839Z scale_ub: Optional[float], 2025-05-07T20:33:22.0251110Z contiguous: bool, 2025-05-07T20:33:22.0251347Z compiled: bool, 2025-05-07T20:33:22.0251659Z ) -> None: 2025-05-07T20:33:22.0251876Z torch.manual_seed(2025) 2025-05-07T20:33:22.0252122Z 2025-05-07T20:33:22.0252394Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.0254476Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.0256345Z 2025-05-07T20:33:22.0256461Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.0256672Z 2025-05-07T20:33:22.0256786Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.0257232Z self=, 2025-05-07T20:33:22.0257636Z T=128, 2025-05-07T20:33:22.0257821Z D=5120, 2025-05-07T20:33:22.0258009Z scale_ub=1200.0, 2025-05-07T20:33:22.0258232Z contiguous=False, 2025-05-07T20:33:22.0258454Z compiled=False, 2025-05-07T20:33:22.0258652Z ) 2025-05-07T20:33:22.1549960Z self = 2025-05-07T20:33:22.1550724Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.1551108Z 2025-05-07T20:33:22.1551217Z @given( 2025-05-07T20:33:22.1551533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.1551878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.1552175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.1552502Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.1552839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.1553124Z ) 2025-05-07T20:33:22.1553480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.1553930Z def test_silu_mul_quant( 2025-05-07T20:33:22.1554174Z self, 2025-05-07T20:33:22.1554373Z T: int, 2025-05-07T20:33:22.1554576Z D: int, 2025-05-07T20:33:22.1554791Z scale_ub: Optional[float], 2025-05-07T20:33:22.1555060Z contiguous: bool, 2025-05-07T20:33:22.1555299Z compiled: bool, 2025-05-07T20:33:22.1555521Z ) -> None: 2025-05-07T20:33:22.1555803Z torch.manual_seed(2025) 2025-05-07T20:33:22.1556049Z 2025-05-07T20:33:22.1556321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.1556668Z 2025-05-07T20:33:22.1556864Z x_sign = torch.sign(x) 2025-05-07T20:33:22.1557150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.1557458Z x = x_sign * x_clamp 2025-05-07T20:33:22.1557701Z x0 = x[:, :D] 2025-05-07T20:33:22.1557925Z x1 = x[:, D:] 2025-05-07T20:33:22.1558126Z 2025-05-07T20:33:22.1558316Z if contiguous: 2025-05-07T20:33:22.1558550Z x0 = x0.contiguous() 2025-05-07T20:33:22.1558802Z x1 = x1.contiguous() 2025-05-07T20:33:22.1559042Z 2025-05-07T20:33:22.1559240Z if scale_ub is not None: 2025-05-07T20:33:22.1559542Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.1559895Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.1560201Z ) 2025-05-07T20:33:22.1560387Z else: 2025-05-07T20:33:22.1560597Z scale_ub_tensor = None 2025-05-07T20:33:22.1560854Z 2025-05-07T20:33:22.1561080Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.1561396Z op = silu_mul_quant 2025-05-07T20:33:22.1561646Z if compiled: 2025-05-07T20:33:22.1561890Z op = torch.compile(op) 2025-05-07T20:33:22.1562363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.1562641Z 2025-05-07T20:33:22.1562835Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.1563003Z 2025-05-07T20:33:22.1563164Z moe/activation_test.py:117: 2025-05-07T20:33:22.1563458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.1563790Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.1564067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.1564759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.1565623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.1566164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.1566838Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.1567570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.1568103Z kernel = self.compile( 2025-05-07T20:33:22.1568636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.1569292Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.1569734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.1569961Z 2025-05-07T20:33:22.1570170Z self = 2025-05-07T20:33:22.1571244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.1572626Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32ced287c0>} 2025-05-07T20:33:22.1573970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.1574996Z context = 2025-05-07T20:33:22.1575282Z 2025-05-07T20:33:22.1575449Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.1575963Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.1576430Z module_map=module_map) 2025-05-07T20:33:22.1576792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.1577138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.1577400Z E ^ 2025-05-07T20:33:22.1577866Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.1578322Z 2025-05-07T20:33:22.1578736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.1579248Z 2025-05-07T20:33:22.1579354Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.1579767Z self=, 2025-05-07T20:33:22.1580167Z T=2048, 2025-05-07T20:33:22.1580351Z D=7168, 2025-05-07T20:33:22.1580543Z scale_ub=None, 2025-05-07T20:33:22.1580758Z contiguous=False, 2025-05-07T20:33:22.1580982Z compiled=False, 2025-05-07T20:33:22.1581182Z ) 2025-05-07T20:33:22.1581497Z self = 2025-05-07T20:33:22.1581991Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.1582263Z 2025-05-07T20:33:22.1582340Z @given( 2025-05-07T20:33:22.1582693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.1583009Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.1583309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.1583699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.1584027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.1584309Z ) 2025-05-07T20:33:22.1584651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.1585102Z def test_silu_mul_quant( 2025-05-07T20:33:22.1590740Z self, 2025-05-07T20:33:22.1590965Z T: int, 2025-05-07T20:33:22.1591172Z D: int, 2025-05-07T20:33:22.1591394Z scale_ub: Optional[float], 2025-05-07T20:33:22.1591661Z contiguous: bool, 2025-05-07T20:33:22.1591899Z compiled: bool, 2025-05-07T20:33:22.1592123Z ) -> None: 2025-05-07T20:33:22.1592332Z torch.manual_seed(2025) 2025-05-07T20:33:22.1592581Z 2025-05-07T20:33:22.1592932Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.1594999Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.1596916Z 2025-05-07T20:33:22.1597043Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.1597255Z 2025-05-07T20:33:22.1597362Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.1597767Z self=, 2025-05-07T20:33:22.1598174Z T=128, 2025-05-07T20:33:22.1598364Z D=7168, 2025-05-07T20:33:22.1598552Z scale_ub=1200.0, 2025-05-07T20:33:22.1598777Z contiguous=True, 2025-05-07T20:33:22.1598998Z compiled=True, 2025-05-07T20:33:22.1599196Z ) 2025-05-07T20:33:22.1907949Z self = 2025-05-07T20:33:22.1909178Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.1909546Z 2025-05-07T20:33:22.1909626Z @given( 2025-05-07T20:33:22.1909861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.1910173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.1910475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.1910807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.1911139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.1911415Z ) 2025-05-07T20:33:22.1911764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.1912206Z def test_silu_mul_quant( 2025-05-07T20:33:22.1912440Z self, 2025-05-07T20:33:22.1912632Z T: int, 2025-05-07T20:33:22.1912828Z D: int, 2025-05-07T20:33:22.1913038Z scale_ub: Optional[float], 2025-05-07T20:33:22.1913309Z contiguous: bool, 2025-05-07T20:33:22.1913546Z compiled: bool, 2025-05-07T20:33:22.1913759Z ) -> None: 2025-05-07T20:33:22.1913972Z torch.manual_seed(2025) 2025-05-07T20:33:22.1914216Z 2025-05-07T20:33:22.1914481Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.1914816Z 2025-05-07T20:33:22.1915005Z x_sign = torch.sign(x) 2025-05-07T20:33:22.1915287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.1915592Z x = x_sign * x_clamp 2025-05-07T20:33:22.1915893Z x0 = x[:, :D] 2025-05-07T20:33:22.1916108Z x1 = x[:, D:] 2025-05-07T20:33:22.1916499Z 2025-05-07T20:33:22.1916687Z if contiguous: 2025-05-07T20:33:22.1916920Z x0 = x0.contiguous() 2025-05-07T20:33:22.1917172Z x1 = x1.contiguous() 2025-05-07T20:33:22.1917472Z 2025-05-07T20:33:22.1917661Z if scale_ub is not None: 2025-05-07T20:33:22.1917927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.1918261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.1918573Z ) 2025-05-07T20:33:22.1918761Z else: 2025-05-07T20:33:22.1918973Z scale_ub_tensor = None 2025-05-07T20:33:22.1919229Z 2025-05-07T20:33:22.1919479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.1919816Z op = silu_mul_quant 2025-05-07T20:33:22.1920062Z if compiled: 2025-05-07T20:33:22.1920303Z op = torch.compile(op) 2025-05-07T20:33:22.1920592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.1920868Z 2025-05-07T20:33:22.1921066Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.1921234Z 2025-05-07T20:33:22.1921424Z moe/activation_test.py:117: 2025-05-07T20:33:22.1921717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.1922049Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.1922323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.1922875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.1923434Z return fn(*args, **kwargs) 2025-05-07T20:33:22.1924078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.1924756Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.1925286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.1925961Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.1926616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.1927147Z kernel = self.compile( 2025-05-07T20:33:22.1927684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.1928331Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.1928723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.1928954Z 2025-05-07T20:33:22.1929159Z self = 2025-05-07T20:33:22.1930236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.1931610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32ced29940>} 2025-05-07T20:33:22.1932940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.1933959Z context = 2025-05-07T20:33:22.1934248Z 2025-05-07T20:33:22.1934411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.1934926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.1935386Z module_map=module_map) 2025-05-07T20:33:22.1935748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.1936097Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.1936437Z E ^ 2025-05-07T20:33:22.1936907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.1937359Z 2025-05-07T20:33:22.1937810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.1938315Z 2025-05-07T20:33:22.1938420Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.1938827Z self=, 2025-05-07T20:33:22.1939229Z T=128, 2025-05-07T20:33:22.1939422Z D=7168, 2025-05-07T20:33:22.1939638Z scale_ub=1200.0, 2025-05-07T20:33:22.1939870Z contiguous=True, 2025-05-07T20:33:22.1940088Z compiled=False, 2025-05-07T20:33:22.1940286Z ) 2025-05-07T20:33:22.1940605Z self = 2025-05-07T20:33:22.1941088Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.1941359Z 2025-05-07T20:33:22.1941443Z @given( 2025-05-07T20:33:22.1941711Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.1942023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.1942328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.1942652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.1942983Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.1943272Z ) 2025-05-07T20:33:22.1943612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.1944052Z def test_silu_mul_quant( 2025-05-07T20:33:22.1944291Z self, 2025-05-07T20:33:22.1944487Z T: int, 2025-05-07T20:33:22.1944680Z D: int, 2025-05-07T20:33:22.1944897Z scale_ub: Optional[float], 2025-05-07T20:33:22.1945157Z contiguous: bool, 2025-05-07T20:33:22.1945396Z compiled: bool, 2025-05-07T20:33:22.1945616Z ) -> None: 2025-05-07T20:33:22.1945828Z torch.manual_seed(2025) 2025-05-07T20:33:22.1946069Z 2025-05-07T20:33:22.1946344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.1946681Z 2025-05-07T20:33:22.1946879Z x_sign = torch.sign(x) 2025-05-07T20:33:22.1947163Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.1949162Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.1951064Z 2025-05-07T20:33:22.1951185Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:22.1951397Z 2025-05-07T20:33:22.1951498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.1951902Z self=, 2025-05-07T20:33:22.1952300Z T=128, 2025-05-07T20:33:22.1952480Z D=5120, 2025-05-07T20:33:22.1952669Z scale_ub=1200.0, 2025-05-07T20:33:22.1952886Z contiguous=True, 2025-05-07T20:33:22.1953101Z compiled=True, 2025-05-07T20:33:22.1953301Z ) 2025-05-07T20:33:22.1953613Z self = 2025-05-07T20:33:22.1954090Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.1954360Z 2025-05-07T20:33:22.1954436Z @given( 2025-05-07T20:33:22.1954660Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.1954967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.1955265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.1955676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.1956074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.1956354Z ) 2025-05-07T20:33:22.1956738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.1957174Z def test_silu_mul_quant( 2025-05-07T20:33:22.1957407Z self, 2025-05-07T20:33:22.1957597Z T: int, 2025-05-07T20:33:22.1957791Z D: int, 2025-05-07T20:33:22.1958006Z scale_ub: Optional[float], 2025-05-07T20:33:22.1958270Z contiguous: bool, 2025-05-07T20:33:22.1958507Z compiled: bool, 2025-05-07T20:33:22.1958720Z ) -> None: 2025-05-07T20:33:22.1958929Z torch.manual_seed(2025) 2025-05-07T20:33:22.1959166Z 2025-05-07T20:33:22.1959423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.1959761Z 2025-05-07T20:33:22.1959951Z x_sign = torch.sign(x) 2025-05-07T20:33:22.1960238Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.1962269Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.1964122Z 2025-05-07T20:33:22.1964238Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:22.1964453Z 2025-05-07T20:33:22.1964557Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.1964962Z self=, 2025-05-07T20:33:22.1965525Z T=128, 2025-05-07T20:33:22.1965717Z D=7168, 2025-05-07T20:33:22.1965908Z scale_ub=None, 2025-05-07T20:33:22.1966122Z contiguous=True, 2025-05-07T20:33:22.1966334Z compiled=True, 2025-05-07T20:33:22.1966533Z ) 2025-05-07T20:33:22.4492011Z self = 2025-05-07T20:33:22.4493018Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.4493558Z 2025-05-07T20:33:22.4493715Z @given( 2025-05-07T20:33:22.4494168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4494782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4495379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4496024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4496669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4497225Z ) 2025-05-07T20:33:22.4497905Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4498795Z def test_silu_mul_quant( 2025-05-07T20:33:22.4499271Z self, 2025-05-07T20:33:22.4499557Z T: int, 2025-05-07T20:33:22.4499759Z D: int, 2025-05-07T20:33:22.4499981Z scale_ub: Optional[float], 2025-05-07T20:33:22.4500249Z contiguous: bool, 2025-05-07T20:33:22.4500494Z compiled: bool, 2025-05-07T20:33:22.4500716Z ) -> None: 2025-05-07T20:33:22.4500939Z torch.manual_seed(2025) 2025-05-07T20:33:22.4501182Z 2025-05-07T20:33:22.4501446Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4503606Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.4505999Z 2025-05-07T20:33:22.4506118Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.4506333Z 2025-05-07T20:33:22.4517557Z FAILED 2025-05-07T20:33:22.4517800Z 2025-05-07T20:33:22.4517968Z =================================== FAILURES =================================== 2025-05-07T20:33:22.4518479Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:22.4519050Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:22.4519969Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:22.4520722Z | yield 2025-05-07T20:33:22.4521305Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:33:22.4522028Z | self._callTestMethod(testMethod) 2025-05-07T20:33:22.4522976Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:33:22.4523880Z | if method() is not None: 2025-05-07T20:33:22.4524221Z | ^^^^^^^^ 2025-05-07T20:33:22.4525103Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:22.4526085Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4526516Z | ^^^^^^^ 2025-05-07T20:33:22.4527276Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:22.4528116Z | raise the_error_hypothesis_found 2025-05-07T20:33:22.4528702Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:22.4529279Z +-+---------------- 1 ---------------- 2025-05-07T20:33:22.4529741Z | Traceback (most recent call last): 2025-05-07T20:33:22.4530713Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:22.4531761Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4532268Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4534971Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.4537663Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:22.4538273Z | self=, 2025-05-07T20:33:22.4538832Z | T=2048, 2025-05-07T20:33:22.4539143Z | D=5120, # or any other generated value 2025-05-07T20:33:22.4539603Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:22.4540101Z | contiguous=True, # or any other generated value 2025-05-07T20:33:22.4540594Z | compiled=False, # or any other generated value 2025-05-07T20:33:22.4541021Z | ) 2025-05-07T20:33:22.4541271Z | 2025-05-07T20:33:22.4541987Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:22.4542803Z +---------------- 2 ---------------- 2025-05-07T20:33:22.4543203Z | Traceback (most recent call last): 2025-05-07T20:33:22.4544315Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:22.4545416Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4545933Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4548645Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.4551321Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:22.4551968Z | self=, 2025-05-07T20:33:22.4552538Z | T=128, 2025-05-07T20:33:22.4552861Z | D=7168, 2025-05-07T20:33:22.4553148Z | scale_ub=None, 2025-05-07T20:33:22.4553467Z | contiguous=True, 2025-05-07T20:33:22.4553799Z | compiled=True, 2025-05-07T20:33:22.4554116Z | ) 2025-05-07T20:33:22.4554358Z | 2025-05-07T20:33:22.4555069Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:22.4556010Z +---------------- 3 ---------------- 2025-05-07T20:33:22.4556423Z | Traceback (most recent call last): 2025-05-07T20:33:22.4557377Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:22.4558442Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4558959Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4561569Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.4563550Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:22.4563984Z | self=, 2025-05-07T20:33:22.4564398Z | T=128, 2025-05-07T20:33:22.4564603Z | D=5120, 2025-05-07T20:33:22.4564809Z | scale_ub=1200.0, 2025-05-07T20:33:22.4565054Z | contiguous=True, 2025-05-07T20:33:22.4565297Z | compiled=True, 2025-05-07T20:33:22.4565702Z | ) 2025-05-07T20:33:22.4565890Z | 2025-05-07T20:33:22.4566412Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:22.4567019Z +---------------- 4 ---------------- 2025-05-07T20:33:22.4567309Z | Traceback (most recent call last): 2025-05-07T20:33:22.4568018Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:22.4568734Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4569021Z | ^^^^^^^^ 2025-05-07T20:33:22.4569762Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:22.4570507Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4570845Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4571694Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:22.4572486Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4573095Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:22.4573828Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4574270Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4574910Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:22.4575739Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4576215Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4576850Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:22.4577544Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4577918Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4578507Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:22.4579078Z | fn() 2025-05-07T20:33:22.4579672Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:22.4580330Z | self.fn.run( 2025-05-07T20:33:22.4580852Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:22.4581434Z | kernel = self.compile( 2025-05-07T20:33:22.4581698Z | ^^^^^^^^^^^^^ 2025-05-07T20:33:22.4582282Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:22.4582984Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4583372Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4584015Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:22.4584795Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4585348Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:22.4585869Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4586355Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4586715Z | ^ 2025-05-07T20:33:22.4587335Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4588112Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:22.4588663Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:22.4589380Z | self=, 2025-05-07T20:33:22.4589982Z | T=1, # or any other generated value 2025-05-07T20:33:22.4590420Z | D=5120, # or any other generated value 2025-05-07T20:33:22.4590885Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:22.4591504Z | contiguous=True, # or any other generated value 2025-05-07T20:33:22.4592020Z | compiled=True, # or any other generated value 2025-05-07T20:33:22.4592434Z | ) 2025-05-07T20:33:22.4592743Z | 2025-05-07T20:33:22.4593461Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:22.4594290Z +------------------------------------ 2025-05-07T20:33:22.4594785Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:22.4595306Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4596002Z self=, 2025-05-07T20:33:22.4596561Z T=1, 2025-05-07T20:33:22.4596823Z D=5120, 2025-05-07T20:33:22.4597094Z scale_ub=None, 2025-05-07T20:33:22.4597384Z contiguous=True, 2025-05-07T20:33:22.4597694Z compiled=True, 2025-05-07T20:33:22.4597989Z ) 2025-05-07T20:33:22.4598432Z self = 2025-05-07T20:33:22.4599204Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.4599596Z 2025-05-07T20:33:22.4599733Z @given( 2025-05-07T20:33:22.4600061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4600500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4619382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4619867Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4620333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4620727Z ) 2025-05-07T20:33:22.4621212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4621835Z def test_silu_mul_quant( 2025-05-07T20:33:22.4622176Z self, 2025-05-07T20:33:22.4622460Z T: int, 2025-05-07T20:33:22.4622755Z D: int, 2025-05-07T20:33:22.4623080Z scale_ub: Optional[float], 2025-05-07T20:33:22.4623466Z contiguous: bool, 2025-05-07T20:33:22.4623814Z compiled: bool, 2025-05-07T20:33:22.4624137Z ) -> None: 2025-05-07T20:33:22.4624447Z torch.manual_seed(2025) 2025-05-07T20:33:22.4624797Z 2025-05-07T20:33:22.4625189Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4625675Z 2025-05-07T20:33:22.4625956Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4626357Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4626780Z x = x_sign * x_clamp 2025-05-07T20:33:22.4627118Z x0 = x[:, :D] 2025-05-07T20:33:22.4627424Z x1 = x[:, D:] 2025-05-07T20:33:22.4627710Z 2025-05-07T20:33:22.4627968Z if contiguous: 2025-05-07T20:33:22.4628287Z x0 = x0.contiguous() 2025-05-07T20:33:22.4628644Z x1 = x1.contiguous() 2025-05-07T20:33:22.4628983Z 2025-05-07T20:33:22.4629271Z if scale_ub is not None: 2025-05-07T20:33:22.4629658Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4630125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4630562Z ) 2025-05-07T20:33:22.4630836Z else: 2025-05-07T20:33:22.4631114Z scale_ub_tensor = None 2025-05-07T20:33:22.4631474Z 2025-05-07T20:33:22.4631792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4632218Z op = silu_mul_quant 2025-05-07T20:33:22.4632575Z if compiled: 2025-05-07T20:33:22.4632929Z op = torch.compile(op) 2025-05-07T20:33:22.4633340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4633735Z 2025-05-07T20:33:22.4634013Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4634409Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4634825Z 2025-05-07T20:33:22.4635168Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4635975Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4636413Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4636852Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4637400Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4637818Z 2025-05-07T20:33:22.4638093Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4638352Z 2025-05-07T20:33:22.4638499Z moe/activation_test.py:126: 2025-05-07T20:33:22.4638889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4639343Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4639785Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4640798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4641786Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4642574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4643475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4644416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4645413Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4646454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4647325Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4648153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4648874Z fn() 2025-05-07T20:33:22.4649578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4650360Z self.fn.run( 2025-05-07T20:33:22.4651005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4651741Z kernel = self.compile( 2025-05-07T20:33:22.4652483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4653375Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4653935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4654255Z 2025-05-07T20:33:22.4654541Z self = 2025-05-07T20:33:22.4655998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4657890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f3351c60>} 2025-05-07T20:33:22.4659739Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4661070Z context = 2025-05-07T20:33:22.4661460Z 2025-05-07T20:33:22.4661702Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4662419Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4663071Z module_map=module_map) 2025-05-07T20:33:22.4663563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4664122Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4664541Z E ^ 2025-05-07T20:33:22.4665182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4666263Z 2025-05-07T20:33:22.4666843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4667516Z 2025-05-07T20:33:22.4667665Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4668203Z self=, 2025-05-07T20:33:22.4668751Z T=2048, 2025-05-07T20:33:22.4669028Z D=5120, 2025-05-07T20:33:22.4669301Z scale_ub=1200.0, 2025-05-07T20:33:22.4669620Z contiguous=True, 2025-05-07T20:33:22.4669942Z compiled=False, 2025-05-07T20:33:22.4670235Z ) 2025-05-07T20:33:22.4670685Z self = 2025-05-07T20:33:22.4671384Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.4671770Z 2025-05-07T20:33:22.4672053Z @given( 2025-05-07T20:33:22.4672391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4672846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4673284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4673747Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4674219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4674630Z ) 2025-05-07T20:33:22.4675115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4675850Z def test_silu_mul_quant( 2025-05-07T20:33:22.4676199Z self, 2025-05-07T20:33:22.4676465Z T: int, 2025-05-07T20:33:22.4676744Z D: int, 2025-05-07T20:33:22.4677048Z scale_ub: Optional[float], 2025-05-07T20:33:22.4677407Z contiguous: bool, 2025-05-07T20:33:22.4677732Z compiled: bool, 2025-05-07T20:33:22.4678045Z ) -> None: 2025-05-07T20:33:22.4678334Z torch.manual_seed(2025) 2025-05-07T20:33:22.4678683Z 2025-05-07T20:33:22.4679059Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4679523Z 2025-05-07T20:33:22.4679795Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4680195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4680621Z x = x_sign * x_clamp 2025-05-07T20:33:22.4680955Z x0 = x[:, :D] 2025-05-07T20:33:22.4681271Z x1 = x[:, D:] 2025-05-07T20:33:22.4681560Z 2025-05-07T20:33:22.4681813Z if contiguous: 2025-05-07T20:33:22.4682123Z x0 = x0.contiguous() 2025-05-07T20:33:22.4682467Z x1 = x1.contiguous() 2025-05-07T20:33:22.4682798Z 2025-05-07T20:33:22.4683060Z if scale_ub is not None: 2025-05-07T20:33:22.4683429Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4683879Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4684299Z ) 2025-05-07T20:33:22.4684565Z else: 2025-05-07T20:33:22.4684856Z scale_ub_tensor = None 2025-05-07T20:33:22.4685203Z 2025-05-07T20:33:22.4685512Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4685945Z op = silu_mul_quant 2025-05-07T20:33:22.4686283Z if compiled: 2025-05-07T20:33:22.4686608Z op = torch.compile(op) 2025-05-07T20:33:22.4687007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4687382Z 2025-05-07T20:33:22.4687638Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4687867Z 2025-05-07T20:33:22.4688003Z moe/activation_test.py:117: 2025-05-07T20:33:22.4688409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4688860Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4689238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4690269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4691273Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4691976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4692967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4693844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4694558Z kernel = self.compile( 2025-05-07T20:33:22.4695282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4696156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4696709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4697022Z 2025-05-07T20:33:22.4697308Z self = 2025-05-07T20:33:22.4698813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4700777Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f31a8220>} 2025-05-07T20:33:22.4702595Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4703923Z context = 2025-05-07T20:33:22.4704295Z 2025-05-07T20:33:22.4704515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4705197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4705810Z module_map=module_map) 2025-05-07T20:33:22.4706279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4706727Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4707073Z E ^ 2025-05-07T20:33:22.4707673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4708256Z 2025-05-07T20:33:22.4708795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4709453Z 2025-05-07T20:33:22.4709589Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4710128Z self=, 2025-05-07T20:33:22.4710658Z T=2048, 2025-05-07T20:33:22.4710904Z D=5120, 2025-05-07T20:33:22.4711169Z scale_ub=1200.0, 2025-05-07T20:33:22.4711476Z contiguous=True, 2025-05-07T20:33:22.4711773Z compiled=True, 2025-05-07T20:33:22.4712051Z ) 2025-05-07T20:33:22.4712468Z self = 2025-05-07T20:33:22.4713100Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.4713458Z 2025-05-07T20:33:22.4713564Z @given( 2025-05-07T20:33:22.4713872Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4714285Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4714683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4715112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4715538Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4716034Z ) 2025-05-07T20:33:22.4716523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4717257Z def test_silu_mul_quant( 2025-05-07T20:33:22.4717600Z self, 2025-05-07T20:33:22.4717886Z T: int, 2025-05-07T20:33:22.4718176Z D: int, 2025-05-07T20:33:22.4718319Z scale_ub: Optional[float], 2025-05-07T20:33:22.4718546Z contiguous: bool, 2025-05-07T20:33:22.4718673Z compiled: bool, 2025-05-07T20:33:22.4718791Z ) -> None: 2025-05-07T20:33:22.4718935Z torch.manual_seed(2025) 2025-05-07T20:33:22.4719063Z 2025-05-07T20:33:22.4719290Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4719400Z 2025-05-07T20:33:22.4719527Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4719695Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4719822Z x = x_sign * x_clamp 2025-05-07T20:33:22.4719936Z x0 = x[:, :D] 2025-05-07T20:33:22.4720047Z x1 = x[:, D:] 2025-05-07T20:33:22.4720156Z 2025-05-07T20:33:22.4720271Z if contiguous: 2025-05-07T20:33:22.4720402Z x0 = x0.contiguous() 2025-05-07T20:33:22.4720585Z x1 = x1.contiguous() 2025-05-07T20:33:22.4720687Z 2025-05-07T20:33:22.4720819Z if scale_ub is not None: 2025-05-07T20:33:22.4720979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4721165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4721285Z ) 2025-05-07T20:33:22.4721396Z else: 2025-05-07T20:33:22.4721532Z scale_ub_tensor = None 2025-05-07T20:33:22.4721645Z 2025-05-07T20:33:22.4721827Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4721954Z op = silu_mul_quant 2025-05-07T20:33:22.4722085Z if compiled: 2025-05-07T20:33:22.4722223Z op = torch.compile(op) 2025-05-07T20:33:22.4722370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4722484Z 2025-05-07T20:33:22.4722608Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4722777Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4722891Z 2025-05-07T20:33:22.4723081Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4723224Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4723372Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4723545Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4723750Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4723858Z 2025-05-07T20:33:22.4724002Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4724009Z 2025-05-07T20:33:22.4724156Z moe/activation_test.py:126: 2025-05-07T20:33:22.4724338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4724485Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4724685Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4725451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4725609Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4726104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4726414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4726931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4727303Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4727835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4728068Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4728601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4728775Z fn() 2025-05-07T20:33:22.4729331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4729502Z self.fn.run( 2025-05-07T20:33:22.4729980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4730114Z kernel = self.compile( 2025-05-07T20:33:22.4730648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4730893Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4731078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4731085Z 2025-05-07T20:33:22.4731375Z self = 2025-05-07T20:33:22.4732502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4733226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f31a96c0>} 2025-05-07T20:33:22.4734231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4734493Z context = 2025-05-07T20:33:22.4734506Z 2025-05-07T20:33:22.4734722Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4735068Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4735234Z module_map=module_map) 2025-05-07T20:33:22.4735445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4735588Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4735709Z E ^ 2025-05-07T20:33:22.4736171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4736177Z 2025-05-07T20:33:22.4736719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4736726Z 2025-05-07T20:33:22.4736864Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4737148Z self=, 2025-05-07T20:33:22.4737261Z T=16384, 2025-05-07T20:33:22.4737363Z D=7168, 2025-05-07T20:33:22.4737473Z scale_ub=1200.0, 2025-05-07T20:33:22.4737594Z contiguous=False, 2025-05-07T20:33:22.4737708Z compiled=False, 2025-05-07T20:33:22.4737814Z ) 2025-05-07T20:33:22.4738116Z self = 2025-05-07T20:33:22.4738356Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.4738364Z 2025-05-07T20:33:22.4738475Z @given( 2025-05-07T20:33:22.4738624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4738748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4738904Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4739062Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4739216Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4739324Z ) 2025-05-07T20:33:22.4739650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4739777Z def test_silu_mul_quant( 2025-05-07T20:33:22.4739881Z self, 2025-05-07T20:33:22.4739996Z T: int, 2025-05-07T20:33:22.4740215Z D: int, 2025-05-07T20:33:22.4740357Z scale_ub: Optional[float], 2025-05-07T20:33:22.4740489Z contiguous: bool, 2025-05-07T20:33:22.4740619Z compiled: bool, 2025-05-07T20:33:22.4740780Z ) -> None: 2025-05-07T20:33:22.4740912Z torch.manual_seed(2025) 2025-05-07T20:33:22.4741025Z 2025-05-07T20:33:22.4741255Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4741365Z 2025-05-07T20:33:22.4741501Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4741671Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4741796Z x = x_sign * x_clamp 2025-05-07T20:33:22.4741918Z x0 = x[:, :D] 2025-05-07T20:33:22.4742032Z x1 = x[:, D:] 2025-05-07T20:33:22.4742144Z 2025-05-07T20:33:22.4742261Z if contiguous: 2025-05-07T20:33:22.4742388Z x0 = x0.contiguous() 2025-05-07T20:33:22.4742524Z x1 = x1.contiguous() 2025-05-07T20:33:22.4742628Z 2025-05-07T20:33:22.4742765Z if scale_ub is not None: 2025-05-07T20:33:22.4742973Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4743162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4743282Z ) 2025-05-07T20:33:22.4743396Z else: 2025-05-07T20:33:22.4743528Z scale_ub_tensor = None 2025-05-07T20:33:22.4743632Z 2025-05-07T20:33:22.4743815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4743940Z op = silu_mul_quant 2025-05-07T20:33:22.4744066Z if compiled: 2025-05-07T20:33:22.4744202Z op = torch.compile(op) 2025-05-07T20:33:22.4744348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4744459Z 2025-05-07T20:33:22.4744586Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4744592Z 2025-05-07T20:33:22.4744724Z moe/activation_test.py:117: 2025-05-07T20:33:22.4744916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4745062Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4745205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4745898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4746037Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4746538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4746851Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4747317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4747455Z kernel = self.compile( 2025-05-07T20:33:22.4747969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4748151Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4748291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4748298Z 2025-05-07T20:33:22.4748501Z self = 2025-05-07T20:33:22.4749290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4749843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f2058040>} 2025-05-07T20:33:22.4750598Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4750898Z context = 2025-05-07T20:33:22.4750906Z 2025-05-07T20:33:22.4751072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4751387Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4751495Z module_map=module_map) 2025-05-07T20:33:22.4751665Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4751765Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4751844Z E ^ 2025-05-07T20:33:22.4752208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4752213Z 2025-05-07T20:33:22.4752629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4752633Z 2025-05-07T20:33:22.4752744Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4753014Z self=, 2025-05-07T20:33:22.4753095Z T=1, 2025-05-07T20:33:22.4753187Z D=7168, 2025-05-07T20:33:22.4753277Z scale_ub=None, 2025-05-07T20:33:22.4753364Z contiguous=True, 2025-05-07T20:33:22.4753455Z compiled=True, 2025-05-07T20:33:22.4753533Z ) 2025-05-07T20:33:22.4753753Z self = 2025-05-07T20:33:22.4753920Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.4753925Z 2025-05-07T20:33:22.4754006Z @given( 2025-05-07T20:33:22.4754127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4754238Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4754355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4754483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4754601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4754680Z ) 2025-05-07T20:33:22.4754936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4755035Z def test_silu_mul_quant( 2025-05-07T20:33:22.4755121Z self, 2025-05-07T20:33:22.4755208Z T: int, 2025-05-07T20:33:22.4755288Z D: int, 2025-05-07T20:33:22.4755390Z scale_ub: Optional[float], 2025-05-07T20:33:22.4755488Z contiguous: bool, 2025-05-07T20:33:22.4755576Z compiled: bool, 2025-05-07T20:33:22.4755658Z ) -> None: 2025-05-07T20:33:22.4755877Z torch.manual_seed(2025) 2025-05-07T20:33:22.4755957Z 2025-05-07T20:33:22.4756136Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4756213Z 2025-05-07T20:33:22.4756306Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4756439Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4756530Z x = x_sign * x_clamp 2025-05-07T20:33:22.4756618Z x0 = x[:, :D] 2025-05-07T20:33:22.4756710Z x1 = x[:, D:] 2025-05-07T20:33:22.4756788Z 2025-05-07T20:33:22.4756874Z if contiguous: 2025-05-07T20:33:22.4756975Z x0 = x0.contiguous() 2025-05-07T20:33:22.4757067Z x1 = x1.contiguous() 2025-05-07T20:33:22.4757145Z 2025-05-07T20:33:22.4757243Z if scale_ub is not None: 2025-05-07T20:33:22.4757351Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4757501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4757580Z ) 2025-05-07T20:33:22.4757658Z else: 2025-05-07T20:33:22.4757761Z scale_ub_tensor = None 2025-05-07T20:33:22.4757838Z 2025-05-07T20:33:22.4757970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4758070Z op = silu_mul_quant 2025-05-07T20:33:22.4758158Z if compiled: 2025-05-07T20:33:22.4758259Z op = torch.compile(op) 2025-05-07T20:33:22.4758501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4758578Z 2025-05-07T20:33:22.4758675Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4758802Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4758920Z 2025-05-07T20:33:22.4759063Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4759168Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4759269Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4759398Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4759560Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4759644Z 2025-05-07T20:33:22.4759775Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4759781Z 2025-05-07T20:33:22.4759880Z moe/activation_test.py:126: 2025-05-07T20:33:22.4760011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4760133Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4760314Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4760887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4760993Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4761356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4761584Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4761951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4762215Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4762595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4762772Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4763126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4763209Z fn() 2025-05-07T20:33:22.4763609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4763699Z self.fn.run( 2025-05-07T20:33:22.4764036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4764135Z kernel = self.compile( 2025-05-07T20:33:22.4764516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4764691Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4764828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4764837Z 2025-05-07T20:33:22.4765046Z self = 2025-05-07T20:33:22.4766173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4766685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f2058ea0>} 2025-05-07T20:33:22.4767428Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4767623Z context = 2025-05-07T20:33:22.4767629Z 2025-05-07T20:33:22.4767943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4768279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4768390Z module_map=module_map) 2025-05-07T20:33:22.4768624Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4768734Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4768812Z E ^ 2025-05-07T20:33:22.4769166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4769181Z 2025-05-07T20:33:22.4769592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4769597Z 2025-05-07T20:33:22.4769705Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4769934Z self=, 2025-05-07T20:33:22.4770014Z T=4096, 2025-05-07T20:33:22.4770103Z D=5120, 2025-05-07T20:33:22.4770194Z scale_ub=None, 2025-05-07T20:33:22.4770343Z contiguous=False, 2025-05-07T20:33:22.4770434Z compiled=False, 2025-05-07T20:33:22.4770523Z ) 2025-05-07T20:33:22.4770745Z self = 2025-05-07T20:33:22.4770927Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.4770931Z 2025-05-07T20:33:22.4771012Z @given( 2025-05-07T20:33:22.4771133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4771240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4771356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4771475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4771596Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4771674Z ) 2025-05-07T20:33:22.4771930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4772030Z def test_silu_mul_quant( 2025-05-07T20:33:22.4772112Z self, 2025-05-07T20:33:22.4772199Z T: int, 2025-05-07T20:33:22.4772279Z D: int, 2025-05-07T20:33:22.4772384Z scale_ub: Optional[float], 2025-05-07T20:33:22.4772484Z contiguous: bool, 2025-05-07T20:33:22.4772573Z compiled: bool, 2025-05-07T20:33:22.4772655Z ) -> None: 2025-05-07T20:33:22.4772762Z torch.manual_seed(2025) 2025-05-07T20:33:22.4772836Z 2025-05-07T20:33:22.4773004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4773089Z 2025-05-07T20:33:22.4773181Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4773307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4773406Z x = x_sign * x_clamp 2025-05-07T20:33:22.4773488Z x0 = x[:, :D] 2025-05-07T20:33:22.4773578Z x1 = x[:, D:] 2025-05-07T20:33:22.4773654Z 2025-05-07T20:33:22.4773744Z if contiguous: 2025-05-07T20:33:22.4773847Z x0 = x0.contiguous() 2025-05-07T20:33:22.4773942Z x1 = x1.contiguous() 2025-05-07T20:33:22.4774018Z 2025-05-07T20:33:22.4774118Z if scale_ub is not None: 2025-05-07T20:33:22.4774228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4774364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4774447Z ) 2025-05-07T20:33:22.4774528Z else: 2025-05-07T20:33:22.4774625Z scale_ub_tensor = None 2025-05-07T20:33:22.4774707Z 2025-05-07T20:33:22.4774837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4774937Z op = silu_mul_quant 2025-05-07T20:33:22.4775024Z if compiled: 2025-05-07T20:33:22.4775125Z op = torch.compile(op) 2025-05-07T20:33:22.4775243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4775319Z 2025-05-07T20:33:22.4775413Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4775503Z 2025-05-07T20:33:22.4775608Z moe/activation_test.py:117: 2025-05-07T20:33:22.4775744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4775887Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4776000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4776499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4776603Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4776960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4777184Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4777533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4777628Z kernel = self.compile( 2025-05-07T20:33:22.4778053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4778238Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4778370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4778374Z 2025-05-07T20:33:22.4778585Z self = 2025-05-07T20:33:22.4779361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4779918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f317b240>} 2025-05-07T20:33:22.4780664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4780857Z context = 2025-05-07T20:33:22.4780864Z 2025-05-07T20:33:22.4781034Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4781299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4781413Z module_map=module_map) 2025-05-07T20:33:22.4781574Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4781676Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4781763Z E ^ 2025-05-07T20:33:22.4782117Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4782122Z 2025-05-07T20:33:22.4782534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4782548Z 2025-05-07T20:33:22.4782652Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4782877Z self=, 2025-05-07T20:33:22.4782963Z T=4096, 2025-05-07T20:33:22.4783042Z D=7168, 2025-05-07T20:33:22.4783127Z scale_ub=None, 2025-05-07T20:33:22.4783223Z contiguous=False, 2025-05-07T20:33:22.4783307Z compiled=False, 2025-05-07T20:33:22.4783382Z ) 2025-05-07T20:33:22.4783608Z self = 2025-05-07T20:33:22.4783781Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.4783786Z 2025-05-07T20:33:22.4783873Z @given( 2025-05-07T20:33:22.4783993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4784092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4784297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4784423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4784539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4784661Z ) 2025-05-07T20:33:22.4784905Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4785000Z def test_silu_mul_quant( 2025-05-07T20:33:22.4785086Z self, 2025-05-07T20:33:22.4785165Z T: int, 2025-05-07T20:33:22.4785244Z D: int, 2025-05-07T20:33:22.4785349Z scale_ub: Optional[float], 2025-05-07T20:33:22.4785441Z contiguous: bool, 2025-05-07T20:33:22.4785533Z compiled: bool, 2025-05-07T20:33:22.4785613Z ) -> None: 2025-05-07T20:33:22.4785708Z torch.manual_seed(2025) 2025-05-07T20:33:22.4785788Z 2025-05-07T20:33:22.4785957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4786035Z 2025-05-07T20:33:22.4786140Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4786305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4786397Z x = x_sign * x_clamp 2025-05-07T20:33:22.4786485Z x0 = x[:, :D] 2025-05-07T20:33:22.4786571Z x1 = x[:, D:] 2025-05-07T20:33:22.4786646Z 2025-05-07T20:33:22.4786738Z if contiguous: 2025-05-07T20:33:22.4786834Z x0 = x0.contiguous() 2025-05-07T20:33:22.4786928Z x1 = x1.contiguous() 2025-05-07T20:33:22.4787010Z 2025-05-07T20:33:22.4787101Z if scale_ub is not None: 2025-05-07T20:33:22.4787212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4787345Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4787423Z ) 2025-05-07T20:33:22.4787505Z else: 2025-05-07T20:33:22.4787602Z scale_ub_tensor = None 2025-05-07T20:33:22.4795015Z 2025-05-07T20:33:22.4795173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4795289Z op = silu_mul_quant 2025-05-07T20:33:22.4795381Z if compiled: 2025-05-07T20:33:22.4795486Z op = torch.compile(op) 2025-05-07T20:33:22.4795602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4795681Z 2025-05-07T20:33:22.4795879Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4795885Z 2025-05-07T20:33:22.4795993Z moe/activation_test.py:117: 2025-05-07T20:33:22.4796124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4796231Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4796341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4796844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4796948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4797309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4797540Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4797885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4797983Z kernel = self.compile( 2025-05-07T20:33:22.4798365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4798549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4798677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4798682Z 2025-05-07T20:33:22.4798893Z self = 2025-05-07T20:33:22.4799786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4800343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f16f25c0>} 2025-05-07T20:33:22.4801129Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4801323Z context = 2025-05-07T20:33:22.4801328Z 2025-05-07T20:33:22.4801502Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4801771Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4801893Z module_map=module_map) 2025-05-07T20:33:22.4802058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4802167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4802294Z E ^ 2025-05-07T20:33:22.4802654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4802662Z 2025-05-07T20:33:22.4803078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4803089Z 2025-05-07T20:33:22.4803197Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4803424Z self=, 2025-05-07T20:33:22.4803512Z T=128, 2025-05-07T20:33:22.4803593Z D=7168, 2025-05-07T20:33:22.4803682Z scale_ub=None, 2025-05-07T20:33:22.4803783Z contiguous=False, 2025-05-07T20:33:22.4803872Z compiled=True, 2025-05-07T20:33:22.4803953Z ) 2025-05-07T20:33:22.4804183Z self = 2025-05-07T20:33:22.4804369Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.4804375Z 2025-05-07T20:33:22.4804469Z @given( 2025-05-07T20:33:22.4804593Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4804699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4804831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4804956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4805078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4805166Z ) 2025-05-07T20:33:22.4806352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4806452Z def test_silu_mul_quant( 2025-05-07T20:33:22.4806545Z self, 2025-05-07T20:33:22.4806629Z T: int, 2025-05-07T20:33:22.4806719Z D: int, 2025-05-07T20:33:22.4806823Z scale_ub: Optional[float], 2025-05-07T20:33:22.4806916Z contiguous: bool, 2025-05-07T20:33:22.4807018Z compiled: bool, 2025-05-07T20:33:22.4807103Z ) -> None: 2025-05-07T20:33:22.4807207Z torch.manual_seed(2025) 2025-05-07T20:33:22.4807296Z 2025-05-07T20:33:22.4807470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4807553Z 2025-05-07T20:33:22.4807659Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4807789Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4807883Z x = x_sign * x_clamp 2025-05-07T20:33:22.4807976Z x0 = x[:, :D] 2025-05-07T20:33:22.4808063Z x1 = x[:, D:] 2025-05-07T20:33:22.4808143Z 2025-05-07T20:33:22.4808239Z if contiguous: 2025-05-07T20:33:22.4808336Z x0 = x0.contiguous() 2025-05-07T20:33:22.4808436Z x1 = x1.contiguous() 2025-05-07T20:33:22.4808519Z 2025-05-07T20:33:22.4808617Z if scale_ub is not None: 2025-05-07T20:33:22.4808734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4808971Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4809055Z ) 2025-05-07T20:33:22.4809146Z else: 2025-05-07T20:33:22.4809247Z scale_ub_tensor = None 2025-05-07T20:33:22.4809364Z 2025-05-07T20:33:22.4809510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4809609Z op = silu_mul_quant 2025-05-07T20:33:22.4809703Z if compiled: 2025-05-07T20:33:22.4809812Z op = torch.compile(op) 2025-05-07T20:33:22.4809925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4810010Z 2025-05-07T20:33:22.4810106Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4810229Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4810312Z 2025-05-07T20:33:22.4810453Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4810559Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4810672Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4810806Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4810990Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4811081Z 2025-05-07T20:33:22.4811190Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4811195Z 2025-05-07T20:33:22.4811304Z moe/activation_test.py:126: 2025-05-07T20:33:22.4811439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4811550Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4811696Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4812256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4812361Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4812732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4812960Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4813339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4813604Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4813981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4814158Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4814502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4814594Z fn() 2025-05-07T20:33:22.4814995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4815082Z self.fn.run( 2025-05-07T20:33:22.4815437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4815537Z kernel = self.compile( 2025-05-07T20:33:22.4815921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4816108Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4816240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4816245Z 2025-05-07T20:33:22.4816463Z self = 2025-05-07T20:33:22.4817244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4817799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f16f31a0>} 2025-05-07T20:33:22.4818592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4818830Z context = 2025-05-07T20:33:22.4818834Z 2025-05-07T20:33:22.4819011Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4819280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4819395Z module_map=module_map) 2025-05-07T20:33:22.4819584Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4819704Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4819817Z E ^ 2025-05-07T20:33:22.4820221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4820229Z 2025-05-07T20:33:22.4820649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4820657Z 2025-05-07T20:33:22.4820774Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4821007Z self=, 2025-05-07T20:33:22.4821096Z T=128, 2025-05-07T20:33:22.4821178Z D=7168, 2025-05-07T20:33:22.4821267Z scale_ub=None, 2025-05-07T20:33:22.4821365Z contiguous=False, 2025-05-07T20:33:22.4821458Z compiled=False, 2025-05-07T20:33:22.4821539Z ) 2025-05-07T20:33:22.4821771Z self = 2025-05-07T20:33:22.4821951Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.4821956Z 2025-05-07T20:33:22.4822039Z @given( 2025-05-07T20:33:22.4822180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4822286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4822414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4822543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4822663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4822753Z ) 2025-05-07T20:33:22.4823001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4823100Z def test_silu_mul_quant( 2025-05-07T20:33:22.4823191Z self, 2025-05-07T20:33:22.4823278Z T: int, 2025-05-07T20:33:22.4823362Z D: int, 2025-05-07T20:33:22.4823475Z scale_ub: Optional[float], 2025-05-07T20:33:22.4823570Z contiguous: bool, 2025-05-07T20:33:22.4823663Z compiled: bool, 2025-05-07T20:33:22.4823753Z ) -> None: 2025-05-07T20:33:22.4823852Z torch.manual_seed(2025) 2025-05-07T20:33:22.4823938Z 2025-05-07T20:33:22.4824123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4824202Z 2025-05-07T20:33:22.4824314Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4824447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4824544Z x = x_sign * x_clamp 2025-05-07T20:33:22.4824639Z x0 = x[:, :D] 2025-05-07T20:33:22.4824727Z x1 = x[:, D:] 2025-05-07T20:33:22.4824805Z 2025-05-07T20:33:22.4824902Z if contiguous: 2025-05-07T20:33:22.4824998Z x0 = x0.contiguous() 2025-05-07T20:33:22.4825092Z x1 = x1.contiguous() 2025-05-07T20:33:22.4825179Z 2025-05-07T20:33:22.4825275Z if scale_ub is not None: 2025-05-07T20:33:22.4825387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4825536Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4825618Z ) 2025-05-07T20:33:22.4825704Z else: 2025-05-07T20:33:22.4825888Z scale_ub_tensor = None 2025-05-07T20:33:22.4825967Z 2025-05-07T20:33:22.4826116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4826210Z op = silu_mul_quant 2025-05-07T20:33:22.4826339Z if compiled: 2025-05-07T20:33:22.4826451Z op = torch.compile(op) 2025-05-07T20:33:22.4826561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4826639Z 2025-05-07T20:33:22.4826749Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4826753Z 2025-05-07T20:33:22.4826854Z moe/activation_test.py:117: 2025-05-07T20:33:22.4826996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4827102Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4827206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4827716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4827826Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4828251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4828489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4828834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4828941Z kernel = self.compile( 2025-05-07T20:33:22.4829327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4829507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4829650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4829655Z 2025-05-07T20:33:22.4829862Z self = 2025-05-07T20:33:22.4830660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4831170Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f1a645e0>} 2025-05-07T20:33:22.4831919Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4832119Z context = 2025-05-07T20:33:22.4832124Z 2025-05-07T20:33:22.4832292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4832569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4832686Z module_map=module_map) 2025-05-07T20:33:22.4832849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4832961Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4833044Z E ^ 2025-05-07T20:33:22.4833401Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4833413Z 2025-05-07T20:33:22.4833829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4833834Z 2025-05-07T20:33:22.4833942Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4834180Z self=, 2025-05-07T20:33:22.4834265Z T=4096, 2025-05-07T20:33:22.4834347Z D=5120, 2025-05-07T20:33:22.4834441Z scale_ub=1200.0, 2025-05-07T20:33:22.4834530Z contiguous=True, 2025-05-07T20:33:22.4834701Z compiled=False, 2025-05-07T20:33:22.4834788Z ) 2025-05-07T20:33:22.4835010Z self = 2025-05-07T20:33:22.4835194Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.4835238Z 2025-05-07T20:33:22.4835329Z @given( 2025-05-07T20:33:22.4835454Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4835567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4835688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4835886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4836010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4836089Z ) 2025-05-07T20:33:22.4836337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4836443Z def test_silu_mul_quant( 2025-05-07T20:33:22.4836525Z self, 2025-05-07T20:33:22.4836613Z T: int, 2025-05-07T20:33:22.4836699Z D: int, 2025-05-07T20:33:22.4836848Z scale_ub: Optional[float], 2025-05-07T20:33:22.4836951Z contiguous: bool, 2025-05-07T20:33:22.4837041Z compiled: bool, 2025-05-07T20:33:22.4837125Z ) -> None: 2025-05-07T20:33:22.4837228Z torch.manual_seed(2025) 2025-05-07T20:33:22.4837305Z 2025-05-07T20:33:22.4837477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4837559Z 2025-05-07T20:33:22.4837655Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4837784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4837882Z x = x_sign * x_clamp 2025-05-07T20:33:22.4837965Z x0 = x[:, :D] 2025-05-07T20:33:22.4838048Z x1 = x[:, D:] 2025-05-07T20:33:22.4838135Z 2025-05-07T20:33:22.4838221Z if contiguous: 2025-05-07T20:33:22.4838316Z x0 = x0.contiguous() 2025-05-07T20:33:22.4838414Z x1 = x1.contiguous() 2025-05-07T20:33:22.4838497Z 2025-05-07T20:33:22.4838599Z if scale_ub is not None: 2025-05-07T20:33:22.4838710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4838848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4838936Z ) 2025-05-07T20:33:22.4839015Z else: 2025-05-07T20:33:22.4839113Z scale_ub_tensor = None 2025-05-07T20:33:22.4839196Z 2025-05-07T20:33:22.4839330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4839429Z op = silu_mul_quant 2025-05-07T20:33:22.4839524Z if compiled: 2025-05-07T20:33:22.4839629Z op = torch.compile(op) 2025-05-07T20:33:22.4839739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4839821Z 2025-05-07T20:33:22.4839915Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4839920Z 2025-05-07T20:33:22.4840025Z moe/activation_test.py:117: 2025-05-07T20:33:22.4840160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4840268Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4840377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4840875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4840977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4841348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4841572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4841919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4842016Z kernel = self.compile( 2025-05-07T20:33:22.4842401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4842677Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4842815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4842857Z 2025-05-07T20:33:22.4843071Z self = 2025-05-07T20:33:22.4843850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4844355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f1a651c0>} 2025-05-07T20:33:22.4845111Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4845344Z context = 2025-05-07T20:33:22.4845349Z 2025-05-07T20:33:22.4845524Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4845797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4845908Z module_map=module_map) 2025-05-07T20:33:22.4846077Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4846181Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4846262Z E ^ 2025-05-07T20:33:22.4846626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4846631Z 2025-05-07T20:33:22.4847046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4847050Z 2025-05-07T20:33:22.4847168Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4847397Z self=, 2025-05-07T20:33:22.4847480Z T=1, 2025-05-07T20:33:22.4847573Z D=5120, 2025-05-07T20:33:22.4847661Z scale_ub=None, 2025-05-07T20:33:22.4847755Z contiguous=True, 2025-05-07T20:33:22.4847845Z compiled=True, 2025-05-07T20:33:22.4847922Z ) 2025-05-07T20:33:22.4848151Z self = 2025-05-07T20:33:22.4848316Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.4848321Z 2025-05-07T20:33:22.4848402Z @given( 2025-05-07T20:33:22.4848533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4848637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4848757Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4848885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4849009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4849093Z ) 2025-05-07T20:33:22.4849348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4849448Z def test_silu_mul_quant( 2025-05-07T20:33:22.4849534Z self, 2025-05-07T20:33:22.4849615Z T: int, 2025-05-07T20:33:22.4849700Z D: int, 2025-05-07T20:33:22.4849808Z scale_ub: Optional[float], 2025-05-07T20:33:22.4849901Z contiguous: bool, 2025-05-07T20:33:22.4849991Z compiled: bool, 2025-05-07T20:33:22.4850079Z ) -> None: 2025-05-07T20:33:22.4850180Z torch.manual_seed(2025) 2025-05-07T20:33:22.4850261Z 2025-05-07T20:33:22.4850443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4850523Z 2025-05-07T20:33:22.4850623Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4850751Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4850893Z x = x_sign * x_clamp 2025-05-07T20:33:22.4851021Z x0 = x[:, :D] 2025-05-07T20:33:22.4851109Z x1 = x[:, D:] 2025-05-07T20:33:22.4851187Z 2025-05-07T20:33:22.4851278Z if contiguous: 2025-05-07T20:33:22.4851413Z x0 = x0.contiguous() 2025-05-07T20:33:22.4851505Z x1 = x1.contiguous() 2025-05-07T20:33:22.4851587Z 2025-05-07T20:33:22.4851681Z if scale_ub is not None: 2025-05-07T20:33:22.4851790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4851932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4852011Z ) 2025-05-07T20:33:22.4852091Z else: 2025-05-07T20:33:22.4852195Z scale_ub_tensor = None 2025-05-07T20:33:22.4852273Z 2025-05-07T20:33:22.4852413Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4852507Z op = silu_mul_quant 2025-05-07T20:33:22.4852595Z if compiled: 2025-05-07T20:33:22.4852710Z op = torch.compile(op) 2025-05-07T20:33:22.4852826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4852944Z 2025-05-07T20:33:22.4853046Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4853173Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4853252Z 2025-05-07T20:33:22.4853396Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4853500Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4853603Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4853734Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4853876Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4853960Z 2025-05-07T20:33:22.4854063Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4854068Z 2025-05-07T20:33:22.4854170Z moe/activation_test.py:126: 2025-05-07T20:33:22.4854309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4854424Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4854564Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4855130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4855236Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4855612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4855839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4856211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4856481Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4856860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4857035Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4857386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4857469Z fn() 2025-05-07T20:33:22.4857883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4857970Z self.fn.run( 2025-05-07T20:33:22.4858311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4858414Z kernel = self.compile( 2025-05-07T20:33:22.4858797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4858977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4859176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4859218Z 2025-05-07T20:33:22.4859430Z self = 2025-05-07T20:33:22.4860267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4860838Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f1a66a20>} 2025-05-07T20:33:22.4861591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4861787Z context = 2025-05-07T20:33:22.4861792Z 2025-05-07T20:33:22.4861969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4862282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4862398Z module_map=module_map) 2025-05-07T20:33:22.4862570Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4862678Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4862761Z E ^ 2025-05-07T20:33:22.4863126Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4863131Z 2025-05-07T20:33:22.4863546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4863550Z 2025-05-07T20:33:22.4863657Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4863897Z self=, 2025-05-07T20:33:22.4863984Z T=2048, 2025-05-07T20:33:22.4864071Z D=5120, 2025-05-07T20:33:22.4864160Z scale_ub=None, 2025-05-07T20:33:22.4864250Z contiguous=True, 2025-05-07T20:33:22.4864344Z compiled=True, 2025-05-07T20:33:22.4864425Z ) 2025-05-07T20:33:22.4864648Z self = 2025-05-07T20:33:22.4864828Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.4864833Z 2025-05-07T20:33:22.4864915Z @given( 2025-05-07T20:33:22.4865037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4865147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4865266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4865702Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4865886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4865976Z ) 2025-05-07T20:33:22.4866235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4866338Z def test_silu_mul_quant( 2025-05-07T20:33:22.4866417Z self, 2025-05-07T20:33:22.4866506Z T: int, 2025-05-07T20:33:22.4866591Z D: int, 2025-05-07T20:33:22.4866693Z scale_ub: Optional[float], 2025-05-07T20:33:22.4866794Z contiguous: bool, 2025-05-07T20:33:22.4866883Z compiled: bool, 2025-05-07T20:33:22.4866964Z ) -> None: 2025-05-07T20:33:22.4867070Z torch.manual_seed(2025) 2025-05-07T20:33:22.4867148Z 2025-05-07T20:33:22.4867326Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4867405Z 2025-05-07T20:33:22.4867502Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4867636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4867727Z x = x_sign * x_clamp 2025-05-07T20:33:22.4867814Z x0 = x[:, :D] 2025-05-07T20:33:22.4867904Z x1 = x[:, D:] 2025-05-07T20:33:22.4867982Z 2025-05-07T20:33:22.4868275Z if contiguous: 2025-05-07T20:33:22.4868380Z x0 = x0.contiguous() 2025-05-07T20:33:22.4868474Z x1 = x1.contiguous() 2025-05-07T20:33:22.4868550Z 2025-05-07T20:33:22.4868711Z if scale_ub is not None: 2025-05-07T20:33:22.4868823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4868965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4869045Z ) 2025-05-07T20:33:22.4869126Z else: 2025-05-07T20:33:22.4869229Z scale_ub_tensor = None 2025-05-07T20:33:22.4869305Z 2025-05-07T20:33:22.4869438Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4869538Z op = silu_mul_quant 2025-05-07T20:33:22.4869640Z if compiled: 2025-05-07T20:33:22.4869754Z op = torch.compile(op) 2025-05-07T20:33:22.4869896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4869973Z 2025-05-07T20:33:22.4870074Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4870270Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4870350Z 2025-05-07T20:33:22.4870498Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4870608Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4870711Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4870844Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4870985Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4871065Z 2025-05-07T20:33:22.4871177Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4871181Z 2025-05-07T20:33:22.4871285Z moe/activation_test.py:126: 2025-05-07T20:33:22.4871420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4871540Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4871678Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4872253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4872358Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4872724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4872959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4873330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4873595Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4873974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4874145Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4874503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4874590Z fn() 2025-05-07T20:33:22.4874994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4875088Z self.fn.run( 2025-05-07T20:33:22.4875428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4875535Z kernel = self.compile( 2025-05-07T20:33:22.4875994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4876172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4876315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4876320Z 2025-05-07T20:33:22.4876528Z self = 2025-05-07T20:33:22.4877410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4877954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f1a4a020>} 2025-05-07T20:33:22.4878703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4878906Z context = 2025-05-07T20:33:22.4878911Z 2025-05-07T20:33:22.4879082Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4879360Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4879511Z module_map=module_map) 2025-05-07T20:33:22.4879681Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4879805Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4879885Z E ^ 2025-05-07T20:33:22.4880246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4880260Z 2025-05-07T20:33:22.4880677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4880682Z 2025-05-07T20:33:22.4880792Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4881028Z self=, 2025-05-07T20:33:22.4881116Z T=128, 2025-05-07T20:33:22.4881197Z D=5120, 2025-05-07T20:33:22.4881290Z scale_ub=None, 2025-05-07T20:33:22.4881383Z contiguous=True, 2025-05-07T20:33:22.4881474Z compiled=True, 2025-05-07T20:33:22.4881558Z ) 2025-05-07T20:33:22.4881784Z self = 2025-05-07T20:33:22.4881966Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.4881971Z 2025-05-07T20:33:22.4882054Z @given( 2025-05-07T20:33:22.4882178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4882288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4882409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4882530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4882655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4882735Z ) 2025-05-07T20:33:22.4882984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4883089Z def test_silu_mul_quant( 2025-05-07T20:33:22.4883170Z self, 2025-05-07T20:33:22.4883264Z T: int, 2025-05-07T20:33:22.4883345Z D: int, 2025-05-07T20:33:22.4883450Z scale_ub: Optional[float], 2025-05-07T20:33:22.4883553Z contiguous: bool, 2025-05-07T20:33:22.4883648Z compiled: bool, 2025-05-07T20:33:22.4883731Z ) -> None: 2025-05-07T20:33:22.4883835Z torch.manual_seed(2025) 2025-05-07T20:33:22.4883914Z 2025-05-07T20:33:22.4884088Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4884173Z 2025-05-07T20:33:22.4884269Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4884398Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4884496Z x = x_sign * x_clamp 2025-05-07T20:33:22.4884581Z x0 = x[:, :D] 2025-05-07T20:33:22.4884676Z x1 = x[:, D:] 2025-05-07T20:33:22.4884756Z 2025-05-07T20:33:22.4884845Z if contiguous: 2025-05-07T20:33:22.4884948Z x0 = x0.contiguous() 2025-05-07T20:33:22.4885089Z x1 = x1.contiguous() 2025-05-07T20:33:22.4885204Z 2025-05-07T20:33:22.4885307Z if scale_ub is not None: 2025-05-07T20:33:22.4885417Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4885554Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4885678Z ) 2025-05-07T20:33:22.4885759Z else: 2025-05-07T20:33:22.4885857Z scale_ub_tensor = None 2025-05-07T20:33:22.4885941Z 2025-05-07T20:33:22.4886075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4886176Z op = silu_mul_quant 2025-05-07T20:33:22.4886266Z if compiled: 2025-05-07T20:33:22.4886368Z op = torch.compile(op) 2025-05-07T20:33:22.4886481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4886559Z 2025-05-07T20:33:22.4886654Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4886785Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4886863Z 2025-05-07T20:33:22.4887008Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4887161Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4887266Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4887395Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4887545Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4887626Z 2025-05-07T20:33:22.4887737Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4887742Z 2025-05-07T20:33:22.4887843Z moe/activation_test.py:126: 2025-05-07T20:33:22.4887978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4888092Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4888229Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4888792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4888909Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4889277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4889511Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4889880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4890139Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4890519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4890688Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4891040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4891125Z fn() 2025-05-07T20:33:22.4891535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4891631Z self.fn.run( 2025-05-07T20:33:22.4891973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4892070Z kernel = self.compile( 2025-05-07T20:33:22.4892459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4892636Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4892774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4892779Z 2025-05-07T20:33:22.4892994Z self = 2025-05-07T20:33:22.4893822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4894398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f18400e0>} 2025-05-07T20:33:22.4895184Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4895384Z context = 2025-05-07T20:33:22.4895389Z 2025-05-07T20:33:22.4895559Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4895827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4895946Z module_map=module_map) 2025-05-07T20:33:22.4896116Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4896270Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4896351Z E ^ 2025-05-07T20:33:22.4896709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4896716Z 2025-05-07T20:33:22.4897140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4897144Z 2025-05-07T20:33:22.4897252Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4897492Z self=, 2025-05-07T20:33:22.4897575Z T=4096, 2025-05-07T20:33:22.4897656Z D=5120, 2025-05-07T20:33:22.4897748Z scale_ub=None, 2025-05-07T20:33:22.4897839Z contiguous=True, 2025-05-07T20:33:22.4897925Z compiled=True, 2025-05-07T20:33:22.4898009Z ) 2025-05-07T20:33:22.4898237Z self = 2025-05-07T20:33:22.4898414Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.4898419Z 2025-05-07T20:33:22.4898507Z @given( 2025-05-07T20:33:22.4898633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4898744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4898863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4898983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4899106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4899186Z ) 2025-05-07T20:33:22.4899432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4899537Z def test_silu_mul_quant( 2025-05-07T20:33:22.4899618Z self, 2025-05-07T20:33:22.4899704Z T: int, 2025-05-07T20:33:22.4899792Z D: int, 2025-05-07T20:33:22.4899895Z scale_ub: Optional[float], 2025-05-07T20:33:22.4899995Z contiguous: bool, 2025-05-07T20:33:22.4900089Z compiled: bool, 2025-05-07T20:33:22.4900175Z ) -> None: 2025-05-07T20:33:22.4900277Z torch.manual_seed(2025) 2025-05-07T20:33:22.4900358Z 2025-05-07T20:33:22.4900530Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4900614Z 2025-05-07T20:33:22.4900709Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4900836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4900933Z x = x_sign * x_clamp 2025-05-07T20:33:22.4901021Z x0 = x[:, :D] 2025-05-07T20:33:22.4901105Z x1 = x[:, D:] 2025-05-07T20:33:22.4901187Z 2025-05-07T20:33:22.4901275Z if contiguous: 2025-05-07T20:33:22.4901369Z x0 = x0.contiguous() 2025-05-07T20:33:22.4901466Z x1 = x1.contiguous() 2025-05-07T20:33:22.4901543Z 2025-05-07T20:33:22.4901637Z if scale_ub is not None: 2025-05-07T20:33:22.4901795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4901978Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4902063Z ) 2025-05-07T20:33:22.4902142Z else: 2025-05-07T20:33:22.4902284Z scale_ub_tensor = None 2025-05-07T20:33:22.4902367Z 2025-05-07T20:33:22.4902499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4902594Z op = silu_mul_quant 2025-05-07T20:33:22.4902688Z if compiled: 2025-05-07T20:33:22.4902792Z op = torch.compile(op) 2025-05-07T20:33:22.4902901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4902985Z 2025-05-07T20:33:22.4903078Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4903204Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4903287Z 2025-05-07T20:33:22.4903425Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4903539Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4903649Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4903814Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4903964Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4904051Z 2025-05-07T20:33:22.4904160Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4904164Z 2025-05-07T20:33:22.4904272Z moe/activation_test.py:126: 2025-05-07T20:33:22.4904404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4904519Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4904657Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4905218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4905327Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4905691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4905920Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4906295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4906560Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4906943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4907113Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4907456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4907543Z fn() 2025-05-07T20:33:22.4907945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4908040Z self.fn.run( 2025-05-07T20:33:22.4908384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4908482Z kernel = self.compile( 2025-05-07T20:33:22.4908873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4909054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4909184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4909188Z 2025-05-07T20:33:22.4909401Z self = 2025-05-07T20:33:22.4910177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4910737Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f0d2b1a0>} 2025-05-07T20:33:22.4911519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4911762Z context = 2025-05-07T20:33:22.4911766Z 2025-05-07T20:33:22.4911934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4912204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4912323Z module_map=module_map) 2025-05-07T20:33:22.4912489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4912596Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4912682Z E ^ 2025-05-07T20:33:22.4913085Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4913090Z 2025-05-07T20:33:22.4913511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4913518Z 2025-05-07T20:33:22.4913626Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4913851Z self=, 2025-05-07T20:33:22.4913938Z T=16384, 2025-05-07T20:33:22.4914018Z D=5120, 2025-05-07T20:33:22.4914104Z scale_ub=None, 2025-05-07T20:33:22.4914197Z contiguous=True, 2025-05-07T20:33:22.4914288Z compiled=True, 2025-05-07T20:33:22.4914369Z ) 2025-05-07T20:33:22.4914589Z self = 2025-05-07T20:33:22.4914763Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.4914773Z 2025-05-07T20:33:22.4914859Z @given( 2025-05-07T20:33:22.4914984Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4915086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4915214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4915334Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4915451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4915536Z ) 2025-05-07T20:33:22.4915839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4915942Z def test_silu_mul_quant( 2025-05-07T20:33:22.4916022Z self, 2025-05-07T20:33:22.4916103Z T: int, 2025-05-07T20:33:22.4916191Z D: int, 2025-05-07T20:33:22.4916292Z scale_ub: Optional[float], 2025-05-07T20:33:22.4916384Z contiguous: bool, 2025-05-07T20:33:22.4916481Z compiled: bool, 2025-05-07T20:33:22.4916562Z ) -> None: 2025-05-07T20:33:22.4916667Z torch.manual_seed(2025) 2025-05-07T20:33:22.4916751Z 2025-05-07T20:33:22.4916924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4917004Z 2025-05-07T20:33:22.4917108Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4917235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4917332Z x = x_sign * x_clamp 2025-05-07T20:33:22.4917417Z x0 = x[:, :D] 2025-05-07T20:33:22.4917499Z x1 = x[:, D:] 2025-05-07T20:33:22.4917584Z 2025-05-07T20:33:22.4917670Z if contiguous: 2025-05-07T20:33:22.4917764Z x0 = x0.contiguous() 2025-05-07T20:33:22.4917861Z x1 = x1.contiguous() 2025-05-07T20:33:22.4917937Z 2025-05-07T20:33:22.4918032Z if scale_ub is not None: 2025-05-07T20:33:22.4918147Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4918283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4918362Z ) 2025-05-07T20:33:22.4918537Z else: 2025-05-07T20:33:22.4918642Z scale_ub_tensor = None 2025-05-07T20:33:22.4918720Z 2025-05-07T20:33:22.4918857Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4918991Z op = silu_mul_quant 2025-05-07T20:33:22.4919084Z if compiled: 2025-05-07T20:33:22.4919190Z op = torch.compile(op) 2025-05-07T20:33:22.4919299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4919381Z 2025-05-07T20:33:22.4919480Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4929545Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4929641Z 2025-05-07T20:33:22.4929796Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4929913Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4930016Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4930146Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4930311Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4930457Z 2025-05-07T20:33:22.4930564Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4930572Z 2025-05-07T20:33:22.4930684Z moe/activation_test.py:126: 2025-05-07T20:33:22.4930821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4930937Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4931074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4931640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4931751Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4932112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4932340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4932724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4932982Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4933370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4933541Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4933883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4933970Z fn() 2025-05-07T20:33:22.4934370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4934464Z self.fn.run( 2025-05-07T20:33:22.4934802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4934906Z kernel = self.compile( 2025-05-07T20:33:22.4935298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4935474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4935605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4935610Z 2025-05-07T20:33:22.4935826Z self = 2025-05-07T20:33:22.4936609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4937123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f0171300>} 2025-05-07T20:33:22.4937990Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4938229Z context = 2025-05-07T20:33:22.4938234Z 2025-05-07T20:33:22.4938402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4938668Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4938785Z module_map=module_map) 2025-05-07T20:33:22.4938948Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4939056Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4939142Z E ^ 2025-05-07T20:33:22.4939499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4939509Z 2025-05-07T20:33:22.4939969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4939975Z 2025-05-07T20:33:22.4940084Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4940309Z self=, 2025-05-07T20:33:22.4940397Z T=1, 2025-05-07T20:33:22.4940478Z D=5120, 2025-05-07T20:33:22.4940564Z scale_ub=1200.0, 2025-05-07T20:33:22.4940658Z contiguous=True, 2025-05-07T20:33:22.4940743Z compiled=True, 2025-05-07T20:33:22.4940832Z ) 2025-05-07T20:33:22.4941051Z self = 2025-05-07T20:33:22.4941218Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.4941223Z 2025-05-07T20:33:22.4941310Z @given( 2025-05-07T20:33:22.4941433Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4941538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4941666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4941783Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4941898Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4941974Z ) 2025-05-07T20:33:22.4942215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4942310Z def test_silu_mul_quant( 2025-05-07T20:33:22.4942385Z self, 2025-05-07T20:33:22.4942464Z T: int, 2025-05-07T20:33:22.4942555Z D: int, 2025-05-07T20:33:22.4942657Z scale_ub: Optional[float], 2025-05-07T20:33:22.4942749Z contiguous: bool, 2025-05-07T20:33:22.4942845Z compiled: bool, 2025-05-07T20:33:22.4942927Z ) -> None: 2025-05-07T20:33:22.4943023Z torch.manual_seed(2025) 2025-05-07T20:33:22.4943105Z 2025-05-07T20:33:22.4943279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4943356Z 2025-05-07T20:33:22.4943459Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4943587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4943686Z x = x_sign * x_clamp 2025-05-07T20:33:22.4943769Z x0 = x[:, :D] 2025-05-07T20:33:22.4943852Z x1 = x[:, D:] 2025-05-07T20:33:22.4943937Z 2025-05-07T20:33:22.4944024Z if contiguous: 2025-05-07T20:33:22.4944117Z x0 = x0.contiguous() 2025-05-07T20:33:22.4944212Z x1 = x1.contiguous() 2025-05-07T20:33:22.4944287Z 2025-05-07T20:33:22.4944380Z if scale_ub is not None: 2025-05-07T20:33:22.4944494Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4944631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4944709Z ) 2025-05-07T20:33:22.4944791Z else: 2025-05-07T20:33:22.4944889Z scale_ub_tensor = None 2025-05-07T20:33:22.4944969Z 2025-05-07T20:33:22.4945187Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4945284Z op = silu_mul_quant 2025-05-07T20:33:22.4945380Z if compiled: 2025-05-07T20:33:22.4945482Z op = torch.compile(op) 2025-05-07T20:33:22.4945631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4945712Z 2025-05-07T20:33:22.4945804Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4945809Z 2025-05-07T20:33:22.4945907Z moe/activation_test.py:117: 2025-05-07T20:33:22.4946045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4946147Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4946256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4946622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.4946717Z return fn(*args, **kwargs) 2025-05-07T20:33:22.4947218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4947359Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4947721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4947952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4948292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4948396Z kernel = self.compile( 2025-05-07T20:33:22.4948778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4948955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4949090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4949095Z 2025-05-07T20:33:22.4949305Z self = 2025-05-07T20:33:22.4950099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4950610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cfd2c720>} 2025-05-07T20:33:22.4951356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4951556Z context = 2025-05-07T20:33:22.4951561Z 2025-05-07T20:33:22.4951727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4952004Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4952115Z module_map=module_map) 2025-05-07T20:33:22.4952280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4952389Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4952468Z E ^ 2025-05-07T20:33:22.4952830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4952841Z 2025-05-07T20:33:22.4953253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4953258Z 2025-05-07T20:33:22.4953363Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4953592Z self=, 2025-05-07T20:33:22.4953671Z T=1, 2025-05-07T20:33:22.4953757Z D=5120, 2025-05-07T20:33:22.4953929Z scale_ub=None, 2025-05-07T20:33:22.4954020Z contiguous=False, 2025-05-07T20:33:22.4954107Z compiled=True, 2025-05-07T20:33:22.4954189Z ) 2025-05-07T20:33:22.4954408Z self = 2025-05-07T20:33:22.4954622Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.4954627Z 2025-05-07T20:33:22.4954708Z @given( 2025-05-07T20:33:22.4954829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4954934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4955057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4955174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4955301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4955382Z ) 2025-05-07T20:33:22.4955629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4955808Z def test_silu_mul_quant( 2025-05-07T20:33:22.4955895Z self, 2025-05-07T20:33:22.4955981Z T: int, 2025-05-07T20:33:22.4956103Z D: int, 2025-05-07T20:33:22.4956205Z scale_ub: Optional[float], 2025-05-07T20:33:22.4956307Z contiguous: bool, 2025-05-07T20:33:22.4956395Z compiled: bool, 2025-05-07T20:33:22.4956475Z ) -> None: 2025-05-07T20:33:22.4956577Z torch.manual_seed(2025) 2025-05-07T20:33:22.4956652Z 2025-05-07T20:33:22.4956822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4956909Z 2025-05-07T20:33:22.4957002Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4957128Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4957224Z x = x_sign * x_clamp 2025-05-07T20:33:22.4957305Z x0 = x[:, :D] 2025-05-07T20:33:22.4957394Z x1 = x[:, D:] 2025-05-07T20:33:22.4957468Z 2025-05-07T20:33:22.4957554Z if contiguous: 2025-05-07T20:33:22.4957657Z x0 = x0.contiguous() 2025-05-07T20:33:22.4957750Z x1 = x1.contiguous() 2025-05-07T20:33:22.4957830Z 2025-05-07T20:33:22.4957930Z if scale_ub is not None: 2025-05-07T20:33:22.4958039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4958177Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4958262Z ) 2025-05-07T20:33:22.4958341Z else: 2025-05-07T20:33:22.4958437Z scale_ub_tensor = None 2025-05-07T20:33:22.4958519Z 2025-05-07T20:33:22.4958650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4958748Z op = silu_mul_quant 2025-05-07T20:33:22.4958834Z if compiled: 2025-05-07T20:33:22.4958937Z op = torch.compile(op) 2025-05-07T20:33:22.4959048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4959123Z 2025-05-07T20:33:22.4959216Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.4959351Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.4959428Z 2025-05-07T20:33:22.4959566Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4959675Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.4959778Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.4959901Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.4960055Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4960130Z 2025-05-07T20:33:22.4960242Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.4960247Z 2025-05-07T20:33:22.4960346Z moe/activation_test.py:126: 2025-05-07T20:33:22.4960478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4960589Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.4960724Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.4961329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.4961478Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.4961834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4962103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4962472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.4962730Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.4963113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.4963280Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.4963630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.4963718Z fn() 2025-05-07T20:33:22.4964185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.4964279Z self.fn.run( 2025-05-07T20:33:22.4964616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4964712Z kernel = self.compile( 2025-05-07T20:33:22.4965097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4965270Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4965623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4965631Z 2025-05-07T20:33:22.4965916Z self = 2025-05-07T20:33:22.4966707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4967221Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cfd2e8e0>} 2025-05-07T20:33:22.4967969Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4968167Z context = 2025-05-07T20:33:22.4968172Z 2025-05-07T20:33:22.4968337Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4968599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4968720Z module_map=module_map) 2025-05-07T20:33:22.4968891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4969003Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.4969084Z E ^ 2025-05-07T20:33:22.4969441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4969446Z 2025-05-07T20:33:22.4969867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4969872Z 2025-05-07T20:33:22.4969975Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4970208Z self=, 2025-05-07T20:33:22.4970289Z T=1, 2025-05-07T20:33:22.4970371Z D=5120, 2025-05-07T20:33:22.4970463Z scale_ub=None, 2025-05-07T20:33:22.4970551Z contiguous=True, 2025-05-07T20:33:22.4970635Z compiled=False, 2025-05-07T20:33:22.4970716Z ) 2025-05-07T20:33:22.4971105Z self = 2025-05-07T20:33:22.4971274Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.4971339Z 2025-05-07T20:33:22.4971417Z @given( 2025-05-07T20:33:22.4971546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4971647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4971767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4971888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4972003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4972087Z ) 2025-05-07T20:33:22.4972331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4972425Z def test_silu_mul_quant( 2025-05-07T20:33:22.4972509Z self, 2025-05-07T20:33:22.4972587Z T: int, 2025-05-07T20:33:22.4972665Z D: int, 2025-05-07T20:33:22.4972777Z scale_ub: Optional[float], 2025-05-07T20:33:22.4972867Z contiguous: bool, 2025-05-07T20:33:22.4973010Z compiled: bool, 2025-05-07T20:33:22.4973096Z ) -> None: 2025-05-07T20:33:22.4973196Z torch.manual_seed(2025) 2025-05-07T20:33:22.4973280Z 2025-05-07T20:33:22.4973449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4973524Z 2025-05-07T20:33:22.4973623Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4973748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4973838Z x = x_sign * x_clamp 2025-05-07T20:33:22.4973926Z x0 = x[:, :D] 2025-05-07T20:33:22.4974007Z x1 = x[:, D:] 2025-05-07T20:33:22.4974080Z 2025-05-07T20:33:22.4974172Z if contiguous: 2025-05-07T20:33:22.4974266Z x0 = x0.contiguous() 2025-05-07T20:33:22.4974356Z x1 = x1.contiguous() 2025-05-07T20:33:22.4974434Z 2025-05-07T20:33:22.4974529Z if scale_ub is not None: 2025-05-07T20:33:22.4974638Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4974779Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4974857Z ) 2025-05-07T20:33:22.4974944Z else: 2025-05-07T20:33:22.4975041Z scale_ub_tensor = None 2025-05-07T20:33:22.4975114Z 2025-05-07T20:33:22.4975253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4975348Z op = silu_mul_quant 2025-05-07T20:33:22.4975436Z if compiled: 2025-05-07T20:33:22.4975542Z op = torch.compile(op) 2025-05-07T20:33:22.4975647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4975721Z 2025-05-07T20:33:22.4975821Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4975825Z 2025-05-07T20:33:22.4975925Z moe/activation_test.py:117: 2025-05-07T20:33:22.4976059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4976168Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4976269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4976771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4976872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4977230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4977459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4977796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4977893Z kernel = self.compile( 2025-05-07T20:33:22.4978271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4978446Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4978669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4978674Z 2025-05-07T20:33:22.4978880Z self = 2025-05-07T20:33:22.4979705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4980209Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cfd2dda0>} 2025-05-07T20:33:22.4980952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4981152Z context = 2025-05-07T20:33:22.4981158Z 2025-05-07T20:33:22.4981361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4981630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4981745Z module_map=module_map) 2025-05-07T20:33:22.4981908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4982017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4982096Z E ^ 2025-05-07T20:33:22.4982451Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4982460Z 2025-05-07T20:33:22.4982871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4982875Z 2025-05-07T20:33:22.4982978Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4983217Z self=, 2025-05-07T20:33:22.4983303Z T=128, 2025-05-07T20:33:22.4983383Z D=5120, 2025-05-07T20:33:22.4983473Z scale_ub=None, 2025-05-07T20:33:22.4983566Z contiguous=False, 2025-05-07T20:33:22.4983651Z compiled=True, 2025-05-07T20:33:22.4983732Z ) 2025-05-07T20:33:22.4983953Z self = 2025-05-07T20:33:22.4984129Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.4984133Z 2025-05-07T20:33:22.4984213Z @given( 2025-05-07T20:33:22.4984333Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4984442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4984559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4984677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4984798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4984884Z ) 2025-05-07T20:33:22.4985133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4985234Z def test_silu_mul_quant( 2025-05-07T20:33:22.4985319Z self, 2025-05-07T20:33:22.4985405Z T: int, 2025-05-07T20:33:22.4985484Z D: int, 2025-05-07T20:33:22.4985586Z scale_ub: Optional[float], 2025-05-07T20:33:22.4985683Z contiguous: bool, 2025-05-07T20:33:22.4985771Z compiled: bool, 2025-05-07T20:33:22.4985852Z ) -> None: 2025-05-07T20:33:22.4985953Z torch.manual_seed(2025) 2025-05-07T20:33:22.4986031Z 2025-05-07T20:33:22.4986199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4986281Z 2025-05-07T20:33:22.4986376Z x_sign = torch.sign(x) 2025-05-07T20:33:22.4986503Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.4986600Z x = x_sign * x_clamp 2025-05-07T20:33:22.4986682Z x0 = x[:, :D] 2025-05-07T20:33:22.4986864Z x1 = x[:, D:] 2025-05-07T20:33:22.4986940Z 2025-05-07T20:33:22.4987030Z if contiguous: 2025-05-07T20:33:22.4987128Z x0 = x0.contiguous() 2025-05-07T20:33:22.4987263Z x1 = x1.contiguous() 2025-05-07T20:33:22.4987338Z 2025-05-07T20:33:22.4987436Z if scale_ub is not None: 2025-05-07T20:33:22.4987543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.4987681Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.4987764Z ) 2025-05-07T20:33:22.4987843Z else: 2025-05-07T20:33:22.4987939Z scale_ub_tensor = None 2025-05-07T20:33:22.4988017Z 2025-05-07T20:33:22.4988146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.4988244Z op = silu_mul_quant 2025-05-07T20:33:22.4988330Z if compiled: 2025-05-07T20:33:22.4988430Z op = torch.compile(op) 2025-05-07T20:33:22.4988542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4988618Z 2025-05-07T20:33:22.4988749Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.4988754Z 2025-05-07T20:33:22.4988858Z moe/activation_test.py:117: 2025-05-07T20:33:22.4988992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4989096Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.4989200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.4989566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.4989676Z return fn(*args, **kwargs) 2025-05-07T20:33:22.4990205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.4990302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.4990663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.4990893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.4991229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.4991330Z kernel = self.compile( 2025-05-07T20:33:22.4991709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.4991891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.4992020Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.4992024Z 2025-05-07T20:33:22.4992231Z self = 2025-05-07T20:33:22.4993016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.4993523Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f01a3920>} 2025-05-07T20:33:22.4994276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.4994468Z context = 2025-05-07T20:33:22.4994472Z 2025-05-07T20:33:22.4994638Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.4994906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.4995014Z module_map=module_map) 2025-05-07T20:33:22.4995181Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.4995366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.4995448Z E ^ 2025-05-07T20:33:22.4995915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.4995995Z 2025-05-07T20:33:22.4996410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.4996415Z 2025-05-07T20:33:22.4996525Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.4996750Z self=, 2025-05-07T20:33:22.4996832Z T=128, 2025-05-07T20:33:22.4996921Z D=7168, 2025-05-07T20:33:22.4997007Z scale_ub=1200.0, 2025-05-07T20:33:22.4997097Z contiguous=False, 2025-05-07T20:33:22.4997189Z compiled=False, 2025-05-07T20:33:22.4997265Z ) 2025-05-07T20:33:22.4997485Z self = 2025-05-07T20:33:22.4997668Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.4997675Z 2025-05-07T20:33:22.4997800Z @given( 2025-05-07T20:33:22.4997925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.4998034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.4998151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.4998274Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.4998390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.4998468Z ) 2025-05-07T20:33:22.4998721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.4998816Z def test_silu_mul_quant( 2025-05-07T20:33:22.4998898Z self, 2025-05-07T20:33:22.4998981Z T: int, 2025-05-07T20:33:22.4999062Z D: int, 2025-05-07T20:33:22.4999163Z scale_ub: Optional[float], 2025-05-07T20:33:22.4999260Z contiguous: bool, 2025-05-07T20:33:22.4999351Z compiled: bool, 2025-05-07T20:33:22.4999439Z ) -> None: 2025-05-07T20:33:22.4999538Z torch.manual_seed(2025) 2025-05-07T20:33:22.4999612Z 2025-05-07T20:33:22.4999792Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.4999889Z 2025-05-07T20:33:22.4999997Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5000140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5000231Z x = x_sign * x_clamp 2025-05-07T20:33:22.5000312Z x0 = x[:, :D] 2025-05-07T20:33:22.5000399Z x1 = x[:, D:] 2025-05-07T20:33:22.5000473Z 2025-05-07T20:33:22.5000558Z if contiguous: 2025-05-07T20:33:22.5000657Z x0 = x0.contiguous() 2025-05-07T20:33:22.5000746Z x1 = x1.contiguous() 2025-05-07T20:33:22.5000818Z 2025-05-07T20:33:22.5000920Z if scale_ub is not None: 2025-05-07T20:33:22.5001027Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5001169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5001247Z ) 2025-05-07T20:33:22.5001329Z else: 2025-05-07T20:33:22.5001429Z scale_ub_tensor = None 2025-05-07T20:33:22.5001506Z 2025-05-07T20:33:22.5001636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5001732Z op = silu_mul_quant 2025-05-07T20:33:22.5001818Z if compiled: 2025-05-07T20:33:22.5001917Z op = torch.compile(op) 2025-05-07T20:33:22.5002028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5002102Z 2025-05-07T20:33:22.5002193Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5002203Z 2025-05-07T20:33:22.5002300Z moe/activation_test.py:117: 2025-05-07T20:33:22.5002433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5002538Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5002644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5003240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5003354Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5003876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5004166Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5004507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5004603Z kernel = self.compile( 2025-05-07T20:33:22.5004987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5005161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5005289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5005298Z 2025-05-07T20:33:22.5005561Z self = 2025-05-07T20:33:22.5006339Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5006852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f16f16c0>} 2025-05-07T20:33:22.5007596Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5007792Z context = 2025-05-07T20:33:22.5007796Z 2025-05-07T20:33:22.5007965Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5008232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5008347Z module_map=module_map) 2025-05-07T20:33:22.5008510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5008612Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5008695Z E ^ 2025-05-07T20:33:22.5009048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5009053Z 2025-05-07T20:33:22.5009469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5009474Z 2025-05-07T20:33:22.5009579Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5009804Z self=, 2025-05-07T20:33:22.5009894Z T=128, 2025-05-07T20:33:22.5009982Z D=5120, 2025-05-07T20:33:22.5010069Z scale_ub=None, 2025-05-07T20:33:22.5010164Z contiguous=False, 2025-05-07T20:33:22.5010250Z compiled=False, 2025-05-07T20:33:22.5010334Z ) 2025-05-07T20:33:22.5010555Z self = 2025-05-07T20:33:22.5010727Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.5010732Z 2025-05-07T20:33:22.5010818Z @given( 2025-05-07T20:33:22.5010939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5011040Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5011160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5011277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5011391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5011473Z ) 2025-05-07T20:33:22.5011714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5011899Z def test_silu_mul_quant( 2025-05-07T20:33:22.5011979Z self, 2025-05-07T20:33:22.5012061Z T: int, 2025-05-07T20:33:22.5012147Z D: int, 2025-05-07T20:33:22.5012285Z scale_ub: Optional[float], 2025-05-07T20:33:22.5012377Z contiguous: bool, 2025-05-07T20:33:22.5012469Z compiled: bool, 2025-05-07T20:33:22.5012551Z ) -> None: 2025-05-07T20:33:22.5012646Z torch.manual_seed(2025) 2025-05-07T20:33:22.5012726Z 2025-05-07T20:33:22.5012894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5012970Z 2025-05-07T20:33:22.5013067Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5013193Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5013289Z x = x_sign * x_clamp 2025-05-07T20:33:22.5013373Z x0 = x[:, :D] 2025-05-07T20:33:22.5013456Z x1 = x[:, D:] 2025-05-07T20:33:22.5013536Z 2025-05-07T20:33:22.5013622Z if contiguous: 2025-05-07T20:33:22.5013722Z x0 = x0.contiguous() 2025-05-07T20:33:22.5013856Z x1 = x1.contiguous() 2025-05-07T20:33:22.5013936Z 2025-05-07T20:33:22.5014030Z if scale_ub is not None: 2025-05-07T20:33:22.5014146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5014279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5014355Z ) 2025-05-07T20:33:22.5014436Z else: 2025-05-07T20:33:22.5014531Z scale_ub_tensor = None 2025-05-07T20:33:22.5014605Z 2025-05-07T20:33:22.5014742Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5014831Z op = silu_mul_quant 2025-05-07T20:33:22.5014924Z if compiled: 2025-05-07T20:33:22.5015023Z op = torch.compile(op) 2025-05-07T20:33:22.5015129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5015208Z 2025-05-07T20:33:22.5015299Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5015309Z 2025-05-07T20:33:22.5015406Z moe/activation_test.py:117: 2025-05-07T20:33:22.5015546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5015646Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5015750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5016248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5016346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5016708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5016929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5017268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5017367Z kernel = self.compile( 2025-05-07T20:33:22.5017753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5017933Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5018063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5018068Z 2025-05-07T20:33:22.5018271Z self = 2025-05-07T20:33:22.5019052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5019559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cfd3fba0>} 2025-05-07T20:33:22.5020404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5020634Z context = 2025-05-07T20:33:22.5020675Z 2025-05-07T20:33:22.5020838Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5021110Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5021222Z module_map=module_map) 2025-05-07T20:33:22.5021385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5021486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5021563Z E ^ 2025-05-07T20:33:22.5021922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5021926Z 2025-05-07T20:33:22.5022339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5022386Z 2025-05-07T20:33:22.5022498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5022721Z self=, 2025-05-07T20:33:22.5022805Z T=128, 2025-05-07T20:33:22.5022891Z D=5120, 2025-05-07T20:33:22.5022978Z scale_ub=1200.0, 2025-05-07T20:33:22.5023064Z contiguous=True, 2025-05-07T20:33:22.5023154Z compiled=False, 2025-05-07T20:33:22.5023231Z ) 2025-05-07T20:33:22.5023449Z self = 2025-05-07T20:33:22.5023624Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.5023629Z 2025-05-07T20:33:22.5023707Z @given( 2025-05-07T20:33:22.5023835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5023935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5024055Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5024184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5024300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5024381Z ) 2025-05-07T20:33:22.5024630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5024726Z def test_silu_mul_quant( 2025-05-07T20:33:22.5024808Z self, 2025-05-07T20:33:22.5024892Z T: int, 2025-05-07T20:33:22.5024971Z D: int, 2025-05-07T20:33:22.5025074Z scale_ub: Optional[float], 2025-05-07T20:33:22.5025165Z contiguous: bool, 2025-05-07T20:33:22.5025253Z compiled: bool, 2025-05-07T20:33:22.5025337Z ) -> None: 2025-05-07T20:33:22.5025434Z torch.manual_seed(2025) 2025-05-07T20:33:22.5025507Z 2025-05-07T20:33:22.5025680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5025754Z 2025-05-07T20:33:22.5025850Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5025984Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5026075Z x = x_sign * x_clamp 2025-05-07T20:33:22.5026158Z x0 = x[:, :D] 2025-05-07T20:33:22.5026251Z x1 = x[:, D:] 2025-05-07T20:33:22.5026326Z 2025-05-07T20:33:22.5026411Z if contiguous: 2025-05-07T20:33:22.5026510Z x0 = x0.contiguous() 2025-05-07T20:33:22.5026602Z x1 = x1.contiguous() 2025-05-07T20:33:22.5026683Z 2025-05-07T20:33:22.5026775Z if scale_ub is not None: 2025-05-07T20:33:22.5026883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5027022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5027101Z ) 2025-05-07T20:33:22.5027180Z else: 2025-05-07T20:33:22.5027282Z scale_ub_tensor = None 2025-05-07T20:33:22.5027356Z 2025-05-07T20:33:22.5027485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5027689Z op = silu_mul_quant 2025-05-07T20:33:22.5027777Z if compiled: 2025-05-07T20:33:22.5027879Z op = torch.compile(op) 2025-05-07T20:33:22.5027990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5028107Z 2025-05-07T20:33:22.5028201Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5028206Z 2025-05-07T20:33:22.5028303Z moe/activation_test.py:117: 2025-05-07T20:33:22.5028433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5028537Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5028636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5029131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5029231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5029587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5029883Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5030247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5030345Z kernel = self.compile( 2025-05-07T20:33:22.5030731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5030905Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5031031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5031040Z 2025-05-07T20:33:22.5031242Z self = 2025-05-07T20:33:22.5032017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5032527Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f078cb80>} 2025-05-07T20:33:22.5033271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5033464Z context = 2025-05-07T20:33:22.5033468Z 2025-05-07T20:33:22.5033633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5033893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5034006Z module_map=module_map) 2025-05-07T20:33:22.5034167Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5034278Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5034356Z E ^ 2025-05-07T20:33:22.5034713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5034720Z 2025-05-07T20:33:22.5035135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5035140Z 2025-05-07T20:33:22.5035243Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5035465Z self=, 2025-05-07T20:33:22.5035551Z T=1, 2025-05-07T20:33:22.5035630Z D=7168, 2025-05-07T20:33:22.5035790Z scale_ub=1200.0, 2025-05-07T20:33:22.5035878Z contiguous=True, 2025-05-07T20:33:22.5035962Z compiled=True, 2025-05-07T20:33:22.5036039Z ) 2025-05-07T20:33:22.5036259Z self = 2025-05-07T20:33:22.5036526Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.5036534Z 2025-05-07T20:33:22.5036617Z @given( 2025-05-07T20:33:22.5036736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5036875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5036993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5037110Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5037228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5037301Z ) 2025-05-07T20:33:22.5037545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5037644Z def test_silu_mul_quant( 2025-05-07T20:33:22.5037722Z self, 2025-05-07T20:33:22.5037801Z T: int, 2025-05-07T20:33:22.5037885Z D: int, 2025-05-07T20:33:22.5037985Z scale_ub: Optional[float], 2025-05-07T20:33:22.5038076Z contiguous: bool, 2025-05-07T20:33:22.5038174Z compiled: bool, 2025-05-07T20:33:22.5038253Z ) -> None: 2025-05-07T20:33:22.5038388Z torch.manual_seed(2025) 2025-05-07T20:33:22.5038468Z 2025-05-07T20:33:22.5038636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5038719Z 2025-05-07T20:33:22.5038812Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5038938Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5039030Z x = x_sign * x_clamp 2025-05-07T20:33:22.5039111Z x0 = x[:, :D] 2025-05-07T20:33:22.5039191Z x1 = x[:, D:] 2025-05-07T20:33:22.5039268Z 2025-05-07T20:33:22.5039354Z if contiguous: 2025-05-07T20:33:22.5039450Z x0 = x0.contiguous() 2025-05-07T20:33:22.5039546Z x1 = x1.contiguous() 2025-05-07T20:33:22.5039619Z 2025-05-07T20:33:22.5039712Z if scale_ub is not None: 2025-05-07T20:33:22.5039822Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5039958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5040039Z ) 2025-05-07T20:33:22.5040129Z else: 2025-05-07T20:33:22.5040227Z scale_ub_tensor = None 2025-05-07T20:33:22.5040307Z 2025-05-07T20:33:22.5040436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5040527Z op = silu_mul_quant 2025-05-07T20:33:22.5040617Z if compiled: 2025-05-07T20:33:22.5040716Z op = torch.compile(op) 2025-05-07T20:33:22.5040821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5040895Z 2025-05-07T20:33:22.5040985Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5040989Z 2025-05-07T20:33:22.5041086Z moe/activation_test.py:117: 2025-05-07T20:33:22.5041222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5041325Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5041429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5041801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5041894Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5042385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5042485Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5042842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5043067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5043405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5043502Z kernel = self.compile( 2025-05-07T20:33:22.5043881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5044139Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5044277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5044320Z 2025-05-07T20:33:22.5044525Z self = 2025-05-07T20:33:22.5045305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5045806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f078e2a0>} 2025-05-07T20:33:22.5046552Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5046788Z context = 2025-05-07T20:33:22.5046792Z 2025-05-07T20:33:22.5046956Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5047223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5047331Z module_map=module_map) 2025-05-07T20:33:22.5047494Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5047598Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5047676Z E ^ 2025-05-07T20:33:22.5048036Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5048040Z 2025-05-07T20:33:22.5048452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5048457Z 2025-05-07T20:33:22.5048564Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5048795Z self=, 2025-05-07T20:33:22.5048875Z T=1, 2025-05-07T20:33:22.5048957Z D=7168, 2025-05-07T20:33:22.5049043Z scale_ub=1200.0, 2025-05-07T20:33:22.5049130Z contiguous=False, 2025-05-07T20:33:22.5049221Z compiled=True, 2025-05-07T20:33:22.5049294Z ) 2025-05-07T20:33:22.5049511Z self = 2025-05-07T20:33:22.5049712Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5049717Z 2025-05-07T20:33:22.5049814Z @given( 2025-05-07T20:33:22.5049935Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5050041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5050156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5050273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5050396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5050472Z ) 2025-05-07T20:33:22.5050721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5050817Z def test_silu_mul_quant( 2025-05-07T20:33:22.5050893Z self, 2025-05-07T20:33:22.5050973Z T: int, 2025-05-07T20:33:22.5051050Z D: int, 2025-05-07T20:33:22.5051148Z scale_ub: Optional[float], 2025-05-07T20:33:22.5051240Z contiguous: bool, 2025-05-07T20:33:22.5051327Z compiled: bool, 2025-05-07T20:33:22.5051405Z ) -> None: 2025-05-07T20:33:22.5051504Z torch.manual_seed(2025) 2025-05-07T20:33:22.5051576Z 2025-05-07T20:33:22.5051742Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5051821Z 2025-05-07T20:33:22.5051912Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5052039Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5052178Z x = x_sign * x_clamp 2025-05-07T20:33:22.5052298Z x0 = x[:, :D] 2025-05-07T20:33:22.5052385Z x1 = x[:, D:] 2025-05-07T20:33:22.5052462Z 2025-05-07T20:33:22.5052552Z if contiguous: 2025-05-07T20:33:22.5052687Z x0 = x0.contiguous() 2025-05-07T20:33:22.5052777Z x1 = x1.contiguous() 2025-05-07T20:33:22.5056267Z 2025-05-07T20:33:22.5056378Z if scale_ub is not None: 2025-05-07T20:33:22.5056495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5056636Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5056713Z ) 2025-05-07T20:33:22.5056799Z else: 2025-05-07T20:33:22.5056895Z scale_ub_tensor = None 2025-05-07T20:33:22.5056969Z 2025-05-07T20:33:22.5057106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5057199Z op = silu_mul_quant 2025-05-07T20:33:22.5057285Z if compiled: 2025-05-07T20:33:22.5057399Z op = torch.compile(op) 2025-05-07T20:33:22.5057509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5057649Z 2025-05-07T20:33:22.5057747Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5057756Z 2025-05-07T20:33:22.5057852Z moe/activation_test.py:117: 2025-05-07T20:33:22.5057985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5058086Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5058186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5058562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5058655Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5059146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5059249Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5059609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5059866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5060229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5060324Z kernel = self.compile( 2025-05-07T20:33:22.5060707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5060880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5061012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5061017Z 2025-05-07T20:33:22.5061222Z self = 2025-05-07T20:33:22.5062002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5062511Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f33f078f9c0>} 2025-05-07T20:33:22.5063257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5063451Z context = 2025-05-07T20:33:22.5063455Z 2025-05-07T20:33:22.5063618Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5063880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5063989Z module_map=module_map) 2025-05-07T20:33:22.5064263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5064375Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5064452Z E ^ 2025-05-07T20:33:22.5064808Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5064852Z 2025-05-07T20:33:22.5065267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5065272Z 2025-05-07T20:33:22.5065582Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5065893Z self=, 2025-05-07T20:33:22.5065976Z T=1, 2025-05-07T20:33:22.5066056Z D=7168, 2025-05-07T20:33:22.5066142Z scale_ub=None, 2025-05-07T20:33:22.5066230Z contiguous=False, 2025-05-07T20:33:22.5066314Z compiled=True, 2025-05-07T20:33:22.5066392Z ) 2025-05-07T20:33:22.5066613Z self = 2025-05-07T20:33:22.5066865Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.5066871Z 2025-05-07T20:33:22.5066957Z @given( 2025-05-07T20:33:22.5067083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5067182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5067301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5067418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5067533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5067606Z ) 2025-05-07T20:33:22.5067849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5067945Z def test_silu_mul_quant( 2025-05-07T20:33:22.5068021Z self, 2025-05-07T20:33:22.5068099Z T: int, 2025-05-07T20:33:22.5068182Z D: int, 2025-05-07T20:33:22.5068280Z scale_ub: Optional[float], 2025-05-07T20:33:22.5068373Z contiguous: bool, 2025-05-07T20:33:22.5068464Z compiled: bool, 2025-05-07T20:33:22.5068550Z ) -> None: 2025-05-07T20:33:22.5068643Z torch.manual_seed(2025) 2025-05-07T20:33:22.5068721Z 2025-05-07T20:33:22.5068890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5068969Z 2025-05-07T20:33:22.5069059Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5069184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5069279Z x = x_sign * x_clamp 2025-05-07T20:33:22.5069358Z x0 = x[:, :D] 2025-05-07T20:33:22.5069442Z x1 = x[:, D:] 2025-05-07T20:33:22.5069518Z 2025-05-07T20:33:22.5069602Z if contiguous: 2025-05-07T20:33:22.5069694Z x0 = x0.contiguous() 2025-05-07T20:33:22.5069788Z x1 = x1.contiguous() 2025-05-07T20:33:22.5069861Z 2025-05-07T20:33:22.5069951Z if scale_ub is not None: 2025-05-07T20:33:22.5070062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5070199Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5070278Z ) 2025-05-07T20:33:22.5070354Z else: 2025-05-07T20:33:22.5070449Z scale_ub_tensor = None 2025-05-07T20:33:22.5070525Z 2025-05-07T20:33:22.5070653Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5070743Z op = silu_mul_quant 2025-05-07T20:33:22.5070834Z if compiled: 2025-05-07T20:33:22.5070934Z op = torch.compile(op) 2025-05-07T20:33:22.5071039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5071117Z 2025-05-07T20:33:22.5071206Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.5071326Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.5071408Z 2025-05-07T20:33:22.5071540Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5071645Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.5071869Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.5072000Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.5072141Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.5072274Z 2025-05-07T20:33:22.5072376Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.5072381Z 2025-05-07T20:33:22.5072482Z moe/activation_test.py:126: 2025-05-07T20:33:22.5072611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5072718Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.5072854Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.5073410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.5073515Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.5073875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5074141Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5074513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.5074772Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.5075150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.5075320Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.5075660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.5075800Z fn() 2025-05-07T20:33:22.5076201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.5076290Z self.fn.run( 2025-05-07T20:33:22.5076635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5076728Z kernel = self.compile( 2025-05-07T20:33:22.5077114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5077288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5077417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5077421Z 2025-05-07T20:33:22.5077628Z self = 2025-05-07T20:33:22.5078403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5078913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cff5cb80>} 2025-05-07T20:33:22.5079658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5079850Z context = 2025-05-07T20:33:22.5079860Z 2025-05-07T20:33:22.5080022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5080284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5080396Z module_map=module_map) 2025-05-07T20:33:22.5080559Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5080662Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.5080746Z E ^ 2025-05-07T20:33:22.5081190Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5081195Z 2025-05-07T20:33:22.5081612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5081655Z 2025-05-07T20:33:22.5081758Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5081979Z self=, 2025-05-07T20:33:22.5082063Z T=1, 2025-05-07T20:33:22.5082140Z D=5120, 2025-05-07T20:33:22.5082223Z scale_ub=1200.0, 2025-05-07T20:33:22.5082312Z contiguous=False, 2025-05-07T20:33:22.5082395Z compiled=True, 2025-05-07T20:33:22.5082468Z ) 2025-05-07T20:33:22.5082687Z self = 2025-05-07T20:33:22.5082853Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5082860Z 2025-05-07T20:33:22.5082944Z @given( 2025-05-07T20:33:22.5083101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5083202Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5083323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5083440Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5083553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5083632Z ) 2025-05-07T20:33:22.5083873Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5083970Z def test_silu_mul_quant( 2025-05-07T20:33:22.5084046Z self, 2025-05-07T20:33:22.5084124Z T: int, 2025-05-07T20:33:22.5084205Z D: int, 2025-05-07T20:33:22.5084304Z scale_ub: Optional[float], 2025-05-07T20:33:22.5084391Z contiguous: bool, 2025-05-07T20:33:22.5084479Z compiled: bool, 2025-05-07T20:33:22.5084558Z ) -> None: 2025-05-07T20:33:22.5084659Z torch.manual_seed(2025) 2025-05-07T20:33:22.5084737Z 2025-05-07T20:33:22.5084912Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5084987Z 2025-05-07T20:33:22.5085085Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5085209Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5085296Z x = x_sign * x_clamp 2025-05-07T20:33:22.5085382Z x0 = x[:, :D] 2025-05-07T20:33:22.5085462Z x1 = x[:, D:] 2025-05-07T20:33:22.5085538Z 2025-05-07T20:33:22.5085622Z if contiguous: 2025-05-07T20:33:22.5085713Z x0 = x0.contiguous() 2025-05-07T20:33:22.5085803Z x1 = x1.contiguous() 2025-05-07T20:33:22.5085876Z 2025-05-07T20:33:22.5085967Z if scale_ub is not None: 2025-05-07T20:33:22.5086077Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5086211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5086285Z ) 2025-05-07T20:33:22.5086371Z else: 2025-05-07T20:33:22.5086466Z scale_ub_tensor = None 2025-05-07T20:33:22.5086541Z 2025-05-07T20:33:22.5086674Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5086767Z op = silu_mul_quant 2025-05-07T20:33:22.5086855Z if compiled: 2025-05-07T20:33:22.5086953Z op = torch.compile(op) 2025-05-07T20:33:22.5087057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5087132Z 2025-05-07T20:33:22.5087224Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5087229Z 2025-05-07T20:33:22.5087324Z moe/activation_test.py:117: 2025-05-07T20:33:22.5087459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5087558Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5087658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5088074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5088203Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5088699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5088838Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5089192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5089416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5089752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5089846Z kernel = self.compile( 2025-05-07T20:33:22.5090228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5090399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5090534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5090610Z 2025-05-07T20:33:22.5090815Z self = 2025-05-07T20:33:22.5091593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5092101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cff5de40>} 2025-05-07T20:33:22.5092844Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5093039Z context = 2025-05-07T20:33:22.5093046Z 2025-05-07T20:33:22.5093211Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5093474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5093582Z module_map=module_map) 2025-05-07T20:33:22.5093742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5093849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5093926Z E ^ 2025-05-07T20:33:22.5094280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5094285Z 2025-05-07T20:33:22.5094698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5094703Z 2025-05-07T20:33:22.5094803Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5095036Z self=, 2025-05-07T20:33:22.5095120Z T=1, 2025-05-07T20:33:22.5095199Z D=5120, 2025-05-07T20:33:22.5095284Z scale_ub=1200.0, 2025-05-07T20:33:22.5095372Z contiguous=False, 2025-05-07T20:33:22.5095458Z compiled=False, 2025-05-07T20:33:22.5095533Z ) 2025-05-07T20:33:22.5095754Z self = 2025-05-07T20:33:22.5095920Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.5095927Z 2025-05-07T20:33:22.5096008Z @given( 2025-05-07T20:33:22.5096126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5096225Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5096344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5096460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5096577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5096662Z ) 2025-05-07T20:33:22.5096994Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5097093Z def test_silu_mul_quant( 2025-05-07T20:33:22.5097173Z self, 2025-05-07T20:33:22.5097292Z T: int, 2025-05-07T20:33:22.5097370Z D: int, 2025-05-07T20:33:22.5097472Z scale_ub: Optional[float], 2025-05-07T20:33:22.5097561Z contiguous: bool, 2025-05-07T20:33:22.5097649Z compiled: bool, 2025-05-07T20:33:22.5097727Z ) -> None: 2025-05-07T20:33:22.5097821Z torch.manual_seed(2025) 2025-05-07T20:33:22.5097896Z 2025-05-07T20:33:22.5098063Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5098136Z 2025-05-07T20:33:22.5098234Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5098358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5098446Z x = x_sign * x_clamp 2025-05-07T20:33:22.5098530Z x0 = x[:, :D] 2025-05-07T20:33:22.5098616Z x1 = x[:, D:] 2025-05-07T20:33:22.5098688Z 2025-05-07T20:33:22.5098818Z if contiguous: 2025-05-07T20:33:22.5098911Z x0 = x0.contiguous() 2025-05-07T20:33:22.5099002Z x1 = x1.contiguous() 2025-05-07T20:33:22.5099077Z 2025-05-07T20:33:22.5099167Z if scale_ub is not None: 2025-05-07T20:33:22.5099275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5099409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5099486Z ) 2025-05-07T20:33:22.5099574Z else: 2025-05-07T20:33:22.5099683Z scale_ub_tensor = None 2025-05-07T20:33:22.5099771Z 2025-05-07T20:33:22.5099918Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5100007Z op = silu_mul_quant 2025-05-07T20:33:22.5100093Z if compiled: 2025-05-07T20:33:22.5100195Z op = torch.compile(op) 2025-05-07T20:33:22.5100297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5100378Z 2025-05-07T20:33:22.5100468Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5100477Z 2025-05-07T20:33:22.5100574Z moe/activation_test.py:117: 2025-05-07T20:33:22.5100704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5100810Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5100907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5101404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5101500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5101855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5102077Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5102417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5102516Z kernel = self.compile( 2025-05-07T20:33:22.5102896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5103070Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5103201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5103206Z 2025-05-07T20:33:22.5103408Z self = 2025-05-07T20:33:22.5104187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5104736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cff5eac0>} 2025-05-07T20:33:22.5105524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5105752Z context = 2025-05-07T20:33:22.5105757Z 2025-05-07T20:33:22.5105919Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5106181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5106289Z module_map=module_map) 2025-05-07T20:33:22.5106449Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5106555Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5106632Z E ^ 2025-05-07T20:33:22.5106987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5106996Z 2025-05-07T20:33:22.5107449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5107456Z 2025-05-07T20:33:22.5107559Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5107783Z self=, 2025-05-07T20:33:22.5107862Z T=16384, 2025-05-07T20:33:22.5107941Z D=5120, 2025-05-07T20:33:22.5108024Z scale_ub=1200.0, 2025-05-07T20:33:22.5108113Z contiguous=False, 2025-05-07T20:33:22.5108199Z compiled=True, 2025-05-07T20:33:22.5108272Z ) 2025-05-07T20:33:22.5108487Z self = 2025-05-07T20:33:22.5108666Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5108671Z 2025-05-07T20:33:22.5108748Z @given( 2025-05-07T20:33:22.5108866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5108974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5109091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5109209Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5109326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5109401Z ) 2025-05-07T20:33:22.5109649Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5109746Z def test_silu_mul_quant( 2025-05-07T20:33:22.5109841Z self, 2025-05-07T20:33:22.5109929Z T: int, 2025-05-07T20:33:22.5110023Z D: int, 2025-05-07T20:33:22.5110122Z scale_ub: Optional[float], 2025-05-07T20:33:22.5110213Z contiguous: bool, 2025-05-07T20:33:22.5110298Z compiled: bool, 2025-05-07T20:33:22.5110375Z ) -> None: 2025-05-07T20:33:22.5110471Z torch.manual_seed(2025) 2025-05-07T20:33:22.5110544Z 2025-05-07T20:33:22.5110714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5110794Z 2025-05-07T20:33:22.5110888Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5111015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5111105Z x = x_sign * x_clamp 2025-05-07T20:33:22.5111185Z x0 = x[:, :D] 2025-05-07T20:33:22.5111267Z x1 = x[:, D:] 2025-05-07T20:33:22.5111338Z 2025-05-07T20:33:22.5111422Z if contiguous: 2025-05-07T20:33:22.5111515Z x0 = x0.contiguous() 2025-05-07T20:33:22.5111603Z x1 = x1.contiguous() 2025-05-07T20:33:22.5111674Z 2025-05-07T20:33:22.5111768Z if scale_ub is not None: 2025-05-07T20:33:22.5111873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5112004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5112084Z ) 2025-05-07T20:33:22.5112161Z else: 2025-05-07T20:33:22.5112258Z scale_ub_tensor = None 2025-05-07T20:33:22.5112381Z 2025-05-07T20:33:22.5112549Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5112645Z op = silu_mul_quant 2025-05-07T20:33:22.5112729Z if compiled: 2025-05-07T20:33:22.5112869Z op = torch.compile(op) 2025-05-07T20:33:22.5112977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5113049Z 2025-05-07T20:33:22.5113141Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5113145Z 2025-05-07T20:33:22.5113245Z moe/activation_test.py:117: 2025-05-07T20:33:22.5113375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5113478Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5113579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5113942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5114036Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5114564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5114664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5115023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5115246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5115584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5115679Z kernel = self.compile( 2025-05-07T20:33:22.5116157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5116335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5116461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5116465Z 2025-05-07T20:33:22.5116674Z self = 2025-05-07T20:33:22.5117459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5117963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf910180>} 2025-05-07T20:33:22.5118708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5118898Z context = 2025-05-07T20:33:22.5118903Z 2025-05-07T20:33:22.5119070Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5119337Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5119443Z module_map=module_map) 2025-05-07T20:33:22.5119607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5119705Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5119781Z E ^ 2025-05-07T20:33:22.5120138Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5120142Z 2025-05-07T20:33:22.5120551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5120555Z 2025-05-07T20:33:22.5120660Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5120881Z self=, 2025-05-07T20:33:22.5120958Z T=2048, 2025-05-07T20:33:22.5121036Z D=7168, 2025-05-07T20:33:22.5121213Z scale_ub=1200.0, 2025-05-07T20:33:22.5121302Z contiguous=False, 2025-05-07T20:33:22.5121388Z compiled=True, 2025-05-07T20:33:22.5121460Z ) 2025-05-07T20:33:22.5121678Z self = 2025-05-07T20:33:22.5121919Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5121923Z 2025-05-07T20:33:22.5122002Z @given( 2025-05-07T20:33:22.5122124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5122221Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5122335Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5122454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5122567Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5122642Z ) 2025-05-07T20:33:22.5122888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5122990Z def test_silu_mul_quant( 2025-05-07T20:33:22.5123066Z self, 2025-05-07T20:33:22.5123186Z T: int, 2025-05-07T20:33:22.5123265Z D: int, 2025-05-07T20:33:22.5123367Z scale_ub: Optional[float], 2025-05-07T20:33:22.5123458Z contiguous: bool, 2025-05-07T20:33:22.5123542Z compiled: bool, 2025-05-07T20:33:22.5123623Z ) -> None: 2025-05-07T20:33:22.5123717Z torch.manual_seed(2025) 2025-05-07T20:33:22.5123789Z 2025-05-07T20:33:22.5123961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5124034Z 2025-05-07T20:33:22.5124126Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5124253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5124342Z x = x_sign * x_clamp 2025-05-07T20:33:22.5124423Z x0 = x[:, :D] 2025-05-07T20:33:22.5124506Z x1 = x[:, D:] 2025-05-07T20:33:22.5124578Z 2025-05-07T20:33:22.5124665Z if contiguous: 2025-05-07T20:33:22.5124764Z x0 = x0.contiguous() 2025-05-07T20:33:22.5124854Z x1 = x1.contiguous() 2025-05-07T20:33:22.5124931Z 2025-05-07T20:33:22.5125021Z if scale_ub is not None: 2025-05-07T20:33:22.5125129Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5125265Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5125341Z ) 2025-05-07T20:33:22.5125418Z else: 2025-05-07T20:33:22.5125515Z scale_ub_tensor = None 2025-05-07T20:33:22.5125586Z 2025-05-07T20:33:22.5125716Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5125809Z op = silu_mul_quant 2025-05-07T20:33:22.5125894Z if compiled: 2025-05-07T20:33:22.5125997Z op = torch.compile(op) 2025-05-07T20:33:22.5126102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5126175Z 2025-05-07T20:33:22.5126267Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5126274Z 2025-05-07T20:33:22.5126373Z moe/activation_test.py:117: 2025-05-07T20:33:22.5126504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5126608Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5126710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5127074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5127169Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5127657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5127756Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5128111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5128331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5128720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5128856Z kernel = self.compile( 2025-05-07T20:33:22.5129235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5129450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5129576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5129581Z 2025-05-07T20:33:22.5129816Z self = 2025-05-07T20:33:22.5130611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5131119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf910ea0>} 2025-05-07T20:33:22.5131902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5132097Z context = 2025-05-07T20:33:22.5132102Z 2025-05-07T20:33:22.5132267Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5132527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5132638Z module_map=module_map) 2025-05-07T20:33:22.5132798Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5132897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5132977Z E ^ 2025-05-07T20:33:22.5133334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5133343Z 2025-05-07T20:33:22.5133755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5133766Z 2025-05-07T20:33:22.5133867Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5134088Z self=, 2025-05-07T20:33:22.5134168Z T=1, 2025-05-07T20:33:22.5134244Z D=5120, 2025-05-07T20:33:22.5134325Z scale_ub=None, 2025-05-07T20:33:22.5134413Z contiguous=False, 2025-05-07T20:33:22.5134495Z compiled=False, 2025-05-07T20:33:22.5134568Z ) 2025-05-07T20:33:22.5134785Z self = 2025-05-07T20:33:22.5134950Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.5134955Z 2025-05-07T20:33:22.5135034Z @given( 2025-05-07T20:33:22.5135158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5135260Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5135376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5135495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5135607Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5135683Z ) 2025-05-07T20:33:22.5135923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5136015Z def test_silu_mul_quant( 2025-05-07T20:33:22.5136096Z self, 2025-05-07T20:33:22.5136172Z T: int, 2025-05-07T20:33:22.5136247Z D: int, 2025-05-07T20:33:22.5136347Z scale_ub: Optional[float], 2025-05-07T20:33:22.5136436Z contiguous: bool, 2025-05-07T20:33:22.5136524Z compiled: bool, 2025-05-07T20:33:22.5136601Z ) -> None: 2025-05-07T20:33:22.5136695Z torch.manual_seed(2025) 2025-05-07T20:33:22.5136818Z 2025-05-07T20:33:22.5137027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5137102Z 2025-05-07T20:33:22.5137196Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5137370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5137460Z x = x_sign * x_clamp 2025-05-07T20:33:22.5137543Z x0 = x[:, :D] 2025-05-07T20:33:22.5137622Z x1 = x[:, D:] 2025-05-07T20:33:22.5137694Z 2025-05-07T20:33:22.5137778Z if contiguous: 2025-05-07T20:33:22.5137869Z x0 = x0.contiguous() 2025-05-07T20:33:22.5137962Z x1 = x1.contiguous() 2025-05-07T20:33:22.5138034Z 2025-05-07T20:33:22.5138123Z if scale_ub is not None: 2025-05-07T20:33:22.5138230Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5138361Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5138437Z ) 2025-05-07T20:33:22.5138517Z else: 2025-05-07T20:33:22.5138616Z scale_ub_tensor = None 2025-05-07T20:33:22.5138688Z 2025-05-07T20:33:22.5138862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5138953Z op = silu_mul_quant 2025-05-07T20:33:22.5139041Z if compiled: 2025-05-07T20:33:22.5139141Z op = torch.compile(op) 2025-05-07T20:33:22.5139246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5139318Z 2025-05-07T20:33:22.5139412Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5139417Z 2025-05-07T20:33:22.5139515Z moe/activation_test.py:117: 2025-05-07T20:33:22.5139670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5139777Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5139891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5140388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5140490Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5140849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5141075Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5141412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5141509Z kernel = self.compile( 2025-05-07T20:33:22.5141886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5142058Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5142189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5142193Z 2025-05-07T20:33:22.5142396Z self = 2025-05-07T20:33:22.5143179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5143684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf911e40>} 2025-05-07T20:33:22.5144429Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5144619Z context = 2025-05-07T20:33:22.5144623Z 2025-05-07T20:33:22.5144785Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5145047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5145242Z module_map=module_map) 2025-05-07T20:33:22.5145405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5145509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5145626Z E ^ 2025-05-07T20:33:22.5145982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5145987Z 2025-05-07T20:33:22.5146397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5146401Z 2025-05-07T20:33:22.5146502Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5146727Z self=, 2025-05-07T20:33:22.5146804Z T=4096, 2025-05-07T20:33:22.5146884Z D=7168, 2025-05-07T20:33:22.5146966Z scale_ub=1200.0, 2025-05-07T20:33:22.5147052Z contiguous=False, 2025-05-07T20:33:22.5147144Z compiled=False, 2025-05-07T20:33:22.5147216Z ) 2025-05-07T20:33:22.5147472Z self = 2025-05-07T20:33:22.5147653Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.5147661Z 2025-05-07T20:33:22.5147738Z @given( 2025-05-07T20:33:22.5147857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5147959Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5148073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5148191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5148303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5148376Z ) 2025-05-07T20:33:22.5148619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5148711Z def test_silu_mul_quant( 2025-05-07T20:33:22.5148786Z self, 2025-05-07T20:33:22.5148865Z T: int, 2025-05-07T20:33:22.5148945Z D: int, 2025-05-07T20:33:22.5149045Z scale_ub: Optional[float], 2025-05-07T20:33:22.5149137Z contiguous: bool, 2025-05-07T20:33:22.5149220Z compiled: bool, 2025-05-07T20:33:22.5149300Z ) -> None: 2025-05-07T20:33:22.5149401Z torch.manual_seed(2025) 2025-05-07T20:33:22.5149472Z 2025-05-07T20:33:22.5149638Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5149714Z 2025-05-07T20:33:22.5149805Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5149932Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5150020Z x = x_sign * x_clamp 2025-05-07T20:33:22.5150099Z x0 = x[:, :D] 2025-05-07T20:33:22.5150181Z x1 = x[:, D:] 2025-05-07T20:33:22.5150253Z 2025-05-07T20:33:22.5150337Z if contiguous: 2025-05-07T20:33:22.5150430Z x0 = x0.contiguous() 2025-05-07T20:33:22.5150518Z x1 = x1.contiguous() 2025-05-07T20:33:22.5150592Z 2025-05-07T20:33:22.5150686Z if scale_ub is not None: 2025-05-07T20:33:22.5150793Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5150926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5151007Z ) 2025-05-07T20:33:22.5151082Z else: 2025-05-07T20:33:22.5151178Z scale_ub_tensor = None 2025-05-07T20:33:22.5151249Z 2025-05-07T20:33:22.5151376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5151468Z op = silu_mul_quant 2025-05-07T20:33:22.5151553Z if compiled: 2025-05-07T20:33:22.5151652Z op = torch.compile(op) 2025-05-07T20:33:22.5151758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5151830Z 2025-05-07T20:33:22.5151921Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5151925Z 2025-05-07T20:33:22.5152023Z moe/activation_test.py:117: 2025-05-07T20:33:22.5152198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5152361Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5152467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5152962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5153101Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5153457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5153677Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5154016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5154109Z kernel = self.compile( 2025-05-07T20:33:22.5154489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5154664Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5154833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5154838Z 2025-05-07T20:33:22.5155048Z self = 2025-05-07T20:33:22.5155868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5156374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf913380>} 2025-05-07T20:33:22.5157115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5157310Z context = 2025-05-07T20:33:22.5157321Z 2025-05-07T20:33:22.5157484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5157748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5157858Z module_map=module_map) 2025-05-07T20:33:22.5158017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5158116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5158200Z E ^ 2025-05-07T20:33:22.5158553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5158557Z 2025-05-07T20:33:22.5158970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5158974Z 2025-05-07T20:33:22.5159078Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5159302Z self=, 2025-05-07T20:33:22.5159385Z T=16384, 2025-05-07T20:33:22.5159464Z D=7168, 2025-05-07T20:33:22.5159549Z scale_ub=None, 2025-05-07T20:33:22.5159636Z contiguous=True, 2025-05-07T20:33:22.5159719Z compiled=True, 2025-05-07T20:33:22.5159791Z ) 2025-05-07T20:33:22.5160009Z self = 2025-05-07T20:33:22.5160180Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.5160184Z 2025-05-07T20:33:22.5160263Z @given( 2025-05-07T20:33:22.5160382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5160480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5160598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5160714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5160874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5160990Z ) 2025-05-07T20:33:22.5161236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5161370Z def test_silu_mul_quant( 2025-05-07T20:33:22.5161451Z self, 2025-05-07T20:33:22.5161527Z T: int, 2025-05-07T20:33:22.5161609Z D: int, 2025-05-07T20:33:22.5161706Z scale_ub: Optional[float], 2025-05-07T20:33:22.5161794Z contiguous: bool, 2025-05-07T20:33:22.5161880Z compiled: bool, 2025-05-07T20:33:22.5161958Z ) -> None: 2025-05-07T20:33:22.5162051Z torch.manual_seed(2025) 2025-05-07T20:33:22.5162126Z 2025-05-07T20:33:22.5162293Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5162367Z 2025-05-07T20:33:22.5162464Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5162587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5162674Z x = x_sign * x_clamp 2025-05-07T20:33:22.5162764Z x0 = x[:, :D] 2025-05-07T20:33:22.5162885Z x1 = x[:, D:] 2025-05-07T20:33:22.5162965Z 2025-05-07T20:33:22.5163050Z if contiguous: 2025-05-07T20:33:22.5163145Z x0 = x0.contiguous() 2025-05-07T20:33:22.5163236Z x1 = x1.contiguous() 2025-05-07T20:33:22.5163307Z 2025-05-07T20:33:22.5163396Z if scale_ub is not None: 2025-05-07T20:33:22.5163506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5163638Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5163712Z ) 2025-05-07T20:33:22.5163790Z else: 2025-05-07T20:33:22.5163883Z scale_ub_tensor = None 2025-05-07T20:33:22.5163955Z 2025-05-07T20:33:22.5164086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5164175Z op = silu_mul_quant 2025-05-07T20:33:22.5164259Z if compiled: 2025-05-07T20:33:22.5164359Z op = torch.compile(op) 2025-05-07T20:33:22.5164468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5164544Z 2025-05-07T20:33:22.5164634Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5164639Z 2025-05-07T20:33:22.5164737Z moe/activation_test.py:117: 2025-05-07T20:33:22.5164869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5164968Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5165066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5165663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5165801Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5166308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5166411Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5166773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5167001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5167338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5167435Z kernel = self.compile( 2025-05-07T20:33:22.5167816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5167990Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5168122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5168126Z 2025-05-07T20:33:22.5168331Z self = 2025-05-07T20:33:22.5169192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5169760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf6a44a0>} 2025-05-07T20:33:22.5170563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5170758Z context = 2025-05-07T20:33:22.5170763Z 2025-05-07T20:33:22.5170926Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5171189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5171295Z module_map=module_map) 2025-05-07T20:33:22.5171458Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5171622Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5171701Z E ^ 2025-05-07T20:33:22.5172056Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5172063Z 2025-05-07T20:33:22.5172479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5172484Z 2025-05-07T20:33:22.5172586Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5172812Z self=, 2025-05-07T20:33:22.5172888Z T=4096, 2025-05-07T20:33:22.5172964Z D=5120, 2025-05-07T20:33:22.5173048Z scale_ub=None, 2025-05-07T20:33:22.5173133Z contiguous=False, 2025-05-07T20:33:22.5173215Z compiled=True, 2025-05-07T20:33:22.5173291Z ) 2025-05-07T20:33:22.5173508Z self = 2025-05-07T20:33:22.5173686Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.5173694Z 2025-05-07T20:33:22.5173772Z @given( 2025-05-07T20:33:22.5173892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5173994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5174110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5174227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5174341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5174416Z ) 2025-05-07T20:33:22.5174658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5174753Z def test_silu_mul_quant( 2025-05-07T20:33:22.5174828Z self, 2025-05-07T20:33:22.5174904Z T: int, 2025-05-07T20:33:22.5174983Z D: int, 2025-05-07T20:33:22.5175082Z scale_ub: Optional[float], 2025-05-07T20:33:22.5175177Z contiguous: bool, 2025-05-07T20:33:22.5175262Z compiled: bool, 2025-05-07T20:33:22.5175343Z ) -> None: 2025-05-07T20:33:22.5175440Z torch.manual_seed(2025) 2025-05-07T20:33:22.5175511Z 2025-05-07T20:33:22.5175683Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5175759Z 2025-05-07T20:33:22.5175850Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5175973Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5176064Z x = x_sign * x_clamp 2025-05-07T20:33:22.5176144Z x0 = x[:, :D] 2025-05-07T20:33:22.5176223Z x1 = x[:, D:] 2025-05-07T20:33:22.5176298Z 2025-05-07T20:33:22.5176382Z if contiguous: 2025-05-07T20:33:22.5179670Z x0 = x0.contiguous() 2025-05-07T20:33:22.5179775Z x1 = x1.contiguous() 2025-05-07T20:33:22.5179853Z 2025-05-07T20:33:22.5179946Z if scale_ub is not None: 2025-05-07T20:33:22.5180052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5180302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5180380Z ) 2025-05-07T20:33:22.5180457Z else: 2025-05-07T20:33:22.5180606Z scale_ub_tensor = None 2025-05-07T20:33:22.5180680Z 2025-05-07T20:33:22.5180816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5180908Z op = silu_mul_quant 2025-05-07T20:33:22.5180994Z if compiled: 2025-05-07T20:33:22.5181099Z op = torch.compile(op) 2025-05-07T20:33:22.5181204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5181276Z 2025-05-07T20:33:22.5181370Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5181375Z 2025-05-07T20:33:22.5181474Z moe/activation_test.py:117: 2025-05-07T20:33:22.5181603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5181708Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5181809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5182239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5182334Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5182828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5182929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5183282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5183501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5183841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5183935Z kernel = self.compile( 2025-05-07T20:33:22.5184319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5184501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5184627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5184634Z 2025-05-07T20:33:22.5184841Z self = 2025-05-07T20:33:22.5185618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5186123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf6a51c0>} 2025-05-07T20:33:22.5186866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5187060Z context = 2025-05-07T20:33:22.5187068Z 2025-05-07T20:33:22.5187235Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5187496Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5187608Z module_map=module_map) 2025-05-07T20:33:22.5187768Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5187865Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5187947Z E ^ 2025-05-07T20:33:22.5188300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5188304Z 2025-05-07T20:33:22.5188717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5188830Z 2025-05-07T20:33:22.5188936Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5189162Z self=, 2025-05-07T20:33:22.5189284Z T=4096, 2025-05-07T20:33:22.5189364Z D=5120, 2025-05-07T20:33:22.5189449Z scale_ub=1200.0, 2025-05-07T20:33:22.5189539Z contiguous=False, 2025-05-07T20:33:22.5189622Z compiled=False, 2025-05-07T20:33:22.5189694Z ) 2025-05-07T20:33:22.5189913Z self = 2025-05-07T20:33:22.5190091Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.5190096Z 2025-05-07T20:33:22.5190177Z @given( 2025-05-07T20:33:22.5190295Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5190394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5190512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5190630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5190788Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5190866Z ) 2025-05-07T20:33:22.5191108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5191208Z def test_silu_mul_quant( 2025-05-07T20:33:22.5191286Z self, 2025-05-07T20:33:22.5191362Z T: int, 2025-05-07T20:33:22.5191441Z D: int, 2025-05-07T20:33:22.5191541Z scale_ub: Optional[float], 2025-05-07T20:33:22.5191631Z contiguous: bool, 2025-05-07T20:33:22.5191718Z compiled: bool, 2025-05-07T20:33:22.5191796Z ) -> None: 2025-05-07T20:33:22.5191890Z torch.manual_seed(2025) 2025-05-07T20:33:22.5191966Z 2025-05-07T20:33:22.5192134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5192208Z 2025-05-07T20:33:22.5192303Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5192428Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5192519Z x = x_sign * x_clamp 2025-05-07T20:33:22.5192604Z x0 = x[:, :D] 2025-05-07T20:33:22.5192687Z x1 = x[:, D:] 2025-05-07T20:33:22.5192762Z 2025-05-07T20:33:22.5192849Z if contiguous: 2025-05-07T20:33:22.5192941Z x0 = x0.contiguous() 2025-05-07T20:33:22.5193031Z x1 = x1.contiguous() 2025-05-07T20:33:22.5193102Z 2025-05-07T20:33:22.5193192Z if scale_ub is not None: 2025-05-07T20:33:22.5193300Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5193432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5193507Z ) 2025-05-07T20:33:22.5193586Z else: 2025-05-07T20:33:22.5193680Z scale_ub_tensor = None 2025-05-07T20:33:22.5193752Z 2025-05-07T20:33:22.5193883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5193973Z op = silu_mul_quant 2025-05-07T20:33:22.5194064Z if compiled: 2025-05-07T20:33:22.5194165Z op = torch.compile(op) 2025-05-07T20:33:22.5194270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5194349Z 2025-05-07T20:33:22.5194443Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5194448Z 2025-05-07T20:33:22.5194544Z moe/activation_test.py:117: 2025-05-07T20:33:22.5194674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5194774Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5194873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5195369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5195464Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5195890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5196163Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5196575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5196670Z kernel = self.compile( 2025-05-07T20:33:22.5197090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5197264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5197391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5197395Z 2025-05-07T20:33:22.5197597Z self = 2025-05-07T20:33:22.5198375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5198920Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf6a6160>} 2025-05-07T20:33:22.5199670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5199864Z context = 2025-05-07T20:33:22.5199869Z 2025-05-07T20:33:22.5200031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5200294Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5200400Z module_map=module_map) 2025-05-07T20:33:22.5200565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5200664Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5200748Z E ^ 2025-05-07T20:33:22.5201108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5201112Z 2025-05-07T20:33:22.5201528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5201533Z 2025-05-07T20:33:22.5201638Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5201861Z self=, 2025-05-07T20:33:22.5201937Z T=4096, 2025-05-07T20:33:22.5202016Z D=5120, 2025-05-07T20:33:22.5202098Z scale_ub=1200.0, 2025-05-07T20:33:22.5202185Z contiguous=False, 2025-05-07T20:33:22.5202271Z compiled=True, 2025-05-07T20:33:22.5202343Z ) 2025-05-07T20:33:22.5202560Z self = 2025-05-07T20:33:22.5202740Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5202749Z 2025-05-07T20:33:22.5202826Z @given( 2025-05-07T20:33:22.5202951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5203050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5203167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5203287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5203400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5203473Z ) 2025-05-07T20:33:22.5203719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5203812Z def test_silu_mul_quant( 2025-05-07T20:33:22.5203889Z self, 2025-05-07T20:33:22.5203969Z T: int, 2025-05-07T20:33:22.5204045Z D: int, 2025-05-07T20:33:22.5204142Z scale_ub: Optional[float], 2025-05-07T20:33:22.5204233Z contiguous: bool, 2025-05-07T20:33:22.5204318Z compiled: bool, 2025-05-07T20:33:22.5204399Z ) -> None: 2025-05-07T20:33:22.5204581Z torch.manual_seed(2025) 2025-05-07T20:33:22.5204655Z 2025-05-07T20:33:22.5204826Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5204939Z 2025-05-07T20:33:22.5205031Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5205160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5205247Z x = x_sign * x_clamp 2025-05-07T20:33:22.5205328Z x0 = x[:, :D] 2025-05-07T20:33:22.5205408Z x1 = x[:, D:] 2025-05-07T20:33:22.5205480Z 2025-05-07T20:33:22.5205563Z if contiguous: 2025-05-07T20:33:22.5205657Z x0 = x0.contiguous() 2025-05-07T20:33:22.5205744Z x1 = x1.contiguous() 2025-05-07T20:33:22.5205816Z 2025-05-07T20:33:22.5205910Z if scale_ub is not None: 2025-05-07T20:33:22.5206014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5206149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5206230Z ) 2025-05-07T20:33:22.5206306Z else: 2025-05-07T20:33:22.5206442Z scale_ub_tensor = None 2025-05-07T20:33:22.5206516Z 2025-05-07T20:33:22.5206646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5206744Z op = silu_mul_quant 2025-05-07T20:33:22.5206829Z if compiled: 2025-05-07T20:33:22.5206928Z op = torch.compile(op) 2025-05-07T20:33:22.5207033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5207106Z 2025-05-07T20:33:22.5207195Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5207202Z 2025-05-07T20:33:22.5207300Z moe/activation_test.py:117: 2025-05-07T20:33:22.5207428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5207530Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5207629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5207997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5208099Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5208589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5208694Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5209049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5209268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5209610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5209702Z kernel = self.compile( 2025-05-07T20:33:22.5210080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5210255Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5210388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5210393Z 2025-05-07T20:33:22.5210599Z self = 2025-05-07T20:33:22.5211374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5211876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf6a7240>} 2025-05-07T20:33:22.5212622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5212856Z context = 2025-05-07T20:33:22.5212898Z 2025-05-07T20:33:22.5213065Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5213328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5213476Z module_map=module_map) 2025-05-07T20:33:22.5213639Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5213738Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5213819Z E ^ 2025-05-07T20:33:22.5214173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5214177Z 2025-05-07T20:33:22.5214586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5214590Z 2025-05-07T20:33:22.5214694Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5214920Z self=, 2025-05-07T20:33:22.5215067Z T=2048, 2025-05-07T20:33:22.5215148Z D=7168, 2025-05-07T20:33:22.5215232Z scale_ub=1200.0, 2025-05-07T20:33:22.5215329Z contiguous=False, 2025-05-07T20:33:22.5215411Z compiled=False, 2025-05-07T20:33:22.5215483Z ) 2025-05-07T20:33:22.5215700Z self = 2025-05-07T20:33:22.5215874Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.5215879Z 2025-05-07T20:33:22.5215956Z @given( 2025-05-07T20:33:22.5216078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5216176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5216294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5216410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5216521Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5216606Z ) 2025-05-07T20:33:22.5216851Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5216942Z def test_silu_mul_quant( 2025-05-07T20:33:22.5217023Z self, 2025-05-07T20:33:22.5217101Z T: int, 2025-05-07T20:33:22.5217178Z D: int, 2025-05-07T20:33:22.5217278Z scale_ub: Optional[float], 2025-05-07T20:33:22.5217367Z contiguous: bool, 2025-05-07T20:33:22.5217450Z compiled: bool, 2025-05-07T20:33:22.5217529Z ) -> None: 2025-05-07T20:33:22.5217627Z torch.manual_seed(2025) 2025-05-07T20:33:22.5217700Z 2025-05-07T20:33:22.5217868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5217944Z 2025-05-07T20:33:22.5218038Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5218166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5218256Z x = x_sign * x_clamp 2025-05-07T20:33:22.5218336Z x0 = x[:, :D] 2025-05-07T20:33:22.5218423Z x1 = x[:, D:] 2025-05-07T20:33:22.5218496Z 2025-05-07T20:33:22.5218581Z if contiguous: 2025-05-07T20:33:22.5218675Z x0 = x0.contiguous() 2025-05-07T20:33:22.5218765Z x1 = x1.contiguous() 2025-05-07T20:33:22.5218837Z 2025-05-07T20:33:22.5218932Z if scale_ub is not None: 2025-05-07T20:33:22.5219037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5219173Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5219250Z ) 2025-05-07T20:33:22.5219326Z else: 2025-05-07T20:33:22.5219425Z scale_ub_tensor = None 2025-05-07T20:33:22.5219497Z 2025-05-07T20:33:22.5219629Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5219743Z op = silu_mul_quant 2025-05-07T20:33:22.5219834Z if compiled: 2025-05-07T20:33:22.5219950Z op = torch.compile(op) 2025-05-07T20:33:22.5220104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5220214Z 2025-05-07T20:33:22.5220306Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5220311Z 2025-05-07T20:33:22.5220409Z moe/activation_test.py:117: 2025-05-07T20:33:22.5220576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5220677Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5220776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5221271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5221369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5221724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5221948Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5222294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5222429Z kernel = self.compile( 2025-05-07T20:33:22.5222811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5222988Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5223115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5223120Z 2025-05-07T20:33:22.5223326Z self = 2025-05-07T20:33:22.5224099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5224606Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf72c220>} 2025-05-07T20:33:22.5225351Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5225542Z context = 2025-05-07T20:33:22.5225549Z 2025-05-07T20:33:22.5225712Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5225973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5226081Z module_map=module_map) 2025-05-07T20:33:22.5226246Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5226343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5226422Z E ^ 2025-05-07T20:33:22.5226780Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5226787Z 2025-05-07T20:33:22.5227201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5227209Z 2025-05-07T20:33:22.5227311Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5227532Z self=, 2025-05-07T20:33:22.5227611Z T=1, 2025-05-07T20:33:22.5227688Z D=7168, 2025-05-07T20:33:22.5227769Z scale_ub=None, 2025-05-07T20:33:22.5227857Z contiguous=True, 2025-05-07T20:33:22.5227941Z compiled=False, 2025-05-07T20:33:22.5228013Z ) 2025-05-07T20:33:22.5228231Z self = 2025-05-07T20:33:22.5228394Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5228399Z 2025-05-07T20:33:22.5228479Z @given( 2025-05-07T20:33:22.5228648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5228785Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5228904Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5229020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5229172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5229248Z ) 2025-05-07T20:33:22.5229490Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5229586Z def test_silu_mul_quant( 2025-05-07T20:33:22.5229676Z self, 2025-05-07T20:33:22.5229767Z T: int, 2025-05-07T20:33:22.5229854Z D: int, 2025-05-07T20:33:22.5229969Z scale_ub: Optional[float], 2025-05-07T20:33:22.5230058Z contiguous: bool, 2025-05-07T20:33:22.5230144Z compiled: bool, 2025-05-07T20:33:22.5230222Z ) -> None: 2025-05-07T20:33:22.5230316Z torch.manual_seed(2025) 2025-05-07T20:33:22.5230391Z 2025-05-07T20:33:22.5230562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5230637Z 2025-05-07T20:33:22.5230775Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5230900Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5230991Z x = x_sign * x_clamp 2025-05-07T20:33:22.5231074Z x0 = x[:, :D] 2025-05-07T20:33:22.5231152Z x1 = x[:, D:] 2025-05-07T20:33:22.5231230Z 2025-05-07T20:33:22.5231314Z if contiguous: 2025-05-07T20:33:22.5231404Z x0 = x0.contiguous() 2025-05-07T20:33:22.5231495Z x1 = x1.contiguous() 2025-05-07T20:33:22.5231567Z 2025-05-07T20:33:22.5231656Z if scale_ub is not None: 2025-05-07T20:33:22.5231762Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5231894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5231969Z ) 2025-05-07T20:33:22.5232048Z else: 2025-05-07T20:33:22.5232141Z scale_ub_tensor = None 2025-05-07T20:33:22.5232218Z 2025-05-07T20:33:22.5232352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5232442Z op = silu_mul_quant 2025-05-07T20:33:22.5232526Z if compiled: 2025-05-07T20:33:22.5232630Z op = torch.compile(op) 2025-05-07T20:33:22.5232734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5232809Z 2025-05-07T20:33:22.5232899Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5232904Z 2025-05-07T20:33:22.5233000Z moe/activation_test.py:117: 2025-05-07T20:33:22.5233131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5233230Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5233329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5233824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5233920Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5234285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5234505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5234845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5234940Z kernel = self.compile( 2025-05-07T20:33:22.5235318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5235489Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5235619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5235623Z 2025-05-07T20:33:22.5235904Z self = 2025-05-07T20:33:22.5236739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5237278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf72d120>} 2025-05-07T20:33:22.5238064Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5238253Z context = 2025-05-07T20:33:22.5238258Z 2025-05-07T20:33:22.5238420Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5238682Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5238791Z module_map=module_map) 2025-05-07T20:33:22.5238998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5239099Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5239181Z E ^ 2025-05-07T20:33:22.5239536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5239541Z 2025-05-07T20:33:22.5239951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5239956Z 2025-05-07T20:33:22.5240057Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5240282Z self=, 2025-05-07T20:33:22.5240361Z T=16384, 2025-05-07T20:33:22.5240439Z D=7168, 2025-05-07T20:33:22.5240521Z scale_ub=1200.0, 2025-05-07T20:33:22.5240611Z contiguous=False, 2025-05-07T20:33:22.5240696Z compiled=True, 2025-05-07T20:33:22.5240771Z ) 2025-05-07T20:33:22.5240991Z self = 2025-05-07T20:33:22.5241174Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5241181Z 2025-05-07T20:33:22.5241259Z @given( 2025-05-07T20:33:22.5241377Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5241477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5241591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5241708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5241824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5241897Z ) 2025-05-07T20:33:22.5242141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5242232Z def test_silu_mul_quant( 2025-05-07T20:33:22.5242308Z self, 2025-05-07T20:33:22.5242387Z T: int, 2025-05-07T20:33:22.5242462Z D: int, 2025-05-07T20:33:22.5242564Z scale_ub: Optional[float], 2025-05-07T20:33:22.5242657Z contiguous: bool, 2025-05-07T20:33:22.5242742Z compiled: bool, 2025-05-07T20:33:22.5242819Z ) -> None: 2025-05-07T20:33:22.5242917Z torch.manual_seed(2025) 2025-05-07T20:33:22.5242991Z 2025-05-07T20:33:22.5243158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5243234Z 2025-05-07T20:33:22.5243326Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5243454Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5243541Z x = x_sign * x_clamp 2025-05-07T20:33:22.5243623Z x0 = x[:, :D] 2025-05-07T20:33:22.5243708Z x1 = x[:, D:] 2025-05-07T20:33:22.5243781Z 2025-05-07T20:33:22.5243863Z if contiguous: 2025-05-07T20:33:22.5243957Z x0 = x0.contiguous() 2025-05-07T20:33:22.5244046Z x1 = x1.contiguous() 2025-05-07T20:33:22.5244118Z 2025-05-07T20:33:22.5244259Z if scale_ub is not None: 2025-05-07T20:33:22.5244402Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5244536Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5244678Z ) 2025-05-07T20:33:22.5244756Z else: 2025-05-07T20:33:22.5244849Z scale_ub_tensor = None 2025-05-07T20:33:22.5244923Z 2025-05-07T20:33:22.5245051Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5245144Z op = silu_mul_quant 2025-05-07T20:33:22.5245230Z if compiled: 2025-05-07T20:33:22.5245328Z op = torch.compile(op) 2025-05-07T20:33:22.5245434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5245507Z 2025-05-07T20:33:22.5245597Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5245601Z 2025-05-07T20:33:22.5245700Z moe/activation_test.py:117: 2025-05-07T20:33:22.5245827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5245932Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5246078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5246444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5246543Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5247031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5247127Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5247483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5247703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5248043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5248136Z kernel = self.compile( 2025-05-07T20:33:22.5248523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5248698Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5248830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5248835Z 2025-05-07T20:33:22.5249037Z self = 2025-05-07T20:33:22.5249814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5250315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf72e520>} 2025-05-07T20:33:22.5251063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5251255Z context = 2025-05-07T20:33:22.5251261Z 2025-05-07T20:33:22.5251430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5251691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5251796Z module_map=module_map) 2025-05-07T20:33:22.5251959Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5252057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5252133Z E ^ 2025-05-07T20:33:22.5252487Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5252492Z 2025-05-07T20:33:22.5252945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5252988Z 2025-05-07T20:33:22.5253094Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5253314Z self=, 2025-05-07T20:33:22.5253430Z T=1, 2025-05-07T20:33:22.5253509Z D=7168, 2025-05-07T20:33:22.5253591Z scale_ub=None, 2025-05-07T20:33:22.5253676Z contiguous=False, 2025-05-07T20:33:22.5253761Z compiled=False, 2025-05-07T20:33:22.5253834Z ) 2025-05-07T20:33:22.5254057Z self = 2025-05-07T20:33:22.5254221Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.5254225Z 2025-05-07T20:33:22.5254303Z @given( 2025-05-07T20:33:22.5254423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5254523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5254639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5254802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5254917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5254996Z ) 2025-05-07T20:33:22.5255246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5255338Z def test_silu_mul_quant( 2025-05-07T20:33:22.5255416Z self, 2025-05-07T20:33:22.5255492Z T: int, 2025-05-07T20:33:22.5255567Z D: int, 2025-05-07T20:33:22.5255665Z scale_ub: Optional[float], 2025-05-07T20:33:22.5255753Z contiguous: bool, 2025-05-07T20:33:22.5255837Z compiled: bool, 2025-05-07T20:33:22.5255916Z ) -> None: 2025-05-07T20:33:22.5256010Z torch.manual_seed(2025) 2025-05-07T20:33:22.5256081Z 2025-05-07T20:33:22.5256254Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5256327Z 2025-05-07T20:33:22.5256420Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5256553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5256640Z x = x_sign * x_clamp 2025-05-07T20:33:22.5256723Z x0 = x[:, :D] 2025-05-07T20:33:22.5256804Z x1 = x[:, D:] 2025-05-07T20:33:22.5256875Z 2025-05-07T20:33:22.5256963Z if contiguous: 2025-05-07T20:33:22.5257053Z x0 = x0.contiguous() 2025-05-07T20:33:22.5257141Z x1 = x1.contiguous() 2025-05-07T20:33:22.5257215Z 2025-05-07T20:33:22.5257306Z if scale_ub is not None: 2025-05-07T20:33:22.5257410Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5257545Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5257619Z ) 2025-05-07T20:33:22.5257696Z else: 2025-05-07T20:33:22.5257793Z scale_ub_tensor = None 2025-05-07T20:33:22.5257866Z 2025-05-07T20:33:22.5257993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5258092Z op = silu_mul_quant 2025-05-07T20:33:22.5258177Z if compiled: 2025-05-07T20:33:22.5258281Z op = torch.compile(op) 2025-05-07T20:33:22.5258384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5258458Z 2025-05-07T20:33:22.5258550Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5258554Z 2025-05-07T20:33:22.5258649Z moe/activation_test.py:117: 2025-05-07T20:33:22.5258776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5258880Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5258978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5259474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5259570Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5259926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5260241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5260579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5260711Z kernel = self.compile( 2025-05-07T20:33:22.5261095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5261267Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5261396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5261401Z 2025-05-07T20:33:22.5261602Z self = 2025-05-07T20:33:22.5262383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5262931Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf72f100>} 2025-05-07T20:33:22.5263680Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5263871Z context = 2025-05-07T20:33:22.5263875Z 2025-05-07T20:33:22.5264037Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5264297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5264406Z module_map=module_map) 2025-05-07T20:33:22.5264564Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5264672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5264748Z E ^ 2025-05-07T20:33:22.5265106Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5265113Z 2025-05-07T20:33:22.5265773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5265779Z 2025-05-07T20:33:22.5265887Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5266112Z self=, 2025-05-07T20:33:22.5266191Z T=2048, 2025-05-07T20:33:22.5266267Z D=7168, 2025-05-07T20:33:22.5266353Z scale_ub=None, 2025-05-07T20:33:22.5266438Z contiguous=False, 2025-05-07T20:33:22.5266521Z compiled=True, 2025-05-07T20:33:22.5266596Z ) 2025-05-07T20:33:22.5266813Z self = 2025-05-07T20:33:22.5266992Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.5267000Z 2025-05-07T20:33:22.5267080Z @given( 2025-05-07T20:33:22.5267197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5267301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5267414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5267529Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5267644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5267718Z ) 2025-05-07T20:33:22.5267960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5268057Z def test_silu_mul_quant( 2025-05-07T20:33:22.5268134Z self, 2025-05-07T20:33:22.5268210Z T: int, 2025-05-07T20:33:22.5268289Z D: int, 2025-05-07T20:33:22.5268387Z scale_ub: Optional[float], 2025-05-07T20:33:22.5268476Z contiguous: bool, 2025-05-07T20:33:22.5268718Z compiled: bool, 2025-05-07T20:33:22.5268798Z ) -> None: 2025-05-07T20:33:22.5268897Z torch.manual_seed(2025) 2025-05-07T20:33:22.5268969Z 2025-05-07T20:33:22.5269140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5269275Z 2025-05-07T20:33:22.5269366Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5269490Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5269580Z x = x_sign * x_clamp 2025-05-07T20:33:22.5269660Z x0 = x[:, :D] 2025-05-07T20:33:22.5269739Z x1 = x[:, D:] 2025-05-07T20:33:22.5269813Z 2025-05-07T20:33:22.5269895Z if contiguous: 2025-05-07T20:33:22.5269986Z x0 = x0.contiguous() 2025-05-07T20:33:22.5270076Z x1 = x1.contiguous() 2025-05-07T20:33:22.5270148Z 2025-05-07T20:33:22.5270243Z if scale_ub is not None: 2025-05-07T20:33:22.5270348Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5270483Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5270568Z ) 2025-05-07T20:33:22.5270698Z else: 2025-05-07T20:33:22.5270793Z scale_ub_tensor = None 2025-05-07T20:33:22.5270871Z 2025-05-07T20:33:22.5270999Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5271090Z op = silu_mul_quant 2025-05-07T20:33:22.5271179Z if compiled: 2025-05-07T20:33:22.5271276Z op = torch.compile(op) 2025-05-07T20:33:22.5271380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5271456Z 2025-05-07T20:33:22.5271546Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5271551Z 2025-05-07T20:33:22.5271649Z moe/activation_test.py:117: 2025-05-07T20:33:22.5271777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5271877Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5271977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5272351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5272445Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5272938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5273036Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5273394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5273615Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5273951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5274047Z kernel = self.compile( 2025-05-07T20:33:22.5274425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5274601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5274734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5274741Z 2025-05-07T20:33:22.5274944Z self = 2025-05-07T20:33:22.5275775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5276277Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf558720>} 2025-05-07T20:33:22.5277070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5277325Z context = 2025-05-07T20:33:22.5277329Z 2025-05-07T20:33:22.5277494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5277799Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5277906Z module_map=module_map) 2025-05-07T20:33:22.5278069Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5278170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5278247Z E ^ 2025-05-07T20:33:22.5278602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5278606Z 2025-05-07T20:33:22.5279014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5279018Z 2025-05-07T20:33:22.5279126Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5279388Z self=, 2025-05-07T20:33:22.5279468Z T=4096, 2025-05-07T20:33:22.5279552Z D=7168, 2025-05-07T20:33:22.5279636Z scale_ub=None, 2025-05-07T20:33:22.5279724Z contiguous=False, 2025-05-07T20:33:22.5279812Z compiled=True, 2025-05-07T20:33:22.5279886Z ) 2025-05-07T20:33:22.5280102Z self = 2025-05-07T20:33:22.5280278Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.5280282Z 2025-05-07T20:33:22.5280360Z @given( 2025-05-07T20:33:22.5280480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5280582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5280698Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5280818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5280938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5281014Z ) 2025-05-07T20:33:22.5281267Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5281362Z def test_silu_mul_quant( 2025-05-07T20:33:22.5281440Z self, 2025-05-07T20:33:22.5281521Z T: int, 2025-05-07T20:33:22.5281599Z D: int, 2025-05-07T20:33:22.5281697Z scale_ub: Optional[float], 2025-05-07T20:33:22.5281787Z contiguous: bool, 2025-05-07T20:33:22.5281871Z compiled: bool, 2025-05-07T20:33:22.5281947Z ) -> None: 2025-05-07T20:33:22.5282044Z torch.manual_seed(2025) 2025-05-07T20:33:22.5282116Z 2025-05-07T20:33:22.5282285Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5282359Z 2025-05-07T20:33:22.5282449Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5282575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5282669Z x = x_sign * x_clamp 2025-05-07T20:33:22.5282748Z x0 = x[:, :D] 2025-05-07T20:33:22.5282835Z x1 = x[:, D:] 2025-05-07T20:33:22.5282907Z 2025-05-07T20:33:22.5282991Z if contiguous: 2025-05-07T20:33:22.5283088Z x0 = x0.contiguous() 2025-05-07T20:33:22.5283175Z x1 = x1.contiguous() 2025-05-07T20:33:22.5283246Z 2025-05-07T20:33:22.5283339Z if scale_ub is not None: 2025-05-07T20:33:22.5283442Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5283578Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5283653Z ) 2025-05-07T20:33:22.5283730Z else: 2025-05-07T20:33:22.5283828Z scale_ub_tensor = None 2025-05-07T20:33:22.5283900Z 2025-05-07T20:33:22.5284028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5284122Z op = silu_mul_quant 2025-05-07T20:33:22.5284207Z if compiled: 2025-05-07T20:33:22.5284353Z op = torch.compile(op) 2025-05-07T20:33:22.5284504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5284577Z 2025-05-07T20:33:22.5284669Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5284711Z 2025-05-07T20:33:22.5284810Z moe/activation_test.py:117: 2025-05-07T20:33:22.5284937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5285043Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5285141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5285505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5285599Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5286087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5286183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5286543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5286804Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5287147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5287240Z kernel = self.compile( 2025-05-07T20:33:22.5287620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5287795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5287923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5287927Z 2025-05-07T20:33:22.5288134Z self = 2025-05-07T20:33:22.5288913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5289415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf559440>} 2025-05-07T20:33:22.5290161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5290349Z context = 2025-05-07T20:33:22.5290354Z 2025-05-07T20:33:22.5290521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5290783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5290890Z module_map=module_map) 2025-05-07T20:33:22.5291057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5291160Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5291239Z E ^ 2025-05-07T20:33:22.5291593Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5291599Z 2025-05-07T20:33:22.5292007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5292012Z 2025-05-07T20:33:22.5292118Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5292341Z self=, 2025-05-07T20:33:22.5292424Z T=16384, 2025-05-07T20:33:22.5292502Z D=5120, 2025-05-07T20:33:22.5292585Z scale_ub=1200.0, 2025-05-07T20:33:22.5292675Z contiguous=False, 2025-05-07T20:33:22.5292760Z compiled=False, 2025-05-07T20:33:22.5292834Z ) 2025-05-07T20:33:22.5293097Z self = 2025-05-07T20:33:22.5293317Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.5293321Z 2025-05-07T20:33:22.5293403Z @given( 2025-05-07T20:33:22.5293566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5293666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5293779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5293899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5294011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5294087Z ) 2025-05-07T20:33:22.5294329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5294420Z def test_silu_mul_quant( 2025-05-07T20:33:22.5294498Z self, 2025-05-07T20:33:22.5294574Z T: int, 2025-05-07T20:33:22.5294650Z D: int, 2025-05-07T20:33:22.5294749Z scale_ub: Optional[float], 2025-05-07T20:33:22.5294843Z contiguous: bool, 2025-05-07T20:33:22.5294966Z compiled: bool, 2025-05-07T20:33:22.5295049Z ) -> None: 2025-05-07T20:33:22.5295141Z torch.manual_seed(2025) 2025-05-07T20:33:22.5295218Z 2025-05-07T20:33:22.5295393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5295467Z 2025-05-07T20:33:22.5295560Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5295684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5295772Z x = x_sign * x_clamp 2025-05-07T20:33:22.5295854Z x0 = x[:, :D] 2025-05-07T20:33:22.5295933Z x1 = x[:, D:] 2025-05-07T20:33:22.5296004Z 2025-05-07T20:33:22.5296090Z if contiguous: 2025-05-07T20:33:22.5296181Z x0 = x0.contiguous() 2025-05-07T20:33:22.5296272Z x1 = x1.contiguous() 2025-05-07T20:33:22.5296346Z 2025-05-07T20:33:22.5296438Z if scale_ub is not None: 2025-05-07T20:33:22.5296546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5296685Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5296762Z ) 2025-05-07T20:33:22.5296840Z else: 2025-05-07T20:33:22.5296937Z scale_ub_tensor = None 2025-05-07T20:33:22.5297010Z 2025-05-07T20:33:22.5297140Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5297229Z op = silu_mul_quant 2025-05-07T20:33:22.5297314Z if compiled: 2025-05-07T20:33:22.5297415Z op = torch.compile(op) 2025-05-07T20:33:22.5297520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5300738Z 2025-05-07T20:33:22.5300844Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5300849Z 2025-05-07T20:33:22.5300952Z moe/activation_test.py:117: 2025-05-07T20:33:22.5301081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5301182Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5301294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5301796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5301896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5302259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5302479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5302822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5302916Z kernel = self.compile( 2025-05-07T20:33:22.5303297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5303476Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5303668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5303710Z 2025-05-07T20:33:22.5303921Z self = 2025-05-07T20:33:22.5304734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5305237Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf55a340>} 2025-05-07T20:33:22.5305982Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5306171Z context = 2025-05-07T20:33:22.5306181Z 2025-05-07T20:33:22.5306387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5306650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5306761Z module_map=module_map) 2025-05-07T20:33:22.5306924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5307023Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5307104Z E ^ 2025-05-07T20:33:22.5307456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5307461Z 2025-05-07T20:33:22.5307872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5307876Z 2025-05-07T20:33:22.5307981Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5308205Z self=, 2025-05-07T20:33:22.5308290Z T=16384, 2025-05-07T20:33:22.5308369Z D=5120, 2025-05-07T20:33:22.5308451Z scale_ub=1200.0, 2025-05-07T20:33:22.5308543Z contiguous=True, 2025-05-07T20:33:22.5308631Z compiled=True, 2025-05-07T20:33:22.5308704Z ) 2025-05-07T20:33:22.5308924Z self = 2025-05-07T20:33:22.5309098Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.5309102Z 2025-05-07T20:33:22.5309180Z @given( 2025-05-07T20:33:22.5309304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5309402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5309515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5309635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5309748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5309824Z ) 2025-05-07T20:33:22.5310074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5310173Z def test_silu_mul_quant( 2025-05-07T20:33:22.5310253Z self, 2025-05-07T20:33:22.5310333Z T: int, 2025-05-07T20:33:22.5310413Z D: int, 2025-05-07T20:33:22.5310518Z scale_ub: Optional[float], 2025-05-07T20:33:22.5310606Z contiguous: bool, 2025-05-07T20:33:22.5310690Z compiled: bool, 2025-05-07T20:33:22.5310773Z ) -> None: 2025-05-07T20:33:22.5310869Z torch.manual_seed(2025) 2025-05-07T20:33:22.5310942Z 2025-05-07T20:33:22.5311112Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5311187Z 2025-05-07T20:33:22.5311284Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5311407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5311495Z x = x_sign * x_clamp 2025-05-07T20:33:22.5311578Z x0 = x[:, :D] 2025-05-07T20:33:22.5311656Z x1 = x[:, D:] 2025-05-07T20:33:22.5311839Z 2025-05-07T20:33:22.5311928Z if contiguous: 2025-05-07T20:33:22.5312022Z x0 = x0.contiguous() 2025-05-07T20:33:22.5312109Z x1 = x1.contiguous() 2025-05-07T20:33:22.5312230Z 2025-05-07T20:33:22.5312321Z if scale_ub is not None: 2025-05-07T20:33:22.5312425Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5312560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5312634Z ) 2025-05-07T20:33:22.5312717Z else: 2025-05-07T20:33:22.5312810Z scale_ub_tensor = None 2025-05-07T20:33:22.5312882Z 2025-05-07T20:33:22.5313013Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5313102Z op = silu_mul_quant 2025-05-07T20:33:22.5313186Z if compiled: 2025-05-07T20:33:22.5313288Z op = torch.compile(op) 2025-05-07T20:33:22.5313394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5313473Z 2025-05-07T20:33:22.5313567Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5313611Z 2025-05-07T20:33:22.5313709Z moe/activation_test.py:117: 2025-05-07T20:33:22.5313839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5313943Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5314041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5314409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5314501Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5314989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5315089Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5315444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5315673Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5316092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5316190Z kernel = self.compile( 2025-05-07T20:33:22.5316572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5316745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5316871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5316876Z 2025-05-07T20:33:22.5317086Z self = 2025-05-07T20:33:22.5317861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5318373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf55b9c0>} 2025-05-07T20:33:22.5319118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5319310Z context = 2025-05-07T20:33:22.5319315Z 2025-05-07T20:33:22.5319477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5319739Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5319849Z module_map=module_map) 2025-05-07T20:33:22.5320008Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5320107Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5320835Z E ^ 2025-05-07T20:33:22.5321195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5321239Z 2025-05-07T20:33:22.5321656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5321661Z 2025-05-07T20:33:22.5321763Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5321983Z self=, 2025-05-07T20:33:22.5322064Z T=16384, 2025-05-07T20:33:22.5322140Z D=5120, 2025-05-07T20:33:22.5322223Z scale_ub=None, 2025-05-07T20:33:22.5322313Z contiguous=False, 2025-05-07T20:33:22.5322394Z compiled=True, 2025-05-07T20:33:22.5322472Z ) 2025-05-07T20:33:22.5322688Z self = 2025-05-07T20:33:22.5322867Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.5322874Z 2025-05-07T20:33:22.5322960Z @given( 2025-05-07T20:33:22.5323118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5323218Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5323338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5323454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5323571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5323644Z ) 2025-05-07T20:33:22.5323887Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5323981Z def test_silu_mul_quant( 2025-05-07T20:33:22.5324058Z self, 2025-05-07T20:33:22.5324138Z T: int, 2025-05-07T20:33:22.5324219Z D: int, 2025-05-07T20:33:22.5324316Z scale_ub: Optional[float], 2025-05-07T20:33:22.5324405Z contiguous: bool, 2025-05-07T20:33:22.5324494Z compiled: bool, 2025-05-07T20:33:22.5324576Z ) -> None: 2025-05-07T20:33:22.5324670Z torch.manual_seed(2025) 2025-05-07T20:33:22.5324748Z 2025-05-07T20:33:22.5324917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5324993Z 2025-05-07T20:33:22.5325088Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5325211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5325305Z x = x_sign * x_clamp 2025-05-07T20:33:22.5325385Z x0 = x[:, :D] 2025-05-07T20:33:22.5325463Z x1 = x[:, D:] 2025-05-07T20:33:22.5325540Z 2025-05-07T20:33:22.5325622Z if contiguous: 2025-05-07T20:33:22.5325713Z x0 = x0.contiguous() 2025-05-07T20:33:22.5325805Z x1 = x1.contiguous() 2025-05-07T20:33:22.5325878Z 2025-05-07T20:33:22.5325970Z if scale_ub is not None: 2025-05-07T20:33:22.5326078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5326209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5326288Z ) 2025-05-07T20:33:22.5326365Z else: 2025-05-07T20:33:22.5326466Z scale_ub_tensor = None 2025-05-07T20:33:22.5326541Z 2025-05-07T20:33:22.5326670Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5326763Z op = silu_mul_quant 2025-05-07T20:33:22.5326849Z if compiled: 2025-05-07T20:33:22.5326947Z op = torch.compile(op) 2025-05-07T20:33:22.5327053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5327130Z 2025-05-07T20:33:22.5327219Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5327224Z 2025-05-07T20:33:22.5327319Z moe/activation_test.py:117: 2025-05-07T20:33:22.5327450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5327549Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5327650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5328062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5328193Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5328684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5328820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5329174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5329402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5329740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5329839Z kernel = self.compile( 2025-05-07T20:33:22.5330218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5330393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5330563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5330568Z 2025-05-07T20:33:22.5330773Z self = 2025-05-07T20:33:22.5331555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5332059Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf368c20>} 2025-05-07T20:33:22.5332804Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5332999Z context = 2025-05-07T20:33:22.5333006Z 2025-05-07T20:33:22.5333172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5333443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5333549Z module_map=module_map) 2025-05-07T20:33:22.5333709Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5333811Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5333886Z E ^ 2025-05-07T20:33:22.5334240Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5334250Z 2025-05-07T20:33:22.5334658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5334662Z 2025-05-07T20:33:22.5334764Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5334995Z self=, 2025-05-07T20:33:22.5335075Z T=2048, 2025-05-07T20:33:22.5335152Z D=5120, 2025-05-07T20:33:22.5335238Z scale_ub=None, 2025-05-07T20:33:22.5335329Z contiguous=False, 2025-05-07T20:33:22.5335410Z compiled=True, 2025-05-07T20:33:22.5335484Z ) 2025-05-07T20:33:22.5335702Z self = 2025-05-07T20:33:22.5335876Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.5335881Z 2025-05-07T20:33:22.5335959Z @given( 2025-05-07T20:33:22.5336077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5336179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5336294Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5336410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5336526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5336683Z ) 2025-05-07T20:33:22.5336931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5337024Z def test_silu_mul_quant( 2025-05-07T20:33:22.5337139Z self, 2025-05-07T20:33:22.5337225Z T: int, 2025-05-07T20:33:22.5337301Z D: int, 2025-05-07T20:33:22.5337398Z scale_ub: Optional[float], 2025-05-07T20:33:22.5337488Z contiguous: bool, 2025-05-07T20:33:22.5337573Z compiled: bool, 2025-05-07T20:33:22.5337651Z ) -> None: 2025-05-07T20:33:22.5337749Z torch.manual_seed(2025) 2025-05-07T20:33:22.5337822Z 2025-05-07T20:33:22.5337988Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5338066Z 2025-05-07T20:33:22.5338156Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5338279Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5338373Z x = x_sign * x_clamp 2025-05-07T20:33:22.5338458Z x0 = x[:, :D] 2025-05-07T20:33:22.5338541Z x1 = x[:, D:] 2025-05-07T20:33:22.5338681Z 2025-05-07T20:33:22.5338767Z if contiguous: 2025-05-07T20:33:22.5338861Z x0 = x0.contiguous() 2025-05-07T20:33:22.5338953Z x1 = x1.contiguous() 2025-05-07T20:33:22.5339025Z 2025-05-07T20:33:22.5339117Z if scale_ub is not None: 2025-05-07T20:33:22.5339226Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5339359Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5339435Z ) 2025-05-07T20:33:22.5339514Z else: 2025-05-07T20:33:22.5339609Z scale_ub_tensor = None 2025-05-07T20:33:22.5339681Z 2025-05-07T20:33:22.5339814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5339905Z op = silu_mul_quant 2025-05-07T20:33:22.5339993Z if compiled: 2025-05-07T20:33:22.5340093Z op = torch.compile(op) 2025-05-07T20:33:22.5340202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5340280Z 2025-05-07T20:33:22.5340372Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5340376Z 2025-05-07T20:33:22.5340473Z moe/activation_test.py:117: 2025-05-07T20:33:22.5340608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5340708Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5340805Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5341172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5341268Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5341761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5341857Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5342215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5342443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5342779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5342874Z kernel = self.compile( 2025-05-07T20:33:22.5343256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5343429Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5343558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5343563Z 2025-05-07T20:33:22.5343764Z self = 2025-05-07T20:33:22.5344586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5345132Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf3699e0>} 2025-05-07T20:33:22.5345915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5346108Z context = 2025-05-07T20:33:22.5346112Z 2025-05-07T20:33:22.5346274Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5346543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5346650Z module_map=module_map) 2025-05-07T20:33:22.5346809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5346916Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5347031Z E ^ 2025-05-07T20:33:22.5347386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5347394Z 2025-05-07T20:33:22.5347809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5347814Z 2025-05-07T20:33:22.5347915Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5348138Z self=, 2025-05-07T20:33:22.5348214Z T=2048, 2025-05-07T20:33:22.5348290Z D=5120, 2025-05-07T20:33:22.5348374Z scale_ub=1200.0, 2025-05-07T20:33:22.5348460Z contiguous=False, 2025-05-07T20:33:22.5348541Z compiled=True, 2025-05-07T20:33:22.5348616Z ) 2025-05-07T20:33:22.5348833Z self = 2025-05-07T20:33:22.5349012Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5349021Z 2025-05-07T20:33:22.5349097Z @given( 2025-05-07T20:33:22.5349215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5349318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5349433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5349549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5349666Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5349740Z ) 2025-05-07T20:33:22.5349981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5350078Z def test_silu_mul_quant( 2025-05-07T20:33:22.5350154Z self, 2025-05-07T20:33:22.5350231Z T: int, 2025-05-07T20:33:22.5350309Z D: int, 2025-05-07T20:33:22.5350406Z scale_ub: Optional[float], 2025-05-07T20:33:22.5350497Z contiguous: bool, 2025-05-07T20:33:22.5350588Z compiled: bool, 2025-05-07T20:33:22.5350666Z ) -> None: 2025-05-07T20:33:22.5350765Z torch.manual_seed(2025) 2025-05-07T20:33:22.5350837Z 2025-05-07T20:33:22.5351005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5351085Z 2025-05-07T20:33:22.5351177Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5351300Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5351391Z x = x_sign * x_clamp 2025-05-07T20:33:22.5351471Z x0 = x[:, :D] 2025-05-07T20:33:22.5351549Z x1 = x[:, D:] 2025-05-07T20:33:22.5351625Z 2025-05-07T20:33:22.5351707Z if contiguous: 2025-05-07T20:33:22.5351799Z x0 = x0.contiguous() 2025-05-07T20:33:22.5351887Z x1 = x1.contiguous() 2025-05-07T20:33:22.5351959Z 2025-05-07T20:33:22.5352053Z if scale_ub is not None: 2025-05-07T20:33:22.5352159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5352377Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5352459Z ) 2025-05-07T20:33:22.5352536Z else: 2025-05-07T20:33:22.5352630Z scale_ub_tensor = None 2025-05-07T20:33:22.5352747Z 2025-05-07T20:33:22.5352875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5352966Z op = silu_mul_quant 2025-05-07T20:33:22.5353056Z if compiled: 2025-05-07T20:33:22.5353154Z op = torch.compile(op) 2025-05-07T20:33:22.5353261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5353334Z 2025-05-07T20:33:22.5353423Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5353427Z 2025-05-07T20:33:22.5353526Z moe/activation_test.py:117: 2025-05-07T20:33:22.5353654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5353754Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5353858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5354269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5354364Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5354859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5354955Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5355313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5355532Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5355917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5356014Z kernel = self.compile( 2025-05-07T20:33:22.5356391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5356573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5356705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5356712Z 2025-05-07T20:33:22.5356914Z self = 2025-05-07T20:33:22.5357691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5358193Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf36ab60>} 2025-05-07T20:33:22.5358944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5359136Z context = 2025-05-07T20:33:22.5359140Z 2025-05-07T20:33:22.5359303Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5359570Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5359675Z module_map=module_map) 2025-05-07T20:33:22.5359837Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5359936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5360012Z E ^ 2025-05-07T20:33:22.5360369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5360374Z 2025-05-07T20:33:22.5360784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5360788Z 2025-05-07T20:33:22.5360980Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5361206Z self=, 2025-05-07T20:33:22.5361285Z T=4096, 2025-05-07T20:33:22.5361404Z D=5120, 2025-05-07T20:33:22.5361487Z scale_ub=1200.0, 2025-05-07T20:33:22.5361571Z contiguous=True, 2025-05-07T20:33:22.5361656Z compiled=True, 2025-05-07T20:33:22.5361729Z ) 2025-05-07T20:33:22.5361947Z self = 2025-05-07T20:33:22.5362119Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.5362124Z 2025-05-07T20:33:22.5362202Z @given( 2025-05-07T20:33:22.5362324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5362422Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5362537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5362657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5362774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5362892Z ) 2025-05-07T20:33:22.5363137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5363235Z def test_silu_mul_quant( 2025-05-07T20:33:22.5363311Z self, 2025-05-07T20:33:22.5363393Z T: int, 2025-05-07T20:33:22.5363470Z D: int, 2025-05-07T20:33:22.5363566Z scale_ub: Optional[float], 2025-05-07T20:33:22.5363657Z contiguous: bool, 2025-05-07T20:33:22.5363742Z compiled: bool, 2025-05-07T20:33:22.5363820Z ) -> None: 2025-05-07T20:33:22.5363917Z torch.manual_seed(2025) 2025-05-07T20:33:22.5363991Z 2025-05-07T20:33:22.5364158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5364234Z 2025-05-07T20:33:22.5364325Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5364451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5364547Z x = x_sign * x_clamp 2025-05-07T20:33:22.5364626Z x0 = x[:, :D] 2025-05-07T20:33:22.5364712Z x1 = x[:, D:] 2025-05-07T20:33:22.5364785Z 2025-05-07T20:33:22.5364868Z if contiguous: 2025-05-07T20:33:22.5364963Z x0 = x0.contiguous() 2025-05-07T20:33:22.5365052Z x1 = x1.contiguous() 2025-05-07T20:33:22.5365124Z 2025-05-07T20:33:22.5365219Z if scale_ub is not None: 2025-05-07T20:33:22.5365325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5365691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5365810Z ) 2025-05-07T20:33:22.5365899Z else: 2025-05-07T20:33:22.5366000Z scale_ub_tensor = None 2025-05-07T20:33:22.5366073Z 2025-05-07T20:33:22.5366203Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5366295Z op = silu_mul_quant 2025-05-07T20:33:22.5366380Z if compiled: 2025-05-07T20:33:22.5366485Z op = torch.compile(op) 2025-05-07T20:33:22.5366594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5366666Z 2025-05-07T20:33:22.5366757Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5366763Z 2025-05-07T20:33:22.5366863Z moe/activation_test.py:117: 2025-05-07T20:33:22.5366992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5367099Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5367197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5367561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5367658Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5368148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5368246Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5368696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5368973Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5369406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5369502Z kernel = self.compile( 2025-05-07T20:33:22.5369881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5370057Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5370185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5370189Z 2025-05-07T20:33:22.5370396Z self = 2025-05-07T20:33:22.5371232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5371740Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf4b8180>} 2025-05-07T20:33:22.5372490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5372678Z context = 2025-05-07T20:33:22.5372683Z 2025-05-07T20:33:22.5372849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5373112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5373218Z module_map=module_map) 2025-05-07T20:33:22.5373386Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5373488Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5373566Z E ^ 2025-05-07T20:33:22.5373922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5373929Z 2025-05-07T20:33:22.5374338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5374342Z 2025-05-07T20:33:22.5374448Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5374670Z self=, 2025-05-07T20:33:22.5374750Z T=128, 2025-05-07T20:33:22.5374832Z D=5120, 2025-05-07T20:33:22.5374916Z scale_ub=1200.0, 2025-05-07T20:33:22.5375001Z contiguous=False, 2025-05-07T20:33:22.5375087Z compiled=True, 2025-05-07T20:33:22.5375160Z ) 2025-05-07T20:33:22.5375381Z self = 2025-05-07T20:33:22.5375559Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5375563Z 2025-05-07T20:33:22.5375641Z @given( 2025-05-07T20:33:22.5375764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5375862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5375974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5376094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5376207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5376284Z ) 2025-05-07T20:33:22.5376524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5376618Z def test_silu_mul_quant( 2025-05-07T20:33:22.5376700Z self, 2025-05-07T20:33:22.5376776Z T: int, 2025-05-07T20:33:22.5376852Z D: int, 2025-05-07T20:33:22.5376953Z scale_ub: Optional[float], 2025-05-07T20:33:22.5377126Z contiguous: bool, 2025-05-07T20:33:22.5377215Z compiled: bool, 2025-05-07T20:33:22.5377298Z ) -> None: 2025-05-07T20:33:22.5377393Z torch.manual_seed(2025) 2025-05-07T20:33:22.5377505Z 2025-05-07T20:33:22.5377679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5377754Z 2025-05-07T20:33:22.5377846Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5377972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5378061Z x = x_sign * x_clamp 2025-05-07T20:33:22.5378144Z x0 = x[:, :D] 2025-05-07T20:33:22.5378224Z x1 = x[:, D:] 2025-05-07T20:33:22.5378296Z 2025-05-07T20:33:22.5378382Z if contiguous: 2025-05-07T20:33:22.5378472Z x0 = x0.contiguous() 2025-05-07T20:33:22.5378561Z x1 = x1.contiguous() 2025-05-07T20:33:22.5378636Z 2025-05-07T20:33:22.5378726Z if scale_ub is not None: 2025-05-07T20:33:22.5378836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5379017Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5379094Z ) 2025-05-07T20:33:22.5379170Z else: 2025-05-07T20:33:22.5379269Z scale_ub_tensor = None 2025-05-07T20:33:22.5379342Z 2025-05-07T20:33:22.5379475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5379566Z op = silu_mul_quant 2025-05-07T20:33:22.5379664Z if compiled: 2025-05-07T20:33:22.5379780Z op = torch.compile(op) 2025-05-07T20:33:22.5379906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5379983Z 2025-05-07T20:33:22.5380076Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5380081Z 2025-05-07T20:33:22.5380176Z moe/activation_test.py:117: 2025-05-07T20:33:22.5380305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5380409Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5380514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5380884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5380980Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5381470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5381570Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5381924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5382146Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5382484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5382577Z kernel = self.compile( 2025-05-07T20:33:22.5382962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5383139Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5383266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5383273Z 2025-05-07T20:33:22.5383481Z self = 2025-05-07T20:33:22.5384254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5384759Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf4b8ea0>} 2025-05-07T20:33:22.5385555Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5385782Z context = 2025-05-07T20:33:22.5385827Z 2025-05-07T20:33:22.5385992Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5386253Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5386362Z module_map=module_map) 2025-05-07T20:33:22.5386521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5386619Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5386700Z E ^ 2025-05-07T20:33:22.5387053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5387057Z 2025-05-07T20:33:22.5387477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5387483Z 2025-05-07T20:33:22.5387624Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5387847Z self=, 2025-05-07T20:33:22.5387935Z T=16384, 2025-05-07T20:33:22.5388012Z D=7168, 2025-05-07T20:33:22.5388095Z scale_ub=1200.0, 2025-05-07T20:33:22.5388185Z contiguous=True, 2025-05-07T20:33:22.5388267Z compiled=True, 2025-05-07T20:33:22.5388340Z ) 2025-05-07T20:33:22.5388557Z self = 2025-05-07T20:33:22.5388735Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.5388740Z 2025-05-07T20:33:22.5388820Z @given( 2025-05-07T20:33:22.5388939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5389037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5389156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5389278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5389395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5389473Z ) 2025-05-07T20:33:22.5389719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5389818Z def test_silu_mul_quant( 2025-05-07T20:33:22.5389893Z self, 2025-05-07T20:33:22.5389970Z T: int, 2025-05-07T20:33:22.5390049Z D: int, 2025-05-07T20:33:22.5390146Z scale_ub: Optional[float], 2025-05-07T20:33:22.5390234Z contiguous: bool, 2025-05-07T20:33:22.5390322Z compiled: bool, 2025-05-07T20:33:22.5390400Z ) -> None: 2025-05-07T20:33:22.5390494Z torch.manual_seed(2025) 2025-05-07T20:33:22.5390572Z 2025-05-07T20:33:22.5390739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5390812Z 2025-05-07T20:33:22.5390907Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5391035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5391129Z x = x_sign * x_clamp 2025-05-07T20:33:22.5391214Z x0 = x[:, :D] 2025-05-07T20:33:22.5391293Z x1 = x[:, D:] 2025-05-07T20:33:22.5391370Z 2025-05-07T20:33:22.5391454Z if contiguous: 2025-05-07T20:33:22.5391547Z x0 = x0.contiguous() 2025-05-07T20:33:22.5391637Z x1 = x1.contiguous() 2025-05-07T20:33:22.5391709Z 2025-05-07T20:33:22.5391801Z if scale_ub is not None: 2025-05-07T20:33:22.5391909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5392040Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5392116Z ) 2025-05-07T20:33:22.5392195Z else: 2025-05-07T20:33:22.5392289Z scale_ub_tensor = None 2025-05-07T20:33:22.5392361Z 2025-05-07T20:33:22.5392492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5392582Z op = silu_mul_quant 2025-05-07T20:33:22.5392758Z if compiled: 2025-05-07T20:33:22.5392861Z op = torch.compile(op) 2025-05-07T20:33:22.5392966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5393085Z 2025-05-07T20:33:22.5393175Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5393180Z 2025-05-07T20:33:22.5393277Z moe/activation_test.py:117: 2025-05-07T20:33:22.5393408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5393512Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5393611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5393978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5394070Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5394562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5394662Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5395061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5395288Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5395629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5395802Z kernel = self.compile( 2025-05-07T20:33:22.5396182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5396355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5396487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5396492Z 2025-05-07T20:33:22.5396694Z self = 2025-05-07T20:33:22.5397473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5397985Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf4ba0c0>} 2025-05-07T20:33:22.5398727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5398919Z context = 2025-05-07T20:33:22.5398923Z 2025-05-07T20:33:22.5399086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5399350Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5399461Z module_map=module_map) 2025-05-07T20:33:22.5399623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5399726Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5399806Z E ^ 2025-05-07T20:33:22.5400159Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5400164Z 2025-05-07T20:33:22.5400575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5400579Z 2025-05-07T20:33:22.5400682Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5400909Z self=, 2025-05-07T20:33:22.5400987Z T=16384, 2025-05-07T20:33:22.5401064Z D=5120, 2025-05-07T20:33:22.5401151Z scale_ub=1200.0, 2025-05-07T20:33:22.5401237Z contiguous=True, 2025-05-07T20:33:22.5401320Z compiled=False, 2025-05-07T20:33:22.5401509Z ) 2025-05-07T20:33:22.5401730Z self = 2025-05-07T20:33:22.5401908Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.5401951Z 2025-05-07T20:33:22.5402029Z @given( 2025-05-07T20:33:22.5402147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5402250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5402365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5402482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5402601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5402676Z ) 2025-05-07T20:33:22.5402919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5403014Z def test_silu_mul_quant( 2025-05-07T20:33:22.5403090Z self, 2025-05-07T20:33:22.5403170Z T: int, 2025-05-07T20:33:22.5403252Z D: int, 2025-05-07T20:33:22.5403351Z scale_ub: Optional[float], 2025-05-07T20:33:22.5403485Z contiguous: bool, 2025-05-07T20:33:22.5403572Z compiled: bool, 2025-05-07T20:33:22.5403655Z ) -> None: 2025-05-07T20:33:22.5403754Z torch.manual_seed(2025) 2025-05-07T20:33:22.5403827Z 2025-05-07T20:33:22.5403994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5404072Z 2025-05-07T20:33:22.5404163Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5404286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5404378Z x = x_sign * x_clamp 2025-05-07T20:33:22.5404457Z x0 = x[:, :D] 2025-05-07T20:33:22.5404537Z x1 = x[:, D:] 2025-05-07T20:33:22.5404612Z 2025-05-07T20:33:22.5404695Z if contiguous: 2025-05-07T20:33:22.5404789Z x0 = x0.contiguous() 2025-05-07T20:33:22.5404878Z x1 = x1.contiguous() 2025-05-07T20:33:22.5404951Z 2025-05-07T20:33:22.5405049Z if scale_ub is not None: 2025-05-07T20:33:22.5405159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5405292Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5405375Z ) 2025-05-07T20:33:22.5405451Z else: 2025-05-07T20:33:22.5405545Z scale_ub_tensor = None 2025-05-07T20:33:22.5405621Z 2025-05-07T20:33:22.5405750Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5405840Z op = silu_mul_quant 2025-05-07T20:33:22.5405927Z if compiled: 2025-05-07T20:33:22.5406025Z op = torch.compile(op) 2025-05-07T20:33:22.5406133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5406206Z 2025-05-07T20:33:22.5406297Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5406301Z 2025-05-07T20:33:22.5406401Z moe/activation_test.py:117: 2025-05-07T20:33:22.5406530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5406634Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5406739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5407233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5407335Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5407692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5407913Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5408253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5408347Z kernel = self.compile( 2025-05-07T20:33:22.5408724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5408949Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5409113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5409118Z 2025-05-07T20:33:22.5409323Z self = 2025-05-07T20:33:22.5410139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5410641Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf4b9a80>} 2025-05-07T20:33:22.5411387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5411582Z context = 2025-05-07T20:33:22.5411625Z 2025-05-07T20:33:22.5411792Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5412055Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5412161Z module_map=module_map) 2025-05-07T20:33:22.5412323Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5412422Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5412503Z E ^ 2025-05-07T20:33:22.5412855Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5412859Z 2025-05-07T20:33:22.5413270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5413274Z 2025-05-07T20:33:22.5413378Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5413607Z self=, 2025-05-07T20:33:22.5413688Z T=1, 2025-05-07T20:33:22.5413768Z D=7168, 2025-05-07T20:33:22.5413854Z scale_ub=1200.0, 2025-05-07T20:33:22.5413943Z contiguous=False, 2025-05-07T20:33:22.5414026Z compiled=False, 2025-05-07T20:33:22.5414100Z ) 2025-05-07T20:33:22.5414317Z self = 2025-05-07T20:33:22.5414483Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.5414487Z 2025-05-07T20:33:22.5414564Z @given( 2025-05-07T20:33:22.5414685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5414784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5414902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5415018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5415134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5415213Z ) 2025-05-07T20:33:22.5415457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5415551Z def test_silu_mul_quant( 2025-05-07T20:33:22.5415631Z self, 2025-05-07T20:33:22.5415707Z T: int, 2025-05-07T20:33:22.5415784Z D: int, 2025-05-07T20:33:22.5415887Z scale_ub: Optional[float], 2025-05-07T20:33:22.5415976Z contiguous: bool, 2025-05-07T20:33:22.5416060Z compiled: bool, 2025-05-07T20:33:22.5416142Z ) -> None: 2025-05-07T20:33:22.5416236Z torch.manual_seed(2025) 2025-05-07T20:33:22.5416312Z 2025-05-07T20:33:22.5416477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5416552Z 2025-05-07T20:33:22.5416645Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5416768Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5416855Z x = x_sign * x_clamp 2025-05-07T20:33:22.5417027Z x0 = x[:, :D] 2025-05-07T20:33:22.5417107Z x1 = x[:, D:] 2025-05-07T20:33:22.5417182Z 2025-05-07T20:33:22.5417270Z if contiguous: 2025-05-07T20:33:22.5417361Z x0 = x0.contiguous() 2025-05-07T20:33:22.5417490Z x1 = x1.contiguous() 2025-05-07T20:33:22.5417566Z 2025-05-07T20:33:22.5417656Z if scale_ub is not None: 2025-05-07T20:33:22.5417761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5417897Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5417973Z ) 2025-05-07T20:33:22.5418052Z else: 2025-05-07T20:33:22.5418145Z scale_ub_tensor = None 2025-05-07T20:33:22.5418218Z 2025-05-07T20:33:22.5418350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5418440Z op = silu_mul_quant 2025-05-07T20:33:22.5418526Z if compiled: 2025-05-07T20:33:22.5418626Z op = torch.compile(op) 2025-05-07T20:33:22.5418737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5418809Z 2025-05-07T20:33:22.5418941Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5418946Z 2025-05-07T20:33:22.5419043Z moe/activation_test.py:117: 2025-05-07T20:33:22.5419179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5419278Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5419377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5419912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5420021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5420378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5420602Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5420943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5424206Z kernel = self.compile( 2025-05-07T20:33:22.5424612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5424795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5424923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5424927Z 2025-05-07T20:33:22.5425132Z self = 2025-05-07T20:33:22.5425907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5426410Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf1d80e0>} 2025-05-07T20:33:22.5427164Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5427358Z context = 2025-05-07T20:33:22.5427362Z 2025-05-07T20:33:22.5427529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5427791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5427898Z module_map=module_map) 2025-05-07T20:33:22.5428063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5428164Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5428244Z E ^ 2025-05-07T20:33:22.5428664Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5428710Z 2025-05-07T20:33:22.5429123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5429166Z 2025-05-07T20:33:22.5429273Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5429495Z self=, 2025-05-07T20:33:22.5429576Z T=4096, 2025-05-07T20:33:22.5429651Z D=7168, 2025-05-07T20:33:22.5429746Z scale_ub=1200.0, 2025-05-07T20:33:22.5429852Z contiguous=False, 2025-05-07T20:33:22.5429948Z compiled=True, 2025-05-07T20:33:22.5430032Z ) 2025-05-07T20:33:22.5430252Z self = 2025-05-07T20:33:22.5430426Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5430431Z 2025-05-07T20:33:22.5430509Z @given( 2025-05-07T20:33:22.5430636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5430774Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5430894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5431013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5431125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5431204Z ) 2025-05-07T20:33:22.5431445Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5431539Z def test_silu_mul_quant( 2025-05-07T20:33:22.5431619Z self, 2025-05-07T20:33:22.5431695Z T: int, 2025-05-07T20:33:22.5431772Z D: int, 2025-05-07T20:33:22.5431878Z scale_ub: Optional[float], 2025-05-07T20:33:22.5431968Z contiguous: bool, 2025-05-07T20:33:22.5432052Z compiled: bool, 2025-05-07T20:33:22.5432135Z ) -> None: 2025-05-07T20:33:22.5432231Z torch.manual_seed(2025) 2025-05-07T20:33:22.5432305Z 2025-05-07T20:33:22.5432480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5432559Z 2025-05-07T20:33:22.5432653Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5432781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5432871Z x = x_sign * x_clamp 2025-05-07T20:33:22.5432955Z x0 = x[:, :D] 2025-05-07T20:33:22.5433034Z x1 = x[:, D:] 2025-05-07T20:33:22.5433107Z 2025-05-07T20:33:22.5433194Z if contiguous: 2025-05-07T20:33:22.5433287Z x0 = x0.contiguous() 2025-05-07T20:33:22.5433376Z x1 = x1.contiguous() 2025-05-07T20:33:22.5433451Z 2025-05-07T20:33:22.5433541Z if scale_ub is not None: 2025-05-07T20:33:22.5433649Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5433785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5433860Z ) 2025-05-07T20:33:22.5433940Z else: 2025-05-07T20:33:22.5434038Z scale_ub_tensor = None 2025-05-07T20:33:22.5434112Z 2025-05-07T20:33:22.5434248Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5434338Z op = silu_mul_quant 2025-05-07T20:33:22.5434424Z if compiled: 2025-05-07T20:33:22.5434528Z op = torch.compile(op) 2025-05-07T20:33:22.5434637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5434709Z 2025-05-07T20:33:22.5434802Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5434807Z 2025-05-07T20:33:22.5434903Z moe/activation_test.py:117: 2025-05-07T20:33:22.5435036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5435135Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5435233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5435603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5435696Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5436359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5436462Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5436857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5437081Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5437419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5437513Z kernel = self.compile( 2025-05-07T20:33:22.5437892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5438065Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5438195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5438205Z 2025-05-07T20:33:22.5438450Z self = 2025-05-07T20:33:22.5439229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5439736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf1d9300>} 2025-05-07T20:33:22.5440478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5440670Z context = 2025-05-07T20:33:22.5440675Z 2025-05-07T20:33:22.5440843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5441108Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5441221Z module_map=module_map) 2025-05-07T20:33:22.5441381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5441483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5441561Z E ^ 2025-05-07T20:33:22.5441916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5441921Z 2025-05-07T20:33:22.5442334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5442338Z 2025-05-07T20:33:22.5442440Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5442661Z self=, 2025-05-07T20:33:22.5442748Z T=128, 2025-05-07T20:33:22.5442825Z D=7168, 2025-05-07T20:33:22.5442914Z scale_ub=1200.0, 2025-05-07T20:33:22.5443000Z contiguous=False, 2025-05-07T20:33:22.5443083Z compiled=True, 2025-05-07T20:33:22.5443160Z ) 2025-05-07T20:33:22.5443377Z self = 2025-05-07T20:33:22.5443547Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:22.5443551Z 2025-05-07T20:33:22.5443633Z @given( 2025-05-07T20:33:22.5443751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5443849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5443966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5444083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5444198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5444273Z ) 2025-05-07T20:33:22.5444561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5444696Z def test_silu_mul_quant( 2025-05-07T20:33:22.5444776Z self, 2025-05-07T20:33:22.5444857Z T: int, 2025-05-07T20:33:22.5444978Z D: int, 2025-05-07T20:33:22.5445080Z scale_ub: Optional[float], 2025-05-07T20:33:22.5445169Z contiguous: bool, 2025-05-07T20:33:22.5445256Z compiled: bool, 2025-05-07T20:33:22.5445334Z ) -> None: 2025-05-07T20:33:22.5445429Z torch.manual_seed(2025) 2025-05-07T20:33:22.5445503Z 2025-05-07T20:33:22.5445670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5445744Z 2025-05-07T20:33:22.5445837Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5445963Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5446055Z x = x_sign * x_clamp 2025-05-07T20:33:22.5446135Z x0 = x[:, :D] 2025-05-07T20:33:22.5446215Z x1 = x[:, D:] 2025-05-07T20:33:22.5446291Z 2025-05-07T20:33:22.5446381Z if contiguous: 2025-05-07T20:33:22.5446513Z x0 = x0.contiguous() 2025-05-07T20:33:22.5446606Z x1 = x1.contiguous() 2025-05-07T20:33:22.5446679Z 2025-05-07T20:33:22.5446773Z if scale_ub is not None: 2025-05-07T20:33:22.5446883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5447016Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5447096Z ) 2025-05-07T20:33:22.5447174Z else: 2025-05-07T20:33:22.5447267Z scale_ub_tensor = None 2025-05-07T20:33:22.5447343Z 2025-05-07T20:33:22.5447472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5447562Z op = silu_mul_quant 2025-05-07T20:33:22.5447651Z if compiled: 2025-05-07T20:33:22.5447750Z op = torch.compile(op) 2025-05-07T20:33:22.5447855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5447931Z 2025-05-07T20:33:22.5448027Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5448031Z 2025-05-07T20:33:22.5448130Z moe/activation_test.py:117: 2025-05-07T20:33:22.5448260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5448364Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5448467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5448830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5448923Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5449414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5449511Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5449902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5450146Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5450486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5450584Z kernel = self.compile( 2025-05-07T20:33:22.5450964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5451138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5451268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5451272Z 2025-05-07T20:33:22.5451474Z self = 2025-05-07T20:33:22.5452252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5452801Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf1da020>} 2025-05-07T20:33:22.5453582Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5453817Z context = 2025-05-07T20:33:22.5453821Z 2025-05-07T20:33:22.5453985Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5454251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5454357Z module_map=module_map) 2025-05-07T20:33:22.5454517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5454617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5454694Z E ^ 2025-05-07T20:33:22.5455095Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5455100Z 2025-05-07T20:33:22.5455510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5455518Z 2025-05-07T20:33:22.5455620Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5455845Z self=, 2025-05-07T20:33:22.5455923Z T=2048, 2025-05-07T20:33:22.5455999Z D=7168, 2025-05-07T20:33:22.5456087Z scale_ub=None, 2025-05-07T20:33:22.5456172Z contiguous=True, 2025-05-07T20:33:22.5456257Z compiled=True, 2025-05-07T20:33:22.5456330Z ) 2025-05-07T20:33:22.5456545Z self = 2025-05-07T20:33:22.5456717Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.5456724Z 2025-05-07T20:33:22.5456803Z @given( 2025-05-07T20:33:22.5456923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5457026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5457143Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5457259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5457375Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5457449Z ) 2025-05-07T20:33:22.5457693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5457785Z def test_silu_mul_quant( 2025-05-07T20:33:22.5457863Z self, 2025-05-07T20:33:22.5457944Z T: int, 2025-05-07T20:33:22.5458020Z D: int, 2025-05-07T20:33:22.5458118Z scale_ub: Optional[float], 2025-05-07T20:33:22.5458211Z contiguous: bool, 2025-05-07T20:33:22.5458297Z compiled: bool, 2025-05-07T20:33:22.5458374Z ) -> None: 2025-05-07T20:33:22.5458476Z torch.manual_seed(2025) 2025-05-07T20:33:22.5458548Z 2025-05-07T20:33:22.5458718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5458797Z 2025-05-07T20:33:22.5458892Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5459021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5459109Z x = x_sign * x_clamp 2025-05-07T20:33:22.5459188Z x0 = x[:, :D] 2025-05-07T20:33:22.5459270Z x1 = x[:, D:] 2025-05-07T20:33:22.5459342Z 2025-05-07T20:33:22.5459426Z if contiguous: 2025-05-07T20:33:22.5459522Z x0 = x0.contiguous() 2025-05-07T20:33:22.5459611Z x1 = x1.contiguous() 2025-05-07T20:33:22.5459698Z 2025-05-07T20:33:22.5459803Z if scale_ub is not None: 2025-05-07T20:33:22.5459931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5460066Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5460147Z ) 2025-05-07T20:33:22.5460316Z else: 2025-05-07T20:33:22.5460411Z scale_ub_tensor = None 2025-05-07T20:33:22.5460492Z 2025-05-07T20:33:22.5460622Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5460755Z op = silu_mul_quant 2025-05-07T20:33:22.5460841Z if compiled: 2025-05-07T20:33:22.5460939Z op = torch.compile(op) 2025-05-07T20:33:22.5461044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5461122Z 2025-05-07T20:33:22.5461213Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5461217Z 2025-05-07T20:33:22.5461316Z moe/activation_test.py:117: 2025-05-07T20:33:22.5461448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5461549Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5461648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5462023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5462117Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5462672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5462774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5463130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5463354Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5463692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5463787Z kernel = self.compile( 2025-05-07T20:33:22.5464165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5464338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5464474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5464481Z 2025-05-07T20:33:22.5464684Z self = 2025-05-07T20:33:22.5465711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5466238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf1db240>} 2025-05-07T20:33:22.5466981Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5467181Z context = 2025-05-07T20:33:22.5467189Z 2025-05-07T20:33:22.5467354Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5467618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5467729Z module_map=module_map) 2025-05-07T20:33:22.5467889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5467992Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5468068Z E ^ 2025-05-07T20:33:22.5468421Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5468428Z 2025-05-07T20:33:22.5468838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5468842Z 2025-05-07T20:33:22.5468944Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5469266Z self=, 2025-05-07T20:33:22.5469398Z T=16384, 2025-05-07T20:33:22.5469478Z D=5120, 2025-05-07T20:33:22.5469565Z scale_ub=None, 2025-05-07T20:33:22.5469652Z contiguous=False, 2025-05-07T20:33:22.5469794Z compiled=False, 2025-05-07T20:33:22.5469869Z ) 2025-05-07T20:33:22.5470087Z self = 2025-05-07T20:33:22.5470266Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.5470270Z 2025-05-07T20:33:22.5470347Z @given( 2025-05-07T20:33:22.5470464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5470565Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5470680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5470796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5470911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5470986Z ) 2025-05-07T20:33:22.5471287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5471385Z def test_silu_mul_quant( 2025-05-07T20:33:22.5471461Z self, 2025-05-07T20:33:22.5471544Z T: int, 2025-05-07T20:33:22.5471619Z D: int, 2025-05-07T20:33:22.5471719Z scale_ub: Optional[float], 2025-05-07T20:33:22.5471809Z contiguous: bool, 2025-05-07T20:33:22.5471894Z compiled: bool, 2025-05-07T20:33:22.5471971Z ) -> None: 2025-05-07T20:33:22.5472069Z torch.manual_seed(2025) 2025-05-07T20:33:22.5472142Z 2025-05-07T20:33:22.5472307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5472386Z 2025-05-07T20:33:22.5472477Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5472601Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5474426Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5474440Z 2025-05-07T20:33:22.5474558Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:22.5474566Z 2025-05-07T20:33:22.5474669Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5474890Z self=, 2025-05-07T20:33:22.5474971Z T=4096, 2025-05-07T20:33:22.5475048Z D=7168, 2025-05-07T20:33:22.5475130Z scale_ub=1200.0, 2025-05-07T20:33:22.5475216Z contiguous=True, 2025-05-07T20:33:22.5475299Z compiled=True, 2025-05-07T20:33:22.5475376Z ) 2025-05-07T20:33:22.5475596Z self = 2025-05-07T20:33:22.5475817Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.5475824Z 2025-05-07T20:33:22.5475903Z @given( 2025-05-07T20:33:22.5476023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5476121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5476237Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5476352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5476464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5476541Z ) 2025-05-07T20:33:22.5476782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5476877Z def test_silu_mul_quant( 2025-05-07T20:33:22.5476955Z self, 2025-05-07T20:33:22.5477033Z T: int, 2025-05-07T20:33:22.5477110Z D: int, 2025-05-07T20:33:22.5477297Z scale_ub: Optional[float], 2025-05-07T20:33:22.5477390Z contiguous: bool, 2025-05-07T20:33:22.5477479Z compiled: bool, 2025-05-07T20:33:22.5477558Z ) -> None: 2025-05-07T20:33:22.5477693Z torch.manual_seed(2025) 2025-05-07T20:33:22.5477768Z 2025-05-07T20:33:22.5477935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5478010Z 2025-05-07T20:33:22.5478104Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5478227Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5480068Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5480077Z 2025-05-07T20:33:22.5480199Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:22.5480203Z 2025-05-07T20:33:22.5480307Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5480530Z self=, 2025-05-07T20:33:22.5480610Z T=16384, 2025-05-07T20:33:22.5480692Z D=7168, 2025-05-07T20:33:22.5480776Z scale_ub=None, 2025-05-07T20:33:22.5480863Z contiguous=False, 2025-05-07T20:33:22.5480951Z compiled=False, 2025-05-07T20:33:22.5481025Z ) 2025-05-07T20:33:22.5481238Z self = 2025-05-07T20:33:22.5481416Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.5481421Z 2025-05-07T20:33:22.5481501Z @given( 2025-05-07T20:33:22.5481624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5481727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5481841Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5481963Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5482074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5482148Z ) 2025-05-07T20:33:22.5482393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5482486Z def test_silu_mul_quant( 2025-05-07T20:33:22.5482561Z self, 2025-05-07T20:33:22.5482639Z T: int, 2025-05-07T20:33:22.5482716Z D: int, 2025-05-07T20:33:22.5482814Z scale_ub: Optional[float], 2025-05-07T20:33:22.5482905Z contiguous: bool, 2025-05-07T20:33:22.5482990Z compiled: bool, 2025-05-07T20:33:22.5483067Z ) -> None: 2025-05-07T20:33:22.5483165Z torch.manual_seed(2025) 2025-05-07T20:33:22.5483243Z 2025-05-07T20:33:22.5483415Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5485215Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5485223Z 2025-05-07T20:33:22.5485340Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5485344Z 2025-05-07T20:33:22.5485444Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5485710Z self=, 2025-05-07T20:33:22.5485827Z T=2048, 2025-05-07T20:33:22.5485905Z D=7168, 2025-05-07T20:33:22.5485989Z scale_ub=1200.0, 2025-05-07T20:33:22.5486078Z contiguous=True, 2025-05-07T20:33:22.5486200Z compiled=True, 2025-05-07T20:33:22.5486273Z ) 2025-05-07T20:33:22.5486490Z self = 2025-05-07T20:33:22.5486659Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.5486664Z 2025-05-07T20:33:22.5486744Z @given( 2025-05-07T20:33:22.5486864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5486961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5487078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5487193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5487306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5487382Z ) 2025-05-07T20:33:22.5487624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5487759Z def test_silu_mul_quant( 2025-05-07T20:33:22.5487839Z self, 2025-05-07T20:33:22.5487917Z T: int, 2025-05-07T20:33:22.5487996Z D: int, 2025-05-07T20:33:22.5488095Z scale_ub: Optional[float], 2025-05-07T20:33:22.5488184Z contiguous: bool, 2025-05-07T20:33:22.5488268Z compiled: bool, 2025-05-07T20:33:22.5488348Z ) -> None: 2025-05-07T20:33:22.5488441Z torch.manual_seed(2025) 2025-05-07T20:33:22.5488515Z 2025-05-07T20:33:22.5488680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5488754Z 2025-05-07T20:33:22.5488848Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5488970Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5490756Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5490769Z 2025-05-07T20:33:22.5490886Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:22.5490890Z 2025-05-07T20:33:22.5490991Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5491213Z self=, 2025-05-07T20:33:22.5491289Z T=2048, 2025-05-07T20:33:22.5491366Z D=7168, 2025-05-07T20:33:22.5491452Z scale_ub=None, 2025-05-07T20:33:22.5491537Z contiguous=True, 2025-05-07T20:33:22.5491622Z compiled=False, 2025-05-07T20:33:22.5491694Z ) 2025-05-07T20:33:22.5491911Z self = 2025-05-07T20:33:22.5492087Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5492093Z 2025-05-07T20:33:22.5492174Z @given( 2025-05-07T20:33:22.5492290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5492389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5492502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5492617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5492733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5492806Z ) 2025-05-07T20:33:22.5493049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5493141Z def test_silu_mul_quant( 2025-05-07T20:33:22.5493217Z self, 2025-05-07T20:33:22.5493297Z T: int, 2025-05-07T20:33:22.5493372Z D: int, 2025-05-07T20:33:22.5493554Z scale_ub: Optional[float], 2025-05-07T20:33:22.5493646Z contiguous: bool, 2025-05-07T20:33:22.5493733Z compiled: bool, 2025-05-07T20:33:22.5493812Z ) -> None: 2025-05-07T20:33:22.5493972Z torch.manual_seed(2025) 2025-05-07T20:33:22.5494045Z 2025-05-07T20:33:22.5494211Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5494288Z 2025-05-07T20:33:22.5494379Z > x_sign = torch.sign(x) 2025-05-07T20:33:22.5496170Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5496178Z 2025-05-07T20:33:22.5496331Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:22.5496337Z 2025-05-07T20:33:22.5496445Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5496666Z self=, 2025-05-07T20:33:22.5496744Z T=1, 2025-05-07T20:33:22.5496827Z D=7168, 2025-05-07T20:33:22.5496910Z scale_ub=1200.0, 2025-05-07T20:33:22.5496995Z contiguous=True, 2025-05-07T20:33:22.5497084Z compiled=False, 2025-05-07T20:33:22.5497158Z ) 2025-05-07T20:33:22.5497373Z self = 2025-05-07T20:33:22.5497541Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.5497546Z 2025-05-07T20:33:22.5497624Z @given( 2025-05-07T20:33:22.5497747Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5497849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5497968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5498088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5498201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5498274Z ) 2025-05-07T20:33:22.5498517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5498609Z def test_silu_mul_quant( 2025-05-07T20:33:22.5498683Z self, 2025-05-07T20:33:22.5498763Z T: int, 2025-05-07T20:33:22.5498839Z D: int, 2025-05-07T20:33:22.5498938Z scale_ub: Optional[float], 2025-05-07T20:33:22.5499025Z contiguous: bool, 2025-05-07T20:33:22.5499110Z compiled: bool, 2025-05-07T20:33:22.5499189Z ) -> None: 2025-05-07T20:33:22.5499281Z torch.manual_seed(2025) 2025-05-07T20:33:22.5499354Z 2025-05-07T20:33:22.5499523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5499602Z 2025-05-07T20:33:22.5499692Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5499821Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5499910Z x = x_sign * x_clamp 2025-05-07T20:33:22.5499992Z x0 = x[:, :D] 2025-05-07T20:33:22.5500076Z x1 = x[:, D:] 2025-05-07T20:33:22.5500150Z 2025-05-07T20:33:22.5500241Z if contiguous: 2025-05-07T20:33:22.5500334Z x0 = x0.contiguous() 2025-05-07T20:33:22.5500422Z x1 = x1.contiguous() 2025-05-07T20:33:22.5500498Z 2025-05-07T20:33:22.5500587Z if scale_ub is not None: 2025-05-07T20:33:22.5500692Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5500830Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5500906Z ) 2025-05-07T20:33:22.5500982Z else: 2025-05-07T20:33:22.5501079Z scale_ub_tensor = None 2025-05-07T20:33:22.5501151Z 2025-05-07T20:33:22.5501363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5501460Z op = silu_mul_quant 2025-05-07T20:33:22.5501545Z if compiled: 2025-05-07T20:33:22.5501642Z op = torch.compile(op) 2025-05-07T20:33:22.5501790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5501861Z 2025-05-07T20:33:22.5501955Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5501960Z 2025-05-07T20:33:22.5502055Z moe/activation_test.py:117: 2025-05-07T20:33:22.5502183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5502288Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5502388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5502886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5502986Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5503346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5503610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5503952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5504044Z kernel = self.compile( 2025-05-07T20:33:22.5504427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5504598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5504728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5504733Z 2025-05-07T20:33:22.5504936Z self = 2025-05-07T20:33:22.5505721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5506229Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf0aa520>} 2025-05-07T20:33:22.5506974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5507167Z context = 2025-05-07T20:33:22.5507171Z 2025-05-07T20:33:22.5507335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5507598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5507708Z module_map=module_map) 2025-05-07T20:33:22.5507875Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5507981Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5508059Z E ^ 2025-05-07T20:33:22.5508413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5508420Z 2025-05-07T20:33:22.5508833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5508838Z 2025-05-07T20:33:22.5508944Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5509169Z self=, 2025-05-07T20:33:22.5509247Z T=128, 2025-05-07T20:33:22.5509326Z D=5120, 2025-05-07T20:33:22.5509412Z scale_ub=None, 2025-05-07T20:33:22.5509497Z contiguous=True, 2025-05-07T20:33:22.5509581Z compiled=False, 2025-05-07T20:33:22.5509663Z ) 2025-05-07T20:33:22.5509967Z self = 2025-05-07T20:33:22.5510181Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5510186Z 2025-05-07T20:33:22.5510268Z @given( 2025-05-07T20:33:22.5510425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5510524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5510643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5510760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5510876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5510955Z ) 2025-05-07T20:33:22.5511197Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5511294Z def test_silu_mul_quant( 2025-05-07T20:33:22.5511371Z self, 2025-05-07T20:33:22.5511451Z T: int, 2025-05-07T20:33:22.5511532Z D: int, 2025-05-07T20:33:22.5511632Z scale_ub: Optional[float], 2025-05-07T20:33:22.5511728Z contiguous: bool, 2025-05-07T20:33:22.5511865Z compiled: bool, 2025-05-07T20:33:22.5511946Z ) -> None: 2025-05-07T20:33:22.5512041Z torch.manual_seed(2025) 2025-05-07T20:33:22.5512120Z 2025-05-07T20:33:22.5512287Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5512362Z 2025-05-07T20:33:22.5512452Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5512575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5512665Z x = x_sign * x_clamp 2025-05-07T20:33:22.5512745Z x0 = x[:, :D] 2025-05-07T20:33:22.5512824Z x1 = x[:, D:] 2025-05-07T20:33:22.5512900Z 2025-05-07T20:33:22.5512982Z if contiguous: 2025-05-07T20:33:22.5513074Z x0 = x0.contiguous() 2025-05-07T20:33:22.5513166Z x1 = x1.contiguous() 2025-05-07T20:33:22.5513239Z 2025-05-07T20:33:22.5513329Z if scale_ub is not None: 2025-05-07T20:33:22.5513438Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5513577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5513661Z ) 2025-05-07T20:33:22.5513736Z else: 2025-05-07T20:33:22.5513832Z scale_ub_tensor = None 2025-05-07T20:33:22.5513908Z 2025-05-07T20:33:22.5514035Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5514126Z op = silu_mul_quant 2025-05-07T20:33:22.5514212Z if compiled: 2025-05-07T20:33:22.5514308Z op = torch.compile(op) 2025-05-07T20:33:22.5514411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5514486Z 2025-05-07T20:33:22.5514575Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5514579Z 2025-05-07T20:33:22.5514674Z moe/activation_test.py:117: 2025-05-07T20:33:22.5514806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5514906Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5515013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5515508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5515608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5516044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5516264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5516606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5516700Z kernel = self.compile( 2025-05-07T20:33:22.5517078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5517253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5517427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5517466Z 2025-05-07T20:33:22.5517673Z self = 2025-05-07T20:33:22.5518493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5518994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cf0ab420>} 2025-05-07T20:33:22.5519748Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5519972Z context = 2025-05-07T20:33:22.5519987Z 2025-05-07T20:33:22.5520197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5520460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5520570Z module_map=module_map) 2025-05-07T20:33:22.5520733Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5520832Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5520910Z E ^ 2025-05-07T20:33:22.5521267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5521272Z 2025-05-07T20:33:22.5521680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5521685Z 2025-05-07T20:33:22.5521789Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5522012Z self=, 2025-05-07T20:33:22.5522091Z T=128, 2025-05-07T20:33:22.5522172Z D=7168, 2025-05-07T20:33:22.5522254Z scale_ub=None, 2025-05-07T20:33:22.5522338Z contiguous=True, 2025-05-07T20:33:22.5522427Z compiled=False, 2025-05-07T20:33:22.5522499Z ) 2025-05-07T20:33:22.5522717Z self = 2025-05-07T20:33:22.5522886Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5522891Z 2025-05-07T20:33:22.5522967Z @given( 2025-05-07T20:33:22.5523087Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5523184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5523298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5523418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5523529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5523602Z ) 2025-05-07T20:33:22.5523852Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5523948Z def test_silu_mul_quant( 2025-05-07T20:33:22.5524026Z self, 2025-05-07T20:33:22.5524101Z T: int, 2025-05-07T20:33:22.5524179Z D: int, 2025-05-07T20:33:22.5524279Z scale_ub: Optional[float], 2025-05-07T20:33:22.5524370Z contiguous: bool, 2025-05-07T20:33:22.5524455Z compiled: bool, 2025-05-07T20:33:22.5524535Z ) -> None: 2025-05-07T20:33:22.5524629Z torch.manual_seed(2025) 2025-05-07T20:33:22.5524702Z 2025-05-07T20:33:22.5524872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5524945Z 2025-05-07T20:33:22.5525036Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5525163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5525250Z x = x_sign * x_clamp 2025-05-07T20:33:22.5525333Z x0 = x[:, :D] 2025-05-07T20:33:22.5525412Z x1 = x[:, D:] 2025-05-07T20:33:22.5525593Z 2025-05-07T20:33:22.5525678Z if contiguous: 2025-05-07T20:33:22.5525771Z x0 = x0.contiguous() 2025-05-07T20:33:22.5525859Z x1 = x1.contiguous() 2025-05-07T20:33:22.5525975Z 2025-05-07T20:33:22.5526067Z if scale_ub is not None: 2025-05-07T20:33:22.5526172Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5526306Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5526381Z ) 2025-05-07T20:33:22.5526456Z else: 2025-05-07T20:33:22.5526553Z scale_ub_tensor = None 2025-05-07T20:33:22.5526625Z 2025-05-07T20:33:22.5526753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5526845Z op = silu_mul_quant 2025-05-07T20:33:22.5526930Z if compiled: 2025-05-07T20:33:22.5527032Z op = torch.compile(op) 2025-05-07T20:33:22.5527136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5527210Z 2025-05-07T20:33:22.5527307Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5527312Z 2025-05-07T20:33:22.5527448Z moe/activation_test.py:117: 2025-05-07T20:33:22.5527578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5527687Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5527784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5528279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5528378Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5528733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5528954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5529291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5529390Z kernel = self.compile( 2025-05-07T20:33:22.5529775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5529950Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5530082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5530086Z 2025-05-07T20:33:22.5530288Z self = 2025-05-07T20:33:22.5531061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5531567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cef8c4a0>} 2025-05-07T20:33:22.5532318Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5532513Z context = 2025-05-07T20:33:22.5532518Z 2025-05-07T20:33:22.5532681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5532944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5533056Z module_map=module_map) 2025-05-07T20:33:22.5533216Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5533317Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5533395Z E ^ 2025-05-07T20:33:22.5533751Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5533835Z 2025-05-07T20:33:22.5534251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5534256Z 2025-05-07T20:33:22.5534394Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5534619Z self=, 2025-05-07T20:33:22.5534696Z T=2048, 2025-05-07T20:33:22.5534771Z D=7168, 2025-05-07T20:33:22.5534856Z scale_ub=1200.0, 2025-05-07T20:33:22.5534938Z contiguous=True, 2025-05-07T20:33:22.5535020Z compiled=False, 2025-05-07T20:33:22.5535095Z ) 2025-05-07T20:33:22.5535310Z self = 2025-05-07T20:33:22.5535484Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.5535489Z 2025-05-07T20:33:22.5535569Z @given( 2025-05-07T20:33:22.5535686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5535794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5535947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5536064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5536183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5536258Z ) 2025-05-07T20:33:22.5536500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5536601Z def test_silu_mul_quant( 2025-05-07T20:33:22.5536676Z self, 2025-05-07T20:33:22.5536753Z T: int, 2025-05-07T20:33:22.5536832Z D: int, 2025-05-07T20:33:22.5536929Z scale_ub: Optional[float], 2025-05-07T20:33:22.5537017Z contiguous: bool, 2025-05-07T20:33:22.5537105Z compiled: bool, 2025-05-07T20:33:22.5537184Z ) -> None: 2025-05-07T20:33:22.5537280Z torch.manual_seed(2025) 2025-05-07T20:33:22.5537353Z 2025-05-07T20:33:22.5537519Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5539326Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5539334Z 2025-05-07T20:33:22.5539451Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5539455Z 2025-05-07T20:33:22.5539559Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5539779Z self=, 2025-05-07T20:33:22.5539854Z T=1, 2025-05-07T20:33:22.5539933Z D=5120, 2025-05-07T20:33:22.5540020Z scale_ub=1200.0, 2025-05-07T20:33:22.5540104Z contiguous=True, 2025-05-07T20:33:22.5540192Z compiled=False, 2025-05-07T20:33:22.5540266Z ) 2025-05-07T20:33:22.5540483Z self = 2025-05-07T20:33:22.5540650Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.5540654Z 2025-05-07T20:33:22.5540731Z @given( 2025-05-07T20:33:22.5540848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5540946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5541058Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5541176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5541287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5541360Z ) 2025-05-07T20:33:22.5541603Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5541746Z def test_silu_mul_quant( 2025-05-07T20:33:22.5541862Z self, 2025-05-07T20:33:22.5541938Z T: int, 2025-05-07T20:33:22.5542018Z D: int, 2025-05-07T20:33:22.5542119Z scale_ub: Optional[float], 2025-05-07T20:33:22.5542248Z contiguous: bool, 2025-05-07T20:33:22.5542333Z compiled: bool, 2025-05-07T20:33:22.5542413Z ) -> None: 2025-05-07T20:33:22.5542506Z torch.manual_seed(2025) 2025-05-07T20:33:22.5542578Z 2025-05-07T20:33:22.5542745Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5542820Z 2025-05-07T20:33:22.5542911Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5543037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5543124Z x = x_sign * x_clamp 2025-05-07T20:33:22.5543207Z x0 = x[:, :D] 2025-05-07T20:33:22.5543286Z x1 = x[:, D:] 2025-05-07T20:33:22.5543357Z 2025-05-07T20:33:22.5543441Z if contiguous: 2025-05-07T20:33:22.5543536Z x0 = x0.contiguous() 2025-05-07T20:33:22.5543627Z x1 = x1.contiguous() 2025-05-07T20:33:22.5543742Z 2025-05-07T20:33:22.5543833Z if scale_ub is not None: 2025-05-07T20:33:22.5543937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5544078Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5544154Z ) 2025-05-07T20:33:22.5544230Z else: 2025-05-07T20:33:22.5547451Z scale_ub_tensor = None 2025-05-07T20:33:22.5547536Z 2025-05-07T20:33:22.5547675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5547766Z op = silu_mul_quant 2025-05-07T20:33:22.5547857Z if compiled: 2025-05-07T20:33:22.5547956Z op = torch.compile(op) 2025-05-07T20:33:22.5548060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5548134Z 2025-05-07T20:33:22.5548227Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5548232Z 2025-05-07T20:33:22.5548342Z moe/activation_test.py:117: 2025-05-07T20:33:22.5548475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5548576Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5548682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5549180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5549275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5549658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5549905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5550245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5550340Z kernel = self.compile( 2025-05-07T20:33:22.5550722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5550899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5551027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5551034Z 2025-05-07T20:33:22.5551238Z self = 2025-05-07T20:33:22.5552019Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5552520Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32cef8da80>} 2025-05-07T20:33:22.5553337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5553566Z context = 2025-05-07T20:33:22.5553609Z 2025-05-07T20:33:22.5553775Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5554036Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5554142Z module_map=module_map) 2025-05-07T20:33:22.5554306Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5554405Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5554482Z E ^ 2025-05-07T20:33:22.5554838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5554842Z 2025-05-07T20:33:22.5555254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5555261Z 2025-05-07T20:33:22.5555410Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5555635Z self=, 2025-05-07T20:33:22.5555771Z T=2048, 2025-05-07T20:33:22.5555855Z D=5120, 2025-05-07T20:33:22.5555937Z scale_ub=None, 2025-05-07T20:33:22.5556022Z contiguous=True, 2025-05-07T20:33:22.5556106Z compiled=False, 2025-05-07T20:33:22.5556180Z ) 2025-05-07T20:33:22.5556402Z self = 2025-05-07T20:33:22.5556572Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5556577Z 2025-05-07T20:33:22.5556655Z @given( 2025-05-07T20:33:22.5556775Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5556874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5556990Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5557119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5557234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5557307Z ) 2025-05-07T20:33:22.5557554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5557650Z def test_silu_mul_quant( 2025-05-07T20:33:22.5557732Z self, 2025-05-07T20:33:22.5557809Z T: int, 2025-05-07T20:33:22.5557886Z D: int, 2025-05-07T20:33:22.5557988Z scale_ub: Optional[float], 2025-05-07T20:33:22.5558076Z contiguous: bool, 2025-05-07T20:33:22.5558160Z compiled: bool, 2025-05-07T20:33:22.5558242Z ) -> None: 2025-05-07T20:33:22.5558336Z torch.manual_seed(2025) 2025-05-07T20:33:22.5558411Z 2025-05-07T20:33:22.5558582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5558658Z 2025-05-07T20:33:22.5558749Z > x_sign = torch.sign(x) 2025-05-07T20:33:22.5560547Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5560555Z 2025-05-07T20:33:22.5560672Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:22.5560680Z 2025-05-07T20:33:22.5560782Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5561003Z self=, 2025-05-07T20:33:22.5561085Z T=16384, 2025-05-07T20:33:22.5561161Z D=5120, 2025-05-07T20:33:22.5561244Z scale_ub=None, 2025-05-07T20:33:22.5561451Z contiguous=True, 2025-05-07T20:33:22.5561537Z compiled=False, 2025-05-07T20:33:22.5561612Z ) 2025-05-07T20:33:22.5561831Z self = 2025-05-07T20:33:22.5562047Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5562052Z 2025-05-07T20:33:22.5562136Z @given( 2025-05-07T20:33:22.5562253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5562350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5562467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5562581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5562694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5562771Z ) 2025-05-07T20:33:22.5563014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5563107Z def test_silu_mul_quant( 2025-05-07T20:33:22.5563191Z self, 2025-05-07T20:33:22.5563267Z T: int, 2025-05-07T20:33:22.5563385Z D: int, 2025-05-07T20:33:22.5563488Z scale_ub: Optional[float], 2025-05-07T20:33:22.5563577Z contiguous: bool, 2025-05-07T20:33:22.5563671Z compiled: bool, 2025-05-07T20:33:22.5563750Z ) -> None: 2025-05-07T20:33:22.5563844Z torch.manual_seed(2025) 2025-05-07T20:33:22.5563920Z 2025-05-07T20:33:22.5564088Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5566238Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5566255Z 2025-05-07T20:33:22.5566376Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5566383Z 2025-05-07T20:33:22.5566485Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5566714Z self=, 2025-05-07T20:33:22.5566793Z T=4096, 2025-05-07T20:33:22.5566871Z D=5120, 2025-05-07T20:33:22.5566957Z scale_ub=None, 2025-05-07T20:33:22.5567045Z contiguous=True, 2025-05-07T20:33:22.5567132Z compiled=False, 2025-05-07T20:33:22.5567205Z ) 2025-05-07T20:33:22.5567420Z self = 2025-05-07T20:33:22.5567594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5567599Z 2025-05-07T20:33:22.5567676Z @given( 2025-05-07T20:33:22.5567796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5567901Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5568019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5568135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5568252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5568330Z ) 2025-05-07T20:33:22.5568579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5568672Z def test_silu_mul_quant( 2025-05-07T20:33:22.5568750Z self, 2025-05-07T20:33:22.5568832Z T: int, 2025-05-07T20:33:22.5568909Z D: int, 2025-05-07T20:33:22.5569009Z scale_ub: Optional[float], 2025-05-07T20:33:22.5569102Z contiguous: bool, 2025-05-07T20:33:22.5569187Z compiled: bool, 2025-05-07T20:33:22.5569266Z ) -> None: 2025-05-07T20:33:22.5569366Z torch.manual_seed(2025) 2025-05-07T20:33:22.5569439Z 2025-05-07T20:33:22.5569714Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5571552Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5571614Z 2025-05-07T20:33:22.5571731Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5571740Z 2025-05-07T20:33:22.5571840Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5572061Z self=, 2025-05-07T20:33:22.5572142Z T=2048, 2025-05-07T20:33:22.5572225Z D=5120, 2025-05-07T20:33:22.5572307Z scale_ub=None, 2025-05-07T20:33:22.5572450Z contiguous=False, 2025-05-07T20:33:22.5572535Z compiled=False, 2025-05-07T20:33:22.5572608Z ) 2025-05-07T20:33:22.5572828Z self = 2025-05-07T20:33:22.5573002Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.5573007Z 2025-05-07T20:33:22.5573089Z @given( 2025-05-07T20:33:22.5573206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5573309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5573426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5573540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5573653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5573731Z ) 2025-05-07T20:33:22.5573971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5574070Z def test_silu_mul_quant( 2025-05-07T20:33:22.5574152Z self, 2025-05-07T20:33:22.5574235Z T: int, 2025-05-07T20:33:22.5574312Z D: int, 2025-05-07T20:33:22.5574414Z scale_ub: Optional[float], 2025-05-07T20:33:22.5574502Z contiguous: bool, 2025-05-07T20:33:22.5574596Z compiled: bool, 2025-05-07T20:33:22.5574675Z ) -> None: 2025-05-07T20:33:22.5574769Z torch.manual_seed(2025) 2025-05-07T20:33:22.5574847Z 2025-05-07T20:33:22.5575012Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5576796Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5576808Z 2025-05-07T20:33:22.5576926Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5576931Z 2025-05-07T20:33:22.5577031Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5577254Z self=, 2025-05-07T20:33:22.5577331Z T=4096, 2025-05-07T20:33:22.5577408Z D=7168, 2025-05-07T20:33:22.5577493Z scale_ub=None, 2025-05-07T20:33:22.5577577Z contiguous=True, 2025-05-07T20:33:22.5577667Z compiled=True, 2025-05-07T20:33:22.5577742Z ) 2025-05-07T20:33:22.5577959Z self = 2025-05-07T20:33:22.5578128Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.5578133Z 2025-05-07T20:33:22.5578210Z @given( 2025-05-07T20:33:22.5578414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5578520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5578635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5578789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5578905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5578979Z ) 2025-05-07T20:33:22.5579225Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5579319Z def test_silu_mul_quant( 2025-05-07T20:33:22.5579395Z self, 2025-05-07T20:33:22.5579478Z T: int, 2025-05-07T20:33:22.5579554Z D: int, 2025-05-07T20:33:22.5579651Z scale_ub: Optional[float], 2025-05-07T20:33:22.5579744Z contiguous: bool, 2025-05-07T20:33:22.5579830Z compiled: bool, 2025-05-07T20:33:22.5579908Z ) -> None: 2025-05-07T20:33:22.5580005Z torch.manual_seed(2025) 2025-05-07T20:33:22.5580085Z 2025-05-07T20:33:22.5580290Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5582075Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5582084Z 2025-05-07T20:33:22.5582199Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5582206Z 2025-05-07T20:33:22.5582308Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5582532Z self=, 2025-05-07T20:33:22.5582617Z T=2048, 2025-05-07T20:33:22.5582695Z D=5120, 2025-05-07T20:33:22.5582780Z scale_ub=1200.0, 2025-05-07T20:33:22.5582869Z contiguous=False, 2025-05-07T20:33:22.5582954Z compiled=False, 2025-05-07T20:33:22.5583027Z ) 2025-05-07T20:33:22.5583247Z self = 2025-05-07T20:33:22.5583420Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.5583424Z 2025-05-07T20:33:22.5583506Z @given( 2025-05-07T20:33:22.5583625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5583723Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5583840Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5583955Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5584068Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5584145Z ) 2025-05-07T20:33:22.5584392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5584491Z def test_silu_mul_quant( 2025-05-07T20:33:22.5584573Z self, 2025-05-07T20:33:22.5584650Z T: int, 2025-05-07T20:33:22.5584729Z D: int, 2025-05-07T20:33:22.5584828Z scale_ub: Optional[float], 2025-05-07T20:33:22.5584916Z contiguous: bool, 2025-05-07T20:33:22.5585005Z compiled: bool, 2025-05-07T20:33:22.5585085Z ) -> None: 2025-05-07T20:33:22.5585181Z torch.manual_seed(2025) 2025-05-07T20:33:22.5585259Z 2025-05-07T20:33:22.5585425Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5587248Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5587328Z 2025-05-07T20:33:22.5587445Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5587449Z 2025-05-07T20:33:22.5587553Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5587775Z self=, 2025-05-07T20:33:22.5587853Z T=4096, 2025-05-07T20:33:22.5587931Z D=7168, 2025-05-07T20:33:22.5588020Z scale_ub=1200.0, 2025-05-07T20:33:22.5588104Z contiguous=True, 2025-05-07T20:33:22.5588191Z compiled=False, 2025-05-07T20:33:22.5588263Z ) 2025-05-07T20:33:22.5588478Z self = 2025-05-07T20:33:22.5588655Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.5588662Z 2025-05-07T20:33:22.5588742Z @given( 2025-05-07T20:33:22.5588924Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5589024Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5589147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5589262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5589375Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5589454Z ) 2025-05-07T20:33:22.5589696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5589792Z def test_silu_mul_quant( 2025-05-07T20:33:22.5589868Z self, 2025-05-07T20:33:22.5589945Z T: int, 2025-05-07T20:33:22.5590024Z D: int, 2025-05-07T20:33:22.5590121Z scale_ub: Optional[float], 2025-05-07T20:33:22.5590209Z contiguous: bool, 2025-05-07T20:33:22.5590298Z compiled: bool, 2025-05-07T20:33:22.5590379Z ) -> None: 2025-05-07T20:33:22.5590475Z torch.manual_seed(2025) 2025-05-07T20:33:22.5590553Z 2025-05-07T20:33:22.5590720Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5592507Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5592513Z 2025-05-07T20:33:22.5592628Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5592632Z 2025-05-07T20:33:22.5592735Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5592961Z self=, 2025-05-07T20:33:22.5593042Z T=16384, 2025-05-07T20:33:22.5593124Z D=7168, 2025-05-07T20:33:22.5593205Z scale_ub=None, 2025-05-07T20:33:22.5593292Z contiguous=False, 2025-05-07T20:33:22.5593376Z compiled=True, 2025-05-07T20:33:22.5593448Z ) 2025-05-07T20:33:22.5593662Z self = 2025-05-07T20:33:22.5593839Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.5593843Z 2025-05-07T20:33:22.5593921Z @given( 2025-05-07T20:33:22.5594038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5594138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5594251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5594370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5594482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5594638Z ) 2025-05-07T20:33:22.5594886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5594978Z def test_silu_mul_quant( 2025-05-07T20:33:22.5595099Z self, 2025-05-07T20:33:22.5595178Z T: int, 2025-05-07T20:33:22.5595256Z D: int, 2025-05-07T20:33:22.5595352Z scale_ub: Optional[float], 2025-05-07T20:33:22.5595443Z contiguous: bool, 2025-05-07T20:33:22.5595527Z compiled: bool, 2025-05-07T20:33:22.5595608Z ) -> None: 2025-05-07T20:33:22.5595701Z torch.manual_seed(2025) 2025-05-07T20:33:22.5595827Z 2025-05-07T20:33:22.5595998Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5597825Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5597836Z 2025-05-07T20:33:22.5597954Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5597959Z 2025-05-07T20:33:22.5598058Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5598278Z self=, 2025-05-07T20:33:22.5598356Z T=4096, 2025-05-07T20:33:22.5598432Z D=7168, 2025-05-07T20:33:22.5598514Z scale_ub=None, 2025-05-07T20:33:22.5598600Z contiguous=True, 2025-05-07T20:33:22.5598683Z compiled=False, 2025-05-07T20:33:22.5598757Z ) 2025-05-07T20:33:22.5598975Z self = 2025-05-07T20:33:22.5599149Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5599155Z 2025-05-07T20:33:22.5599235Z @given( 2025-05-07T20:33:22.5599352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5599453Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5599570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5599684Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5599795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5599870Z ) 2025-05-07T20:33:22.5600111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5600208Z def test_silu_mul_quant( 2025-05-07T20:33:22.5600285Z self, 2025-05-07T20:33:22.5600361Z T: int, 2025-05-07T20:33:22.5600440Z D: int, 2025-05-07T20:33:22.5600536Z scale_ub: Optional[float], 2025-05-07T20:33:22.5600625Z contiguous: bool, 2025-05-07T20:33:22.5600718Z compiled: bool, 2025-05-07T20:33:22.5600795Z ) -> None: 2025-05-07T20:33:22.5600891Z torch.manual_seed(2025) 2025-05-07T20:33:22.5600968Z 2025-05-07T20:33:22.5601132Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5602917Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5604812Z 2025-05-07T20:33:22.5604931Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5605190Z 2025-05-07T20:33:22.5605331Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5605749Z self=, 2025-05-07T20:33:22.5606155Z T=16384, 2025-05-07T20:33:22.5606387Z D=7168, 2025-05-07T20:33:22.5606582Z scale_ub=None, 2025-05-07T20:33:22.5606800Z contiguous=True, 2025-05-07T20:33:22.5607022Z compiled=False, 2025-05-07T20:33:22.5607222Z ) 2025-05-07T20:33:22.5607542Z self = 2025-05-07T20:33:22.5608039Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:22.5608315Z 2025-05-07T20:33:22.5608407Z @given( 2025-05-07T20:33:22.5608643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5608957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5609264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5609591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5609923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5610255Z ) 2025-05-07T20:33:22.5610603Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5611043Z def test_silu_mul_quant( 2025-05-07T20:33:22.5611285Z self, 2025-05-07T20:33:22.5611479Z T: int, 2025-05-07T20:33:22.5611670Z D: int, 2025-05-07T20:33:22.5611885Z scale_ub: Optional[float], 2025-05-07T20:33:22.5612155Z contiguous: bool, 2025-05-07T20:33:22.5612387Z compiled: bool, 2025-05-07T20:33:22.5612608Z ) -> None: 2025-05-07T20:33:22.5612819Z torch.manual_seed(2025) 2025-05-07T20:33:22.5613054Z 2025-05-07T20:33:22.5613326Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5615381Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5617253Z 2025-05-07T20:33:22.5617370Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5617581Z 2025-05-07T20:33:22.5617686Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5618094Z self=, 2025-05-07T20:33:22.5618497Z T=16384, 2025-05-07T20:33:22.5618689Z D=7168, 2025-05-07T20:33:22.5618877Z scale_ub=1200.0, 2025-05-07T20:33:22.5619098Z contiguous=True, 2025-05-07T20:33:22.5619317Z compiled=False, 2025-05-07T20:33:22.5619520Z ) 2025-05-07T20:33:22.5619838Z self = 2025-05-07T20:33:22.5620336Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.5620614Z 2025-05-07T20:33:22.5620702Z @given( 2025-05-07T20:33:22.5620924Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5621239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5621544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5621870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5622200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5622485Z ) 2025-05-07T20:33:22.5622838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5623274Z def test_silu_mul_quant( 2025-05-07T20:33:22.5623515Z self, 2025-05-07T20:33:22.5623707Z T: int, 2025-05-07T20:33:22.5623898Z D: int, 2025-05-07T20:33:22.5624171Z scale_ub: Optional[float], 2025-05-07T20:33:22.5624477Z contiguous: bool, 2025-05-07T20:33:22.5624714Z compiled: bool, 2025-05-07T20:33:22.5624936Z ) -> None: 2025-05-07T20:33:22.5625149Z torch.manual_seed(2025) 2025-05-07T20:33:22.5625426Z 2025-05-07T20:33:22.5625695Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5627748Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5629614Z 2025-05-07T20:33:22.5629737Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5629947Z 2025-05-07T20:33:22.5630093Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5630502Z self=, 2025-05-07T20:33:22.5630908Z T=128, 2025-05-07T20:33:22.5631096Z D=5120, 2025-05-07T20:33:22.5631283Z scale_ub=1200.0, 2025-05-07T20:33:22.5631504Z contiguous=False, 2025-05-07T20:33:22.5631728Z compiled=False, 2025-05-07T20:33:22.5631928Z ) 2025-05-07T20:33:22.5632244Z self = 2025-05-07T20:33:22.5632735Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:22.5633008Z 2025-05-07T20:33:22.5633091Z @given( 2025-05-07T20:33:22.5633313Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5633625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5633932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5634258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5634588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5634871Z ) 2025-05-07T20:33:22.5635215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5635654Z def test_silu_mul_quant( 2025-05-07T20:33:22.5635943Z self, 2025-05-07T20:33:22.5636136Z T: int, 2025-05-07T20:33:22.5636333Z D: int, 2025-05-07T20:33:22.5636549Z scale_ub: Optional[float], 2025-05-07T20:33:22.5636814Z contiguous: bool, 2025-05-07T20:33:22.5637052Z compiled: bool, 2025-05-07T20:33:22.5637273Z ) -> None: 2025-05-07T20:33:22.5637485Z torch.manual_seed(2025) 2025-05-07T20:33:22.5637723Z 2025-05-07T20:33:22.5637993Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5638334Z 2025-05-07T20:33:22.5638523Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5638819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5639135Z x = x_sign * x_clamp 2025-05-07T20:33:22.5639373Z x0 = x[:, :D] 2025-05-07T20:33:22.5639589Z x1 = x[:, D:] 2025-05-07T20:33:22.5639795Z 2025-05-07T20:33:22.5639975Z if contiguous: 2025-05-07T20:33:22.5640209Z x0 = x0.contiguous() 2025-05-07T20:33:22.5640464Z x1 = x1.contiguous() 2025-05-07T20:33:22.5640701Z 2025-05-07T20:33:22.5640895Z if scale_ub is not None: 2025-05-07T20:33:22.5641169Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5641497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5641810Z ) 2025-05-07T20:33:22.5642000Z else: 2025-05-07T20:33:22.5642207Z scale_ub_tensor = None 2025-05-07T20:33:22.5642456Z 2025-05-07T20:33:22.5642685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5643050Z op = silu_mul_quant 2025-05-07T20:33:22.5643334Z if compiled: 2025-05-07T20:33:22.5643581Z op = torch.compile(op) 2025-05-07T20:33:22.5643877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5644213Z 2025-05-07T20:33:22.5644410Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5644573Z 2025-05-07T20:33:22.5644674Z moe/activation_test.py:117: 2025-05-07T20:33:22.5644964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5645298Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5645578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5646265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5646954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5647490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5648179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5648888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5649426Z kernel = self.compile( 2025-05-07T20:33:22.5649963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5650616Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5651008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5651244Z 2025-05-07T20:33:22.5651451Z self = 2025-05-07T20:33:22.5652538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5653914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32ced287c0>} 2025-05-07T20:33:22.5655253Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5656277Z context = 2025-05-07T20:33:22.5656567Z 2025-05-07T20:33:22.5656733Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5657256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5657721Z module_map=module_map) 2025-05-07T20:33:22.5658090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5658451Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5658710Z E ^ 2025-05-07T20:33:22.5659179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5659661Z 2025-05-07T20:33:22.5660099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5660609Z 2025-05-07T20:33:22.5660717Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5661127Z self=, 2025-05-07T20:33:22.5661530Z T=2048, 2025-05-07T20:33:22.5661718Z D=7168, 2025-05-07T20:33:22.5661908Z scale_ub=None, 2025-05-07T20:33:22.5662118Z contiguous=False, 2025-05-07T20:33:22.5662342Z compiled=False, 2025-05-07T20:33:22.5662543Z ) 2025-05-07T20:33:22.5662856Z self = 2025-05-07T20:33:22.5663408Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.5663720Z 2025-05-07T20:33:22.5663805Z @given( 2025-05-07T20:33:22.5664032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5664387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5664694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5665020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5665614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5665948Z ) 2025-05-07T20:33:22.5666299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5666737Z def test_silu_mul_quant( 2025-05-07T20:33:22.5666978Z self, 2025-05-07T20:33:22.5667170Z T: int, 2025-05-07T20:33:22.5667363Z D: int, 2025-05-07T20:33:22.5667579Z scale_ub: Optional[float], 2025-05-07T20:33:22.5667847Z contiguous: bool, 2025-05-07T20:33:22.5668080Z compiled: bool, 2025-05-07T20:33:22.5668308Z ) -> None: 2025-05-07T20:33:22.5668603Z torch.manual_seed(2025) 2025-05-07T20:33:22.5668841Z 2025-05-07T20:33:22.5669111Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5671216Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5673079Z 2025-05-07T20:33:22.5673198Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5673410Z 2025-05-07T20:33:22.5673520Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5673932Z self=, 2025-05-07T20:33:22.5674333Z T=128, 2025-05-07T20:33:22.5674522Z D=7168, 2025-05-07T20:33:22.5674712Z scale_ub=1200.0, 2025-05-07T20:33:22.5674934Z contiguous=True, 2025-05-07T20:33:22.5675152Z compiled=True, 2025-05-07T20:33:22.5675350Z ) 2025-05-07T20:33:22.5675666Z self = 2025-05-07T20:33:22.5676200Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.5676468Z 2025-05-07T20:33:22.5676550Z @given( 2025-05-07T20:33:22.5676778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5677086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5677389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5677711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5678039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5678329Z ) 2025-05-07T20:33:22.5678676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5679121Z def test_silu_mul_quant( 2025-05-07T20:33:22.5679361Z self, 2025-05-07T20:33:22.5679548Z T: int, 2025-05-07T20:33:22.5679745Z D: int, 2025-05-07T20:33:22.5679964Z scale_ub: Optional[float], 2025-05-07T20:33:22.5680227Z contiguous: bool, 2025-05-07T20:33:22.5680463Z compiled: bool, 2025-05-07T20:33:22.5680681Z ) -> None: 2025-05-07T20:33:22.5680894Z torch.manual_seed(2025) 2025-05-07T20:33:22.5681126Z 2025-05-07T20:33:22.5681396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5686249Z 2025-05-07T20:33:22.5686479Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5686773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5687091Z x = x_sign * x_clamp 2025-05-07T20:33:22.5687503Z x0 = x[:, :D] 2025-05-07T20:33:22.5687725Z x1 = x[:, D:] 2025-05-07T20:33:22.5687932Z 2025-05-07T20:33:22.5688120Z if contiguous: 2025-05-07T20:33:22.5688411Z x0 = x0.contiguous() 2025-05-07T20:33:22.5688673Z x1 = x1.contiguous() 2025-05-07T20:33:22.5688917Z 2025-05-07T20:33:22.5689116Z if scale_ub is not None: 2025-05-07T20:33:22.5689384Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.5689719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.5690031Z ) 2025-05-07T20:33:22.5690219Z else: 2025-05-07T20:33:22.5690435Z scale_ub_tensor = None 2025-05-07T20:33:22.5690687Z 2025-05-07T20:33:22.5690921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.5691237Z op = silu_mul_quant 2025-05-07T20:33:22.5691485Z if compiled: 2025-05-07T20:33:22.5691727Z op = torch.compile(op) 2025-05-07T20:33:22.5692029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5692353Z 2025-05-07T20:33:22.5692549Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.5692718Z 2025-05-07T20:33:22.5692822Z moe/activation_test.py:117: 2025-05-07T20:33:22.5693120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5693454Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.5693732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.5694295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:22.5694858Z return fn(*args, **kwargs) 2025-05-07T20:33:22.5695514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.5696195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.5696732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.5697416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.5698076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.5698610Z kernel = self.compile( 2025-05-07T20:33:22.5699148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.5699825Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.5700250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.5700486Z 2025-05-07T20:33:22.5700694Z self = 2025-05-07T20:33:22.5701782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.5703158Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f32ced29940>} 2025-05-07T20:33:22.5704495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.5705523Z context = 2025-05-07T20:33:22.5705814Z 2025-05-07T20:33:22.5705979Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.5706502Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.5706963Z module_map=module_map) 2025-05-07T20:33:22.5707381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.5707774Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.5708030Z E ^ 2025-05-07T20:33:22.5708496Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.5708992Z 2025-05-07T20:33:22.5709406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.5709953Z 2025-05-07T20:33:22.5710070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5710479Z self=, 2025-05-07T20:33:22.5710882Z T=128, 2025-05-07T20:33:22.5711073Z D=7168, 2025-05-07T20:33:22.5711265Z scale_ub=1200.0, 2025-05-07T20:33:22.5711492Z contiguous=True, 2025-05-07T20:33:22.5711716Z compiled=False, 2025-05-07T20:33:22.5711917Z ) 2025-05-07T20:33:22.5712240Z self = 2025-05-07T20:33:22.5712786Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.5713057Z 2025-05-07T20:33:22.5713137Z @given( 2025-05-07T20:33:22.5713366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5713683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5713989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5714318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5714643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5714928Z ) 2025-05-07T20:33:22.5715269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5715782Z def test_silu_mul_quant( 2025-05-07T20:33:22.5716024Z self, 2025-05-07T20:33:22.5716212Z T: int, 2025-05-07T20:33:22.5716412Z D: int, 2025-05-07T20:33:22.5716627Z scale_ub: Optional[float], 2025-05-07T20:33:22.5716905Z contiguous: bool, 2025-05-07T20:33:22.5717139Z compiled: bool, 2025-05-07T20:33:22.5717362Z ) -> None: 2025-05-07T20:33:22.5717572Z torch.manual_seed(2025) 2025-05-07T20:33:22.5717806Z 2025-05-07T20:33:22.5718079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5718421Z 2025-05-07T20:33:22.5718609Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5718904Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5720916Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5722776Z 2025-05-07T20:33:22.5722897Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:22.5723108Z 2025-05-07T20:33:22.5723217Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5723621Z self=, 2025-05-07T20:33:22.5724026Z T=128, 2025-05-07T20:33:22.5724214Z D=5120, 2025-05-07T20:33:22.5724404Z scale_ub=1200.0, 2025-05-07T20:33:22.5724628Z contiguous=True, 2025-05-07T20:33:22.5724845Z compiled=True, 2025-05-07T20:33:22.5725041Z ) 2025-05-07T20:33:22.5725358Z self = 2025-05-07T20:33:22.5725844Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.5726112Z 2025-05-07T20:33:22.5726190Z @given( 2025-05-07T20:33:22.5726417Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5726821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5727130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5727452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5727818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5728104Z ) 2025-05-07T20:33:22.5728446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5728891Z def test_silu_mul_quant( 2025-05-07T20:33:22.5729136Z self, 2025-05-07T20:33:22.5729322Z T: int, 2025-05-07T20:33:22.5729518Z D: int, 2025-05-07T20:33:22.5729739Z scale_ub: Optional[float], 2025-05-07T20:33:22.5730006Z contiguous: bool, 2025-05-07T20:33:22.5730243Z compiled: bool, 2025-05-07T20:33:22.5730464Z ) -> None: 2025-05-07T20:33:22.5730674Z torch.manual_seed(2025) 2025-05-07T20:33:22.5730917Z 2025-05-07T20:33:22.5731189Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5731530Z 2025-05-07T20:33:22.5731765Z x_sign = torch.sign(x) 2025-05-07T20:33:22.5732052Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.5734057Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5735913Z 2025-05-07T20:33:22.5736030Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:22.5736241Z 2025-05-07T20:33:22.5736350Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.5736764Z self=, 2025-05-07T20:33:22.5737166Z T=128, 2025-05-07T20:33:22.5737355Z D=7168, 2025-05-07T20:33:22.5737544Z scale_ub=None, 2025-05-07T20:33:22.5737760Z contiguous=True, 2025-05-07T20:33:22.5737978Z compiled=True, 2025-05-07T20:33:22.5738172Z ) 2025-05-07T20:33:22.5738486Z self = 2025-05-07T20:33:22.5738969Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.5739231Z 2025-05-07T20:33:22.5739310Z @given( 2025-05-07T20:33:22.5739535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.5739875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.5740200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.5740525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.5740848Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.5741138Z ) 2025-05-07T20:33:22.5741483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.5741920Z def test_silu_mul_quant( 2025-05-07T20:33:22.5742162Z self, 2025-05-07T20:33:22.5742350Z T: int, 2025-05-07T20:33:22.5742542Z D: int, 2025-05-07T20:33:22.5742755Z scale_ub: Optional[float], 2025-05-07T20:33:22.5743020Z contiguous: bool, 2025-05-07T20:33:22.5743255Z compiled: bool, 2025-05-07T20:33:22.5743474Z ) -> None: 2025-05-07T20:33:22.5743687Z torch.manual_seed(2025) 2025-05-07T20:33:22.5743921Z 2025-05-07T20:33:22.5744189Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.5746280Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:22.5748229Z 2025-05-07T20:33:22.5748351Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:22.5748689Z =============================== warnings summary =============================== 2025-05-07T20:33:22.5749233Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:22.5749953Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:22.5750666Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:22.5751974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:22.5753171Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:22.5753501Z 2025-05-07T20:33:22.5753708Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:22.5754177Z ================= 1 failed, 1 deselected, 3 warnings in 13.14s ================= 2025-05-07T20:33:24.1453583Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:24.2071552Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:33:24.2071781Z 2025-05-07T20:33:26.2089524Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:28.3661002Z ============================= test session starts ============================== 2025-05-07T20:33:28.3661666Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:28.3662190Z cachedir: .pytest_cache 2025-05-07T20:33:28.3662882Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:28.3664339Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:28.3665140Z plugins: hypothesis-6.131.14 2025-05-07T20:33:29.9804655Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:30.0883969Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:30.0884387Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:30.0884623Z 2025-05-07T20:33:32.4543815Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.4545300Z self=, 2025-05-07T20:33:32.4545731Z T=1, 2025-05-07T20:33:32.4545918Z D=5120, 2025-05-07T20:33:32.4546117Z scale_ub=None, 2025-05-07T20:33:32.4546333Z contiguous=True, 2025-05-07T20:33:32.4546552Z compiled=True, 2025-05-07T20:33:32.4546765Z ) 2025-05-07T20:33:32.4547093Z self = 2025-05-07T20:33:32.4547584Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:32.4547846Z 2025-05-07T20:33:32.4547926Z @given( 2025-05-07T20:33:32.4548161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:32.4548478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:32.4548780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:32.4549557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:32.4549898Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:32.4550183Z ) 2025-05-07T20:33:32.4550656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:32.4551111Z def test_silu_mul_quant( 2025-05-07T20:33:32.4551359Z self, 2025-05-07T20:33:32.4551557Z T: int, 2025-05-07T20:33:32.4551761Z D: int, 2025-05-07T20:33:32.4551981Z scale_ub: Optional[float], 2025-05-07T20:33:32.4552252Z contiguous: bool, 2025-05-07T20:33:32.4552501Z compiled: bool, 2025-05-07T20:33:32.4552736Z ) -> None: 2025-05-07T20:33:32.4552954Z torch.manual_seed(2025) 2025-05-07T20:33:32.4553202Z 2025-05-07T20:33:32.4553480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:32.4553828Z 2025-05-07T20:33:32.4554029Z x_sign = torch.sign(x) 2025-05-07T20:33:32.4554331Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:32.4554740Z x = x_sign * x_clamp 2025-05-07T20:33:32.4554996Z x0 = x[:, :D] 2025-05-07T20:33:32.4555220Z x1 = x[:, D:] 2025-05-07T20:33:32.4555431Z 2025-05-07T20:33:32.4555626Z if contiguous: 2025-05-07T20:33:32.4555993Z x0 = x0.contiguous() 2025-05-07T20:33:32.4556253Z x1 = x1.contiguous() 2025-05-07T20:33:32.4556500Z 2025-05-07T20:33:32.4556703Z if scale_ub is not None: 2025-05-07T20:33:32.4556982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:32.4557320Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:32.4557640Z ) 2025-05-07T20:33:32.4557839Z else: 2025-05-07T20:33:32.4558051Z scale_ub_tensor = None 2025-05-07T20:33:32.4558310Z 2025-05-07T20:33:32.4558553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.4558870Z op = silu_mul_quant 2025-05-07T20:33:32.4559131Z if compiled: 2025-05-07T20:33:32.4559387Z op = torch.compile(op) 2025-05-07T20:33:32.4559684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:32.4559966Z 2025-05-07T20:33:32.4560164Z y_fp8, y_scale = fn() 2025-05-07T20:33:32.4560449Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:32.4560746Z 2025-05-07T20:33:32.4560992Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:32.4561326Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:32.4561623Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:32.4561942Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:32.4562309Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.4562621Z 2025-05-07T20:33:32.4562829Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:32.4563025Z 2025-05-07T20:33:32.4563133Z moe/activation_test.py:126: 2025-05-07T20:33:32.4563437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.4563781Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:32.4564116Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:32.4564915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:32.4565967Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:32.4566518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:32.4567206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:32.4567893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:32.4568621Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:32.4569546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:32.4570191Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:32.4570850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:32.4571373Z fn() 2025-05-07T20:33:32.4571887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:32.4572473Z self.fn.run( 2025-05-07T20:33:32.4572940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:32.4573478Z kernel = self.compile( 2025-05-07T20:33:32.4574022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:32.4574681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:32.4575151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:32.4575386Z 2025-05-07T20:33:32.4575606Z self = 2025-05-07T20:33:32.4576701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:32.4578096Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb2735c60>} 2025-05-07T20:33:32.4579445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:32.4580485Z context = 2025-05-07T20:33:32.4580777Z 2025-05-07T20:33:32.4580961Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:32.4581488Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:32.4581968Z module_map=module_map) 2025-05-07T20:33:32.4582338Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:32.4582703Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:32.4582971Z E ^ 2025-05-07T20:33:32.4583442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:32.4583896Z 2025-05-07T20:33:32.4584318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:32.4584833Z 2025-05-07T20:33:32.4584960Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:32.4585415Z self=, 2025-05-07T20:33:32.4585824Z T=2048, 2025-05-07T20:33:32.4586018Z D=5120, 2025-05-07T20:33:32.4586212Z scale_ub=1200.0, 2025-05-07T20:33:32.4586442Z contiguous=True, 2025-05-07T20:33:32.4586669Z compiled=False, 2025-05-07T20:33:32.4586874Z ) 2025-05-07T20:33:33.1877701Z self = 2025-05-07T20:33:33.1878490Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:33.1878866Z 2025-05-07T20:33:33.1878985Z @given( 2025-05-07T20:33:33.1879303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.1879679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.1879991Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.1880315Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.1880937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.1881304Z ) 2025-05-07T20:33:33.1881653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.1882097Z def test_silu_mul_quant( 2025-05-07T20:33:33.1882427Z self, 2025-05-07T20:33:33.1882623Z T: int, 2025-05-07T20:33:33.1882821Z D: int, 2025-05-07T20:33:33.1883042Z scale_ub: Optional[float], 2025-05-07T20:33:33.1883316Z contiguous: bool, 2025-05-07T20:33:33.1883558Z compiled: bool, 2025-05-07T20:33:33.1883790Z ) -> None: 2025-05-07T20:33:33.1884014Z torch.manual_seed(2025) 2025-05-07T20:33:33.1884249Z 2025-05-07T20:33:33.1884521Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.1884865Z 2025-05-07T20:33:33.1885053Z x_sign = torch.sign(x) 2025-05-07T20:33:33.1885394Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.1885704Z x = x_sign * x_clamp 2025-05-07T20:33:33.1885945Z x0 = x[:, :D] 2025-05-07T20:33:33.1886163Z x1 = x[:, D:] 2025-05-07T20:33:33.1886452Z 2025-05-07T20:33:33.1886636Z if contiguous: 2025-05-07T20:33:33.1886869Z x0 = x0.contiguous() 2025-05-07T20:33:33.1887131Z x1 = x1.contiguous() 2025-05-07T20:33:33.1887367Z 2025-05-07T20:33:33.1887561Z if scale_ub is not None: 2025-05-07T20:33:33.1887835Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.1888169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.1888473Z ) 2025-05-07T20:33:33.1888667Z else: 2025-05-07T20:33:33.1888885Z scale_ub_tensor = None 2025-05-07T20:33:33.1889132Z 2025-05-07T20:33:33.1889363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.1889682Z op = silu_mul_quant 2025-05-07T20:33:33.1889924Z if compiled: 2025-05-07T20:33:33.1890177Z op = torch.compile(op) 2025-05-07T20:33:33.1890483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.1890754Z 2025-05-07T20:33:33.1890954Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.1891122Z 2025-05-07T20:33:33.1891229Z moe/activation_test.py:117: 2025-05-07T20:33:33.1891521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.1891857Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.1892141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.1892832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.1893513Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.1894054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.1894735Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.1895397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.1895935Z kernel = self.compile( 2025-05-07T20:33:33.1896475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.1897131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.1897523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.1897763Z 2025-05-07T20:33:33.1897970Z self = 2025-05-07T20:33:33.1899056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.1900497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb258c220>} 2025-05-07T20:33:33.1901883Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.1902946Z context = 2025-05-07T20:33:33.1903251Z 2025-05-07T20:33:33.1903418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.1903940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.1910516Z module_map=module_map) 2025-05-07T20:33:33.1910894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.1911255Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.1911518Z E ^ 2025-05-07T20:33:33.1911987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.1912448Z 2025-05-07T20:33:33.1912947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.1913474Z 2025-05-07T20:33:33.1913582Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.1913998Z self=, 2025-05-07T20:33:33.1914397Z T=2048, 2025-05-07T20:33:33.1914594Z D=5120, 2025-05-07T20:33:33.1914799Z scale_ub=1200.0, 2025-05-07T20:33:33.1915022Z contiguous=True, 2025-05-07T20:33:33.1915251Z compiled=True, 2025-05-07T20:33:33.1915467Z ) 2025-05-07T20:33:33.1915878Z self = 2025-05-07T20:33:33.1916379Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:33.1916650Z 2025-05-07T20:33:33.1916731Z @given( 2025-05-07T20:33:33.1916974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.1917306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.1917615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.1917953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.1918290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.1918580Z ) 2025-05-07T20:33:33.1918935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.1919386Z def test_silu_mul_quant( 2025-05-07T20:33:33.1919638Z self, 2025-05-07T20:33:33.1919834Z T: int, 2025-05-07T20:33:33.1920037Z D: int, 2025-05-07T20:33:33.1920259Z scale_ub: Optional[float], 2025-05-07T20:33:33.1920535Z contiguous: bool, 2025-05-07T20:33:33.1920780Z compiled: bool, 2025-05-07T20:33:33.1921011Z ) -> None: 2025-05-07T20:33:33.1921226Z torch.manual_seed(2025) 2025-05-07T20:33:33.1921474Z 2025-05-07T20:33:33.1921759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.1922101Z 2025-05-07T20:33:33.1922305Z x_sign = torch.sign(x) 2025-05-07T20:33:33.1922600Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.1922911Z x = x_sign * x_clamp 2025-05-07T20:33:33.1923154Z x0 = x[:, :D] 2025-05-07T20:33:33.1923381Z x1 = x[:, D:] 2025-05-07T20:33:33.1923589Z 2025-05-07T20:33:33.1923782Z if contiguous: 2025-05-07T20:33:33.1924020Z x0 = x0.contiguous() 2025-05-07T20:33:33.1924285Z x1 = x1.contiguous() 2025-05-07T20:33:33.1924523Z 2025-05-07T20:33:33.1924727Z if scale_ub is not None: 2025-05-07T20:33:33.1925004Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.1925334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.1925655Z ) 2025-05-07T20:33:33.1925854Z else: 2025-05-07T20:33:33.1926126Z scale_ub_tensor = None 2025-05-07T20:33:33.1926427Z 2025-05-07T20:33:33.1926667Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.1926981Z op = silu_mul_quant 2025-05-07T20:33:33.1927280Z if compiled: 2025-05-07T20:33:33.1927532Z op = torch.compile(op) 2025-05-07T20:33:33.1927827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.1928105Z 2025-05-07T20:33:33.1928307Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.1928590Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.1928884Z 2025-05-07T20:33:33.1929129Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.1929470Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.1929764Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.1930088Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.1930452Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.1930771Z 2025-05-07T20:33:33.1931027Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.1931227Z 2025-05-07T20:33:33.1931343Z moe/activation_test.py:126: 2025-05-07T20:33:33.1931644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.1931993Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.1932329Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.1933130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.1933875Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.1934424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.1935108Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.1935797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.1936530Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.1937267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.1937907Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.1938506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.1939030Z fn() 2025-05-07T20:33:33.1939547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.1940136Z self.fn.run( 2025-05-07T20:33:33.1940599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.1941139Z kernel = self.compile( 2025-05-07T20:33:33.1941690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.1942360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.1942761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.1942997Z 2025-05-07T20:33:33.1943206Z self = 2025-05-07T20:33:33.1944296Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.1945726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb258d6c0>} 2025-05-07T20:33:33.1947113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.1948185Z context = 2025-05-07T20:33:33.1948520Z 2025-05-07T20:33:33.1948687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.1949215Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.1949680Z module_map=module_map) 2025-05-07T20:33:33.1950062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.1950422Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.1950687Z E ^ 2025-05-07T20:33:33.1951152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.1951608Z 2025-05-07T20:33:33.1952025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.1952579Z 2025-05-07T20:33:33.1952696Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.1953113Z self=, 2025-05-07T20:33:33.1953525Z T=16384, 2025-05-07T20:33:33.1953732Z D=7168, 2025-05-07T20:33:33.1953925Z scale_ub=1200.0, 2025-05-07T20:33:33.1954162Z contiguous=False, 2025-05-07T20:33:33.1954396Z compiled=False, 2025-05-07T20:33:33.1954604Z ) 2025-05-07T20:33:33.9233069Z self = 2025-05-07T20:33:33.9233863Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:33.9234270Z 2025-05-07T20:33:33.9234386Z @given( 2025-05-07T20:33:33.9234715Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.9235161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.9235505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.9235921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.9236263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.9236566Z ) 2025-05-07T20:33:33.9236921Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.9237375Z def test_silu_mul_quant( 2025-05-07T20:33:33.9237632Z self, 2025-05-07T20:33:33.9237832Z T: int, 2025-05-07T20:33:33.9238039Z D: int, 2025-05-07T20:33:33.9238268Z scale_ub: Optional[float], 2025-05-07T20:33:33.9238552Z contiguous: bool, 2025-05-07T20:33:33.9238791Z compiled: bool, 2025-05-07T20:33:33.9239026Z ) -> None: 2025-05-07T20:33:33.9239250Z torch.manual_seed(2025) 2025-05-07T20:33:33.9239497Z 2025-05-07T20:33:33.9239778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.9240132Z 2025-05-07T20:33:33.9240332Z x_sign = torch.sign(x) 2025-05-07T20:33:33.9240628Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.9240945Z x = x_sign * x_clamp 2025-05-07T20:33:33.9241183Z x0 = x[:, :D] 2025-05-07T20:33:33.9241419Z x1 = x[:, D:] 2025-05-07T20:33:33.9241634Z 2025-05-07T20:33:33.9241826Z if contiguous: 2025-05-07T20:33:33.9242057Z x0 = x0.contiguous() 2025-05-07T20:33:33.9242351Z x1 = x1.contiguous() 2025-05-07T20:33:33.9242597Z 2025-05-07T20:33:33.9242795Z if scale_ub is not None: 2025-05-07T20:33:33.9243067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.9243406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.9243724Z ) 2025-05-07T20:33:33.9243922Z else: 2025-05-07T20:33:33.9244133Z scale_ub_tensor = None 2025-05-07T20:33:33.9244390Z 2025-05-07T20:33:33.9244628Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.9245178Z op = silu_mul_quant 2025-05-07T20:33:33.9245439Z if compiled: 2025-05-07T20:33:33.9245693Z op = torch.compile(op) 2025-05-07T20:33:33.9245986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.9246342Z 2025-05-07T20:33:33.9246544Z > y_fp8, y_scale = fn() 2025-05-07T20:33:33.9246709Z 2025-05-07T20:33:33.9246810Z moe/activation_test.py:117: 2025-05-07T20:33:33.9247116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.9247455Z moe/activation_test.py:115: in fn 2025-05-07T20:33:33.9247734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.9248436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:33.9249130Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:33.9249677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.9250443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.9251116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.9251657Z kernel = self.compile( 2025-05-07T20:33:33.9252204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.9252858Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.9253263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.9253496Z 2025-05-07T20:33:33.9253712Z self = 2025-05-07T20:33:33.9254807Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.9256198Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb1450a40>} 2025-05-07T20:33:33.9257548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.9258587Z context = 2025-05-07T20:33:33.9258878Z 2025-05-07T20:33:33.9259052Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.9259576Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.9260059Z module_map=module_map) 2025-05-07T20:33:33.9260430Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.9260802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:33.9261068Z E ^ 2025-05-07T20:33:33.9261538Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.9261993Z 2025-05-07T20:33:33.9262415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.9262924Z 2025-05-07T20:33:33.9263031Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.9263452Z self=, 2025-05-07T20:33:33.9263860Z T=1, 2025-05-07T20:33:33.9264050Z D=7168, 2025-05-07T20:33:33.9264255Z scale_ub=None, 2025-05-07T20:33:33.9264474Z contiguous=True, 2025-05-07T20:33:33.9264701Z compiled=True, 2025-05-07T20:33:33.9264907Z ) 2025-05-07T20:33:33.9265237Z self = 2025-05-07T20:33:33.9266053Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:33.9266316Z 2025-05-07T20:33:33.9266396Z @given( 2025-05-07T20:33:33.9266631Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:33.9267011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:33.9267316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:33.9267650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:33.9267989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:33.9268279Z ) 2025-05-07T20:33:33.9268626Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:33.9269079Z def test_silu_mul_quant( 2025-05-07T20:33:33.9269329Z self, 2025-05-07T20:33:33.9269526Z T: int, 2025-05-07T20:33:33.9269730Z D: int, 2025-05-07T20:33:33.9269958Z scale_ub: Optional[float], 2025-05-07T20:33:33.9270227Z contiguous: bool, 2025-05-07T20:33:33.9270480Z compiled: bool, 2025-05-07T20:33:33.9270711Z ) -> None: 2025-05-07T20:33:33.9270989Z torch.manual_seed(2025) 2025-05-07T20:33:33.9271238Z 2025-05-07T20:33:33.9271519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:33.9271866Z 2025-05-07T20:33:33.9272069Z x_sign = torch.sign(x) 2025-05-07T20:33:33.9272368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:33.9272691Z x = x_sign * x_clamp 2025-05-07T20:33:33.9272932Z x0 = x[:, :D] 2025-05-07T20:33:33.9273158Z x1 = x[:, D:] 2025-05-07T20:33:33.9273373Z 2025-05-07T20:33:33.9273562Z if contiguous: 2025-05-07T20:33:33.9273798Z x0 = x0.contiguous() 2025-05-07T20:33:33.9274066Z x1 = x1.contiguous() 2025-05-07T20:33:33.9274312Z 2025-05-07T20:33:33.9274512Z if scale_ub is not None: 2025-05-07T20:33:33.9274793Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:33.9275137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:33.9275456Z ) 2025-05-07T20:33:33.9275711Z else: 2025-05-07T20:33:33.9275923Z scale_ub_tensor = None 2025-05-07T20:33:33.9276195Z 2025-05-07T20:33:33.9276435Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.9276750Z op = silu_mul_quant 2025-05-07T20:33:33.9277004Z if compiled: 2025-05-07T20:33:33.9277264Z op = torch.compile(op) 2025-05-07T20:33:33.9277559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:33.9277842Z 2025-05-07T20:33:33.9278042Z y_fp8, y_scale = fn() 2025-05-07T20:33:33.9278334Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:33.9278623Z 2025-05-07T20:33:33.9278867Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:33.9279209Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:33.9279501Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:33.9279829Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:33.9280200Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.9280517Z 2025-05-07T20:33:33.9280732Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:33.9280934Z 2025-05-07T20:33:33.9281042Z moe/activation_test.py:126: 2025-05-07T20:33:33.9281345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.9281684Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:33.9282024Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:33.9282813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:33.9283569Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:33.9284125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:33.9284923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:33.9285621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:33.9286387Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:33.9287122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:33.9287769Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:33.9288372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:33.9288894Z fn() 2025-05-07T20:33:33.9289409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:33.9289998Z self.fn.run( 2025-05-07T20:33:33.9290516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:33.9291057Z kernel = self.compile( 2025-05-07T20:33:33.9291603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:33.9292255Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:33.9292663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:33.9292903Z 2025-05-07T20:33:33.9293110Z self = 2025-05-07T20:33:33.9294203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:33.9295584Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb1450360>} 2025-05-07T20:33:33.9296926Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:33.9297961Z context = 2025-05-07T20:33:33.9298257Z 2025-05-07T20:33:33.9298425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:33.9298960Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:33.9299437Z module_map=module_map) 2025-05-07T20:33:33.9299813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:33.9300185Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:33.9300453Z E ^ 2025-05-07T20:33:33.9300927Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:33.9301392Z 2025-05-07T20:33:33.9301807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:33.9302319Z 2025-05-07T20:33:33.9302431Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:33.9302842Z self=, 2025-05-07T20:33:33.9303251Z T=4096, 2025-05-07T20:33:33.9303447Z D=5120, 2025-05-07T20:33:33.9303639Z scale_ub=None, 2025-05-07T20:33:33.9303862Z contiguous=False, 2025-05-07T20:33:33.9304091Z compiled=False, 2025-05-07T20:33:33.9304293Z ) 2025-05-07T20:33:34.7254418Z self = 2025-05-07T20:33:34.7254999Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:34.7255398Z 2025-05-07T20:33:34.7255491Z @given( 2025-05-07T20:33:34.7255954Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.7256272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.7256585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.7256983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.7257314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.7257606Z ) 2025-05-07T20:33:34.7257962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.7258400Z def test_silu_mul_quant( 2025-05-07T20:33:34.7258648Z self, 2025-05-07T20:33:34.7258843Z T: int, 2025-05-07T20:33:34.7259035Z D: int, 2025-05-07T20:33:34.7259252Z scale_ub: Optional[float], 2025-05-07T20:33:34.7259524Z contiguous: bool, 2025-05-07T20:33:34.7259768Z compiled: bool, 2025-05-07T20:33:34.7259989Z ) -> None: 2025-05-07T20:33:34.7260210Z torch.manual_seed(2025) 2025-05-07T20:33:34.7260462Z 2025-05-07T20:33:34.7260795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.7261142Z 2025-05-07T20:33:34.7261341Z x_sign = torch.sign(x) 2025-05-07T20:33:34.7261628Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.7261946Z x = x_sign * x_clamp 2025-05-07T20:33:34.7262188Z x0 = x[:, :D] 2025-05-07T20:33:34.7262399Z x1 = x[:, D:] 2025-05-07T20:33:34.7262608Z 2025-05-07T20:33:34.7262796Z if contiguous: 2025-05-07T20:33:34.7263023Z x0 = x0.contiguous() 2025-05-07T20:33:34.7263288Z x1 = x1.contiguous() 2025-05-07T20:33:34.7263527Z 2025-05-07T20:33:34.7263715Z if scale_ub is not None: 2025-05-07T20:33:34.7263998Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.7264340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.7264652Z ) 2025-05-07T20:33:34.7264843Z else: 2025-05-07T20:33:34.7265067Z scale_ub_tensor = None 2025-05-07T20:33:34.7265325Z 2025-05-07T20:33:34.7265962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.7266293Z op = silu_mul_quant 2025-05-07T20:33:34.7266552Z if compiled: 2025-05-07T20:33:34.7266804Z op = torch.compile(op) 2025-05-07T20:33:34.7267115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7267399Z 2025-05-07T20:33:34.7267601Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.7267779Z 2025-05-07T20:33:34.7267886Z moe/activation_test.py:117: 2025-05-07T20:33:34.7268195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7268532Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.7268828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7269535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.7270237Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.7270777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.7271468Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.7272143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.7272687Z kernel = self.compile( 2025-05-07T20:33:34.7273226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.7273884Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.7274288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7274521Z 2025-05-07T20:33:34.7274727Z self = 2025-05-07T20:33:34.7276033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.7277473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb256f240>} 2025-05-07T20:33:34.7278821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.7279853Z context = 2025-05-07T20:33:34.7280144Z 2025-05-07T20:33:34.7280311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.7280843Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.7281423Z module_map=module_map) 2025-05-07T20:33:34.7281792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.7282161Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.7282428Z E ^ 2025-05-07T20:33:34.7282898Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.7283352Z 2025-05-07T20:33:34.7283766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.7284282Z 2025-05-07T20:33:34.7284388Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.7284807Z self=, 2025-05-07T20:33:34.7285226Z T=4096, 2025-05-07T20:33:34.7285430Z D=7168, 2025-05-07T20:33:34.7291499Z scale_ub=None, 2025-05-07T20:33:34.7291742Z contiguous=False, 2025-05-07T20:33:34.7291986Z compiled=False, 2025-05-07T20:33:34.7292203Z ) 2025-05-07T20:33:34.7292535Z self = 2025-05-07T20:33:34.7293068Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:34.7293351Z 2025-05-07T20:33:34.7293435Z @given( 2025-05-07T20:33:34.7293682Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.7294004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.7294314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.7294654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.7294992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.7295281Z ) 2025-05-07T20:33:34.7295644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.7296095Z def test_silu_mul_quant( 2025-05-07T20:33:34.7296343Z self, 2025-05-07T20:33:34.7296543Z T: int, 2025-05-07T20:33:34.7296754Z D: int, 2025-05-07T20:33:34.7296975Z scale_ub: Optional[float], 2025-05-07T20:33:34.7297257Z contiguous: bool, 2025-05-07T20:33:34.7297509Z compiled: bool, 2025-05-07T20:33:34.7297736Z ) -> None: 2025-05-07T20:33:34.7297962Z torch.manual_seed(2025) 2025-05-07T20:33:34.7298214Z 2025-05-07T20:33:34.7298488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.7298842Z 2025-05-07T20:33:34.7299045Z x_sign = torch.sign(x) 2025-05-07T20:33:34.7299333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.7299648Z x = x_sign * x_clamp 2025-05-07T20:33:34.7299900Z x0 = x[:, :D] 2025-05-07T20:33:34.7300119Z x1 = x[:, D:] 2025-05-07T20:33:34.7300337Z 2025-05-07T20:33:34.7300534Z if contiguous: 2025-05-07T20:33:34.7300767Z x0 = x0.contiguous() 2025-05-07T20:33:34.7301038Z x1 = x1.contiguous() 2025-05-07T20:33:34.7301406Z 2025-05-07T20:33:34.7301610Z if scale_ub is not None: 2025-05-07T20:33:34.7301896Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.7302239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.7302600Z ) 2025-05-07T20:33:34.7302797Z else: 2025-05-07T20:33:34.7303019Z scale_ub_tensor = None 2025-05-07T20:33:34.7303278Z 2025-05-07T20:33:34.7303511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.7303838Z op = silu_mul_quant 2025-05-07T20:33:34.7304098Z if compiled: 2025-05-07T20:33:34.7304345Z op = torch.compile(op) 2025-05-07T20:33:34.7304648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7304931Z 2025-05-07T20:33:34.7305126Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.7305300Z 2025-05-07T20:33:34.7305407Z moe/activation_test.py:117: 2025-05-07T20:33:34.7305721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7306110Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.7306391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7307085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.7307785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.7308320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.7309006Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.7309678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.7310217Z kernel = self.compile( 2025-05-07T20:33:34.7310761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.7311425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.7311831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7312067Z 2025-05-07T20:33:34.7312284Z self = 2025-05-07T20:33:34.7313378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.7314766Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0aee0c0>} 2025-05-07T20:33:34.7316195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.7317243Z context = 2025-05-07T20:33:34.7317531Z 2025-05-07T20:33:34.7317701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.7318231Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.7318720Z module_map=module_map) 2025-05-07T20:33:34.7319094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.7319458Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.7319734Z E ^ 2025-05-07T20:33:34.7320205Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.7320662Z 2025-05-07T20:33:34.7321077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.7321597Z 2025-05-07T20:33:34.7321794Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.7322220Z self=, 2025-05-07T20:33:34.7322639Z T=128, 2025-05-07T20:33:34.7322872Z D=7168, 2025-05-07T20:33:34.7323077Z scale_ub=None, 2025-05-07T20:33:34.7323305Z contiguous=False, 2025-05-07T20:33:34.7323533Z compiled=True, 2025-05-07T20:33:34.7323746Z ) 2025-05-07T20:33:34.7872006Z self = 2025-05-07T20:33:34.7873066Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:34.7873610Z 2025-05-07T20:33:34.7873767Z @given( 2025-05-07T20:33:34.7874234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.7874869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.7875491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.7876111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.7876464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.7876764Z ) 2025-05-07T20:33:34.7877218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.7877678Z def test_silu_mul_quant( 2025-05-07T20:33:34.7877938Z self, 2025-05-07T20:33:34.7878141Z T: int, 2025-05-07T20:33:34.7878356Z D: int, 2025-05-07T20:33:34.7878585Z scale_ub: Optional[float], 2025-05-07T20:33:34.7878865Z contiguous: bool, 2025-05-07T20:33:34.7879120Z compiled: bool, 2025-05-07T20:33:34.7879357Z ) -> None: 2025-05-07T20:33:34.7879580Z torch.manual_seed(2025) 2025-05-07T20:33:34.7879836Z 2025-05-07T20:33:34.7880115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.7880476Z 2025-05-07T20:33:34.7880679Z x_sign = torch.sign(x) 2025-05-07T20:33:34.7880988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.7881315Z x = x_sign * x_clamp 2025-05-07T20:33:34.7881560Z x0 = x[:, :D] 2025-05-07T20:33:34.7881795Z x1 = x[:, D:] 2025-05-07T20:33:34.7882019Z 2025-05-07T20:33:34.7882206Z if contiguous: 2025-05-07T20:33:34.7882454Z x0 = x0.contiguous() 2025-05-07T20:33:34.7882719Z x1 = x1.contiguous() 2025-05-07T20:33:34.7882964Z 2025-05-07T20:33:34.7883159Z if scale_ub is not None: 2025-05-07T20:33:34.7883436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.7883770Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.7884082Z ) 2025-05-07T20:33:34.7884281Z else: 2025-05-07T20:33:34.7884494Z scale_ub_tensor = None 2025-05-07T20:33:34.7884748Z 2025-05-07T20:33:34.7884982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.7885303Z op = silu_mul_quant 2025-05-07T20:33:34.7885553Z if compiled: 2025-05-07T20:33:34.7885812Z op = torch.compile(op) 2025-05-07T20:33:34.7886117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.7886396Z 2025-05-07T20:33:34.7886597Z y_fp8, y_scale = fn() 2025-05-07T20:33:34.7886891Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:34.7887185Z 2025-05-07T20:33:34.7887424Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.7887768Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:34.7888070Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:34.7888387Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:34.7888742Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:34.7889055Z 2025-05-07T20:33:34.7889266Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:34.7889460Z 2025-05-07T20:33:34.7889564Z moe/activation_test.py:126: 2025-05-07T20:33:34.7889860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7890330Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:34.7890663Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:34.7891446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:34.7892256Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:34.7892809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.7893492Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.7894178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:34.7894907Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:34.7895651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:34.7896375Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:34.7896983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:34.7897506Z fn() 2025-05-07T20:33:34.7898016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:34.7898595Z self.fn.run( 2025-05-07T20:33:34.7899071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.7899605Z kernel = self.compile( 2025-05-07T20:33:34.7900140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.7900793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.7901201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.7901435Z 2025-05-07T20:33:34.7901651Z self = 2025-05-07T20:33:34.7902734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.7904110Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0aada80>} 2025-05-07T20:33:34.7905452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.7906531Z context = 2025-05-07T20:33:34.7906819Z 2025-05-07T20:33:34.7907001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.7907525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.7908005Z module_map=module_map) 2025-05-07T20:33:34.7908376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.7908738Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:34.7909019Z E ^ 2025-05-07T20:33:34.7909491Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.7909943Z 2025-05-07T20:33:34.7910368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.7910875Z 2025-05-07T20:33:34.7910983Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.7911402Z self=, 2025-05-07T20:33:34.7911897Z T=128, 2025-05-07T20:33:34.7912088Z D=7168, 2025-05-07T20:33:34.7912292Z scale_ub=None, 2025-05-07T20:33:34.7912512Z contiguous=False, 2025-05-07T20:33:34.7912742Z compiled=False, 2025-05-07T20:33:34.7913025Z ) 2025-05-07T20:33:34.9878527Z self = 2025-05-07T20:33:34.9879135Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:34.9879413Z 2025-05-07T20:33:34.9879502Z @given( 2025-05-07T20:33:34.9879736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.9880059Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.9880371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.9880709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.9881037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.9881328Z ) 2025-05-07T20:33:34.9881685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.9882243Z def test_silu_mul_quant( 2025-05-07T20:33:34.9882495Z self, 2025-05-07T20:33:34.9882692Z T: int, 2025-05-07T20:33:34.9882891Z D: int, 2025-05-07T20:33:34.9883114Z scale_ub: Optional[float], 2025-05-07T20:33:34.9883389Z contiguous: bool, 2025-05-07T20:33:34.9883627Z compiled: bool, 2025-05-07T20:33:34.9883855Z ) -> None: 2025-05-07T20:33:34.9884076Z torch.manual_seed(2025) 2025-05-07T20:33:34.9884315Z 2025-05-07T20:33:34.9884599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.9884950Z 2025-05-07T20:33:34.9885143Z x_sign = torch.sign(x) 2025-05-07T20:33:34.9885435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.9885752Z x = x_sign * x_clamp 2025-05-07T20:33:34.9885996Z x0 = x[:, :D] 2025-05-07T20:33:34.9886231Z x1 = x[:, D:] 2025-05-07T20:33:34.9886473Z 2025-05-07T20:33:34.9886669Z if contiguous: 2025-05-07T20:33:34.9886905Z x0 = x0.contiguous() 2025-05-07T20:33:34.9887170Z x1 = x1.contiguous() 2025-05-07T20:33:34.9887413Z 2025-05-07T20:33:34.9887608Z if scale_ub is not None: 2025-05-07T20:33:34.9887883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.9888226Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.9888534Z ) 2025-05-07T20:33:34.9888733Z else: 2025-05-07T20:33:34.9888957Z scale_ub_tensor = None 2025-05-07T20:33:34.9889209Z 2025-05-07T20:33:34.9889449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.9889774Z op = silu_mul_quant 2025-05-07T20:33:34.9890028Z if compiled: 2025-05-07T20:33:34.9890278Z op = torch.compile(op) 2025-05-07T20:33:34.9890581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.9890865Z 2025-05-07T20:33:34.9891063Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.9891236Z 2025-05-07T20:33:34.9891344Z moe/activation_test.py:117: 2025-05-07T20:33:34.9891646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.9891980Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.9892264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.9892954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.9893640Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.9894180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.9894864Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.9895528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.9896134Z kernel = self.compile( 2025-05-07T20:33:34.9896779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.9897439Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.9897901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.9898132Z 2025-05-07T20:33:34.9898339Z self = 2025-05-07T20:33:34.9899421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.9900796Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0cb05e0>} 2025-05-07T20:33:34.9902185Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.9903218Z context = 2025-05-07T20:33:34.9903511Z 2025-05-07T20:33:34.9903678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.9904203Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.9904675Z module_map=module_map) 2025-05-07T20:33:34.9905035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.9905394Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.9905661Z E ^ 2025-05-07T20:33:34.9906124Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.9906579Z 2025-05-07T20:33:34.9906998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.9907514Z 2025-05-07T20:33:34.9907620Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.9908048Z self=, 2025-05-07T20:33:34.9908454Z T=4096, 2025-05-07T20:33:34.9908665Z D=5120, 2025-05-07T20:33:34.9908863Z scale_ub=1200.0, 2025-05-07T20:33:34.9909089Z contiguous=True, 2025-05-07T20:33:34.9909311Z compiled=False, 2025-05-07T20:33:34.9909525Z ) 2025-05-07T20:33:34.9909853Z self = 2025-05-07T20:33:34.9910349Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:34.9910633Z 2025-05-07T20:33:34.9910713Z @given( 2025-05-07T20:33:34.9910951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:34.9911274Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:34.9911582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:34.9911921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:34.9912255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:34.9912545Z ) 2025-05-07T20:33:34.9912899Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:34.9913346Z def test_silu_mul_quant( 2025-05-07T20:33:34.9913587Z self, 2025-05-07T20:33:34.9913785Z T: int, 2025-05-07T20:33:34.9913987Z D: int, 2025-05-07T20:33:34.9914211Z scale_ub: Optional[float], 2025-05-07T20:33:34.9914487Z contiguous: bool, 2025-05-07T20:33:34.9914728Z compiled: bool, 2025-05-07T20:33:34.9914959Z ) -> None: 2025-05-07T20:33:34.9915185Z torch.manual_seed(2025) 2025-05-07T20:33:34.9915439Z 2025-05-07T20:33:34.9915800Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:34.9916236Z 2025-05-07T20:33:34.9916439Z x_sign = torch.sign(x) 2025-05-07T20:33:34.9916736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:34.9917044Z x = x_sign * x_clamp 2025-05-07T20:33:34.9917336Z x0 = x[:, :D] 2025-05-07T20:33:34.9917555Z x1 = x[:, D:] 2025-05-07T20:33:34.9917761Z 2025-05-07T20:33:34.9917953Z if contiguous: 2025-05-07T20:33:34.9918193Z x0 = x0.contiguous() 2025-05-07T20:33:34.9918454Z x1 = x1.contiguous() 2025-05-07T20:33:34.9918701Z 2025-05-07T20:33:34.9918903Z if scale_ub is not None: 2025-05-07T20:33:34.9919173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:34.9919509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:34.9919821Z ) 2025-05-07T20:33:34.9920014Z else: 2025-05-07T20:33:34.9920229Z scale_ub_tensor = None 2025-05-07T20:33:34.9920483Z 2025-05-07T20:33:34.9920722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:34.9921041Z op = silu_mul_quant 2025-05-07T20:33:34.9921335Z if compiled: 2025-05-07T20:33:34.9921587Z op = torch.compile(op) 2025-05-07T20:33:34.9921888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.9922164Z 2025-05-07T20:33:34.9922361Z > y_fp8, y_scale = fn() 2025-05-07T20:33:34.9922527Z 2025-05-07T20:33:34.9922626Z moe/activation_test.py:117: 2025-05-07T20:33:34.9922924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.9923263Z moe/activation_test.py:115: in fn 2025-05-07T20:33:34.9923542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:34.9924233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:34.9924921Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:34.9925463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:34.9926145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:34.9926810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:34.9927344Z kernel = self.compile( 2025-05-07T20:33:34.9927878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:34.9928530Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:34.9928929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:34.9929160Z 2025-05-07T20:33:34.9929370Z self = 2025-05-07T20:33:34.9930450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:34.9931828Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0cb1b20>} 2025-05-07T20:33:34.9933175Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:34.9934198Z context = 2025-05-07T20:33:34.9934483Z 2025-05-07T20:33:34.9934652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:34.9935169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:34.9935638Z module_map=module_map) 2025-05-07T20:33:34.9936053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:34.9936442Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:34.9936701Z E ^ 2025-05-07T20:33:34.9937167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:34.9937659Z 2025-05-07T20:33:34.9938082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:34.9938592Z 2025-05-07T20:33:34.9938697Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:34.9939112Z self=, 2025-05-07T20:33:34.9939517Z T=1, 2025-05-07T20:33:34.9939701Z D=5120, 2025-05-07T20:33:34.9939899Z scale_ub=None, 2025-05-07T20:33:34.9940124Z contiguous=True, 2025-05-07T20:33:34.9940348Z compiled=True, 2025-05-07T20:33:34.9940555Z ) 2025-05-07T20:33:35.3693211Z self = 2025-05-07T20:33:35.3693909Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:35.3694185Z 2025-05-07T20:33:35.3694267Z @given( 2025-05-07T20:33:35.3694510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.3694826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.3695144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.3695477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.3695807Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.3696092Z ) 2025-05-07T20:33:35.3696442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.3696887Z def test_silu_mul_quant( 2025-05-07T20:33:35.3697128Z self, 2025-05-07T20:33:35.3697325Z T: int, 2025-05-07T20:33:35.3697526Z D: int, 2025-05-07T20:33:35.3697745Z scale_ub: Optional[float], 2025-05-07T20:33:35.3698026Z contiguous: bool, 2025-05-07T20:33:35.3698270Z compiled: bool, 2025-05-07T20:33:35.3698493Z ) -> None: 2025-05-07T20:33:35.3698711Z torch.manual_seed(2025) 2025-05-07T20:33:35.3698955Z 2025-05-07T20:33:35.3699228Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.3699576Z 2025-05-07T20:33:35.3699772Z x_sign = torch.sign(x) 2025-05-07T20:33:35.3705352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.3705713Z x = x_sign * x_clamp 2025-05-07T20:33:35.3705965Z x0 = x[:, :D] 2025-05-07T20:33:35.3706189Z x1 = x[:, D:] 2025-05-07T20:33:35.3706433Z 2025-05-07T20:33:35.3706655Z if contiguous: 2025-05-07T20:33:35.3706884Z x0 = x0.contiguous() 2025-05-07T20:33:35.3707143Z x1 = x1.contiguous() 2025-05-07T20:33:35.3707387Z 2025-05-07T20:33:35.3707591Z if scale_ub is not None: 2025-05-07T20:33:35.3707862Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.3708214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.3708540Z ) 2025-05-07T20:33:35.3708735Z else: 2025-05-07T20:33:35.3708956Z scale_ub_tensor = None 2025-05-07T20:33:35.3709212Z 2025-05-07T20:33:35.3709443Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.3709760Z op = silu_mul_quant 2025-05-07T20:33:35.3710012Z if compiled: 2025-05-07T20:33:35.3710263Z op = torch.compile(op) 2025-05-07T20:33:35.3710560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.3710832Z 2025-05-07T20:33:35.3711031Z y_fp8, y_scale = fn() 2025-05-07T20:33:35.3711322Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:35.3711610Z 2025-05-07T20:33:35.3711843Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.3712182Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:35.3712648Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:35.3712967Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:35.3713329Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:35.3713714Z 2025-05-07T20:33:35.3713917Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:35.3714117Z 2025-05-07T20:33:35.3714218Z moe/activation_test.py:126: 2025-05-07T20:33:35.3714517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.3714852Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:35.3715184Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:35.3716041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:35.3716793Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:35.3717338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.3718066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.3718755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:35.3719481Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:35.3720207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:35.3720849Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:35.3721452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:35.3721963Z fn() 2025-05-07T20:33:35.3722475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:35.3723063Z self.fn.run( 2025-05-07T20:33:35.3723529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.3724055Z kernel = self.compile( 2025-05-07T20:33:35.3724591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.3725244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.3725635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.3725872Z 2025-05-07T20:33:35.3726077Z self = 2025-05-07T20:33:35.3727209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.3728590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0cb2a20>} 2025-05-07T20:33:35.3729935Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.3730953Z context = 2025-05-07T20:33:35.3731240Z 2025-05-07T20:33:35.3731404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.3731926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.3732392Z module_map=module_map) 2025-05-07T20:33:35.3732748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.3733102Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:35.3733370Z E ^ 2025-05-07T20:33:35.3733917Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.3734375Z 2025-05-07T20:33:35.3734789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.3735345Z 2025-05-07T20:33:35.3735451Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.3735870Z self=, 2025-05-07T20:33:35.3736271Z T=2048, 2025-05-07T20:33:35.3736464Z D=5120, 2025-05-07T20:33:35.3736666Z scale_ub=None, 2025-05-07T20:33:35.3736875Z contiguous=True, 2025-05-07T20:33:35.3737100Z compiled=True, 2025-05-07T20:33:35.3737306Z ) 2025-05-07T20:33:35.7368064Z self = 2025-05-07T20:33:35.7368620Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:35.7368947Z 2025-05-07T20:33:35.7369075Z @given( 2025-05-07T20:33:35.7369539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:35.7369878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:35.7370209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:35.7370548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:35.7370890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:35.7371191Z ) 2025-05-07T20:33:35.7371554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:35.7372013Z def test_silu_mul_quant( 2025-05-07T20:33:35.7372270Z self, 2025-05-07T20:33:35.7372473Z T: int, 2025-05-07T20:33:35.7372686Z D: int, 2025-05-07T20:33:35.7372918Z scale_ub: Optional[float], 2025-05-07T20:33:35.7373207Z contiguous: bool, 2025-05-07T20:33:35.7373463Z compiled: bool, 2025-05-07T20:33:35.7373705Z ) -> None: 2025-05-07T20:33:35.7373929Z torch.manual_seed(2025) 2025-05-07T20:33:35.7374184Z 2025-05-07T20:33:35.7374477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:35.7374838Z 2025-05-07T20:33:35.7375047Z x_sign = torch.sign(x) 2025-05-07T20:33:35.7375352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:35.7375677Z x = x_sign * x_clamp 2025-05-07T20:33:35.7375930Z x0 = x[:, :D] 2025-05-07T20:33:35.7376161Z x1 = x[:, D:] 2025-05-07T20:33:35.7376392Z 2025-05-07T20:33:35.7376581Z if contiguous: 2025-05-07T20:33:35.7376824Z x0 = x0.contiguous() 2025-05-07T20:33:35.7377092Z x1 = x1.contiguous() 2025-05-07T20:33:35.7377337Z 2025-05-07T20:33:35.7377544Z if scale_ub is not None: 2025-05-07T20:33:35.7377831Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:35.7378172Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:35.7378491Z ) 2025-05-07T20:33:35.7378706Z else: 2025-05-07T20:33:35.7378920Z scale_ub_tensor = None 2025-05-07T20:33:35.7379189Z 2025-05-07T20:33:35.7379435Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.7379762Z op = silu_mul_quant 2025-05-07T20:33:35.7380025Z if compiled: 2025-05-07T20:33:35.7380290Z op = torch.compile(op) 2025-05-07T20:33:35.7380607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:35.7380889Z 2025-05-07T20:33:35.7381106Z y_fp8, y_scale = fn() 2025-05-07T20:33:35.7381412Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:35.7381708Z 2025-05-07T20:33:35.7381956Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:35.7382308Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:35.7382615Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:35.7382942Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:35.7383443Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:35.7383764Z 2025-05-07T20:33:35.7383979Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:35.7384188Z 2025-05-07T20:33:35.7384355Z moe/activation_test.py:126: 2025-05-07T20:33:35.7384661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.7385009Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:35.7385344Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:35.7386138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:35.7386902Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:35.7387457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:35.7388142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:35.7388887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:35.7389621Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:35.7390370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:35.7391014Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:35.7391623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:35.7392150Z fn() 2025-05-07T20:33:35.7392664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:35.7393247Z self.fn.run( 2025-05-07T20:33:35.7393724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:35.7394267Z kernel = self.compile( 2025-05-07T20:33:35.7394808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:35.7395473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:35.7395990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:35.7396224Z 2025-05-07T20:33:35.7396442Z self = 2025-05-07T20:33:35.7397533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:35.7398918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0cce020>} 2025-05-07T20:33:35.7400277Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:35.7401314Z context = 2025-05-07T20:33:35.7401606Z 2025-05-07T20:33:35.7401782Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:35.7402309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:35.7402790Z module_map=module_map) 2025-05-07T20:33:35.7403165Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:35.7403530Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:35.7403806Z E ^ 2025-05-07T20:33:35.7404282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:35.7404790Z 2025-05-07T20:33:35.7405250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:35.7405765Z 2025-05-07T20:33:35.7405936Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:35.7406356Z self=, 2025-05-07T20:33:35.7406804Z T=128, 2025-05-07T20:33:35.7407012Z D=5120, 2025-05-07T20:33:35.7407213Z scale_ub=None, 2025-05-07T20:33:35.7407442Z contiguous=True, 2025-05-07T20:33:35.7407666Z compiled=True, 2025-05-07T20:33:35.7407880Z ) 2025-05-07T20:33:36.1673009Z self = 2025-05-07T20:33:36.1674313Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:36.1674855Z 2025-05-07T20:33:36.1675023Z @given( 2025-05-07T20:33:36.1675484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.1676271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.1676992Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.1677380Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.1677712Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.1678010Z ) 2025-05-07T20:33:36.1678375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.1678819Z def test_silu_mul_quant( 2025-05-07T20:33:36.1679066Z self, 2025-05-07T20:33:36.1679267Z T: int, 2025-05-07T20:33:36.1679466Z D: int, 2025-05-07T20:33:36.1679686Z scale_ub: Optional[float], 2025-05-07T20:33:36.1679958Z contiguous: bool, 2025-05-07T20:33:36.1680200Z compiled: bool, 2025-05-07T20:33:36.1680432Z ) -> None: 2025-05-07T20:33:36.1680656Z torch.manual_seed(2025) 2025-05-07T20:33:36.1680896Z 2025-05-07T20:33:36.1681178Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.1681531Z 2025-05-07T20:33:36.1681737Z x_sign = torch.sign(x) 2025-05-07T20:33:36.1682027Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.1682349Z x = x_sign * x_clamp 2025-05-07T20:33:36.1682598Z x0 = x[:, :D] 2025-05-07T20:33:36.1682815Z x1 = x[:, D:] 2025-05-07T20:33:36.1683031Z 2025-05-07T20:33:36.1683220Z if contiguous: 2025-05-07T20:33:36.1683449Z x0 = x0.contiguous() 2025-05-07T20:33:36.1683716Z x1 = x1.contiguous() 2025-05-07T20:33:36.1683967Z 2025-05-07T20:33:36.1684163Z if scale_ub is not None: 2025-05-07T20:33:36.1684440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.1684784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.1685095Z ) 2025-05-07T20:33:36.1685289Z else: 2025-05-07T20:33:36.1685506Z scale_ub_tensor = None 2025-05-07T20:33:36.1685756Z 2025-05-07T20:33:36.1685996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.1686318Z op = silu_mul_quant 2025-05-07T20:33:36.1686576Z if compiled: 2025-05-07T20:33:36.1686824Z op = torch.compile(op) 2025-05-07T20:33:36.1687129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.1687415Z 2025-05-07T20:33:36.1687608Z y_fp8, y_scale = fn() 2025-05-07T20:33:36.1687898Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:36.1688197Z 2025-05-07T20:33:36.1688436Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.1688777Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:36.1689078Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:36.1689389Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:36.1689755Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:36.1690071Z 2025-05-07T20:33:36.1690344Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:36.1690602Z 2025-05-07T20:33:36.1690706Z moe/activation_test.py:126: 2025-05-07T20:33:36.1691009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.1691408Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:36.1691734Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:36.1692526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:36.1693283Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:36.1693832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.1694510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.1695201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:36.1695978Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:36.1696708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:36.1697409Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:36.1698013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:36.1698536Z fn() 2025-05-07T20:33:36.1699040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:36.1699624Z self.fn.run( 2025-05-07T20:33:36.1700092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.1700623Z kernel = self.compile( 2025-05-07T20:33:36.1701166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.1701826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.1702229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.1702465Z 2025-05-07T20:33:36.1702673Z self = 2025-05-07T20:33:36.1703758Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.1705139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0e3aac0>} 2025-05-07T20:33:36.1706489Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.1707571Z context = 2025-05-07T20:33:36.1707861Z 2025-05-07T20:33:36.1708030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.1708558Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.1709034Z module_map=module_map) 2025-05-07T20:33:36.1709396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.1709758Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:36.1710034Z E ^ 2025-05-07T20:33:36.1710502Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.1710955Z 2025-05-07T20:33:36.1711368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.1711973Z 2025-05-07T20:33:36.1712080Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.1712498Z self=, 2025-05-07T20:33:36.1712937Z T=4096, 2025-05-07T20:33:36.1713132Z D=5120, 2025-05-07T20:33:36.1713333Z scale_ub=None, 2025-05-07T20:33:36.1713555Z contiguous=True, 2025-05-07T20:33:36.1713776Z compiled=True, 2025-05-07T20:33:36.1713989Z ) 2025-05-07T20:33:36.5971469Z self = 2025-05-07T20:33:36.5971989Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:36.5972288Z 2025-05-07T20:33:36.5972393Z @given( 2025-05-07T20:33:36.5972718Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.5973077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.5973390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.5973735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.5974195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.5974498Z ) 2025-05-07T20:33:36.5974857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.5975305Z def test_silu_mul_quant( 2025-05-07T20:33:36.5975564Z self, 2025-05-07T20:33:36.5975765Z T: int, 2025-05-07T20:33:36.5975957Z D: int, 2025-05-07T20:33:36.5976177Z scale_ub: Optional[float], 2025-05-07T20:33:36.5976454Z contiguous: bool, 2025-05-07T20:33:36.5976690Z compiled: bool, 2025-05-07T20:33:36.5976925Z ) -> None: 2025-05-07T20:33:36.5977176Z torch.manual_seed(2025) 2025-05-07T20:33:36.5977442Z 2025-05-07T20:33:36.5977711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.5978066Z 2025-05-07T20:33:36.5978264Z x_sign = torch.sign(x) 2025-05-07T20:33:36.5978554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.5978875Z x = x_sign * x_clamp 2025-05-07T20:33:36.5979121Z x0 = x[:, :D] 2025-05-07T20:33:36.5979334Z x1 = x[:, D:] 2025-05-07T20:33:36.5979552Z 2025-05-07T20:33:36.5979748Z if contiguous: 2025-05-07T20:33:36.5980180Z x0 = x0.contiguous() 2025-05-07T20:33:36.5980446Z x1 = x1.contiguous() 2025-05-07T20:33:36.5980694Z 2025-05-07T20:33:36.5980885Z if scale_ub is not None: 2025-05-07T20:33:36.5981161Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.5981503Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.5981810Z ) 2025-05-07T20:33:36.5982008Z else: 2025-05-07T20:33:36.5982220Z scale_ub_tensor = None 2025-05-07T20:33:36.5982471Z 2025-05-07T20:33:36.5982708Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.5983030Z op = silu_mul_quant 2025-05-07T20:33:36.5983283Z if compiled: 2025-05-07T20:33:36.5983542Z op = torch.compile(op) 2025-05-07T20:33:36.5983843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.5984120Z 2025-05-07T20:33:36.5984322Z y_fp8, y_scale = fn() 2025-05-07T20:33:36.5984613Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:36.5984908Z 2025-05-07T20:33:36.5985146Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.5985485Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:36.5985786Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:36.5986098Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:36.5986457Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:36.5986774Z 2025-05-07T20:33:36.5986972Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:36.5987175Z 2025-05-07T20:33:36.5987278Z moe/activation_test.py:126: 2025-05-07T20:33:36.5987660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.5988058Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:36.5988382Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:36.5989241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:36.5989996Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:36.5990539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.5991222Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.5991916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:36.5992642Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:36.5993372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:36.5994055Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:36.5994659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:36.5995177Z fn() 2025-05-07T20:33:36.5995757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:36.5996341Z self.fn.run( 2025-05-07T20:33:36.5996810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.5997335Z kernel = self.compile( 2025-05-07T20:33:36.5997876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.5998526Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.5998936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.5999169Z 2025-05-07T20:33:36.5999379Z self = 2025-05-07T20:33:36.6000466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.6001848Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbcfb560>} 2025-05-07T20:33:36.6003189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.6004208Z context = 2025-05-07T20:33:36.6004507Z 2025-05-07T20:33:36.6004675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.6005200Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.6005678Z module_map=module_map) 2025-05-07T20:33:36.6006039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.6006397Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:36.6006666Z E ^ 2025-05-07T20:33:36.6007176Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.6007632Z 2025-05-07T20:33:36.6008055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.6008572Z 2025-05-07T20:33:36.6008676Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.6009173Z self=, 2025-05-07T20:33:36.6009611Z T=16384, 2025-05-07T20:33:36.6009810Z D=5120, 2025-05-07T20:33:36.6010007Z scale_ub=None, 2025-05-07T20:33:36.6010221Z contiguous=True, 2025-05-07T20:33:36.6010489Z compiled=True, 2025-05-07T20:33:36.6010694Z ) 2025-05-07T20:33:36.6270145Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:36.6277785Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:36.6279129Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:36.6280152Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:36.6281378Z W0507 20:33:36.625000 96975 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:36.7156334Z self = 2025-05-07T20:33:36.7157421Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:36.7157799Z 2025-05-07T20:33:36.7157883Z @given( 2025-05-07T20:33:36.7158128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.7158447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.7158770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.7159120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.7159478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.7159769Z ) 2025-05-07T20:33:36.7160137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.7160590Z def test_silu_mul_quant( 2025-05-07T20:33:36.7160842Z self, 2025-05-07T20:33:36.7161049Z T: int, 2025-05-07T20:33:36.7161247Z D: int, 2025-05-07T20:33:36.7161470Z scale_ub: Optional[float], 2025-05-07T20:33:36.7161753Z contiguous: bool, 2025-05-07T20:33:36.7161991Z compiled: bool, 2025-05-07T20:33:36.7162229Z ) -> None: 2025-05-07T20:33:36.7162453Z torch.manual_seed(2025) 2025-05-07T20:33:36.7162702Z 2025-05-07T20:33:36.7162977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.7163330Z 2025-05-07T20:33:36.7163528Z x_sign = torch.sign(x) 2025-05-07T20:33:36.7163820Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.7164135Z x = x_sign * x_clamp 2025-05-07T20:33:36.7164386Z x0 = x[:, :D] 2025-05-07T20:33:36.7164609Z x1 = x[:, D:] 2025-05-07T20:33:36.7164819Z 2025-05-07T20:33:36.7165009Z if contiguous: 2025-05-07T20:33:36.7165241Z x0 = x0.contiguous() 2025-05-07T20:33:36.7165835Z x1 = x1.contiguous() 2025-05-07T20:33:36.7166092Z 2025-05-07T20:33:36.7166285Z if scale_ub is not None: 2025-05-07T20:33:36.7166560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.7166897Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.7167211Z ) 2025-05-07T20:33:36.7167408Z else: 2025-05-07T20:33:36.7167625Z scale_ub_tensor = None 2025-05-07T20:33:36.7167882Z 2025-05-07T20:33:36.7168110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.7168429Z op = silu_mul_quant 2025-05-07T20:33:36.7168681Z if compiled: 2025-05-07T20:33:36.7168925Z op = torch.compile(op) 2025-05-07T20:33:36.7169226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.7169677Z 2025-05-07T20:33:36.7169871Z y_fp8, y_scale = fn() 2025-05-07T20:33:36.7170162Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:36.7170457Z 2025-05-07T20:33:36.7170754Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.7171093Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:36.7171398Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:36.7171710Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:36.7172070Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:36.7172386Z 2025-05-07T20:33:36.7172588Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:36.7172784Z 2025-05-07T20:33:36.7172885Z moe/activation_test.py:126: 2025-05-07T20:33:36.7173187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.7173529Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:36.7173859Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:36.7174706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:36.7175466Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:36.7176014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.7176697Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.7177388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:36.7178114Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:36.7178851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:36.7179492Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:36.7180104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:36.7180622Z fn() 2025-05-07T20:33:36.7181127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:36.7181706Z self.fn.run( 2025-05-07T20:33:36.7182176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.7182708Z kernel = self.compile( 2025-05-07T20:33:36.7183241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.7183897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.7184297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.7184527Z 2025-05-07T20:33:36.7184737Z self = 2025-05-07T20:33:36.7185824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.7187204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb551620>} 2025-05-07T20:33:36.7188546Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.7189571Z context = 2025-05-07T20:33:36.7189862Z 2025-05-07T20:33:36.7190028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.7190649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.7191122Z module_map=module_map) 2025-05-07T20:33:36.7191522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.7191884Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:36.7192156Z E ^ 2025-05-07T20:33:36.7192618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.7193071Z 2025-05-07T20:33:36.7193483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.7193996Z 2025-05-07T20:33:36.7194102Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.7194519Z self=, 2025-05-07T20:33:36.7194925Z T=1, 2025-05-07T20:33:36.7195108Z D=5120, 2025-05-07T20:33:36.7195315Z scale_ub=1200.0, 2025-05-07T20:33:36.7195548Z contiguous=True, 2025-05-07T20:33:36.7195867Z compiled=True, 2025-05-07T20:33:36.7196072Z ) 2025-05-07T20:33:36.8632540Z self = 2025-05-07T20:33:36.8633207Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:36.8633473Z 2025-05-07T20:33:36.8633554Z @given( 2025-05-07T20:33:36.8633787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:36.8634103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:36.8634407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:36.8634739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:36.8635067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:36.8635361Z ) 2025-05-07T20:33:36.8635767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:36.8636219Z def test_silu_mul_quant( 2025-05-07T20:33:36.8636465Z self, 2025-05-07T20:33:36.8636655Z T: int, 2025-05-07T20:33:36.8636855Z D: int, 2025-05-07T20:33:36.8637072Z scale_ub: Optional[float], 2025-05-07T20:33:36.8637340Z contiguous: bool, 2025-05-07T20:33:36.8637586Z compiled: bool, 2025-05-07T20:33:36.8637815Z ) -> None: 2025-05-07T20:33:36.8638026Z torch.manual_seed(2025) 2025-05-07T20:33:36.8638266Z 2025-05-07T20:33:36.8638537Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:36.8638881Z 2025-05-07T20:33:36.8639074Z x_sign = torch.sign(x) 2025-05-07T20:33:36.8639361Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:36.8639667Z x = x_sign * x_clamp 2025-05-07T20:33:36.8639909Z x0 = x[:, :D] 2025-05-07T20:33:36.8640125Z x1 = x[:, D:] 2025-05-07T20:33:36.8640331Z 2025-05-07T20:33:36.8640512Z if contiguous: 2025-05-07T20:33:36.8640745Z x0 = x0.contiguous() 2025-05-07T20:33:36.8641007Z x1 = x1.contiguous() 2025-05-07T20:33:36.8641245Z 2025-05-07T20:33:36.8641446Z if scale_ub is not None: 2025-05-07T20:33:36.8641716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:36.8642052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:36.8642369Z ) 2025-05-07T20:33:36.8642560Z else: 2025-05-07T20:33:36.8642767Z scale_ub_tensor = None 2025-05-07T20:33:36.8643021Z 2025-05-07T20:33:36.8643251Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:36.8643562Z op = silu_mul_quant 2025-05-07T20:33:36.8643814Z if compiled: 2025-05-07T20:33:36.8644060Z op = torch.compile(op) 2025-05-07T20:33:36.8644354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.8644630Z 2025-05-07T20:33:36.8644822Z > y_fp8, y_scale = fn() 2025-05-07T20:33:36.8644985Z 2025-05-07T20:33:36.8645206Z moe/activation_test.py:117: 2025-05-07T20:33:36.8645560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.8645888Z moe/activation_test.py:115: in fn 2025-05-07T20:33:36.8646229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:36.8646780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:36.8647338Z return fn(*args, **kwargs) 2025-05-07T20:33:36.8647993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:36.8648677Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:36.8649204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:36.8649881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:36.8650541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:36.8651125Z kernel = self.compile( 2025-05-07T20:33:36.8651659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:36.8652313Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:36.8652709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:36.8652936Z 2025-05-07T20:33:36.8653141Z self = 2025-05-07T20:33:36.8654221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:36.8655596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbe5c900>} 2025-05-07T20:33:36.8656944Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:36.8657977Z context = 2025-05-07T20:33:36.8658267Z 2025-05-07T20:33:36.8658431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:36.8658951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:36.8659425Z module_map=module_map) 2025-05-07T20:33:36.8659780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:36.8660133Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:36.8660390Z E ^ 2025-05-07T20:33:36.8660858Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:36.8661307Z 2025-05-07T20:33:36.8661721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:36.8662234Z 2025-05-07T20:33:36.8662339Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:36.8662748Z self=, 2025-05-07T20:33:36.8663141Z T=1, 2025-05-07T20:33:36.8663324Z D=5120, 2025-05-07T20:33:36.8663515Z scale_ub=None, 2025-05-07T20:33:36.8663725Z contiguous=False, 2025-05-07T20:33:36.8663955Z compiled=True, 2025-05-07T20:33:36.8664161Z ) 2025-05-07T20:33:37.0812893Z self = 2025-05-07T20:33:37.0813538Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.0813844Z 2025-05-07T20:33:37.0813931Z @given( 2025-05-07T20:33:37.0814278Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.0814683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.0814998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.0815328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.0815724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.0816013Z ) 2025-05-07T20:33:37.0816373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.0816815Z def test_silu_mul_quant( 2025-05-07T20:33:37.0817071Z self, 2025-05-07T20:33:37.0817275Z T: int, 2025-05-07T20:33:37.0817482Z D: int, 2025-05-07T20:33:37.0817707Z scale_ub: Optional[float], 2025-05-07T20:33:37.0817983Z contiguous: bool, 2025-05-07T20:33:37.0818228Z compiled: bool, 2025-05-07T20:33:37.0818465Z ) -> None: 2025-05-07T20:33:37.0818683Z torch.manual_seed(2025) 2025-05-07T20:33:37.0818924Z 2025-05-07T20:33:37.0819206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.0819558Z 2025-05-07T20:33:37.0819820Z x_sign = torch.sign(x) 2025-05-07T20:33:37.0820119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.0820440Z x = x_sign * x_clamp 2025-05-07T20:33:37.0820679Z x0 = x[:, :D] 2025-05-07T20:33:37.0820898Z x1 = x[:, D:] 2025-05-07T20:33:37.0821109Z 2025-05-07T20:33:37.0821293Z if contiguous: 2025-05-07T20:33:37.0821529Z x0 = x0.contiguous() 2025-05-07T20:33:37.0821789Z x1 = x1.contiguous() 2025-05-07T20:33:37.0822035Z 2025-05-07T20:33:37.0822224Z if scale_ub is not None: 2025-05-07T20:33:37.0822501Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.0822841Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.0823150Z ) 2025-05-07T20:33:37.0823346Z else: 2025-05-07T20:33:37.0823562Z scale_ub_tensor = None 2025-05-07T20:33:37.0823818Z 2025-05-07T20:33:37.0824054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.0824374Z op = silu_mul_quant 2025-05-07T20:33:37.0824624Z if compiled: 2025-05-07T20:33:37.0824878Z op = torch.compile(op) 2025-05-07T20:33:37.0825175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.0825452Z 2025-05-07T20:33:37.0825651Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.0825945Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.0826234Z 2025-05-07T20:33:37.0826475Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.0826815Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.0827118Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.0827432Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.0827799Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.0828117Z 2025-05-07T20:33:37.0828321Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.0828526Z 2025-05-07T20:33:37.0828630Z moe/activation_test.py:126: 2025-05-07T20:33:37.0828931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.0829268Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.0829606Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.0830394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.0831154Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.0831697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.0832389Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.0833134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.0833897Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.0834632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.0835323Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.0836016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.0836542Z fn() 2025-05-07T20:33:37.0837055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.0837645Z self.fn.run( 2025-05-07T20:33:37.0838119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.0838651Z kernel = self.compile( 2025-05-07T20:33:37.0839199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.0839911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.0840315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.0840560Z 2025-05-07T20:33:37.0840770Z self = 2025-05-07T20:33:37.0841863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.0843247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbe5ec00>} 2025-05-07T20:33:37.0844603Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.0845630Z context = 2025-05-07T20:33:37.0845931Z 2025-05-07T20:33:37.0846099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.0846628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.0847107Z module_map=module_map) 2025-05-07T20:33:37.0847476Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.0847846Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.0848125Z E ^ 2025-05-07T20:33:37.0848590Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.0849053Z 2025-05-07T20:33:37.0849472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.0849989Z 2025-05-07T20:33:37.0850098Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.0850517Z self=, 2025-05-07T20:33:37.0850927Z T=1, 2025-05-07T20:33:37.0851123Z D=5120, 2025-05-07T20:33:37.0851329Z scale_ub=None, 2025-05-07T20:33:37.0851543Z contiguous=True, 2025-05-07T20:33:37.0851771Z compiled=False, 2025-05-07T20:33:37.0851984Z ) 2025-05-07T20:33:37.2358053Z self = 2025-05-07T20:33:37.2358742Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:37.2359134Z 2025-05-07T20:33:37.2359250Z @given( 2025-05-07T20:33:37.2359568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2360002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2360447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2360842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2361185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2361476Z ) 2025-05-07T20:33:37.2361891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2362343Z def test_silu_mul_quant( 2025-05-07T20:33:37.2362595Z self, 2025-05-07T20:33:37.2362799Z T: int, 2025-05-07T20:33:37.2363004Z D: int, 2025-05-07T20:33:37.2363217Z scale_ub: Optional[float], 2025-05-07T20:33:37.2363495Z contiguous: bool, 2025-05-07T20:33:37.2363740Z compiled: bool, 2025-05-07T20:33:37.2363973Z ) -> None: 2025-05-07T20:33:37.2364203Z torch.manual_seed(2025) 2025-05-07T20:33:37.2364461Z 2025-05-07T20:33:37.2364738Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2365077Z 2025-05-07T20:33:37.2365275Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2365827Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2366216Z x = x_sign * x_clamp 2025-05-07T20:33:37.2366462Z x0 = x[:, :D] 2025-05-07T20:33:37.2366681Z x1 = x[:, D:] 2025-05-07T20:33:37.2366889Z 2025-05-07T20:33:37.2367076Z if contiguous: 2025-05-07T20:33:37.2367320Z x0 = x0.contiguous() 2025-05-07T20:33:37.2367627Z x1 = x1.contiguous() 2025-05-07T20:33:37.2367875Z 2025-05-07T20:33:37.2368073Z if scale_ub is not None: 2025-05-07T20:33:37.2368341Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.2368674Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.2369078Z ) 2025-05-07T20:33:37.2369354Z else: 2025-05-07T20:33:37.2369645Z scale_ub_tensor = None 2025-05-07T20:33:37.2369990Z 2025-05-07T20:33:37.2370258Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2370572Z op = silu_mul_quant 2025-05-07T20:33:37.2370830Z if compiled: 2025-05-07T20:33:37.2371083Z op = torch.compile(op) 2025-05-07T20:33:37.2371374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2371656Z 2025-05-07T20:33:37.2371850Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.2372014Z 2025-05-07T20:33:37.2372112Z moe/activation_test.py:117: 2025-05-07T20:33:37.2372410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2372746Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.2373022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2373707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.2374398Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.2374932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2375615Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2376278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2376813Z kernel = self.compile( 2025-05-07T20:33:37.2377351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2378007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2378408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2378639Z 2025-05-07T20:33:37.2378855Z self = 2025-05-07T20:33:37.2380022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2381448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbe5f6a0>} 2025-05-07T20:33:37.2382859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2383886Z context = 2025-05-07T20:33:37.2384173Z 2025-05-07T20:33:37.2384346Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2384865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2385335Z module_map=module_map) 2025-05-07T20:33:37.2385700Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2386064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.2386331Z E ^ 2025-05-07T20:33:37.2386842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2387297Z 2025-05-07T20:33:37.2387722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2388231Z 2025-05-07T20:33:37.2388339Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2388754Z self=, 2025-05-07T20:33:37.2389161Z T=128, 2025-05-07T20:33:37.2389359Z D=5120, 2025-05-07T20:33:37.2389552Z scale_ub=None, 2025-05-07T20:33:37.2389769Z contiguous=False, 2025-05-07T20:33:37.2389994Z compiled=True, 2025-05-07T20:33:37.2390198Z ) 2025-05-07T20:33:37.2390522Z self = 2025-05-07T20:33:37.2391045Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.2391315Z 2025-05-07T20:33:37.2391401Z @given( 2025-05-07T20:33:37.2391638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.2397848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.2398196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.2398540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.2398880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.2399168Z ) 2025-05-07T20:33:37.2399521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.2399964Z def test_silu_mul_quant( 2025-05-07T20:33:37.2400208Z self, 2025-05-07T20:33:37.2400413Z T: int, 2025-05-07T20:33:37.2400617Z D: int, 2025-05-07T20:33:37.2400849Z scale_ub: Optional[float], 2025-05-07T20:33:37.2401127Z contiguous: bool, 2025-05-07T20:33:37.2401376Z compiled: bool, 2025-05-07T20:33:37.2401615Z ) -> None: 2025-05-07T20:33:37.2401840Z torch.manual_seed(2025) 2025-05-07T20:33:37.2402090Z 2025-05-07T20:33:37.2402369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.2402725Z 2025-05-07T20:33:37.2402920Z x_sign = torch.sign(x) 2025-05-07T20:33:37.2403208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.2403524Z x = x_sign * x_clamp 2025-05-07T20:33:37.2403772Z x0 = x[:, :D] 2025-05-07T20:33:37.2403991Z x1 = x[:, D:] 2025-05-07T20:33:37.2404199Z 2025-05-07T20:33:37.2404388Z if contiguous: 2025-05-07T20:33:37.2404621Z x0 = x0.contiguous() 2025-05-07T20:33:37.2404889Z x1 = x1.contiguous() 2025-05-07T20:33:37.2405143Z 2025-05-07T20:33:37.2405336Z if scale_ub is not None: 2025-05-07T20:33:37.2405624Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.2406042Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.2406392Z ) 2025-05-07T20:33:37.2406587Z else: 2025-05-07T20:33:37.2406802Z scale_ub_tensor = None 2025-05-07T20:33:37.2407055Z 2025-05-07T20:33:37.2407389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.2407716Z op = silu_mul_quant 2025-05-07T20:33:37.2407975Z if compiled: 2025-05-07T20:33:37.2408221Z op = torch.compile(op) 2025-05-07T20:33:37.2408518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2408794Z 2025-05-07T20:33:37.2408987Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.2409156Z 2025-05-07T20:33:37.2409258Z moe/activation_test.py:117: 2025-05-07T20:33:37.2409568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2409905Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.2410184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.2410748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.2411358Z return fn(*args, **kwargs) 2025-05-07T20:33:37.2412014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.2412713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.2413254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.2413934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.2414593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.2415126Z kernel = self.compile( 2025-05-07T20:33:37.2415675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.2416336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.2416746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.2416984Z 2025-05-07T20:33:37.2417196Z self = 2025-05-07T20:33:37.2418329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.2419871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbe5f100>} 2025-05-07T20:33:37.2421229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.2422264Z context = 2025-05-07T20:33:37.2422560Z 2025-05-07T20:33:37.2422738Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.2423273Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.2423747Z module_map=module_map) 2025-05-07T20:33:37.2424115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.2424474Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.2424735Z E ^ 2025-05-07T20:33:37.2425204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.2425660Z 2025-05-07T20:33:37.2426082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.2426594Z 2025-05-07T20:33:37.2426768Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.2427225Z self=, 2025-05-07T20:33:37.2427666Z T=128, 2025-05-07T20:33:37.2427891Z D=7168, 2025-05-07T20:33:37.2428127Z scale_ub=1200.0, 2025-05-07T20:33:37.2428357Z contiguous=False, 2025-05-07T20:33:37.2428585Z compiled=False, 2025-05-07T20:33:37.2428791Z ) 2025-05-07T20:33:37.3561799Z self = 2025-05-07T20:33:37.3562545Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.3562921Z 2025-05-07T20:33:37.3563029Z @given( 2025-05-07T20:33:37.3563348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.3563773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.3564109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.3564437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.3564765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.3565058Z ) 2025-05-07T20:33:37.3565703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.3566153Z def test_silu_mul_quant( 2025-05-07T20:33:37.3566393Z self, 2025-05-07T20:33:37.3566595Z T: int, 2025-05-07T20:33:37.3566787Z D: int, 2025-05-07T20:33:37.3567003Z scale_ub: Optional[float], 2025-05-07T20:33:37.3567281Z contiguous: bool, 2025-05-07T20:33:37.3567564Z compiled: bool, 2025-05-07T20:33:37.3567801Z ) -> None: 2025-05-07T20:33:37.3568021Z torch.manual_seed(2025) 2025-05-07T20:33:37.3568271Z 2025-05-07T20:33:37.3568538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.3568886Z 2025-05-07T20:33:37.3569078Z x_sign = torch.sign(x) 2025-05-07T20:33:37.3569362Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.3569682Z x = x_sign * x_clamp 2025-05-07T20:33:37.3569934Z x0 = x[:, :D] 2025-05-07T20:33:37.3570158Z x1 = x[:, D:] 2025-05-07T20:33:37.3570371Z 2025-05-07T20:33:37.3570567Z if contiguous: 2025-05-07T20:33:37.3570801Z x0 = x0.contiguous() 2025-05-07T20:33:37.3571062Z x1 = x1.contiguous() 2025-05-07T20:33:37.3571305Z 2025-05-07T20:33:37.3571494Z if scale_ub is not None: 2025-05-07T20:33:37.3571767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.3572107Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.3572415Z ) 2025-05-07T20:33:37.3572607Z else: 2025-05-07T20:33:37.3572815Z scale_ub_tensor = None 2025-05-07T20:33:37.3573072Z 2025-05-07T20:33:37.3573299Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.3573613Z op = silu_mul_quant 2025-05-07T20:33:37.3573869Z if compiled: 2025-05-07T20:33:37.3574111Z op = torch.compile(op) 2025-05-07T20:33:37.3574408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.3574684Z 2025-05-07T20:33:37.3574874Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.3575039Z 2025-05-07T20:33:37.3575140Z moe/activation_test.py:117: 2025-05-07T20:33:37.3575437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.3575764Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.3576044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.3576731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.3577445Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.3578000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.3578678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.3579415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.3579998Z kernel = self.compile( 2025-05-07T20:33:37.3580537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.3581250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.3581648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.3581880Z 2025-05-07T20:33:37.3582087Z self = 2025-05-07T20:33:37.3583168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.3584546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbcf9940>} 2025-05-07T20:33:37.3585931Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.3586958Z context = 2025-05-07T20:33:37.3587246Z 2025-05-07T20:33:37.3587414Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.3587943Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.3588414Z module_map=module_map) 2025-05-07T20:33:37.3588772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.3589135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.3589404Z E ^ 2025-05-07T20:33:37.3589872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.3590327Z 2025-05-07T20:33:37.3590739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.3591254Z 2025-05-07T20:33:37.3591356Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.3591772Z self=, 2025-05-07T20:33:37.3592175Z T=128, 2025-05-07T20:33:37.3592362Z D=5120, 2025-05-07T20:33:37.3592562Z scale_ub=None, 2025-05-07T20:33:37.3592782Z contiguous=False, 2025-05-07T20:33:37.3593003Z compiled=False, 2025-05-07T20:33:37.3593206Z ) 2025-05-07T20:33:37.3593525Z self = 2025-05-07T20:33:37.3594037Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:37.3594305Z 2025-05-07T20:33:37.3594383Z @given( 2025-05-07T20:33:37.3594620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.3594937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.3595243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.3595582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.3596013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.3596293Z ) 2025-05-07T20:33:37.3596642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.3597090Z def test_silu_mul_quant( 2025-05-07T20:33:37.3597330Z self, 2025-05-07T20:33:37.3597538Z T: int, 2025-05-07T20:33:37.3597771Z D: int, 2025-05-07T20:33:37.3597990Z scale_ub: Optional[float], 2025-05-07T20:33:37.3598259Z contiguous: bool, 2025-05-07T20:33:37.3598498Z compiled: bool, 2025-05-07T20:33:37.3598721Z ) -> None: 2025-05-07T20:33:37.3598931Z torch.manual_seed(2025) 2025-05-07T20:33:37.3599231Z 2025-05-07T20:33:37.3599539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.3599876Z 2025-05-07T20:33:37.3600071Z x_sign = torch.sign(x) 2025-05-07T20:33:37.3600426Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.3600735Z x = x_sign * x_clamp 2025-05-07T20:33:37.3600979Z x0 = x[:, :D] 2025-05-07T20:33:37.3601198Z x1 = x[:, D:] 2025-05-07T20:33:37.3601395Z 2025-05-07T20:33:37.3601582Z if contiguous: 2025-05-07T20:33:37.3601825Z x0 = x0.contiguous() 2025-05-07T20:33:37.3602081Z x1 = x1.contiguous() 2025-05-07T20:33:37.3602329Z 2025-05-07T20:33:37.3602522Z if scale_ub is not None: 2025-05-07T20:33:37.3602790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.3603122Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.3603432Z ) 2025-05-07T20:33:37.3603623Z else: 2025-05-07T20:33:37.3603834Z scale_ub_tensor = None 2025-05-07T20:33:37.3604084Z 2025-05-07T20:33:37.3604363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.3604674Z op = silu_mul_quant 2025-05-07T20:33:37.3604927Z if compiled: 2025-05-07T20:33:37.3605176Z op = torch.compile(op) 2025-05-07T20:33:37.3605469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.3605747Z 2025-05-07T20:33:37.3605947Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.3606109Z 2025-05-07T20:33:37.3606211Z moe/activation_test.py:117: 2025-05-07T20:33:37.3606511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.3606848Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.3607125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.3607809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.3608499Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.3609039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.3609713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.3610372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.3610902Z kernel = self.compile( 2025-05-07T20:33:37.3611447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.3612092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.3612487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.3612718Z 2025-05-07T20:33:37.3612929Z self = 2025-05-07T20:33:37.3614008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.3615379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbed04a0>} 2025-05-07T20:33:37.3616719Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.3617740Z context = 2025-05-07T20:33:37.3618027Z 2025-05-07T20:33:37.3618200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.3618764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.3619271Z module_map=module_map) 2025-05-07T20:33:37.3619635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.3619991Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.3620294Z E ^ 2025-05-07T20:33:37.3620753Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.3621200Z 2025-05-07T20:33:37.3621616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.3622120Z 2025-05-07T20:33:37.3622225Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.3622638Z self=, 2025-05-07T20:33:37.3623041Z T=128, 2025-05-07T20:33:37.3623228Z D=5120, 2025-05-07T20:33:37.3623420Z scale_ub=1200.0, 2025-05-07T20:33:37.3623646Z contiguous=True, 2025-05-07T20:33:37.3623876Z compiled=False, 2025-05-07T20:33:37.3624081Z ) 2025-05-07T20:33:37.5367089Z self = 2025-05-07T20:33:37.5367844Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:37.5368277Z 2025-05-07T20:33:37.5368391Z @given( 2025-05-07T20:33:37.5368712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.5369128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.5369455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.5369798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.5370138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.5370428Z ) 2025-05-07T20:33:37.5370785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.5371233Z def test_silu_mul_quant( 2025-05-07T20:33:37.5371481Z self, 2025-05-07T20:33:37.5371684Z T: int, 2025-05-07T20:33:37.5371895Z D: int, 2025-05-07T20:33:37.5372120Z scale_ub: Optional[float], 2025-05-07T20:33:37.5372408Z contiguous: bool, 2025-05-07T20:33:37.5372654Z compiled: bool, 2025-05-07T20:33:37.5372886Z ) -> None: 2025-05-07T20:33:37.5373110Z torch.manual_seed(2025) 2025-05-07T20:33:37.5373357Z 2025-05-07T20:33:37.5373626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.5373976Z 2025-05-07T20:33:37.5374178Z x_sign = torch.sign(x) 2025-05-07T20:33:37.5374468Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.5374788Z x = x_sign * x_clamp 2025-05-07T20:33:37.5375030Z x0 = x[:, :D] 2025-05-07T20:33:37.5375254Z x1 = x[:, D:] 2025-05-07T20:33:37.5375456Z 2025-05-07T20:33:37.5375643Z if contiguous: 2025-05-07T20:33:37.5375882Z x0 = x0.contiguous() 2025-05-07T20:33:37.5376140Z x1 = x1.contiguous() 2025-05-07T20:33:37.5376387Z 2025-05-07T20:33:37.5376586Z if scale_ub is not None: 2025-05-07T20:33:37.5376861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.5377201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.5377513Z ) 2025-05-07T20:33:37.5377704Z else: 2025-05-07T20:33:37.5377919Z scale_ub_tensor = None 2025-05-07T20:33:37.5378172Z 2025-05-07T20:33:37.5378403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.5378730Z op = silu_mul_quant 2025-05-07T20:33:37.5378981Z if compiled: 2025-05-07T20:33:37.5379233Z op = torch.compile(op) 2025-05-07T20:33:37.5379535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.5379811Z 2025-05-07T20:33:37.5380000Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.5380168Z 2025-05-07T20:33:37.5380268Z moe/activation_test.py:117: 2025-05-07T20:33:37.5380643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.5381025Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.5381306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.5381994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.5382741Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.5383272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.5383956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.5384617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.5385153Z kernel = self.compile( 2025-05-07T20:33:37.5385684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.5386343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.5386785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.5387018Z 2025-05-07T20:33:37.5387234Z self = 2025-05-07T20:33:37.5388363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.5389738Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbed13a0>} 2025-05-07T20:33:37.5391085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.5392121Z context = 2025-05-07T20:33:37.5392410Z 2025-05-07T20:33:37.5392577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.5393102Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.5393572Z module_map=module_map) 2025-05-07T20:33:37.5393935Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.5394288Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.5394551Z E ^ 2025-05-07T20:33:37.5395016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.5395466Z 2025-05-07T20:33:37.5396007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.5396520Z 2025-05-07T20:33:37.5396629Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.5397050Z self=, 2025-05-07T20:33:37.5397457Z T=1, 2025-05-07T20:33:37.5397646Z D=7168, 2025-05-07T20:33:37.5397871Z scale_ub=1200.0, 2025-05-07T20:33:37.5398124Z contiguous=True, 2025-05-07T20:33:37.5398344Z compiled=True, 2025-05-07T20:33:37.5398550Z ) 2025-05-07T20:33:37.5398876Z self = 2025-05-07T20:33:37.5399362Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:37.5399625Z 2025-05-07T20:33:37.5399708Z @given( 2025-05-07T20:33:37.5399946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.5400260Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.5400565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.5400893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.5401273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.5401598Z ) 2025-05-07T20:33:37.5401951Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.5402431Z def test_silu_mul_quant( 2025-05-07T20:33:37.5402672Z self, 2025-05-07T20:33:37.5402868Z T: int, 2025-05-07T20:33:37.5403072Z D: int, 2025-05-07T20:33:37.5403293Z scale_ub: Optional[float], 2025-05-07T20:33:37.5403576Z contiguous: bool, 2025-05-07T20:33:37.5403826Z compiled: bool, 2025-05-07T20:33:37.5404050Z ) -> None: 2025-05-07T20:33:37.5404268Z torch.manual_seed(2025) 2025-05-07T20:33:37.5404518Z 2025-05-07T20:33:37.5404793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.5405138Z 2025-05-07T20:33:37.5405341Z x_sign = torch.sign(x) 2025-05-07T20:33:37.5405630Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.5405937Z x = x_sign * x_clamp 2025-05-07T20:33:37.5406186Z x0 = x[:, :D] 2025-05-07T20:33:37.5406448Z x1 = x[:, D:] 2025-05-07T20:33:37.5406662Z 2025-05-07T20:33:37.5406850Z if contiguous: 2025-05-07T20:33:37.5407092Z x0 = x0.contiguous() 2025-05-07T20:33:37.5407347Z x1 = x1.contiguous() 2025-05-07T20:33:37.5407594Z 2025-05-07T20:33:37.5407788Z if scale_ub is not None: 2025-05-07T20:33:37.5408059Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.5408394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.5408709Z ) 2025-05-07T20:33:37.5408899Z else: 2025-05-07T20:33:37.5409115Z scale_ub_tensor = None 2025-05-07T20:33:37.5409368Z 2025-05-07T20:33:37.5409601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.5409913Z op = silu_mul_quant 2025-05-07T20:33:37.5410167Z if compiled: 2025-05-07T20:33:37.5410413Z op = torch.compile(op) 2025-05-07T20:33:37.5410711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.5410989Z 2025-05-07T20:33:37.5411184Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.5411347Z 2025-05-07T20:33:37.5411450Z moe/activation_test.py:117: 2025-05-07T20:33:37.5411747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.5412082Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.5412361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.5412916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.5413475Z return fn(*args, **kwargs) 2025-05-07T20:33:37.5414135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.5414820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.5415358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.5416050Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.5422012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.5422598Z kernel = self.compile( 2025-05-07T20:33:37.5423145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.5423806Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.5424211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.5424441Z 2025-05-07T20:33:37.5424652Z self = 2025-05-07T20:33:37.5425800Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.5427216Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbed2b60>} 2025-05-07T20:33:37.5428659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.5429688Z context = 2025-05-07T20:33:37.5429979Z 2025-05-07T20:33:37.5430151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.5430677Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.5431151Z module_map=module_map) 2025-05-07T20:33:37.5431525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.5431921Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.5432187Z E ^ 2025-05-07T20:33:37.5432652Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.5433111Z 2025-05-07T20:33:37.5433530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.5434043Z 2025-05-07T20:33:37.5434147Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.5434564Z self=, 2025-05-07T20:33:37.5434968Z T=1, 2025-05-07T20:33:37.5435153Z D=7168, 2025-05-07T20:33:37.5435344Z scale_ub=1200.0, 2025-05-07T20:33:37.5435562Z contiguous=False, 2025-05-07T20:33:37.5435847Z compiled=True, 2025-05-07T20:33:37.5436052Z ) 2025-05-07T20:33:37.6763771Z self = 2025-05-07T20:33:37.6764534Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.6764906Z 2025-05-07T20:33:37.6765019Z @given( 2025-05-07T20:33:37.6765267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.6765744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.6766045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.6766372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.6766699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.6766978Z ) 2025-05-07T20:33:37.6767327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.6767767Z def test_silu_mul_quant( 2025-05-07T20:33:37.6768005Z self, 2025-05-07T20:33:37.6768198Z T: int, 2025-05-07T20:33:37.6768394Z D: int, 2025-05-07T20:33:37.6768604Z scale_ub: Optional[float], 2025-05-07T20:33:37.6768879Z contiguous: bool, 2025-05-07T20:33:37.6769129Z compiled: bool, 2025-05-07T20:33:37.6769358Z ) -> None: 2025-05-07T20:33:37.6769571Z torch.manual_seed(2025) 2025-05-07T20:33:37.6769821Z 2025-05-07T20:33:37.6770100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.6770438Z 2025-05-07T20:33:37.6770632Z x_sign = torch.sign(x) 2025-05-07T20:33:37.6770924Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.6771230Z x = x_sign * x_clamp 2025-05-07T20:33:37.6771473Z x0 = x[:, :D] 2025-05-07T20:33:37.6771695Z x1 = x[:, D:] 2025-05-07T20:33:37.6771901Z 2025-05-07T20:33:37.6772090Z if contiguous: 2025-05-07T20:33:37.6772325Z x0 = x0.contiguous() 2025-05-07T20:33:37.6772578Z x1 = x1.contiguous() 2025-05-07T20:33:37.6772821Z 2025-05-07T20:33:37.6773018Z if scale_ub is not None: 2025-05-07T20:33:37.6773285Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.6773800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.6774117Z ) 2025-05-07T20:33:37.6774309Z else: 2025-05-07T20:33:37.6774578Z scale_ub_tensor = None 2025-05-07T20:33:37.6774827Z 2025-05-07T20:33:37.6775056Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.6775366Z op = silu_mul_quant 2025-05-07T20:33:37.6775614Z if compiled: 2025-05-07T20:33:37.6775864Z op = torch.compile(op) 2025-05-07T20:33:37.6776159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.6776436Z 2025-05-07T20:33:37.6776627Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.6776795Z 2025-05-07T20:33:37.6776895Z moe/activation_test.py:117: 2025-05-07T20:33:37.6777192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.6777520Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.6777797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.6778418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.6778981Z return fn(*args, **kwargs) 2025-05-07T20:33:37.6779636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.6780314Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.6780850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.6781526Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.6782181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.6782709Z kernel = self.compile( 2025-05-07T20:33:37.6783247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.6783903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.6784299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.6784533Z 2025-05-07T20:33:37.6784740Z self = 2025-05-07T20:33:37.6785822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.6787194Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbeaba60>} 2025-05-07T20:33:37.6788589Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.6789612Z context = 2025-05-07T20:33:37.6789904Z 2025-05-07T20:33:37.6790071Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.6790592Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.6791060Z module_map=module_map) 2025-05-07T20:33:37.6791419Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.6791772Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.6792032Z E ^ 2025-05-07T20:33:37.6792490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.6792943Z 2025-05-07T20:33:37.6793353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.6793949Z 2025-05-07T20:33:37.6794054Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.6794469Z self=, 2025-05-07T20:33:37.6794910Z T=1, 2025-05-07T20:33:37.6795094Z D=7168, 2025-05-07T20:33:37.6795287Z scale_ub=None, 2025-05-07T20:33:37.6795497Z contiguous=False, 2025-05-07T20:33:37.6795800Z compiled=True, 2025-05-07T20:33:37.6796010Z ) 2025-05-07T20:33:37.7668615Z self = 2025-05-07T20:33:37.7669346Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:37.7669708Z 2025-05-07T20:33:37.7669820Z @given( 2025-05-07T20:33:37.7670143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.7670471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.7670776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.7671118Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.7671452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.7671847Z ) 2025-05-07T20:33:37.7672200Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.7672642Z def test_silu_mul_quant( 2025-05-07T20:33:37.7672881Z self, 2025-05-07T20:33:37.7673076Z T: int, 2025-05-07T20:33:37.7673274Z D: int, 2025-05-07T20:33:37.7673494Z scale_ub: Optional[float], 2025-05-07T20:33:37.7673770Z contiguous: bool, 2025-05-07T20:33:37.7674012Z compiled: bool, 2025-05-07T20:33:37.7674232Z ) -> None: 2025-05-07T20:33:37.7674452Z torch.manual_seed(2025) 2025-05-07T20:33:37.7674693Z 2025-05-07T20:33:37.7674958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.7675301Z 2025-05-07T20:33:37.7675493Z x_sign = torch.sign(x) 2025-05-07T20:33:37.7675873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.7676187Z x = x_sign * x_clamp 2025-05-07T20:33:37.7676433Z x0 = x[:, :D] 2025-05-07T20:33:37.7676655Z x1 = x[:, D:] 2025-05-07T20:33:37.7676858Z 2025-05-07T20:33:37.7677047Z if contiguous: 2025-05-07T20:33:37.7677279Z x0 = x0.contiguous() 2025-05-07T20:33:37.7677541Z x1 = x1.contiguous() 2025-05-07T20:33:37.7677808Z 2025-05-07T20:33:37.7678021Z if scale_ub is not None: 2025-05-07T20:33:37.7678289Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.7678626Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.7678935Z ) 2025-05-07T20:33:37.7679125Z else: 2025-05-07T20:33:37.7679336Z scale_ub_tensor = None 2025-05-07T20:33:37.7679585Z 2025-05-07T20:33:37.7679813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.7680129Z op = silu_mul_quant 2025-05-07T20:33:37.7680375Z if compiled: 2025-05-07T20:33:37.7680626Z op = torch.compile(op) 2025-05-07T20:33:37.7680919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.7681197Z 2025-05-07T20:33:37.7681390Z y_fp8, y_scale = fn() 2025-05-07T20:33:37.7681670Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:37.7681962Z 2025-05-07T20:33:37.7682199Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.7682528Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:37.7682821Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:37.7683134Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:37.7683484Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.7683791Z 2025-05-07T20:33:37.7683991Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:37.7684184Z 2025-05-07T20:33:37.7684288Z moe/activation_test.py:126: 2025-05-07T20:33:37.7684653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.7685045Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:37.7685371Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:37.7686241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:37.7686991Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:37.7687530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.7688210Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.7688893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:37.7689608Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:37.7690338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:37.7691017Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:37.7691612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:37.7692140Z fn() 2025-05-07T20:33:37.7692648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:37.7693229Z self.fn.run( 2025-05-07T20:33:37.7693691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.7694221Z kernel = self.compile( 2025-05-07T20:33:37.7694754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.7695403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.7695803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.7696032Z 2025-05-07T20:33:37.7696245Z self = 2025-05-07T20:33:37.7697325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.7698697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfadd4c20>} 2025-05-07T20:33:37.7700043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.7701066Z context = 2025-05-07T20:33:37.7701358Z 2025-05-07T20:33:37.7701531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.7702050Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.7702523Z module_map=module_map) 2025-05-07T20:33:37.7702885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.7703245Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:37.7703520Z E ^ 2025-05-07T20:33:37.7703985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.7704433Z 2025-05-07T20:33:37.7704859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.7705367Z 2025-05-07T20:33:37.7705475Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.7705935Z self=, 2025-05-07T20:33:37.7706376Z T=1, 2025-05-07T20:33:37.7706558Z D=5120, 2025-05-07T20:33:37.7706754Z scale_ub=1200.0, 2025-05-07T20:33:37.7706979Z contiguous=False, 2025-05-07T20:33:37.7707244Z compiled=True, 2025-05-07T20:33:37.7707443Z ) 2025-05-07T20:33:37.9259530Z self = 2025-05-07T20:33:37.9260962Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:37.9261695Z 2025-05-07T20:33:37.9261918Z @given( 2025-05-07T20:33:37.9262429Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.9263051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.9263662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.9264322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.9264968Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.9265981Z ) 2025-05-07T20:33:37.9266706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.9267955Z def test_silu_mul_quant( 2025-05-07T20:33:37.9268242Z self, 2025-05-07T20:33:37.9268457Z T: int, 2025-05-07T20:33:37.9268654Z D: int, 2025-05-07T20:33:37.9268870Z scale_ub: Optional[float], 2025-05-07T20:33:37.9269143Z contiguous: bool, 2025-05-07T20:33:37.9269393Z compiled: bool, 2025-05-07T20:33:37.9269613Z ) -> None: 2025-05-07T20:33:37.9269836Z torch.manual_seed(2025) 2025-05-07T20:33:37.9270089Z 2025-05-07T20:33:37.9270357Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.9270707Z 2025-05-07T20:33:37.9270915Z x_sign = torch.sign(x) 2025-05-07T20:33:37.9271208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.9271523Z x = x_sign * x_clamp 2025-05-07T20:33:37.9271771Z x0 = x[:, :D] 2025-05-07T20:33:37.9271986Z x1 = x[:, D:] 2025-05-07T20:33:37.9272205Z 2025-05-07T20:33:37.9272399Z if contiguous: 2025-05-07T20:33:37.9272638Z x0 = x0.contiguous() 2025-05-07T20:33:37.9272902Z x1 = x1.contiguous() 2025-05-07T20:33:37.9273145Z 2025-05-07T20:33:37.9273336Z if scale_ub is not None: 2025-05-07T20:33:37.9273611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.9273948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.9274258Z ) 2025-05-07T20:33:37.9274448Z else: 2025-05-07T20:33:37.9274662Z scale_ub_tensor = None 2025-05-07T20:33:37.9274913Z 2025-05-07T20:33:37.9275142Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.9275456Z op = silu_mul_quant 2025-05-07T20:33:37.9275790Z if compiled: 2025-05-07T20:33:37.9276037Z op = torch.compile(op) 2025-05-07T20:33:37.9276334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.9276619Z 2025-05-07T20:33:37.9276810Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.9276983Z 2025-05-07T20:33:37.9277082Z moe/activation_test.py:117: 2025-05-07T20:33:37.9277387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.9277727Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.9278010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.9278568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:37.9279128Z return fn(*args, **kwargs) 2025-05-07T20:33:37.9279781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.9280468Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.9281004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.9281755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.9282470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.9283063Z kernel = self.compile( 2025-05-07T20:33:37.9283602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.9284251Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.9284651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.9284887Z 2025-05-07T20:33:37.9285095Z self = 2025-05-07T20:33:37.9286181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.9287607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfadd5ee0>} 2025-05-07T20:33:37.9289006Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.9290030Z context = 2025-05-07T20:33:37.9290318Z 2025-05-07T20:33:37.9290488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.9291013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.9291487Z module_map=module_map) 2025-05-07T20:33:37.9291850Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.9292209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.9292471Z E ^ 2025-05-07T20:33:37.9292937Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.9293392Z 2025-05-07T20:33:37.9293810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.9294317Z 2025-05-07T20:33:37.9294426Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.9294845Z self=, 2025-05-07T20:33:37.9295257Z T=1, 2025-05-07T20:33:37.9295443Z D=5120, 2025-05-07T20:33:37.9295637Z scale_ub=1200.0, 2025-05-07T20:33:37.9295871Z contiguous=False, 2025-05-07T20:33:37.9296098Z compiled=False, 2025-05-07T20:33:37.9296302Z ) 2025-05-07T20:33:37.9296622Z self = 2025-05-07T20:33:37.9297113Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:37.9297382Z 2025-05-07T20:33:37.9297466Z @given( 2025-05-07T20:33:37.9297699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:37.9298041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:37.9298375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:37.9298706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:37.9299041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:37.9299331Z ) 2025-05-07T20:33:37.9299674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:37.9300118Z def test_silu_mul_quant( 2025-05-07T20:33:37.9300368Z self, 2025-05-07T20:33:37.9300562Z T: int, 2025-05-07T20:33:37.9300768Z D: int, 2025-05-07T20:33:37.9300994Z scale_ub: Optional[float], 2025-05-07T20:33:37.9301264Z contiguous: bool, 2025-05-07T20:33:37.9301508Z compiled: bool, 2025-05-07T20:33:37.9301819Z ) -> None: 2025-05-07T20:33:37.9302039Z torch.manual_seed(2025) 2025-05-07T20:33:37.9302277Z 2025-05-07T20:33:37.9302551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:37.9302938Z 2025-05-07T20:33:37.9303129Z x_sign = torch.sign(x) 2025-05-07T20:33:37.9303421Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:37.9303730Z x = x_sign * x_clamp 2025-05-07T20:33:37.9303968Z x0 = x[:, :D] 2025-05-07T20:33:37.9304185Z x1 = x[:, D:] 2025-05-07T20:33:37.9304394Z 2025-05-07T20:33:37.9304575Z if contiguous: 2025-05-07T20:33:37.9304813Z x0 = x0.contiguous() 2025-05-07T20:33:37.9305073Z x1 = x1.contiguous() 2025-05-07T20:33:37.9305307Z 2025-05-07T20:33:37.9305503Z if scale_ub is not None: 2025-05-07T20:33:37.9305782Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:37.9306120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:37.9306446Z ) 2025-05-07T20:33:37.9306644Z else: 2025-05-07T20:33:37.9306904Z scale_ub_tensor = None 2025-05-07T20:33:37.9307163Z 2025-05-07T20:33:37.9307400Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:37.9307721Z op = silu_mul_quant 2025-05-07T20:33:37.9307975Z if compiled: 2025-05-07T20:33:37.9308221Z op = torch.compile(op) 2025-05-07T20:33:37.9308516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.9308792Z 2025-05-07T20:33:37.9308983Z > y_fp8, y_scale = fn() 2025-05-07T20:33:37.9309151Z 2025-05-07T20:33:37.9309261Z moe/activation_test.py:117: 2025-05-07T20:33:37.9309555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.9309888Z moe/activation_test.py:115: in fn 2025-05-07T20:33:37.9310173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:37.9310861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:37.9311551Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:37.9312085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:37.9312769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:37.9313426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:37.9313957Z kernel = self.compile( 2025-05-07T20:33:37.9314501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:37.9315153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:37.9315545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:37.9315853Z 2025-05-07T20:33:37.9316062Z self = 2025-05-07T20:33:37.9317150Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:37.9318574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfadd6b60>} 2025-05-07T20:33:37.9319914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:37.9320941Z context = 2025-05-07T20:33:37.9321233Z 2025-05-07T20:33:37.9321400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:37.9322048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:37.9322519Z module_map=module_map) 2025-05-07T20:33:37.9322928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:37.9323289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:37.9323549Z E ^ 2025-05-07T20:33:37.9324012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:37.9324472Z 2025-05-07T20:33:37.9324884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:37.9325393Z 2025-05-07T20:33:37.9325504Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:37.9325913Z self=, 2025-05-07T20:33:37.9326319Z T=16384, 2025-05-07T20:33:37.9326521Z D=5120, 2025-05-07T20:33:37.9332876Z scale_ub=1200.0, 2025-05-07T20:33:37.9333120Z contiguous=False, 2025-05-07T20:33:37.9333419Z compiled=True, 2025-05-07T20:33:37.9333636Z ) 2025-05-07T20:33:38.0201265Z self = 2025-05-07T20:33:38.0202050Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:38.0202453Z 2025-05-07T20:33:38.0202572Z @given( 2025-05-07T20:33:38.0202883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.0203312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.0203624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.0203959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.0204291Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.0204573Z ) 2025-05-07T20:33:38.0204929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.0205384Z def test_silu_mul_quant( 2025-05-07T20:33:38.0205633Z self, 2025-05-07T20:33:38.0205831Z T: int, 2025-05-07T20:33:38.0206034Z D: int, 2025-05-07T20:33:38.0206260Z scale_ub: Optional[float], 2025-05-07T20:33:38.0206536Z contiguous: bool, 2025-05-07T20:33:38.0206777Z compiled: bool, 2025-05-07T20:33:38.0207002Z ) -> None: 2025-05-07T20:33:38.0207220Z torch.manual_seed(2025) 2025-05-07T20:33:38.0207472Z 2025-05-07T20:33:38.0207746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.0208091Z 2025-05-07T20:33:38.0208289Z x_sign = torch.sign(x) 2025-05-07T20:33:38.0208583Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.0208891Z x = x_sign * x_clamp 2025-05-07T20:33:38.0209135Z x0 = x[:, :D] 2025-05-07T20:33:38.0209358Z x1 = x[:, D:] 2025-05-07T20:33:38.0209570Z 2025-05-07T20:33:38.0209758Z if contiguous: 2025-05-07T20:33:38.0210003Z x0 = x0.contiguous() 2025-05-07T20:33:38.0210262Z x1 = x1.contiguous() 2025-05-07T20:33:38.0210510Z 2025-05-07T20:33:38.0210706Z if scale_ub is not None: 2025-05-07T20:33:38.0210984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.0211318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.0211634Z ) 2025-05-07T20:33:38.0211830Z else: 2025-05-07T20:33:38.0212039Z scale_ub_tensor = None 2025-05-07T20:33:38.0212296Z 2025-05-07T20:33:38.0212536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.0212851Z op = silu_mul_quant 2025-05-07T20:33:38.0213102Z if compiled: 2025-05-07T20:33:38.0213355Z op = torch.compile(op) 2025-05-07T20:33:38.0213649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.0213934Z 2025-05-07T20:33:38.0214133Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.0214305Z 2025-05-07T20:33:38.0214584Z moe/activation_test.py:117: 2025-05-07T20:33:38.0214897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.0215234Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.0215577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.0216137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:38.0216710Z return fn(*args, **kwargs) 2025-05-07T20:33:38.0217379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.0218230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.0218907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.0219769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.0220515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.0221109Z kernel = self.compile( 2025-05-07T20:33:38.0221657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.0222317Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.0222723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.0222956Z 2025-05-07T20:33:38.0223174Z self = 2025-05-07T20:33:38.0224264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.0225648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb4c8220>} 2025-05-07T20:33:38.0227000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.0228203Z context = 2025-05-07T20:33:38.0228571Z 2025-05-07T20:33:38.0228780Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.0229438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.0229956Z module_map=module_map) 2025-05-07T20:33:38.0230318Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.0230677Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.0230941Z E ^ 2025-05-07T20:33:38.0231412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.0231874Z 2025-05-07T20:33:38.0232290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.0232807Z 2025-05-07T20:33:38.0232914Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.0233332Z self=, 2025-05-07T20:33:38.0233734Z T=2048, 2025-05-07T20:33:38.0233928Z D=7168, 2025-05-07T20:33:38.0234122Z scale_ub=1200.0, 2025-05-07T20:33:38.0234372Z contiguous=False, 2025-05-07T20:33:38.0234594Z compiled=True, 2025-05-07T20:33:38.0234803Z ) 2025-05-07T20:33:38.0235127Z self = 2025-05-07T20:33:38.0235619Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:38.0236039Z 2025-05-07T20:33:38.0236119Z @given( 2025-05-07T20:33:38.0236504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.0236824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.0237136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.0237518Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.0237855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.0238143Z ) 2025-05-07T20:33:38.0238495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.0238939Z def test_silu_mul_quant( 2025-05-07T20:33:38.0239179Z self, 2025-05-07T20:33:38.0239380Z T: int, 2025-05-07T20:33:38.0239582Z D: int, 2025-05-07T20:33:38.0239798Z scale_ub: Optional[float], 2025-05-07T20:33:38.0240075Z contiguous: bool, 2025-05-07T20:33:38.0240318Z compiled: bool, 2025-05-07T20:33:38.0240548Z ) -> None: 2025-05-07T20:33:38.0240764Z torch.manual_seed(2025) 2025-05-07T20:33:38.0241008Z 2025-05-07T20:33:38.0241291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.0241680Z 2025-05-07T20:33:38.0241881Z x_sign = torch.sign(x) 2025-05-07T20:33:38.0242176Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.0242488Z x = x_sign * x_clamp 2025-05-07T20:33:38.0242729Z x0 = x[:, :D] 2025-05-07T20:33:38.0242947Z x1 = x[:, D:] 2025-05-07T20:33:38.0243155Z 2025-05-07T20:33:38.0243345Z if contiguous: 2025-05-07T20:33:38.0243581Z x0 = x0.contiguous() 2025-05-07T20:33:38.0243841Z x1 = x1.contiguous() 2025-05-07T20:33:38.0244083Z 2025-05-07T20:33:38.0244281Z if scale_ub is not None: 2025-05-07T20:33:38.0244552Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.0244890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.0245205Z ) 2025-05-07T20:33:38.0245399Z else: 2025-05-07T20:33:38.0245616Z scale_ub_tensor = None 2025-05-07T20:33:38.0245878Z 2025-05-07T20:33:38.0246116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.0246430Z op = silu_mul_quant 2025-05-07T20:33:38.0246686Z if compiled: 2025-05-07T20:33:38.0246936Z op = torch.compile(op) 2025-05-07T20:33:38.0247230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.0247512Z 2025-05-07T20:33:38.0247716Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.0247906Z 2025-05-07T20:33:38.0248031Z moe/activation_test.py:117: 2025-05-07T20:33:38.0248330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.0248665Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.0248944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.0249501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:38.0250064Z return fn(*args, **kwargs) 2025-05-07T20:33:38.0250729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.0251418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.0251962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.0252646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.0253311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.0253841Z kernel = self.compile( 2025-05-07T20:33:38.0254382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.0255041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.0255486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.0255761Z 2025-05-07T20:33:38.0255973Z self = 2025-05-07T20:33:38.0257058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.0258530Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb4c8f40>} 2025-05-07T20:33:38.0259876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.0260896Z context = 2025-05-07T20:33:38.0261190Z 2025-05-07T20:33:38.0261362Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.0261929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.0262417Z module_map=module_map) 2025-05-07T20:33:38.0262779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.0263137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.0263408Z E ^ 2025-05-07T20:33:38.0263878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.0264339Z 2025-05-07T20:33:38.0264751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.0265268Z 2025-05-07T20:33:38.1424228Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.1425432Z self=, 2025-05-07T20:33:38.1426624Z T=1, 2025-05-07T20:33:38.1427130Z D=5120, 2025-05-07T20:33:38.1427650Z scale_ub=None, 2025-05-07T20:33:38.1428121Z contiguous=False, 2025-05-07T20:33:38.1428442Z compiled=False, 2025-05-07T20:33:38.1428646Z ) 2025-05-07T20:33:38.1428962Z self = 2025-05-07T20:33:38.1429455Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:38.1429722Z 2025-05-07T20:33:38.1429801Z @given( 2025-05-07T20:33:38.1430040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.1430349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.1430654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.1430987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.1431307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.1431596Z ) 2025-05-07T20:33:38.1431939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.1432378Z def test_silu_mul_quant( 2025-05-07T20:33:38.1432617Z self, 2025-05-07T20:33:38.1432811Z T: int, 2025-05-07T20:33:38.1433013Z D: int, 2025-05-07T20:33:38.1433234Z scale_ub: Optional[float], 2025-05-07T20:33:38.1433505Z contiguous: bool, 2025-05-07T20:33:38.1433751Z compiled: bool, 2025-05-07T20:33:38.1433972Z ) -> None: 2025-05-07T20:33:38.1434186Z torch.manual_seed(2025) 2025-05-07T20:33:38.1434429Z 2025-05-07T20:33:38.1434694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.1435034Z 2025-05-07T20:33:38.1435228Z x_sign = torch.sign(x) 2025-05-07T20:33:38.1435511Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.1435919Z x = x_sign * x_clamp 2025-05-07T20:33:38.1436162Z x0 = x[:, :D] 2025-05-07T20:33:38.1436375Z x1 = x[:, D:] 2025-05-07T20:33:38.1436580Z 2025-05-07T20:33:38.1436980Z if contiguous: 2025-05-07T20:33:38.1437206Z x0 = x0.contiguous() 2025-05-07T20:33:38.1437466Z x1 = x1.contiguous() 2025-05-07T20:33:38.1437709Z 2025-05-07T20:33:38.1438002Z if scale_ub is not None: 2025-05-07T20:33:38.1438275Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.1438608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.1438914Z ) 2025-05-07T20:33:38.1439107Z else: 2025-05-07T20:33:38.1439314Z scale_ub_tensor = None 2025-05-07T20:33:38.1439563Z 2025-05-07T20:33:38.1439788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.1440099Z op = silu_mul_quant 2025-05-07T20:33:38.1440352Z if compiled: 2025-05-07T20:33:38.1440591Z op = torch.compile(op) 2025-05-07T20:33:38.1440880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.1441148Z 2025-05-07T20:33:38.1441342Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.1441509Z 2025-05-07T20:33:38.1441672Z moe/activation_test.py:117: 2025-05-07T20:33:38.1441969Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.1442301Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.1442576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.1443256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.1443944Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.1444472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.1445151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.1445837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.1446366Z kernel = self.compile( 2025-05-07T20:33:38.1446907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.1447552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.1447993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.1448229Z 2025-05-07T20:33:38.1448444Z self = 2025-05-07T20:33:38.1449524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.1450888Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb4c9ee0>} 2025-05-07T20:33:38.1452233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.1453254Z context = 2025-05-07T20:33:38.1453541Z 2025-05-07T20:33:38.1453713Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.1454225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.1454693Z module_map=module_map) 2025-05-07T20:33:38.1455058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.1455410Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.1455664Z E ^ 2025-05-07T20:33:38.1456121Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.1456573Z 2025-05-07T20:33:38.1457057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.1457603Z 2025-05-07T20:33:38.1457709Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.1458161Z self=, 2025-05-07T20:33:38.1458560Z T=4096, 2025-05-07T20:33:38.1458747Z D=7168, 2025-05-07T20:33:38.1458934Z scale_ub=1200.0, 2025-05-07T20:33:38.1459155Z contiguous=False, 2025-05-07T20:33:38.1459375Z compiled=False, 2025-05-07T20:33:38.1459573Z ) 2025-05-07T20:33:38.1459886Z self = 2025-05-07T20:33:38.1460380Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:38.1460652Z 2025-05-07T20:33:38.1460730Z @given( 2025-05-07T20:33:38.1460961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.1461270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.1461586Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.1461952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.1462283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.1462575Z ) 2025-05-07T20:33:38.1462916Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.1463353Z def test_silu_mul_quant( 2025-05-07T20:33:38.1463590Z self, 2025-05-07T20:33:38.1463779Z T: int, 2025-05-07T20:33:38.1463971Z D: int, 2025-05-07T20:33:38.1464187Z scale_ub: Optional[float], 2025-05-07T20:33:38.1464455Z contiguous: bool, 2025-05-07T20:33:38.1464690Z compiled: bool, 2025-05-07T20:33:38.1464908Z ) -> None: 2025-05-07T20:33:38.1465120Z torch.manual_seed(2025) 2025-05-07T20:33:38.1465360Z 2025-05-07T20:33:38.1465930Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.1466266Z 2025-05-07T20:33:38.1466464Z x_sign = torch.sign(x) 2025-05-07T20:33:38.1466753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.1467066Z x = x_sign * x_clamp 2025-05-07T20:33:38.1467300Z x0 = x[:, :D] 2025-05-07T20:33:38.1467516Z x1 = x[:, D:] 2025-05-07T20:33:38.1467721Z 2025-05-07T20:33:38.1467904Z if contiguous: 2025-05-07T20:33:38.1468167Z x0 = x0.contiguous() 2025-05-07T20:33:38.1468437Z x1 = x1.contiguous() 2025-05-07T20:33:38.1468673Z 2025-05-07T20:33:38.1468863Z if scale_ub is not None: 2025-05-07T20:33:38.1469135Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.1469466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.1469771Z ) 2025-05-07T20:33:38.1469964Z else: 2025-05-07T20:33:38.1470166Z scale_ub_tensor = None 2025-05-07T20:33:38.1470415Z 2025-05-07T20:33:38.1470650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.1470961Z op = silu_mul_quant 2025-05-07T20:33:38.1471214Z if compiled: 2025-05-07T20:33:38.1471458Z op = torch.compile(op) 2025-05-07T20:33:38.1471750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.1472021Z 2025-05-07T20:33:38.1472215Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.1472376Z 2025-05-07T20:33:38.1472479Z moe/activation_test.py:117: 2025-05-07T20:33:38.1472766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.1473101Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.1473381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.1474056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.1474745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.1475350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.1476124Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.1476778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.1477368Z kernel = self.compile( 2025-05-07T20:33:38.1477904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.1478552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.1478943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.1479173Z 2025-05-07T20:33:38.1479377Z self = 2025-05-07T20:33:38.1480457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.1481880Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb4cb420>} 2025-05-07T20:33:38.1483217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.1484243Z context = 2025-05-07T20:33:38.1484535Z 2025-05-07T20:33:38.1484702Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.1485225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.1485691Z module_map=module_map) 2025-05-07T20:33:38.1486061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.1486420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.1486686Z E ^ 2025-05-07T20:33:38.1487154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.1487608Z 2025-05-07T20:33:38.1488017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.1488524Z 2025-05-07T20:33:38.1488634Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.1489041Z self=, 2025-05-07T20:33:38.1489444Z T=16384, 2025-05-07T20:33:38.1489642Z D=7168, 2025-05-07T20:33:38.1489832Z scale_ub=None, 2025-05-07T20:33:38.1490045Z contiguous=True, 2025-05-07T20:33:38.1490269Z compiled=True, 2025-05-07T20:33:38.1490467Z ) 2025-05-07T20:33:38.3243782Z self = 2025-05-07T20:33:38.3245326Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:38.3246109Z 2025-05-07T20:33:38.3246326Z @given( 2025-05-07T20:33:38.3246933Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.3247689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.3248140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.3248506Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.3248829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.3249114Z ) 2025-05-07T20:33:38.3249469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.3249908Z def test_silu_mul_quant( 2025-05-07T20:33:38.3250152Z self, 2025-05-07T20:33:38.3250353Z T: int, 2025-05-07T20:33:38.3250547Z D: int, 2025-05-07T20:33:38.3250769Z scale_ub: Optional[float], 2025-05-07T20:33:38.3251040Z contiguous: bool, 2025-05-07T20:33:38.3251495Z compiled: bool, 2025-05-07T20:33:38.3251721Z ) -> None: 2025-05-07T20:33:38.3251941Z torch.manual_seed(2025) 2025-05-07T20:33:38.3252188Z 2025-05-07T20:33:38.3252524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.3252876Z 2025-05-07T20:33:38.3253068Z x_sign = torch.sign(x) 2025-05-07T20:33:38.3253356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.3253682Z x = x_sign * x_clamp 2025-05-07T20:33:38.3253924Z x0 = x[:, :D] 2025-05-07T20:33:38.3254136Z x1 = x[:, D:] 2025-05-07T20:33:38.3254347Z 2025-05-07T20:33:38.3254539Z if contiguous: 2025-05-07T20:33:38.3254775Z x0 = x0.contiguous() 2025-05-07T20:33:38.3255044Z x1 = x1.contiguous() 2025-05-07T20:33:38.3255290Z 2025-05-07T20:33:38.3255481Z if scale_ub is not None: 2025-05-07T20:33:38.3261984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.3262347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.3262755Z ) 2025-05-07T20:33:38.3262948Z else: 2025-05-07T20:33:38.3263160Z scale_ub_tensor = None 2025-05-07T20:33:38.3263421Z 2025-05-07T20:33:38.3263650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.3263963Z op = silu_mul_quant 2025-05-07T20:33:38.3264221Z if compiled: 2025-05-07T20:33:38.3264466Z op = torch.compile(op) 2025-05-07T20:33:38.3264759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.3265042Z 2025-05-07T20:33:38.3265238Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.3265664Z 2025-05-07T20:33:38.3265769Z moe/activation_test.py:117: 2025-05-07T20:33:38.3266080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.3266416Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.3266708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.3267280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:38.3267837Z return fn(*args, **kwargs) 2025-05-07T20:33:38.3268495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.3269182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.3269724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.3270402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.3271066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.3271607Z kernel = self.compile( 2025-05-07T20:33:38.3272153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.3272806Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.3273366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.3273605Z 2025-05-07T20:33:38.3273820Z self = 2025-05-07T20:33:38.3274909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.3276366Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfaa5c540>} 2025-05-07T20:33:38.3277801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.3278896Z context = 2025-05-07T20:33:38.3279185Z 2025-05-07T20:33:38.3279357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.3279937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.3280416Z module_map=module_map) 2025-05-07T20:33:38.3280785Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.3281137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.3281400Z E ^ 2025-05-07T20:33:38.3281868Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.3282323Z 2025-05-07T20:33:38.3282739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.3283258Z 2025-05-07T20:33:38.3283363Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.3283847Z self=, 2025-05-07T20:33:38.3284258Z T=4096, 2025-05-07T20:33:38.3284453Z D=5120, 2025-05-07T20:33:38.3284646Z scale_ub=None, 2025-05-07T20:33:38.3284864Z contiguous=False, 2025-05-07T20:33:38.3285093Z compiled=True, 2025-05-07T20:33:38.3285293Z ) 2025-05-07T20:33:38.3285615Z self = 2025-05-07T20:33:38.3286108Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:38.3286381Z 2025-05-07T20:33:38.3286460Z @given( 2025-05-07T20:33:38.3286695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.3287016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.3287327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.3287663Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.3288003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.3288295Z ) 2025-05-07T20:33:38.3288639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.3289088Z def test_silu_mul_quant( 2025-05-07T20:33:38.3289340Z self, 2025-05-07T20:33:38.3289542Z T: int, 2025-05-07T20:33:38.3289749Z D: int, 2025-05-07T20:33:38.3289973Z scale_ub: Optional[float], 2025-05-07T20:33:38.3290245Z contiguous: bool, 2025-05-07T20:33:38.3290490Z compiled: bool, 2025-05-07T20:33:38.3290715Z ) -> None: 2025-05-07T20:33:38.3290930Z torch.manual_seed(2025) 2025-05-07T20:33:38.3291181Z 2025-05-07T20:33:38.3291459Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.3291800Z 2025-05-07T20:33:38.3291999Z x_sign = torch.sign(x) 2025-05-07T20:33:38.3292293Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.3292607Z x = x_sign * x_clamp 2025-05-07T20:33:38.3292852Z x0 = x[:, :D] 2025-05-07T20:33:38.3293073Z x1 = x[:, D:] 2025-05-07T20:33:38.3293280Z 2025-05-07T20:33:38.3293474Z if contiguous: 2025-05-07T20:33:38.3293710Z x0 = x0.contiguous() 2025-05-07T20:33:38.3293973Z x1 = x1.contiguous() 2025-05-07T20:33:38.3294212Z 2025-05-07T20:33:38.3294408Z if scale_ub is not None: 2025-05-07T20:33:38.3294687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.3295019Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.3295331Z ) 2025-05-07T20:33:38.3295528Z else: 2025-05-07T20:33:38.3295738Z scale_ub_tensor = None 2025-05-07T20:33:38.3295997Z 2025-05-07T20:33:38.3296229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.3296545Z op = silu_mul_quant 2025-05-07T20:33:38.3296800Z if compiled: 2025-05-07T20:33:38.3297147Z op = torch.compile(op) 2025-05-07T20:33:38.3297444Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.3297721Z 2025-05-07T20:33:38.3297916Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.3298126Z 2025-05-07T20:33:38.3298232Z moe/activation_test.py:117: 2025-05-07T20:33:38.3298525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.3298860Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.3299145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.3299698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:38.3300261Z return fn(*args, **kwargs) 2025-05-07T20:33:38.3300918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.3301606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.3302186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.3302870Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.3303538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.3304069Z kernel = self.compile( 2025-05-07T20:33:38.3304610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.3305265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.3305666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.3305896Z 2025-05-07T20:33:38.3306110Z self = 2025-05-07T20:33:38.3307201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.3308596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfaa5d260>} 2025-05-07T20:33:38.3309947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.3310973Z context = 2025-05-07T20:33:38.3311268Z 2025-05-07T20:33:38.3311436Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.3311960Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.3312433Z module_map=module_map) 2025-05-07T20:33:38.3312800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.3313161Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.3313426Z E ^ 2025-05-07T20:33:38.3313891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.3314346Z 2025-05-07T20:33:38.3314758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.3315272Z 2025-05-07T20:33:38.6386067Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.6386689Z self=, 2025-05-07T20:33:38.6387298Z T=4096, 2025-05-07T20:33:38.6387550Z D=5120, 2025-05-07T20:33:38.6387813Z scale_ub=1200.0, 2025-05-07T20:33:38.6388149Z contiguous=False, 2025-05-07T20:33:38.6388732Z compiled=False, 2025-05-07T20:33:38.6389141Z ) 2025-05-07T20:33:38.6390054Z self = 2025-05-07T20:33:38.6390985Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:38.6391493Z 2025-05-07T20:33:38.6391752Z @given( 2025-05-07T20:33:38.6392168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.6392813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.6393434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.6394028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.6394626Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.6395147Z ) 2025-05-07T20:33:38.6395899Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.6396706Z def test_silu_mul_quant( 2025-05-07T20:33:38.6397149Z self, 2025-05-07T20:33:38.6397505Z T: int, 2025-05-07T20:33:38.6397853Z D: int, 2025-05-07T20:33:38.6398256Z scale_ub: Optional[float], 2025-05-07T20:33:38.6398580Z contiguous: bool, 2025-05-07T20:33:38.6398883Z compiled: bool, 2025-05-07T20:33:38.6399111Z ) -> None: 2025-05-07T20:33:38.6399333Z torch.manual_seed(2025) 2025-05-07T20:33:38.6399574Z 2025-05-07T20:33:38.6399844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.6400192Z 2025-05-07T20:33:38.6400384Z x_sign = torch.sign(x) 2025-05-07T20:33:38.6400676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.6400988Z x = x_sign * x_clamp 2025-05-07T20:33:38.6401220Z x0 = x[:, :D] 2025-05-07T20:33:38.6401438Z x1 = x[:, D:] 2025-05-07T20:33:38.6401650Z 2025-05-07T20:33:38.6401832Z if contiguous: 2025-05-07T20:33:38.6402069Z x0 = x0.contiguous() 2025-05-07T20:33:38.6402327Z x1 = x1.contiguous() 2025-05-07T20:33:38.6402562Z 2025-05-07T20:33:38.6402753Z if scale_ub is not None: 2025-05-07T20:33:38.6403033Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.6403374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.6403681Z ) 2025-05-07T20:33:38.6403881Z else: 2025-05-07T20:33:38.6404097Z scale_ub_tensor = None 2025-05-07T20:33:38.6404343Z 2025-05-07T20:33:38.6404570Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.6404881Z op = silu_mul_quant 2025-05-07T20:33:38.6405129Z if compiled: 2025-05-07T20:33:38.6405389Z op = torch.compile(op) 2025-05-07T20:33:38.6405694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.6405962Z 2025-05-07T20:33:38.6406155Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.6406322Z 2025-05-07T20:33:38.6406430Z moe/activation_test.py:117: 2025-05-07T20:33:38.6406730Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.6407071Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.6407361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.6408060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.6408789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.6409322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.6410002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.6410664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.6411186Z kernel = self.compile( 2025-05-07T20:33:38.6411724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.6412376Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.6412891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.6413128Z 2025-05-07T20:33:38.6413334Z self = 2025-05-07T20:33:38.6414452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.6415824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfaa5e200>} 2025-05-07T20:33:38.6417166Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.6418200Z context = 2025-05-07T20:33:38.6418543Z 2025-05-07T20:33:38.6418768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.6419290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.6419761Z module_map=module_map) 2025-05-07T20:33:38.6420115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.6420477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.6420739Z E ^ 2025-05-07T20:33:38.6421206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.6421662Z 2025-05-07T20:33:38.6422072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.6422583Z 2025-05-07T20:33:38.6422687Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.6423102Z self=, 2025-05-07T20:33:38.6423500Z T=4096, 2025-05-07T20:33:38.6423693Z D=5120, 2025-05-07T20:33:38.6423889Z scale_ub=1200.0, 2025-05-07T20:33:38.6424111Z contiguous=False, 2025-05-07T20:33:38.6424337Z compiled=True, 2025-05-07T20:33:38.6424543Z ) 2025-05-07T20:33:38.6424854Z self = 2025-05-07T20:33:38.6425346Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:38.6425622Z 2025-05-07T20:33:38.6425699Z @given( 2025-05-07T20:33:38.6425928Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.6426235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.6426540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.6426868Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.6427189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.6427475Z ) 2025-05-07T20:33:38.6427823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.6428259Z def test_silu_mul_quant( 2025-05-07T20:33:38.6428499Z self, 2025-05-07T20:33:38.6428696Z T: int, 2025-05-07T20:33:38.6428890Z D: int, 2025-05-07T20:33:38.6429108Z scale_ub: Optional[float], 2025-05-07T20:33:38.6429378Z contiguous: bool, 2025-05-07T20:33:38.6429616Z compiled: bool, 2025-05-07T20:33:38.6429833Z ) -> None: 2025-05-07T20:33:38.6430049Z torch.manual_seed(2025) 2025-05-07T20:33:38.6430296Z 2025-05-07T20:33:38.6430565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.6430907Z 2025-05-07T20:33:38.6431100Z x_sign = torch.sign(x) 2025-05-07T20:33:38.6431384Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.6431694Z x = x_sign * x_clamp 2025-05-07T20:33:38.6431936Z x0 = x[:, :D] 2025-05-07T20:33:38.6432247Z x1 = x[:, D:] 2025-05-07T20:33:38.6432458Z 2025-05-07T20:33:38.6432654Z if contiguous: 2025-05-07T20:33:38.6432878Z x0 = x0.contiguous() 2025-05-07T20:33:38.6433136Z x1 = x1.contiguous() 2025-05-07T20:33:38.6433419Z 2025-05-07T20:33:38.6433606Z if scale_ub is not None: 2025-05-07T20:33:38.6433877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.6434210Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.6434512Z ) 2025-05-07T20:33:38.6434718Z else: 2025-05-07T20:33:38.6434928Z scale_ub_tensor = None 2025-05-07T20:33:38.6435177Z 2025-05-07T20:33:38.6435403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.6435758Z op = silu_mul_quant 2025-05-07T20:33:38.6436008Z if compiled: 2025-05-07T20:33:38.6436248Z op = torch.compile(op) 2025-05-07T20:33:38.6436541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.6436819Z 2025-05-07T20:33:38.6437060Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.6437232Z 2025-05-07T20:33:38.6437329Z moe/activation_test.py:117: 2025-05-07T20:33:38.6437623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.6437960Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.6438275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.6438837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:38.6439391Z return fn(*args, **kwargs) 2025-05-07T20:33:38.6440038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.6440721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.6441257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.6441937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.6442593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.6443123Z kernel = self.compile( 2025-05-07T20:33:38.6443656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.6444302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.6444691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.6444923Z 2025-05-07T20:33:38.6445129Z self = 2025-05-07T20:33:38.6446203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.6447574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfaa5f2e0>} 2025-05-07T20:33:38.6448962Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.6449982Z context = 2025-05-07T20:33:38.6450273Z 2025-05-07T20:33:38.6450438Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.6450954Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.6451418Z module_map=module_map) 2025-05-07T20:33:38.6451782Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.6452229Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.6452484Z E ^ 2025-05-07T20:33:38.6452947Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.6453442Z 2025-05-07T20:33:38.6453857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.6454364Z 2025-05-07T20:33:38.7601698Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.7602324Z self=, 2025-05-07T20:33:38.7602890Z T=2048, 2025-05-07T20:33:38.7603165Z D=7168, 2025-05-07T20:33:38.7603436Z scale_ub=1200.0, 2025-05-07T20:33:38.7603735Z contiguous=False, 2025-05-07T20:33:38.7604048Z compiled=False, 2025-05-07T20:33:38.7604320Z ) 2025-05-07T20:33:38.7604640Z self = 2025-05-07T20:33:38.7605146Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:38.7605426Z 2025-05-07T20:33:38.7605618Z @given( 2025-05-07T20:33:38.7605847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.7606167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.7606478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.7606817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.7607145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.7607438Z ) 2025-05-07T20:33:38.7607784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.7608221Z def test_silu_mul_quant( 2025-05-07T20:33:38.7608462Z self, 2025-05-07T20:33:38.7608660Z T: int, 2025-05-07T20:33:38.7608852Z D: int, 2025-05-07T20:33:38.7609070Z scale_ub: Optional[float], 2025-05-07T20:33:38.7609338Z contiguous: bool, 2025-05-07T20:33:38.7609578Z compiled: bool, 2025-05-07T20:33:38.7609813Z ) -> None: 2025-05-07T20:33:38.7610064Z torch.manual_seed(2025) 2025-05-07T20:33:38.7610306Z 2025-05-07T20:33:38.7610577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.7610918Z 2025-05-07T20:33:38.7611111Z x_sign = torch.sign(x) 2025-05-07T20:33:38.7611399Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.7611708Z x = x_sign * x_clamp 2025-05-07T20:33:38.7611946Z x0 = x[:, :D] 2025-05-07T20:33:38.7612164Z x1 = x[:, D:] 2025-05-07T20:33:38.7612366Z 2025-05-07T20:33:38.7612554Z if contiguous: 2025-05-07T20:33:38.7612794Z x0 = x0.contiguous() 2025-05-07T20:33:38.7613059Z x1 = x1.contiguous() 2025-05-07T20:33:38.7613303Z 2025-05-07T20:33:38.7613501Z if scale_ub is not None: 2025-05-07T20:33:38.7613772Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.7614115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.7614426Z ) 2025-05-07T20:33:38.7614623Z else: 2025-05-07T20:33:38.7614832Z scale_ub_tensor = None 2025-05-07T20:33:38.7615084Z 2025-05-07T20:33:38.7615319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.7615633Z op = silu_mul_quant 2025-05-07T20:33:38.7615885Z if compiled: 2025-05-07T20:33:38.7616132Z op = torch.compile(op) 2025-05-07T20:33:38.7616419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.7616698Z 2025-05-07T20:33:38.7616892Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.7617057Z 2025-05-07T20:33:38.7617168Z moe/activation_test.py:117: 2025-05-07T20:33:38.7617458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.7617790Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.7618076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.7618839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.7619584Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.7620176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.7620857Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.7621522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.7622054Z kernel = self.compile( 2025-05-07T20:33:38.7622590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.7623237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.7623635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.7623874Z 2025-05-07T20:33:38.7624084Z self = 2025-05-07T20:33:38.7625208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.7626578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa9a42c0>} 2025-05-07T20:33:38.7627923Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.7628952Z context = 2025-05-07T20:33:38.7629240Z 2025-05-07T20:33:38.7629411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.7629937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.7630404Z module_map=module_map) 2025-05-07T20:33:38.7630771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.7631128Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.7631384Z E ^ 2025-05-07T20:33:38.7631852Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.7632300Z 2025-05-07T20:33:38.7638279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.7638835Z 2025-05-07T20:33:38.7638943Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.7639353Z self=, 2025-05-07T20:33:38.7639747Z T=1, 2025-05-07T20:33:38.7639949Z D=7168, 2025-05-07T20:33:38.7640137Z scale_ub=None, 2025-05-07T20:33:38.7640348Z contiguous=True, 2025-05-07T20:33:38.7640575Z compiled=False, 2025-05-07T20:33:38.7640784Z ) 2025-05-07T20:33:38.7641095Z self = 2025-05-07T20:33:38.7641579Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:38.7641847Z 2025-05-07T20:33:38.7641928Z @given( 2025-05-07T20:33:38.7642163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:38.7642473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:38.7642780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:38.7643106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:38.7643428Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:38.7643718Z ) 2025-05-07T20:33:38.7644056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:38.7644611Z def test_silu_mul_quant( 2025-05-07T20:33:38.7644852Z self, 2025-05-07T20:33:38.7645043Z T: int, 2025-05-07T20:33:38.7645239Z D: int, 2025-05-07T20:33:38.7645453Z scale_ub: Optional[float], 2025-05-07T20:33:38.7645790Z contiguous: bool, 2025-05-07T20:33:38.7646040Z compiled: bool, 2025-05-07T20:33:38.7646280Z ) -> None: 2025-05-07T20:33:38.7646503Z torch.manual_seed(2025) 2025-05-07T20:33:38.7646755Z 2025-05-07T20:33:38.7647050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:38.7647429Z 2025-05-07T20:33:38.7647626Z x_sign = torch.sign(x) 2025-05-07T20:33:38.7647955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:38.7648296Z x = x_sign * x_clamp 2025-05-07T20:33:38.7648544Z x0 = x[:, :D] 2025-05-07T20:33:38.7648770Z x1 = x[:, D:] 2025-05-07T20:33:38.7648985Z 2025-05-07T20:33:38.7649176Z if contiguous: 2025-05-07T20:33:38.7649431Z x0 = x0.contiguous() 2025-05-07T20:33:38.7649749Z x1 = x1.contiguous() 2025-05-07T20:33:38.7650000Z 2025-05-07T20:33:38.7650200Z if scale_ub is not None: 2025-05-07T20:33:38.7650495Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:38.7650858Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:38.7651200Z ) 2025-05-07T20:33:38.7651401Z else: 2025-05-07T20:33:38.7651616Z scale_ub_tensor = None 2025-05-07T20:33:38.7651880Z 2025-05-07T20:33:38.7652123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:38.7652438Z op = silu_mul_quant 2025-05-07T20:33:38.7652680Z if compiled: 2025-05-07T20:33:38.7652919Z op = torch.compile(op) 2025-05-07T20:33:38.7653210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.7653475Z 2025-05-07T20:33:38.7653669Z > y_fp8, y_scale = fn() 2025-05-07T20:33:38.7653837Z 2025-05-07T20:33:38.7653943Z moe/activation_test.py:117: 2025-05-07T20:33:38.7654233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.7654567Z moe/activation_test.py:115: in fn 2025-05-07T20:33:38.7654844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:38.7655533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:38.7656215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:38.7656752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:38.7657429Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:38.7658086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:38.7658617Z kernel = self.compile( 2025-05-07T20:33:38.7659160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:38.7659817Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:38.7660206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:38.7660442Z 2025-05-07T20:33:38.7660651Z self = 2025-05-07T20:33:38.7661735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:38.7663108Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa9a51c0>} 2025-05-07T20:33:38.7664496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:38.7665823Z context = 2025-05-07T20:33:38.7666198Z 2025-05-07T20:33:38.7666361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:38.7666884Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:38.7667345Z module_map=module_map) 2025-05-07T20:33:38.7667706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:38.7668062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:38.7668316Z E ^ 2025-05-07T20:33:38.7668772Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:38.7669227Z 2025-05-07T20:33:38.7669640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:38.7670147Z 2025-05-07T20:33:38.7670319Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:38.7670735Z self=, 2025-05-07T20:33:38.7671139Z T=16384, 2025-05-07T20:33:38.7671332Z D=7168, 2025-05-07T20:33:38.7671529Z scale_ub=1200.0, 2025-05-07T20:33:38.7671744Z contiguous=False, 2025-05-07T20:33:38.7671967Z compiled=True, 2025-05-07T20:33:39.0089743Z ) 2025-05-07T20:33:39.0090430Z self = 2025-05-07T20:33:39.0091199Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:39.0091599Z 2025-05-07T20:33:39.0091712Z @given( 2025-05-07T20:33:39.0092033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.0092385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.0092708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.0093046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.0093373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.0093665Z ) 2025-05-07T20:33:39.0094019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.0094468Z def test_silu_mul_quant( 2025-05-07T20:33:39.0094706Z self, 2025-05-07T20:33:39.0094905Z T: int, 2025-05-07T20:33:39.0095107Z D: int, 2025-05-07T20:33:39.0095327Z scale_ub: Optional[float], 2025-05-07T20:33:39.0095609Z contiguous: bool, 2025-05-07T20:33:39.0095856Z compiled: bool, 2025-05-07T20:33:39.0096077Z ) -> None: 2025-05-07T20:33:39.0096293Z torch.manual_seed(2025) 2025-05-07T20:33:39.0096544Z 2025-05-07T20:33:39.0096817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.0097169Z 2025-05-07T20:33:39.0097365Z x_sign = torch.sign(x) 2025-05-07T20:33:39.0097662Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.0097983Z x = x_sign * x_clamp 2025-05-07T20:33:39.0098229Z x0 = x[:, :D] 2025-05-07T20:33:39.0098442Z x1 = x[:, D:] 2025-05-07T20:33:39.0098653Z 2025-05-07T20:33:39.0098844Z if contiguous: 2025-05-07T20:33:39.0099076Z x0 = x0.contiguous() 2025-05-07T20:33:39.0099346Z x1 = x1.contiguous() 2025-05-07T20:33:39.0099587Z 2025-05-07T20:33:39.0099786Z if scale_ub is not None: 2025-05-07T20:33:39.0100052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.0100388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.0100703Z ) 2025-05-07T20:33:39.0100894Z else: 2025-05-07T20:33:39.0101111Z scale_ub_tensor = None 2025-05-07T20:33:39.0101367Z 2025-05-07T20:33:39.0101596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.0102102Z op = silu_mul_quant 2025-05-07T20:33:39.0102356Z if compiled: 2025-05-07T20:33:39.0102605Z op = torch.compile(op) 2025-05-07T20:33:39.0102904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.0103280Z 2025-05-07T20:33:39.0103472Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.0103641Z 2025-05-07T20:33:39.0103744Z moe/activation_test.py:117: 2025-05-07T20:33:39.0104041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.0104376Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.0104654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.0105210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.0105769Z return fn(*args, **kwargs) 2025-05-07T20:33:39.0106419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.0107108Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.0107705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.0108412Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.0109099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.0109633Z kernel = self.compile( 2025-05-07T20:33:39.0110173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.0110824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.0111232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.0111464Z 2025-05-07T20:33:39.0111672Z self = 2025-05-07T20:33:39.0112764Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.0114143Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa9a65c0>} 2025-05-07T20:33:39.0115483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.0116613Z context = 2025-05-07T20:33:39.0116908Z 2025-05-07T20:33:39.0117073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.0117602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.0118075Z module_map=module_map) 2025-05-07T20:33:39.0118456Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.0118850Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.0119110Z E ^ 2025-05-07T20:33:39.0119576Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.0120035Z 2025-05-07T20:33:39.0120453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.0120962Z 2025-05-07T20:33:39.0121070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.0121481Z self=, 2025-05-07T20:33:39.0121885Z T=1, 2025-05-07T20:33:39.0122076Z D=7168, 2025-05-07T20:33:39.0122269Z scale_ub=None, 2025-05-07T20:33:39.0122489Z contiguous=False, 2025-05-07T20:33:39.0122839Z compiled=False, 2025-05-07T20:33:39.0123044Z ) 2025-05-07T20:33:39.0123364Z self = 2025-05-07T20:33:39.0123860Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:39.0124167Z 2025-05-07T20:33:39.0124250Z @given( 2025-05-07T20:33:39.0124480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.0124794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.0125106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.0125433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.0125762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.0126048Z ) 2025-05-07T20:33:39.0126390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.0126833Z def test_silu_mul_quant( 2025-05-07T20:33:39.0127079Z self, 2025-05-07T20:33:39.0127273Z T: int, 2025-05-07T20:33:39.0127480Z D: int, 2025-05-07T20:33:39.0127743Z scale_ub: Optional[float], 2025-05-07T20:33:39.0128013Z contiguous: bool, 2025-05-07T20:33:39.0128261Z compiled: bool, 2025-05-07T20:33:39.0128489Z ) -> None: 2025-05-07T20:33:39.0128702Z torch.manual_seed(2025) 2025-05-07T20:33:39.0128947Z 2025-05-07T20:33:39.0129225Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.0129572Z 2025-05-07T20:33:39.0129767Z x_sign = torch.sign(x) 2025-05-07T20:33:39.0130063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.0130370Z x = x_sign * x_clamp 2025-05-07T20:33:39.0130614Z x0 = x[:, :D] 2025-05-07T20:33:39.0130834Z x1 = x[:, D:] 2025-05-07T20:33:39.0131047Z 2025-05-07T20:33:39.0131233Z if contiguous: 2025-05-07T20:33:39.0131465Z x0 = x0.contiguous() 2025-05-07T20:33:39.0131725Z x1 = x1.contiguous() 2025-05-07T20:33:39.0131969Z 2025-05-07T20:33:39.0132167Z if scale_ub is not None: 2025-05-07T20:33:39.0132450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.0132788Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.0133105Z ) 2025-05-07T20:33:39.0133297Z else: 2025-05-07T20:33:39.0133506Z scale_ub_tensor = None 2025-05-07T20:33:39.0133757Z 2025-05-07T20:33:39.0133994Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.0134310Z op = silu_mul_quant 2025-05-07T20:33:39.0134556Z if compiled: 2025-05-07T20:33:39.0134810Z op = torch.compile(op) 2025-05-07T20:33:39.0135104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.0135376Z 2025-05-07T20:33:39.0135578Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.0135742Z 2025-05-07T20:33:39.0135848Z moe/activation_test.py:117: 2025-05-07T20:33:39.0136144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.0136481Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.0136764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.0137445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.0138136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.0138727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.0139409Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.0140083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.0140614Z kernel = self.compile( 2025-05-07T20:33:39.0141153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.0141895Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.0142297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.0142535Z 2025-05-07T20:33:39.0142790Z self = 2025-05-07T20:33:39.0143867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.0145240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa9a71a0>} 2025-05-07T20:33:39.0146581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.0147652Z context = 2025-05-07T20:33:39.0147950Z 2025-05-07T20:33:39.0148118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.0148649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.0149117Z module_map=module_map) 2025-05-07T20:33:39.0149482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.0149844Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.0150107Z E ^ 2025-05-07T20:33:39.0150569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.0151025Z 2025-05-07T20:33:39.0151440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.0151947Z 2025-05-07T20:33:39.0152059Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.0152479Z self=, 2025-05-07T20:33:39.0152879Z T=2048, 2025-05-07T20:33:39.0153071Z D=7168, 2025-05-07T20:33:39.0153269Z scale_ub=None, 2025-05-07T20:33:39.0153482Z contiguous=False, 2025-05-07T20:33:39.0153706Z compiled=True, 2025-05-07T20:33:39.0153912Z ) 2025-05-07T20:33:39.1030172Z self = 2025-05-07T20:33:39.1030899Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:39.1031291Z 2025-05-07T20:33:39.1031399Z @given( 2025-05-07T20:33:39.1031738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.1032170Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.1032590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.1032978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.1033317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.1033603Z ) 2025-05-07T20:33:39.1033949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.1034396Z def test_silu_mul_quant( 2025-05-07T20:33:39.1034633Z self, 2025-05-07T20:33:39.1034854Z T: int, 2025-05-07T20:33:39.1035051Z D: int, 2025-05-07T20:33:39.1035263Z scale_ub: Optional[float], 2025-05-07T20:33:39.1035540Z contiguous: bool, 2025-05-07T20:33:39.1035868Z compiled: bool, 2025-05-07T20:33:39.1036086Z ) -> None: 2025-05-07T20:33:39.1036301Z torch.manual_seed(2025) 2025-05-07T20:33:39.1036542Z 2025-05-07T20:33:39.1036805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.1037148Z 2025-05-07T20:33:39.1037339Z x_sign = torch.sign(x) 2025-05-07T20:33:39.1037624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.1037937Z x = x_sign * x_clamp 2025-05-07T20:33:39.1038367Z x0 = x[:, :D] 2025-05-07T20:33:39.1038583Z x1 = x[:, D:] 2025-05-07T20:33:39.1038792Z 2025-05-07T20:33:39.1038982Z if contiguous: 2025-05-07T20:33:39.1039277Z x0 = x0.contiguous() 2025-05-07T20:33:39.1039527Z x1 = x1.contiguous() 2025-05-07T20:33:39.1039768Z 2025-05-07T20:33:39.1039967Z if scale_ub is not None: 2025-05-07T20:33:39.1040238Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.1040574Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.1040885Z ) 2025-05-07T20:33:39.1041073Z else: 2025-05-07T20:33:39.1041285Z scale_ub_tensor = None 2025-05-07T20:33:39.1041538Z 2025-05-07T20:33:39.1041764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.1042079Z op = silu_mul_quant 2025-05-07T20:33:39.1042331Z if compiled: 2025-05-07T20:33:39.1042574Z op = torch.compile(op) 2025-05-07T20:33:39.1042875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.1043221Z 2025-05-07T20:33:39.1043414Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.1043586Z 2025-05-07T20:33:39.1043692Z moe/activation_test.py:117: 2025-05-07T20:33:39.1043993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.1044325Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.1044600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.1045163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.1045725Z return fn(*args, **kwargs) 2025-05-07T20:33:39.1046375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.1047068Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.1047608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.1048318Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.1048997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.1049530Z kernel = self.compile( 2025-05-07T20:33:39.1050069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.1050727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.1051118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.1051352Z 2025-05-07T20:33:39.1051564Z self = 2025-05-07T20:33:39.1052647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.1054017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfac807c0>} 2025-05-07T20:33:39.1055356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.1056375Z context = 2025-05-07T20:33:39.1056666Z 2025-05-07T20:33:39.1056831Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.1057353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.1057820Z module_map=module_map) 2025-05-07T20:33:39.1058230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.1058627Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.1058878Z E ^ 2025-05-07T20:33:39.1059340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.1059833Z 2025-05-07T20:33:39.1060242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.1060747Z 2025-05-07T20:33:39.1060858Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.1061262Z self=, 2025-05-07T20:33:39.1061665Z T=4096, 2025-05-07T20:33:39.1061856Z D=7168, 2025-05-07T20:33:39.1062045Z scale_ub=None, 2025-05-07T20:33:39.1062270Z contiguous=False, 2025-05-07T20:33:39.1062494Z compiled=True, 2025-05-07T20:33:39.1062695Z ) 2025-05-07T20:33:39.1063023Z self = 2025-05-07T20:33:39.1063587Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:39.1063858Z 2025-05-07T20:33:39.1063945Z @given( 2025-05-07T20:33:39.1064178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.1064491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.1064803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.1065129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.1065640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.1065934Z ) 2025-05-07T20:33:39.1066274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.1066716Z def test_silu_mul_quant( 2025-05-07T20:33:39.1066973Z self, 2025-05-07T20:33:39.1067181Z T: int, 2025-05-07T20:33:39.1067376Z D: int, 2025-05-07T20:33:39.1067599Z scale_ub: Optional[float], 2025-05-07T20:33:39.1067882Z contiguous: bool, 2025-05-07T20:33:39.1068118Z compiled: bool, 2025-05-07T20:33:39.1068345Z ) -> None: 2025-05-07T20:33:39.1068560Z torch.manual_seed(2025) 2025-05-07T20:33:39.1068806Z 2025-05-07T20:33:39.1069084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.1069427Z 2025-05-07T20:33:39.1069618Z x_sign = torch.sign(x) 2025-05-07T20:33:39.1069908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.1070224Z x = x_sign * x_clamp 2025-05-07T20:33:39.1070460Z x0 = x[:, :D] 2025-05-07T20:33:39.1070682Z x1 = x[:, D:] 2025-05-07T20:33:39.1070893Z 2025-05-07T20:33:39.1076712Z if contiguous: 2025-05-07T20:33:39.1076993Z x0 = x0.contiguous() 2025-05-07T20:33:39.1077255Z x1 = x1.contiguous() 2025-05-07T20:33:39.1077503Z 2025-05-07T20:33:39.1077705Z if scale_ub is not None: 2025-05-07T20:33:39.1077978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.1078338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.1078701Z ) 2025-05-07T20:33:39.1078895Z else: 2025-05-07T20:33:39.1079116Z scale_ub_tensor = None 2025-05-07T20:33:39.1079381Z 2025-05-07T20:33:39.1079625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.1079940Z op = silu_mul_quant 2025-05-07T20:33:39.1080202Z if compiled: 2025-05-07T20:33:39.1080457Z op = torch.compile(op) 2025-05-07T20:33:39.1080754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.1081031Z 2025-05-07T20:33:39.1081227Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.1081396Z 2025-05-07T20:33:39.1081498Z moe/activation_test.py:117: 2025-05-07T20:33:39.1081802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.1082131Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.1082522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.1083608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.1084173Z return fn(*args, **kwargs) 2025-05-07T20:33:39.1084885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.1085567Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.1086108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.1086778Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.1087441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.1087976Z kernel = self.compile( 2025-05-07T20:33:39.1088527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.1089236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.1089639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.1089871Z 2025-05-07T20:33:39.1090081Z self = 2025-05-07T20:33:39.1091163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.1092532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfac814e0>} 2025-05-07T20:33:39.1093885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.1094919Z context = 2025-05-07T20:33:39.1095209Z 2025-05-07T20:33:39.1095383Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.1095898Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.1096369Z module_map=module_map) 2025-05-07T20:33:39.1096730Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.1097080Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.1097335Z E ^ 2025-05-07T20:33:39.1097801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.1098252Z 2025-05-07T20:33:39.1098720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.1099230Z 2025-05-07T20:33:39.2692812Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.2693423Z self=, 2025-05-07T20:33:39.2693992Z T=16384, 2025-05-07T20:33:39.2694258Z D=5120, 2025-05-07T20:33:39.2694531Z scale_ub=1200.0, 2025-05-07T20:33:39.2694839Z contiguous=False, 2025-05-07T20:33:39.2695142Z compiled=False, 2025-05-07T20:33:39.2695416Z ) 2025-05-07T20:33:39.2695828Z self = 2025-05-07T20:33:39.2696356Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:39.2696637Z 2025-05-07T20:33:39.2696721Z @given( 2025-05-07T20:33:39.2696957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.2697266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.2697576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.2698030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.2698419Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.2698705Z ) 2025-05-07T20:33:39.2699056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.2699553Z def test_silu_mul_quant( 2025-05-07T20:33:39.2699795Z self, 2025-05-07T20:33:39.2699991Z T: int, 2025-05-07T20:33:39.2700192Z D: int, 2025-05-07T20:33:39.2700409Z scale_ub: Optional[float], 2025-05-07T20:33:39.2700682Z contiguous: bool, 2025-05-07T20:33:39.2700914Z compiled: bool, 2025-05-07T20:33:39.2701142Z ) -> None: 2025-05-07T20:33:39.2701359Z torch.manual_seed(2025) 2025-05-07T20:33:39.2701598Z 2025-05-07T20:33:39.2701862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.2702208Z 2025-05-07T20:33:39.2702401Z x_sign = torch.sign(x) 2025-05-07T20:33:39.2702687Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.2703001Z x = x_sign * x_clamp 2025-05-07T20:33:39.2703308Z x0 = x[:, :D] 2025-05-07T20:33:39.2703522Z x1 = x[:, D:] 2025-05-07T20:33:39.2703741Z 2025-05-07T20:33:39.2703925Z if contiguous: 2025-05-07T20:33:39.2704151Z x0 = x0.contiguous() 2025-05-07T20:33:39.2704414Z x1 = x1.contiguous() 2025-05-07T20:33:39.2704657Z 2025-05-07T20:33:39.2704844Z if scale_ub is not None: 2025-05-07T20:33:39.2705122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.2705456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.2705755Z ) 2025-05-07T20:33:39.2705954Z else: 2025-05-07T20:33:39.2706174Z scale_ub_tensor = None 2025-05-07T20:33:39.2706422Z 2025-05-07T20:33:39.2706660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.2706976Z op = silu_mul_quant 2025-05-07T20:33:39.2707225Z if compiled: 2025-05-07T20:33:39.2707470Z op = torch.compile(op) 2025-05-07T20:33:39.2707773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.2708050Z 2025-05-07T20:33:39.2708246Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.2708411Z 2025-05-07T20:33:39.2708510Z moe/activation_test.py:117: 2025-05-07T20:33:39.2708805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.2709134Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.2709417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.2710106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.2710793Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.2711328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.2712014Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.2712683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.2713215Z kernel = self.compile( 2025-05-07T20:33:39.2713759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.2714418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.2714815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.2715041Z 2025-05-07T20:33:39.2715245Z self = 2025-05-07T20:33:39.2716433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.2717912Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfac823e0>} 2025-05-07T20:33:39.2719254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.2720317Z context = 2025-05-07T20:33:39.2720605Z 2025-05-07T20:33:39.2720769Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.2721291Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.2721767Z module_map=module_map) 2025-05-07T20:33:39.2722121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.2722475Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.2722739Z E ^ 2025-05-07T20:33:39.2723241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.2723694Z 2025-05-07T20:33:39.2724108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.2724620Z 2025-05-07T20:33:39.2724726Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.2725140Z self=, 2025-05-07T20:33:39.2725562Z T=16384, 2025-05-07T20:33:39.2725754Z D=5120, 2025-05-07T20:33:39.2725949Z scale_ub=1200.0, 2025-05-07T20:33:39.2726172Z contiguous=True, 2025-05-07T20:33:39.2726394Z compiled=True, 2025-05-07T20:33:39.2726601Z ) 2025-05-07T20:33:39.2726921Z self = 2025-05-07T20:33:39.2727416Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:39.2727696Z 2025-05-07T20:33:39.2727777Z @given( 2025-05-07T20:33:39.2728014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.2728331Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.2728676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.2729015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.2729345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.2729622Z ) 2025-05-07T20:33:39.2729971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.2730414Z def test_silu_mul_quant( 2025-05-07T20:33:39.2730661Z self, 2025-05-07T20:33:39.2730852Z T: int, 2025-05-07T20:33:39.2731051Z D: int, 2025-05-07T20:33:39.2731271Z scale_ub: Optional[float], 2025-05-07T20:33:39.2731535Z contiguous: bool, 2025-05-07T20:33:39.2731772Z compiled: bool, 2025-05-07T20:33:39.2731999Z ) -> None: 2025-05-07T20:33:39.2732212Z torch.manual_seed(2025) 2025-05-07T20:33:39.2732451Z 2025-05-07T20:33:39.2732729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.2733072Z 2025-05-07T20:33:39.2733264Z x_sign = torch.sign(x) 2025-05-07T20:33:39.2733557Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.2733864Z x = x_sign * x_clamp 2025-05-07T20:33:39.2734115Z x0 = x[:, :D] 2025-05-07T20:33:39.2734336Z x1 = x[:, D:] 2025-05-07T20:33:39.2734539Z 2025-05-07T20:33:39.2734725Z if contiguous: 2025-05-07T20:33:39.2734959Z x0 = x0.contiguous() 2025-05-07T20:33:39.2735211Z x1 = x1.contiguous() 2025-05-07T20:33:39.2735452Z 2025-05-07T20:33:39.2735655Z if scale_ub is not None: 2025-05-07T20:33:39.2735935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.2736267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.2736670Z ) 2025-05-07T20:33:39.2736867Z else: 2025-05-07T20:33:39.2737073Z scale_ub_tensor = None 2025-05-07T20:33:39.2737324Z 2025-05-07T20:33:39.2737557Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.2737940Z op = silu_mul_quant 2025-05-07T20:33:39.2738191Z if compiled: 2025-05-07T20:33:39.2738437Z op = torch.compile(op) 2025-05-07T20:33:39.2738727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.2739014Z 2025-05-07T20:33:39.2739213Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.2739375Z 2025-05-07T20:33:39.2739472Z moe/activation_test.py:117: 2025-05-07T20:33:39.2739766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.2740104Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.2740394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.2740956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.2741556Z return fn(*args, **kwargs) 2025-05-07T20:33:39.2742211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.2742899Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.2743440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.2744122Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.2744782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.2745305Z kernel = self.compile( 2025-05-07T20:33:39.2745845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.2746502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.2746906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.2747144Z 2025-05-07T20:33:39.2747351Z self = 2025-05-07T20:33:39.2748434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.2749803Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfac83a60>} 2025-05-07T20:33:39.2751150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.2752168Z context = 2025-05-07T20:33:39.2752466Z 2025-05-07T20:33:39.2752636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.2753161Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.2753643Z module_map=module_map) 2025-05-07T20:33:39.2754002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.2754358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.2754625Z E ^ 2025-05-07T20:33:39.2755088Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.2755539Z 2025-05-07T20:33:39.2755995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.2756508Z 2025-05-07T20:33:39.4479772Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.4480706Z self=, 2025-05-07T20:33:39.4481277Z T=16384, 2025-05-07T20:33:39.4481523Z D=5120, 2025-05-07T20:33:39.4481729Z scale_ub=None, 2025-05-07T20:33:39.4482023Z contiguous=False, 2025-05-07T20:33:39.4482246Z compiled=True, 2025-05-07T20:33:39.4482451Z ) 2025-05-07T20:33:39.4482774Z self = 2025-05-07T20:33:39.4483270Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:39.4483555Z 2025-05-07T20:33:39.4483635Z @given( 2025-05-07T20:33:39.4483870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.4484187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.4484491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.4484818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.4485151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.4485440Z ) 2025-05-07T20:33:39.4485859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.4486305Z def test_silu_mul_quant( 2025-05-07T20:33:39.4486547Z self, 2025-05-07T20:33:39.4486743Z T: int, 2025-05-07T20:33:39.4486942Z D: int, 2025-05-07T20:33:39.4487155Z scale_ub: Optional[float], 2025-05-07T20:33:39.4487434Z contiguous: bool, 2025-05-07T20:33:39.4487677Z compiled: bool, 2025-05-07T20:33:39.4487900Z ) -> None: 2025-05-07T20:33:39.4488120Z torch.manual_seed(2025) 2025-05-07T20:33:39.4488369Z 2025-05-07T20:33:39.4488647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.4488996Z 2025-05-07T20:33:39.4489206Z x_sign = torch.sign(x) 2025-05-07T20:33:39.4489513Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.4489822Z x = x_sign * x_clamp 2025-05-07T20:33:39.4490075Z x0 = x[:, :D] 2025-05-07T20:33:39.4490309Z x1 = x[:, D:] 2025-05-07T20:33:39.4490522Z 2025-05-07T20:33:39.4490715Z if contiguous: 2025-05-07T20:33:39.4490953Z x0 = x0.contiguous() 2025-05-07T20:33:39.4491212Z x1 = x1.contiguous() 2025-05-07T20:33:39.4491467Z 2025-05-07T20:33:39.4491667Z if scale_ub is not None: 2025-05-07T20:33:39.4491936Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.4492285Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.4492598Z ) 2025-05-07T20:33:39.4492789Z else: 2025-05-07T20:33:39.4493001Z scale_ub_tensor = None 2025-05-07T20:33:39.4493256Z 2025-05-07T20:33:39.4493488Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.4493802Z op = silu_mul_quant 2025-05-07T20:33:39.4494054Z if compiled: 2025-05-07T20:33:39.4494305Z op = torch.compile(op) 2025-05-07T20:33:39.4494600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.4494884Z 2025-05-07T20:33:39.4495084Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.4495247Z 2025-05-07T20:33:39.4495346Z moe/activation_test.py:117: 2025-05-07T20:33:39.4495645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.4495980Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.4496257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.4496818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.4497379Z return fn(*args, **kwargs) 2025-05-07T20:33:39.4498040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.4498768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.4499306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.4500081Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.4500748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.4501312Z kernel = self.compile( 2025-05-07T20:33:39.4501851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.4502501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.4502898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.4503132Z 2025-05-07T20:33:39.4503340Z self = 2025-05-07T20:33:39.4504427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.4505853Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbb88cc0>} 2025-05-07T20:33:39.4507203Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.4508221Z context = 2025-05-07T20:33:39.4508522Z 2025-05-07T20:33:39.4508697Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.4509223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.4509694Z module_map=module_map) 2025-05-07T20:33:39.4510066Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.4510426Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.4510700Z E ^ 2025-05-07T20:33:39.4511166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.4511624Z 2025-05-07T20:33:39.4512035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.4512554Z 2025-05-07T20:33:39.4512663Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.4513079Z self=, 2025-05-07T20:33:39.4513480Z T=2048, 2025-05-07T20:33:39.4513678Z D=5120, 2025-05-07T20:33:39.4513877Z scale_ub=None, 2025-05-07T20:33:39.4514095Z contiguous=False, 2025-05-07T20:33:39.4514328Z compiled=True, 2025-05-07T20:33:39.4514539Z ) 2025-05-07T20:33:39.5425056Z self = 2025-05-07T20:33:39.5426522Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:39.5427213Z 2025-05-07T20:33:39.5427414Z @given( 2025-05-07T20:33:39.5427999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.5428604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.5429050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.5429374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.5429706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.5429993Z ) 2025-05-07T20:33:39.5430336Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.5430780Z def test_silu_mul_quant( 2025-05-07T20:33:39.5431027Z self, 2025-05-07T20:33:39.5431218Z T: int, 2025-05-07T20:33:39.5431419Z D: int, 2025-05-07T20:33:39.5431641Z scale_ub: Optional[float], 2025-05-07T20:33:39.5431908Z contiguous: bool, 2025-05-07T20:33:39.5432400Z compiled: bool, 2025-05-07T20:33:39.5432628Z ) -> None: 2025-05-07T20:33:39.5432839Z torch.manual_seed(2025) 2025-05-07T20:33:39.5433081Z 2025-05-07T20:33:39.5433356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.5433759Z 2025-05-07T20:33:39.5433957Z x_sign = torch.sign(x) 2025-05-07T20:33:39.5434245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.5434557Z x = x_sign * x_clamp 2025-05-07T20:33:39.5434791Z x0 = x[:, :D] 2025-05-07T20:33:39.5435007Z x1 = x[:, D:] 2025-05-07T20:33:39.5435218Z 2025-05-07T20:33:39.5435403Z if contiguous: 2025-05-07T20:33:39.5435635Z x0 = x0.contiguous() 2025-05-07T20:33:39.5435973Z x1 = x1.contiguous() 2025-05-07T20:33:39.5436207Z 2025-05-07T20:33:39.5436402Z if scale_ub is not None: 2025-05-07T20:33:39.5436682Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.5437017Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.5437331Z ) 2025-05-07T20:33:39.5437605Z else: 2025-05-07T20:33:39.5437816Z scale_ub_tensor = None 2025-05-07T20:33:39.5438074Z 2025-05-07T20:33:39.5438312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.5438625Z op = silu_mul_quant 2025-05-07T20:33:39.5438883Z if compiled: 2025-05-07T20:33:39.5439133Z op = torch.compile(op) 2025-05-07T20:33:39.5439435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.5439710Z 2025-05-07T20:33:39.5439911Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.5440078Z 2025-05-07T20:33:39.5440185Z moe/activation_test.py:117: 2025-05-07T20:33:39.5440474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.5440813Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.5441095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.5441661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.5442219Z return fn(*args, **kwargs) 2025-05-07T20:33:39.5442873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.5443560Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.5444087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.5444764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.5445425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.5445949Z kernel = self.compile( 2025-05-07T20:33:39.5446483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.5447144Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.5447543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.5447774Z 2025-05-07T20:33:39.5447982Z self = 2025-05-07T20:33:39.5449062Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.5450432Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbb89a80>} 2025-05-07T20:33:39.5451822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.5452918Z context = 2025-05-07T20:33:39.5453207Z 2025-05-07T20:33:39.5453371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.5453935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.5454403Z module_map=module_map) 2025-05-07T20:33:39.5454759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.5455115Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.5455377Z E ^ 2025-05-07T20:33:39.5455841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.5456289Z 2025-05-07T20:33:39.5456701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.5457212Z 2025-05-07T20:33:39.5457328Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.5464875Z self=, 2025-05-07T20:33:39.5465309Z T=2048, 2025-05-07T20:33:39.5465710Z D=5120, 2025-05-07T20:33:39.5465905Z scale_ub=1200.0, 2025-05-07T20:33:39.5466139Z contiguous=False, 2025-05-07T20:33:39.5466372Z compiled=True, 2025-05-07T20:33:39.5466578Z ) 2025-05-07T20:33:39.5466907Z self = 2025-05-07T20:33:39.5467419Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:39.5467698Z 2025-05-07T20:33:39.5467788Z @given( 2025-05-07T20:33:39.5468015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.5468335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.5468640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.5468973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.5469308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.5469596Z ) 2025-05-07T20:33:39.5469954Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.5470409Z def test_silu_mul_quant( 2025-05-07T20:33:39.5470659Z self, 2025-05-07T20:33:39.5470854Z T: int, 2025-05-07T20:33:39.5471059Z D: int, 2025-05-07T20:33:39.5471283Z scale_ub: Optional[float], 2025-05-07T20:33:39.5471554Z contiguous: bool, 2025-05-07T20:33:39.5471800Z compiled: bool, 2025-05-07T20:33:39.5472026Z ) -> None: 2025-05-07T20:33:39.5472238Z torch.manual_seed(2025) 2025-05-07T20:33:39.5472483Z 2025-05-07T20:33:39.5472761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.5473103Z 2025-05-07T20:33:39.5473302Z x_sign = torch.sign(x) 2025-05-07T20:33:39.5473592Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.5473912Z x = x_sign * x_clamp 2025-05-07T20:33:39.5477038Z x0 = x[:, :D] 2025-05-07T20:33:39.5477272Z x1 = x[:, D:] 2025-05-07T20:33:39.5477486Z 2025-05-07T20:33:39.5477680Z if contiguous: 2025-05-07T20:33:39.5477921Z x0 = x0.contiguous() 2025-05-07T20:33:39.5478184Z x1 = x1.contiguous() 2025-05-07T20:33:39.5478434Z 2025-05-07T20:33:39.5478637Z if scale_ub is not None: 2025-05-07T20:33:39.5478914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.5479258Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.5479582Z ) 2025-05-07T20:33:39.5479784Z else: 2025-05-07T20:33:39.5479993Z scale_ub_tensor = None 2025-05-07T20:33:39.5480248Z 2025-05-07T20:33:39.5480489Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.5480805Z op = silu_mul_quant 2025-05-07T20:33:39.5481062Z if compiled: 2025-05-07T20:33:39.5481413Z op = torch.compile(op) 2025-05-07T20:33:39.5481714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.5482031Z 2025-05-07T20:33:39.5482227Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.5482465Z 2025-05-07T20:33:39.5482565Z moe/activation_test.py:117: 2025-05-07T20:33:39.5482867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.5483207Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.5483493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.5484056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.5484613Z return fn(*args, **kwargs) 2025-05-07T20:33:39.5485280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.5485971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.5486517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.5487272Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.5487946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.5488489Z kernel = self.compile( 2025-05-07T20:33:39.5489034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.5489692Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.5490097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.5490329Z 2025-05-07T20:33:39.5490547Z self = 2025-05-07T20:33:39.5491648Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.5493028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbb8ac00>} 2025-05-07T20:33:39.5494388Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.5495426Z context = 2025-05-07T20:33:39.5495717Z 2025-05-07T20:33:39.5495898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.5496423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.5496908Z module_map=module_map) 2025-05-07T20:33:39.5497291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.5497743Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.5498011Z E ^ 2025-05-07T20:33:39.5498481Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.5498987Z 2025-05-07T20:33:39.5499410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.5499924Z 2025-05-07T20:33:39.7239141Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.7239577Z self=, 2025-05-07T20:33:39.7240105Z T=4096, 2025-05-07T20:33:39.7240390Z D=5120, 2025-05-07T20:33:39.7240583Z scale_ub=1200.0, 2025-05-07T20:33:39.7240820Z contiguous=True, 2025-05-07T20:33:39.7241053Z compiled=True, 2025-05-07T20:33:39.7241262Z ) 2025-05-07T20:33:39.7241720Z self = 2025-05-07T20:33:39.7242239Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:39.7242514Z 2025-05-07T20:33:39.7242602Z @given( 2025-05-07T20:33:39.7242899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.7243232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.7243561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.7243899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.7244234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.7244525Z ) 2025-05-07T20:33:39.7244874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.7245324Z def test_silu_mul_quant( 2025-05-07T20:33:39.7245578Z self, 2025-05-07T20:33:39.7245787Z T: int, 2025-05-07T20:33:39.7245993Z D: int, 2025-05-07T20:33:39.7246219Z scale_ub: Optional[float], 2025-05-07T20:33:39.7246501Z contiguous: bool, 2025-05-07T20:33:39.7246745Z compiled: bool, 2025-05-07T20:33:39.7247038Z ) -> None: 2025-05-07T20:33:39.7247271Z torch.manual_seed(2025) 2025-05-07T20:33:39.7247515Z 2025-05-07T20:33:39.7247796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.7248143Z 2025-05-07T20:33:39.7248344Z x_sign = torch.sign(x) 2025-05-07T20:33:39.7248644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.7248967Z x = x_sign * x_clamp 2025-05-07T20:33:39.7249211Z x0 = x[:, :D] 2025-05-07T20:33:39.7249434Z x1 = x[:, D:] 2025-05-07T20:33:39.7249654Z 2025-05-07T20:33:39.7249839Z if contiguous: 2025-05-07T20:33:39.7250081Z x0 = x0.contiguous() 2025-05-07T20:33:39.7250349Z x1 = x1.contiguous() 2025-05-07T20:33:39.7250586Z 2025-05-07T20:33:39.7250783Z if scale_ub is not None: 2025-05-07T20:33:39.7251063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.7251419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.7251733Z ) 2025-05-07T20:33:39.7251929Z else: 2025-05-07T20:33:39.7252150Z scale_ub_tensor = None 2025-05-07T20:33:39.7252403Z 2025-05-07T20:33:39.7252639Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.7252960Z op = silu_mul_quant 2025-05-07T20:33:39.7253210Z if compiled: 2025-05-07T20:33:39.7253459Z op = torch.compile(op) 2025-05-07T20:33:39.7253760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.7254034Z 2025-05-07T20:33:39.7254233Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.7254399Z 2025-05-07T20:33:39.7254505Z moe/activation_test.py:117: 2025-05-07T20:33:39.7254803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.7255141Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.7255432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.7256092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.7256655Z return fn(*args, **kwargs) 2025-05-07T20:33:39.7257315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.7258005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.7258538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.7259225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.7259892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.7260427Z kernel = self.compile( 2025-05-07T20:33:39.7261017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.7261683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.7262086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.7262360Z 2025-05-07T20:33:39.7262575Z self = 2025-05-07T20:33:39.7263667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.7265047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa828220>} 2025-05-07T20:33:39.7266591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.7267693Z context = 2025-05-07T20:33:39.7267992Z 2025-05-07T20:33:39.7268159Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.7268693Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.7269166Z module_map=module_map) 2025-05-07T20:33:39.7269535Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.7269890Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.7270160Z E ^ 2025-05-07T20:33:39.7270628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.7271083Z 2025-05-07T20:33:39.7271509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.7272021Z 2025-05-07T20:33:39.7272133Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.7272553Z self=, 2025-05-07T20:33:39.7272961Z T=128, 2025-05-07T20:33:39.7273148Z D=5120, 2025-05-07T20:33:39.7273348Z scale_ub=1200.0, 2025-05-07T20:33:39.7273579Z contiguous=False, 2025-05-07T20:33:39.7273803Z compiled=True, 2025-05-07T20:33:39.7274010Z ) 2025-05-07T20:33:39.9958104Z self = 2025-05-07T20:33:39.9958839Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:39.9959225Z 2025-05-07T20:33:39.9959357Z @given( 2025-05-07T20:33:39.9959676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.9960119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.9960521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.9960873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.9961365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.9961669Z ) 2025-05-07T20:33:39.9962034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.9962486Z def test_silu_mul_quant( 2025-05-07T20:33:39.9962738Z self, 2025-05-07T20:33:39.9962944Z T: int, 2025-05-07T20:33:39.9963144Z D: int, 2025-05-07T20:33:39.9963382Z scale_ub: Optional[float], 2025-05-07T20:33:39.9963696Z contiguous: bool, 2025-05-07T20:33:39.9963950Z compiled: bool, 2025-05-07T20:33:39.9964182Z ) -> None: 2025-05-07T20:33:39.9964399Z torch.manual_seed(2025) 2025-05-07T20:33:39.9964646Z 2025-05-07T20:33:39.9964925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.9965270Z 2025-05-07T20:33:39.9965725Z x_sign = torch.sign(x) 2025-05-07T20:33:39.9966113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.9966427Z x = x_sign * x_clamp 2025-05-07T20:33:39.9966672Z x0 = x[:, :D] 2025-05-07T20:33:39.9966896Z x1 = x[:, D:] 2025-05-07T20:33:39.9967162Z 2025-05-07T20:33:39.9967352Z if contiguous: 2025-05-07T20:33:39.9967586Z x0 = x0.contiguous() 2025-05-07T20:33:39.9967842Z x1 = x1.contiguous() 2025-05-07T20:33:39.9968089Z 2025-05-07T20:33:39.9968287Z if scale_ub is not None: 2025-05-07T20:33:39.9968565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:39.9968926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:39.9969261Z ) 2025-05-07T20:33:39.9969455Z else: 2025-05-07T20:33:39.9969667Z scale_ub_tensor = None 2025-05-07T20:33:39.9969925Z 2025-05-07T20:33:39.9970158Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:39.9970476Z op = silu_mul_quant 2025-05-07T20:33:39.9970725Z if compiled: 2025-05-07T20:33:39.9970985Z op = torch.compile(op) 2025-05-07T20:33:39.9971340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.9971624Z 2025-05-07T20:33:39.9971825Z > y_fp8, y_scale = fn() 2025-05-07T20:33:39.9971992Z 2025-05-07T20:33:39.9972095Z moe/activation_test.py:117: 2025-05-07T20:33:39.9972400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.9972741Z moe/activation_test.py:115: in fn 2025-05-07T20:33:39.9973030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:39.9973586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:39.9974155Z return fn(*args, **kwargs) 2025-05-07T20:33:39.9974816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:39.9975502Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:39.9976049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:39.9976736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:39.9977406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:39.9977935Z kernel = self.compile( 2025-05-07T20:33:39.9978481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:39.9979142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:39.9979547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:39.9979781Z 2025-05-07T20:33:39.9979989Z self = 2025-05-07T20:33:39.9981084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:39.9982547Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa828f40>} 2025-05-07T20:33:39.9983889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:39.9984910Z context = 2025-05-07T20:33:39.9985202Z 2025-05-07T20:33:39.9985368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:39.9985894Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:39.9986406Z module_map=module_map) 2025-05-07T20:33:39.9986776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:39.9987133Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:39.9987435Z E ^ 2025-05-07T20:33:39.9987898Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:39.9988354Z 2025-05-07T20:33:39.9988766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:39.9989279Z 2025-05-07T20:33:39.9989384Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:39.9989801Z self=, 2025-05-07T20:33:39.9990198Z T=16384, 2025-05-07T20:33:39.9990396Z D=7168, 2025-05-07T20:33:39.9990593Z scale_ub=1200.0, 2025-05-07T20:33:39.9990813Z contiguous=True, 2025-05-07T20:33:39.9991041Z compiled=True, 2025-05-07T20:33:39.9991247Z ) 2025-05-07T20:33:39.9991608Z self = 2025-05-07T20:33:39.9992109Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:39.9992395Z 2025-05-07T20:33:39.9992478Z @given( 2025-05-07T20:33:39.9992713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:39.9993031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:39.9993343Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:39.9993675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:39.9994001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:39.9994288Z ) 2025-05-07T20:33:39.9994639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:39.9995077Z def test_silu_mul_quant( 2025-05-07T20:33:39.9995327Z self, 2025-05-07T20:33:39.9995527Z T: int, 2025-05-07T20:33:39.9995798Z D: int, 2025-05-07T20:33:39.9996026Z scale_ub: Optional[float], 2025-05-07T20:33:39.9996306Z contiguous: bool, 2025-05-07T20:33:39.9996552Z compiled: bool, 2025-05-07T20:33:39.9996769Z ) -> None: 2025-05-07T20:33:39.9996994Z torch.manual_seed(2025) 2025-05-07T20:33:39.9997241Z 2025-05-07T20:33:39.9997513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:39.9997861Z 2025-05-07T20:33:39.9998053Z x_sign = torch.sign(x) 2025-05-07T20:33:39.9998342Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:39.9998657Z x = x_sign * x_clamp 2025-05-07T20:33:39.9998906Z x0 = x[:, :D] 2025-05-07T20:33:39.9999123Z x1 = x[:, D:] 2025-05-07T20:33:39.9999337Z 2025-05-07T20:33:39.9999533Z if contiguous: 2025-05-07T20:33:39.9999768Z x0 = x0.contiguous() 2025-05-07T20:33:40.0000024Z x1 = x1.contiguous() 2025-05-07T20:33:40.0000265Z 2025-05-07T20:33:40.0000457Z if scale_ub is not None: 2025-05-07T20:33:40.0000790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.0001125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.0001442Z ) 2025-05-07T20:33:40.0001634Z else: 2025-05-07T20:33:40.0001847Z scale_ub_tensor = None 2025-05-07T20:33:40.0002106Z 2025-05-07T20:33:40.0002334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.0002651Z op = silu_mul_quant 2025-05-07T20:33:40.0002901Z if compiled: 2025-05-07T20:33:40.0003144Z op = torch.compile(op) 2025-05-07T20:33:40.0003445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.0003718Z 2025-05-07T20:33:40.0003908Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.0004082Z 2025-05-07T20:33:40.0004182Z moe/activation_test.py:117: 2025-05-07T20:33:40.0004478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.0004854Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.0005145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.0005700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.0006333Z return fn(*args, **kwargs) 2025-05-07T20:33:40.0006981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.0007676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.0008213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.0008893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.0009552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.0010089Z kernel = self.compile( 2025-05-07T20:33:40.0010671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.0011324Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.0011731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.0011964Z 2025-05-07T20:33:40.0012170Z self = 2025-05-07T20:33:40.0013258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.0014623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa82a160>} 2025-05-07T20:33:40.0015978Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.0017004Z context = 2025-05-07T20:33:40.0017300Z 2025-05-07T20:33:40.0017476Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.0018003Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.0018473Z module_map=module_map) 2025-05-07T20:33:40.0018846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.0019258Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.0019518Z E ^ 2025-05-07T20:33:40.0019984Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.0020435Z 2025-05-07T20:33:40.0020853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.0021415Z 2025-05-07T20:33:40.1257020Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.1257661Z self=, 2025-05-07T20:33:40.1258271Z T=16384, 2025-05-07T20:33:40.1258551Z D=5120, 2025-05-07T20:33:40.1258819Z scale_ub=1200.0, 2025-05-07T20:33:40.1259158Z contiguous=True, 2025-05-07T20:33:40.1259460Z compiled=False, 2025-05-07T20:33:40.1259668Z ) 2025-05-07T20:33:40.1259993Z self = 2025-05-07T20:33:40.1260500Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:40.1260780Z 2025-05-07T20:33:40.1260866Z @given( 2025-05-07T20:33:40.1261096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.1261419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.1261860Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.1262200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.1262537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.1262892Z ) 2025-05-07T20:33:40.1263237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.1263690Z def test_silu_mul_quant( 2025-05-07T20:33:40.1263942Z self, 2025-05-07T20:33:40.1264145Z T: int, 2025-05-07T20:33:40.1264346Z D: int, 2025-05-07T20:33:40.1264567Z scale_ub: Optional[float], 2025-05-07T20:33:40.1264858Z contiguous: bool, 2025-05-07T20:33:40.1265110Z compiled: bool, 2025-05-07T20:33:40.1265592Z ) -> None: 2025-05-07T20:33:40.1271689Z torch.manual_seed(2025) 2025-05-07T20:33:40.1271978Z 2025-05-07T20:33:40.1272251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.1272594Z 2025-05-07T20:33:40.1272814Z x_sign = torch.sign(x) 2025-05-07T20:33:40.1273211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.1273532Z x = x_sign * x_clamp 2025-05-07T20:33:40.1273781Z x0 = x[:, :D] 2025-05-07T20:33:40.1274016Z x1 = x[:, D:] 2025-05-07T20:33:40.1274226Z 2025-05-07T20:33:40.1274432Z if contiguous: 2025-05-07T20:33:40.1274678Z x0 = x0.contiguous() 2025-05-07T20:33:40.1274945Z x1 = x1.contiguous() 2025-05-07T20:33:40.1275197Z 2025-05-07T20:33:40.1275396Z if scale_ub is not None: 2025-05-07T20:33:40.1275740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.1276088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.1276406Z ) 2025-05-07T20:33:40.1276605Z else: 2025-05-07T20:33:40.1276820Z scale_ub_tensor = None 2025-05-07T20:33:40.1277069Z 2025-05-07T20:33:40.1277307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.1277630Z op = silu_mul_quant 2025-05-07T20:33:40.1277883Z if compiled: 2025-05-07T20:33:40.1278133Z op = torch.compile(op) 2025-05-07T20:33:40.1278434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.1278718Z 2025-05-07T20:33:40.1278919Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.1279091Z 2025-05-07T20:33:40.1279193Z moe/activation_test.py:117: 2025-05-07T20:33:40.1279493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.1279824Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.1280107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.1280801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.1281488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.1282028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.1282711Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.1283461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.1283997Z kernel = self.compile( 2025-05-07T20:33:40.1284540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.1285195Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.1285595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.1285827Z 2025-05-07T20:33:40.1286036Z self = 2025-05-07T20:33:40.1287190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.1288582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa829b20>} 2025-05-07T20:33:40.1289989Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.1291016Z context = 2025-05-07T20:33:40.1291312Z 2025-05-07T20:33:40.1291477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.1292006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.1292483Z module_map=module_map) 2025-05-07T20:33:40.1292846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.1293208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.1293513Z E ^ 2025-05-07T20:33:40.1293975Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.1294440Z 2025-05-07T20:33:40.1294860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.1295376Z 2025-05-07T20:33:40.1295481Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.1295898Z self=, 2025-05-07T20:33:40.1296299Z T=1, 2025-05-07T20:33:40.1296489Z D=7168, 2025-05-07T20:33:40.1296687Z scale_ub=1200.0, 2025-05-07T20:33:40.1296909Z contiguous=False, 2025-05-07T20:33:40.1297136Z compiled=False, 2025-05-07T20:33:40.1297346Z ) 2025-05-07T20:33:40.1297668Z self = 2025-05-07T20:33:40.1298163Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:40.1298446Z 2025-05-07T20:33:40.1298525Z @given( 2025-05-07T20:33:40.1298762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.1299081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.1299394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.1299739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.1300074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.1300369Z ) 2025-05-07T20:33:40.1300725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.1301165Z def test_silu_mul_quant( 2025-05-07T20:33:40.1301417Z self, 2025-05-07T20:33:40.1301614Z T: int, 2025-05-07T20:33:40.1301810Z D: int, 2025-05-07T20:33:40.1302033Z scale_ub: Optional[float], 2025-05-07T20:33:40.1302307Z contiguous: bool, 2025-05-07T20:33:40.1302548Z compiled: bool, 2025-05-07T20:33:40.1302821Z ) -> None: 2025-05-07T20:33:40.1303041Z torch.manual_seed(2025) 2025-05-07T20:33:40.1303281Z 2025-05-07T20:33:40.1303555Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.1303911Z 2025-05-07T20:33:40.1304106Z x_sign = torch.sign(x) 2025-05-07T20:33:40.1304393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.1304704Z x = x_sign * x_clamp 2025-05-07T20:33:40.1304945Z x0 = x[:, :D] 2025-05-07T20:33:40.1305159Z x1 = x[:, D:] 2025-05-07T20:33:40.1305372Z 2025-05-07T20:33:40.1305564Z if contiguous: 2025-05-07T20:33:40.1305797Z x0 = x0.contiguous() 2025-05-07T20:33:40.1306061Z x1 = x1.contiguous() 2025-05-07T20:33:40.1306304Z 2025-05-07T20:33:40.1306492Z if scale_ub is not None: 2025-05-07T20:33:40.1306766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.1307153Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.1307470Z ) 2025-05-07T20:33:40.1307662Z else: 2025-05-07T20:33:40.1307874Z scale_ub_tensor = None 2025-05-07T20:33:40.1308175Z 2025-05-07T20:33:40.1308407Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.1308723Z op = silu_mul_quant 2025-05-07T20:33:40.1308974Z if compiled: 2025-05-07T20:33:40.1309219Z op = torch.compile(op) 2025-05-07T20:33:40.1309517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.1309793Z 2025-05-07T20:33:40.1309985Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.1310156Z 2025-05-07T20:33:40.1310259Z moe/activation_test.py:117: 2025-05-07T20:33:40.1310560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.1310892Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.1311178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.1311912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.1312620Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.1313162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.1313845Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.1314511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.1315042Z kernel = self.compile( 2025-05-07T20:33:40.1315576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.1316282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.1316683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.1316917Z 2025-05-07T20:33:40.1317139Z self = 2025-05-07T20:33:40.1318221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.1319600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa61c180>} 2025-05-07T20:33:40.1320947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.1321977Z context = 2025-05-07T20:33:40.1322266Z 2025-05-07T20:33:40.1322435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.1323017Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.1323500Z module_map=module_map) 2025-05-07T20:33:40.1323866Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.1324223Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.1324483Z E ^ 2025-05-07T20:33:40.1324954Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.1325409Z 2025-05-07T20:33:40.1325826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.1326345Z 2025-05-07T20:33:40.3061715Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.3062581Z self=, 2025-05-07T20:33:40.3063334Z T=4096, 2025-05-07T20:33:40.3064000Z D=7168, 2025-05-07T20:33:40.3064380Z scale_ub=1200.0, 2025-05-07T20:33:40.3064806Z contiguous=False, 2025-05-07T20:33:40.3065220Z compiled=True, 2025-05-07T20:33:40.3065988Z ) 2025-05-07T20:33:40.3066583Z self = 2025-05-07T20:33:40.3067508Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:40.3068032Z 2025-05-07T20:33:40.3068184Z @given( 2025-05-07T20:33:40.3068618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.3069180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.3069530Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.3069874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.3070219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.3070507Z ) 2025-05-07T20:33:40.3070886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.3071339Z def test_silu_mul_quant( 2025-05-07T20:33:40.3071677Z self, 2025-05-07T20:33:40.3071888Z T: int, 2025-05-07T20:33:40.3072097Z D: int, 2025-05-07T20:33:40.3072318Z scale_ub: Optional[float], 2025-05-07T20:33:40.3072599Z contiguous: bool, 2025-05-07T20:33:40.3072852Z compiled: bool, 2025-05-07T20:33:40.3073082Z ) -> None: 2025-05-07T20:33:40.3073309Z torch.manual_seed(2025) 2025-05-07T20:33:40.3073572Z 2025-05-07T20:33:40.3073847Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.3074205Z 2025-05-07T20:33:40.3074409Z x_sign = torch.sign(x) 2025-05-07T20:33:40.3074699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.3075017Z x = x_sign * x_clamp 2025-05-07T20:33:40.3075267Z x0 = x[:, :D] 2025-05-07T20:33:40.3075489Z x1 = x[:, D:] 2025-05-07T20:33:40.3075745Z 2025-05-07T20:33:40.3075942Z if contiguous: 2025-05-07T20:33:40.3076181Z x0 = x0.contiguous() 2025-05-07T20:33:40.3076446Z x1 = x1.contiguous() 2025-05-07T20:33:40.3076695Z 2025-05-07T20:33:40.3076902Z if scale_ub is not None: 2025-05-07T20:33:40.3077200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.3077546Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.3077864Z ) 2025-05-07T20:33:40.3078057Z else: 2025-05-07T20:33:40.3078282Z scale_ub_tensor = None 2025-05-07T20:33:40.3078545Z 2025-05-07T20:33:40.3078778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.3079101Z op = silu_mul_quant 2025-05-07T20:33:40.3079359Z if compiled: 2025-05-07T20:33:40.3079608Z op = torch.compile(op) 2025-05-07T20:33:40.3079911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.3080197Z 2025-05-07T20:33:40.3080393Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.3080569Z 2025-05-07T20:33:40.3080771Z moe/activation_test.py:117: 2025-05-07T20:33:40.3081077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.3081421Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.3081707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.3082281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.3082851Z return fn(*args, **kwargs) 2025-05-07T20:33:40.3083505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.3084195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.3084743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.3085430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.3086167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.3086715Z kernel = self.compile( 2025-05-07T20:33:40.3087325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.3087979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.3088389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.3088632Z 2025-05-07T20:33:40.3088846Z self = 2025-05-07T20:33:40.3089940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.3091387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa61d3a0>} 2025-05-07T20:33:40.3092737Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.3093777Z context = 2025-05-07T20:33:40.3094069Z 2025-05-07T20:33:40.3094249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.3094783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.3095255Z module_map=module_map) 2025-05-07T20:33:40.3095622Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.3095984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.3096246Z E ^ 2025-05-07T20:33:40.3096725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.3097180Z 2025-05-07T20:33:40.3097605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.3098123Z 2025-05-07T20:33:40.3098239Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.3098660Z self=, 2025-05-07T20:33:40.3099083Z T=128, 2025-05-07T20:33:40.3099317Z D=7168, 2025-05-07T20:33:40.3099521Z scale_ub=1200.0, 2025-05-07T20:33:40.3099756Z contiguous=False, 2025-05-07T20:33:40.3099994Z compiled=True, 2025-05-07T20:33:40.3100201Z ) 2025-05-07T20:33:40.4008848Z self = 2025-05-07T20:33:40.4009406Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:40.4009693Z 2025-05-07T20:33:40.4009789Z @given( 2025-05-07T20:33:40.4010161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.4010486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.4010804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.4011141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.4011480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.4011773Z ) 2025-05-07T20:33:40.4012127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.4012582Z def test_silu_mul_quant( 2025-05-07T20:33:40.4012837Z self, 2025-05-07T20:33:40.4013039Z T: int, 2025-05-07T20:33:40.4013242Z D: int, 2025-05-07T20:33:40.4013470Z scale_ub: Optional[float], 2025-05-07T20:33:40.4013750Z contiguous: bool, 2025-05-07T20:33:40.4013991Z compiled: bool, 2025-05-07T20:33:40.4014225Z ) -> None: 2025-05-07T20:33:40.4014528Z torch.manual_seed(2025) 2025-05-07T20:33:40.4014772Z 2025-05-07T20:33:40.4015057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.4015406Z 2025-05-07T20:33:40.4015599Z x_sign = torch.sign(x) 2025-05-07T20:33:40.4015964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.4016280Z x = x_sign * x_clamp 2025-05-07T20:33:40.4016526Z x0 = x[:, :D] 2025-05-07T20:33:40.4016748Z x1 = x[:, D:] 2025-05-07T20:33:40.4016965Z 2025-05-07T20:33:40.4017153Z if contiguous: 2025-05-07T20:33:40.4017390Z x0 = x0.contiguous() 2025-05-07T20:33:40.4017658Z x1 = x1.contiguous() 2025-05-07T20:33:40.4017903Z 2025-05-07T20:33:40.4018100Z if scale_ub is not None: 2025-05-07T20:33:40.4018380Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.4018727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.4019037Z ) 2025-05-07T20:33:40.4019243Z else: 2025-05-07T20:33:40.4019463Z scale_ub_tensor = None 2025-05-07T20:33:40.4019788Z 2025-05-07T20:33:40.4020191Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.4020525Z op = silu_mul_quant 2025-05-07T20:33:40.4020775Z if compiled: 2025-05-07T20:33:40.4021033Z op = torch.compile(op) 2025-05-07T20:33:40.4021344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.4021622Z 2025-05-07T20:33:40.4021828Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.4021999Z 2025-05-07T20:33:40.4022114Z moe/activation_test.py:117: 2025-05-07T20:33:40.4022417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.4022761Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.4023060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.4023635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.4024200Z return fn(*args, **kwargs) 2025-05-07T20:33:40.4024883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.4025583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.4026120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.4026812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.4027494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.4028038Z kernel = self.compile( 2025-05-07T20:33:40.4028578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.4029245Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.4029664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.4029964Z 2025-05-07T20:33:40.4030191Z self = 2025-05-07T20:33:40.4031278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.4032661Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa61e0c0>} 2025-05-07T20:33:40.4034009Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.4035041Z context = 2025-05-07T20:33:40.4035377Z 2025-05-07T20:33:40.4035558Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.4036160Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.4036681Z module_map=module_map) 2025-05-07T20:33:40.4037053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.4037409Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.4037674Z E ^ 2025-05-07T20:33:40.4038144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.4038598Z 2025-05-07T20:33:40.4039022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.4039533Z 2025-05-07T20:33:40.4039639Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.4040063Z self=, 2025-05-07T20:33:40.4040479Z T=2048, 2025-05-07T20:33:40.4040714Z D=7168, 2025-05-07T20:33:40.4040918Z scale_ub=None, 2025-05-07T20:33:40.4041140Z contiguous=True, 2025-05-07T20:33:40.4041365Z compiled=True, 2025-05-07T20:33:40.4041573Z ) 2025-05-07T20:33:40.4041899Z self = 2025-05-07T20:33:40.4042394Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:40.4042673Z 2025-05-07T20:33:40.4042757Z @given( 2025-05-07T20:33:40.4042999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.4043321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.4043631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.4043977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.4044313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.4044606Z ) 2025-05-07T20:33:40.4044973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.4045434Z def test_silu_mul_quant( 2025-05-07T20:33:40.4045681Z self, 2025-05-07T20:33:40.4045893Z T: int, 2025-05-07T20:33:40.4046109Z D: int, 2025-05-07T20:33:40.4046335Z scale_ub: Optional[float], 2025-05-07T20:33:40.4046626Z contiguous: bool, 2025-05-07T20:33:40.4046875Z compiled: bool, 2025-05-07T20:33:40.4047110Z ) -> None: 2025-05-07T20:33:40.4047328Z torch.manual_seed(2025) 2025-05-07T20:33:40.4047603Z 2025-05-07T20:33:40.4047891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.4048243Z 2025-05-07T20:33:40.4048450Z x_sign = torch.sign(x) 2025-05-07T20:33:40.4048755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.4049073Z x = x_sign * x_clamp 2025-05-07T20:33:40.4049378Z x0 = x[:, :D] 2025-05-07T20:33:40.4049618Z x1 = x[:, D:] 2025-05-07T20:33:40.4049845Z 2025-05-07T20:33:40.4050090Z if contiguous: 2025-05-07T20:33:40.4050338Z x0 = x0.contiguous() 2025-05-07T20:33:40.4050608Z x1 = x1.contiguous() 2025-05-07T20:33:40.4050854Z 2025-05-07T20:33:40.4051059Z if scale_ub is not None: 2025-05-07T20:33:40.4051341Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.4051678Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.4051996Z ) 2025-05-07T20:33:40.4052201Z else: 2025-05-07T20:33:40.4052415Z scale_ub_tensor = None 2025-05-07T20:33:40.4052679Z 2025-05-07T20:33:40.4052923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.4053238Z op = silu_mul_quant 2025-05-07T20:33:40.4053496Z if compiled: 2025-05-07T20:33:40.4053754Z op = torch.compile(op) 2025-05-07T20:33:40.4054052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.4054382Z 2025-05-07T20:33:40.4054584Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.4054758Z 2025-05-07T20:33:40.4054864Z moe/activation_test.py:117: 2025-05-07T20:33:40.4055158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.4055543Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.4055838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.4056395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:40.4056970Z return fn(*args, **kwargs) 2025-05-07T20:33:40.4057632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.4058326Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.4058871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.4059572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.4067393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.4067952Z kernel = self.compile( 2025-05-07T20:33:40.4068516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.4069175Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.4069601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.4069838Z 2025-05-07T20:33:40.4070066Z self = 2025-05-07T20:33:40.4071156Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.4072556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa61f2e0>} 2025-05-07T20:33:40.4073919Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.4074959Z context = 2025-05-07T20:33:40.4075253Z 2025-05-07T20:33:40.4075433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.4076047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.4076525Z module_map=module_map) 2025-05-07T20:33:40.4076898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.4077259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.4077522Z E ^ 2025-05-07T20:33:40.4078087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.4078541Z 2025-05-07T20:33:40.4078958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.4079527Z 2025-05-07T20:33:40.4745616Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.4746056Z self=, 2025-05-07T20:33:40.4746517Z T=16384, 2025-05-07T20:33:40.4746724Z D=5120, 2025-05-07T20:33:40.4746924Z scale_ub=None, 2025-05-07T20:33:40.4747149Z contiguous=False, 2025-05-07T20:33:40.4747386Z compiled=False, 2025-05-07T20:33:40.4747603Z ) 2025-05-07T20:33:40.4747926Z self = 2025-05-07T20:33:40.4748558Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:40.4748850Z 2025-05-07T20:33:40.4748941Z @given( 2025-05-07T20:33:40.4749199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.4749557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.4749938Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.4750269Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.4750608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.4750904Z ) 2025-05-07T20:33:40.4751258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.4751702Z def test_silu_mul_quant( 2025-05-07T20:33:40.4751950Z self, 2025-05-07T20:33:40.4752151Z T: int, 2025-05-07T20:33:40.4752348Z D: int, 2025-05-07T20:33:40.4752578Z scale_ub: Optional[float], 2025-05-07T20:33:40.4752858Z contiguous: bool, 2025-05-07T20:33:40.4753097Z compiled: bool, 2025-05-07T20:33:40.4753336Z ) -> None: 2025-05-07T20:33:40.4753561Z torch.manual_seed(2025) 2025-05-07T20:33:40.4753871Z 2025-05-07T20:33:40.4754155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.4754514Z 2025-05-07T20:33:40.4754708Z x_sign = torch.sign(x) 2025-05-07T20:33:40.4755004Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.4757114Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.4759009Z 2025-05-07T20:33:40.4759133Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:40.4759352Z 2025-05-07T20:33:40.4759463Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.4759878Z self=, 2025-05-07T20:33:40.4760294Z T=4096, 2025-05-07T20:33:40.4760496Z D=7168, 2025-05-07T20:33:40.4760699Z scale_ub=1200.0, 2025-05-07T20:33:40.4760930Z contiguous=True, 2025-05-07T20:33:40.4761159Z compiled=True, 2025-05-07T20:33:40.4761361Z ) 2025-05-07T20:33:40.4761688Z self = 2025-05-07T20:33:40.4762194Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:40.4762469Z 2025-05-07T20:33:40.4762558Z @given( 2025-05-07T20:33:40.4762789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.4763119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.4763437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.4763858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.4764198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.4764497Z ) 2025-05-07T20:33:40.4764853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.4765309Z def test_silu_mul_quant( 2025-05-07T20:33:40.4765780Z self, 2025-05-07T20:33:40.4765981Z T: int, 2025-05-07T20:33:40.4766186Z D: int, 2025-05-07T20:33:40.4766415Z scale_ub: Optional[float], 2025-05-07T20:33:40.4766694Z contiguous: bool, 2025-05-07T20:33:40.4766939Z compiled: bool, 2025-05-07T20:33:40.4767167Z ) -> None: 2025-05-07T20:33:40.4767387Z torch.manual_seed(2025) 2025-05-07T20:33:40.4767627Z 2025-05-07T20:33:40.4767909Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.4768257Z 2025-05-07T20:33:40.4768528Z x_sign = torch.sign(x) 2025-05-07T20:33:40.4768828Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.4770901Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.4772833Z 2025-05-07T20:33:40.4772960Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:40.4773172Z 2025-05-07T20:33:40.4773285Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.4773698Z self=, 2025-05-07T20:33:40.4774110Z T=16384, 2025-05-07T20:33:40.4774316Z D=7168, 2025-05-07T20:33:40.4774565Z scale_ub=None, 2025-05-07T20:33:40.4774787Z contiguous=False, 2025-05-07T20:33:40.4775017Z compiled=False, 2025-05-07T20:33:40.4775226Z ) 2025-05-07T20:33:40.4775548Z self = 2025-05-07T20:33:40.4776050Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:40.4776329Z 2025-05-07T20:33:40.4776409Z @given( 2025-05-07T20:33:40.4776647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.4776966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.4777279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.4777612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.4777950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.4778242Z ) 2025-05-07T20:33:40.4778597Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.4779087Z def test_silu_mul_quant( 2025-05-07T20:33:40.4779364Z self, 2025-05-07T20:33:40.4779563Z T: int, 2025-05-07T20:33:40.4779774Z D: int, 2025-05-07T20:33:40.4780003Z scale_ub: Optional[float], 2025-05-07T20:33:40.4780282Z contiguous: bool, 2025-05-07T20:33:40.4780531Z compiled: bool, 2025-05-07T20:33:40.4780762Z ) -> None: 2025-05-07T20:33:40.4780980Z torch.manual_seed(2025) 2025-05-07T20:33:40.4781231Z 2025-05-07T20:33:40.4781512Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.4783596Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.4785547Z 2025-05-07T20:33:40.4785679Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.4785894Z 2025-05-07T20:33:40.4785998Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.4786425Z self=, 2025-05-07T20:33:40.4786839Z T=2048, 2025-05-07T20:33:40.4787028Z D=7168, 2025-05-07T20:33:40.4787228Z scale_ub=1200.0, 2025-05-07T20:33:40.4787457Z contiguous=True, 2025-05-07T20:33:40.4787679Z compiled=True, 2025-05-07T20:33:40.4787888Z ) 2025-05-07T20:33:40.4788212Z self = 2025-05-07T20:33:40.4788712Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:40.4789036Z 2025-05-07T20:33:40.4789119Z @given( 2025-05-07T20:33:40.4789385Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.4789731Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.4790078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.4790414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.4790751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.4791037Z ) 2025-05-07T20:33:40.4791389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.4791838Z def test_silu_mul_quant( 2025-05-07T20:33:40.4792088Z self, 2025-05-07T20:33:40.4792282Z T: int, 2025-05-07T20:33:40.4792488Z D: int, 2025-05-07T20:33:40.4792712Z scale_ub: Optional[float], 2025-05-07T20:33:40.4792981Z contiguous: bool, 2025-05-07T20:33:40.4793221Z compiled: bool, 2025-05-07T20:33:40.4793439Z ) -> None: 2025-05-07T20:33:40.4793652Z torch.manual_seed(2025) 2025-05-07T20:33:40.4793897Z 2025-05-07T20:33:40.4794213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.4794564Z 2025-05-07T20:33:40.4794771Z x_sign = torch.sign(x) 2025-05-07T20:33:40.4795073Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.4797121Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.4798982Z 2025-05-07T20:33:40.4799113Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:40.4799358Z 2025-05-07T20:33:40.4799482Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.4799899Z self=, 2025-05-07T20:33:40.4800310Z T=2048, 2025-05-07T20:33:40.4800498Z D=7168, 2025-05-07T20:33:40.4800694Z scale_ub=None, 2025-05-07T20:33:40.4800911Z contiguous=True, 2025-05-07T20:33:40.4801133Z compiled=False, 2025-05-07T20:33:40.4801343Z ) 2025-05-07T20:33:40.5941843Z self = 2025-05-07T20:33:40.5942390Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.5942677Z 2025-05-07T20:33:40.5942759Z @given( 2025-05-07T20:33:40.5943002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.5943312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.5943623Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.5943971Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.5944430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.5944716Z ) 2025-05-07T20:33:40.5945070Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.5945522Z def test_silu_mul_quant( 2025-05-07T20:33:40.5945772Z self, 2025-05-07T20:33:40.5945982Z T: int, 2025-05-07T20:33:40.5946192Z D: int, 2025-05-07T20:33:40.5946409Z scale_ub: Optional[float], 2025-05-07T20:33:40.5946701Z contiguous: bool, 2025-05-07T20:33:40.5946949Z compiled: bool, 2025-05-07T20:33:40.5947190Z ) -> None: 2025-05-07T20:33:40.5947406Z torch.manual_seed(2025) 2025-05-07T20:33:40.5947660Z 2025-05-07T20:33:40.5947936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.5948283Z 2025-05-07T20:33:40.5948485Z > x_sign = torch.sign(x) 2025-05-07T20:33:40.5950550Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.5952473Z 2025-05-07T20:33:40.5952600Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:40.5952812Z 2025-05-07T20:33:40.5952917Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.5953337Z self=, 2025-05-07T20:33:40.5953752Z T=1, 2025-05-07T20:33:40.5953952Z D=7168, 2025-05-07T20:33:40.5954148Z scale_ub=1200.0, 2025-05-07T20:33:40.5954378Z contiguous=True, 2025-05-07T20:33:40.5954668Z compiled=False, 2025-05-07T20:33:40.5954875Z ) 2025-05-07T20:33:40.5955201Z self = 2025-05-07T20:33:40.5955767Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:40.5956034Z 2025-05-07T20:33:40.5956119Z @given( 2025-05-07T20:33:40.5956358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.5956679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.5956985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.5957322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.5957657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.5957952Z ) 2025-05-07T20:33:40.5958302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.5958750Z def test_silu_mul_quant( 2025-05-07T20:33:40.5959003Z self, 2025-05-07T20:33:40.5959204Z T: int, 2025-05-07T20:33:40.5959414Z D: int, 2025-05-07T20:33:40.5959644Z scale_ub: Optional[float], 2025-05-07T20:33:40.5959917Z contiguous: bool, 2025-05-07T20:33:40.5960168Z compiled: bool, 2025-05-07T20:33:40.5960396Z ) -> None: 2025-05-07T20:33:40.5960610Z torch.manual_seed(2025) 2025-05-07T20:33:40.5960860Z 2025-05-07T20:33:40.5961138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.5961481Z 2025-05-07T20:33:40.5961682Z x_sign = torch.sign(x) 2025-05-07T20:33:40.5961976Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.5962295Z x = x_sign * x_clamp 2025-05-07T20:33:40.5962548Z x0 = x[:, :D] 2025-05-07T20:33:40.5962770Z x1 = x[:, D:] 2025-05-07T20:33:40.5962984Z 2025-05-07T20:33:40.5963174Z if contiguous: 2025-05-07T20:33:40.5963414Z x0 = x0.contiguous() 2025-05-07T20:33:40.5963688Z x1 = x1.contiguous() 2025-05-07T20:33:40.5963982Z 2025-05-07T20:33:40.5964191Z if scale_ub is not None: 2025-05-07T20:33:40.5964469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.5964809Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.5965128Z ) 2025-05-07T20:33:40.5965332Z else: 2025-05-07T20:33:40.5965697Z scale_ub_tensor = None 2025-05-07T20:33:40.5965956Z 2025-05-07T20:33:40.5966197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.5966516Z op = silu_mul_quant 2025-05-07T20:33:40.5966774Z if compiled: 2025-05-07T20:33:40.5967031Z op = torch.compile(op) 2025-05-07T20:33:40.5967351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.5967626Z 2025-05-07T20:33:40.5967826Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.5967991Z 2025-05-07T20:33:40.5968099Z moe/activation_test.py:117: 2025-05-07T20:33:40.5968474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.5968829Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.5969118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.5969875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.5970564Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.5971105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.5971792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.5972455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.5972993Z kernel = self.compile( 2025-05-07T20:33:40.5973551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.5974275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.5974678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.5974922Z 2025-05-07T20:33:40.5975134Z self = 2025-05-07T20:33:40.5976228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.5977609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa7f65c0>} 2025-05-07T20:33:40.5978957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.5979997Z context = 2025-05-07T20:33:40.5980294Z 2025-05-07T20:33:40.5980467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.5980997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.5981466Z module_map=module_map) 2025-05-07T20:33:40.5981835Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.5982194Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.5982458Z E ^ 2025-05-07T20:33:40.5982922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.5983381Z 2025-05-07T20:33:40.5983799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.5984311Z 2025-05-07T20:33:40.5984496Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.5984921Z self=, 2025-05-07T20:33:40.5985325Z T=128, 2025-05-07T20:33:40.5985521Z D=5120, 2025-05-07T20:33:40.5985723Z scale_ub=None, 2025-05-07T20:33:40.5985937Z contiguous=True, 2025-05-07T20:33:40.5986168Z compiled=False, 2025-05-07T20:33:40.5986380Z ) 2025-05-07T20:33:40.6664634Z self = 2025-05-07T20:33:40.6665203Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.6665718Z 2025-05-07T20:33:40.6665820Z @given( 2025-05-07T20:33:40.6666061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.6666384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.6666692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.6667158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.6667507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.6667797Z ) 2025-05-07T20:33:40.6668153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.6668667Z def test_silu_mul_quant( 2025-05-07T20:33:40.6668918Z self, 2025-05-07T20:33:40.6669120Z T: int, 2025-05-07T20:33:40.6669324Z D: int, 2025-05-07T20:33:40.6669549Z scale_ub: Optional[float], 2025-05-07T20:33:40.6669824Z contiguous: bool, 2025-05-07T20:33:40.6670077Z compiled: bool, 2025-05-07T20:33:40.6670315Z ) -> None: 2025-05-07T20:33:40.6670532Z torch.manual_seed(2025) 2025-05-07T20:33:40.6670779Z 2025-05-07T20:33:40.6671052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.6671394Z 2025-05-07T20:33:40.6671594Z x_sign = torch.sign(x) 2025-05-07T20:33:40.6671892Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.6672208Z x = x_sign * x_clamp 2025-05-07T20:33:40.6672518Z x0 = x[:, :D] 2025-05-07T20:33:40.6672741Z x1 = x[:, D:] 2025-05-07T20:33:40.6672949Z 2025-05-07T20:33:40.6673140Z if contiguous: 2025-05-07T20:33:40.6673379Z x0 = x0.contiguous() 2025-05-07T20:33:40.6673646Z x1 = x1.contiguous() 2025-05-07T20:33:40.6673886Z 2025-05-07T20:33:40.6674083Z if scale_ub is not None: 2025-05-07T20:33:40.6674363Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.6674699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.6675018Z ) 2025-05-07T20:33:40.6675216Z else: 2025-05-07T20:33:40.6675431Z scale_ub_tensor = None 2025-05-07T20:33:40.6675772Z 2025-05-07T20:33:40.6676012Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.6676326Z op = silu_mul_quant 2025-05-07T20:33:40.6676589Z if compiled: 2025-05-07T20:33:40.6676844Z op = torch.compile(op) 2025-05-07T20:33:40.6677142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.6677426Z 2025-05-07T20:33:40.6677626Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.6677794Z 2025-05-07T20:33:40.6677900Z moe/activation_test.py:117: 2025-05-07T20:33:40.6678199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.6678537Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.6678831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.6679519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.6680221Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.6680763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.6681455Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.6682196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.6682752Z kernel = self.compile( 2025-05-07T20:33:40.6683298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.6683953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.6684355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.6684592Z 2025-05-07T20:33:40.6684801Z self = 2025-05-07T20:33:40.6685891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.6687316Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa7f74c0>} 2025-05-07T20:33:40.6688665Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.6689737Z context = 2025-05-07T20:33:40.6690025Z 2025-05-07T20:33:40.6690201Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.6690728Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.6691202Z module_map=module_map) 2025-05-07T20:33:40.6691570Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.6691934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.6692207Z E ^ 2025-05-07T20:33:40.6692719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.6693176Z 2025-05-07T20:33:40.6693601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.6694112Z 2025-05-07T20:33:40.6694223Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.6694636Z self=, 2025-05-07T20:33:40.6695045Z T=128, 2025-05-07T20:33:40.6695248Z D=7168, 2025-05-07T20:33:40.6695442Z scale_ub=None, 2025-05-07T20:33:40.6695667Z contiguous=True, 2025-05-07T20:33:40.6695900Z compiled=False, 2025-05-07T20:33:40.6696104Z ) 2025-05-07T20:33:40.6696428Z self = 2025-05-07T20:33:40.6696928Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.6697207Z 2025-05-07T20:33:40.6697292Z @given( 2025-05-07T20:33:40.6697533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.6697862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.6698178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.6698516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.6698851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.6699150Z ) 2025-05-07T20:33:40.6705328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.6705799Z def test_silu_mul_quant( 2025-05-07T20:33:40.6706058Z self, 2025-05-07T20:33:40.6706265Z T: int, 2025-05-07T20:33:40.6706467Z D: int, 2025-05-07T20:33:40.6706698Z scale_ub: Optional[float], 2025-05-07T20:33:40.6706984Z contiguous: bool, 2025-05-07T20:33:40.6707237Z compiled: bool, 2025-05-07T20:33:40.6707464Z ) -> None: 2025-05-07T20:33:40.6707694Z torch.manual_seed(2025) 2025-05-07T20:33:40.6708024Z 2025-05-07T20:33:40.6708304Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.6708659Z 2025-05-07T20:33:40.6708857Z x_sign = torch.sign(x) 2025-05-07T20:33:40.6709148Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.6709492Z x = x_sign * x_clamp 2025-05-07T20:33:40.6709753Z x0 = x[:, :D] 2025-05-07T20:33:40.6709975Z x1 = x[:, D:] 2025-05-07T20:33:40.6710182Z 2025-05-07T20:33:40.6710367Z if contiguous: 2025-05-07T20:33:40.6710603Z x0 = x0.contiguous() 2025-05-07T20:33:40.6710870Z x1 = x1.contiguous() 2025-05-07T20:33:40.6711112Z 2025-05-07T20:33:40.6711305Z if scale_ub is not None: 2025-05-07T20:33:40.6711583Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.6711915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.6712287Z ) 2025-05-07T20:33:40.6712485Z else: 2025-05-07T20:33:40.6712707Z scale_ub_tensor = None 2025-05-07T20:33:40.6712959Z 2025-05-07T20:33:40.6713193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.6713556Z op = silu_mul_quant 2025-05-07T20:33:40.6713803Z if compiled: 2025-05-07T20:33:40.6714052Z op = torch.compile(op) 2025-05-07T20:33:40.6714351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.6714621Z 2025-05-07T20:33:40.6714822Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.6714988Z 2025-05-07T20:33:40.6715093Z moe/activation_test.py:117: 2025-05-07T20:33:40.6715391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.6715801Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.6716085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.6716781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.6717517Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.6718058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.6718750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.6719412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.6719945Z kernel = self.compile( 2025-05-07T20:33:40.6720487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.6721147Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.6721542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.6721778Z 2025-05-07T20:33:40.6721990Z self = 2025-05-07T20:33:40.6723081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.6724465Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa168540>} 2025-05-07T20:33:40.6725803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.6726835Z context = 2025-05-07T20:33:40.6727131Z 2025-05-07T20:33:40.6727298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.6727830Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.6728348Z module_map=module_map) 2025-05-07T20:33:40.6728716Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.6729076Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.6729371Z E ^ 2025-05-07T20:33:40.6729850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.6730306Z 2025-05-07T20:33:40.6730733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.6731249Z 2025-05-07T20:33:40.6731356Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.6731771Z self=, 2025-05-07T20:33:40.6732172Z T=2048, 2025-05-07T20:33:40.6732364Z D=7168, 2025-05-07T20:33:40.6732559Z scale_ub=1200.0, 2025-05-07T20:33:40.6732825Z contiguous=True, 2025-05-07T20:33:40.6733055Z compiled=False, 2025-05-07T20:33:40.6733263Z ) 2025-05-07T20:33:40.7540617Z self = 2025-05-07T20:33:40.7541268Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:40.7541553Z 2025-05-07T20:33:40.7541636Z @given( 2025-05-07T20:33:40.7541884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.7542226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.7542587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.7543021Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.7543453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.7543760Z ) 2025-05-07T20:33:40.7544116Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.7544579Z def test_silu_mul_quant( 2025-05-07T20:33:40.7544829Z self, 2025-05-07T20:33:40.7545039Z T: int, 2025-05-07T20:33:40.7545331Z D: int, 2025-05-07T20:33:40.7545558Z scale_ub: Optional[float], 2025-05-07T20:33:40.7545839Z contiguous: bool, 2025-05-07T20:33:40.7546086Z compiled: bool, 2025-05-07T20:33:40.7546315Z ) -> None: 2025-05-07T20:33:40.7546540Z torch.manual_seed(2025) 2025-05-07T20:33:40.7546791Z 2025-05-07T20:33:40.7547068Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.7549145Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.7551063Z 2025-05-07T20:33:40.7551187Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.7551409Z 2025-05-07T20:33:40.7551516Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.7551935Z self=, 2025-05-07T20:33:40.7552343Z T=1, 2025-05-07T20:33:40.7552540Z D=5120, 2025-05-07T20:33:40.7552747Z scale_ub=1200.0, 2025-05-07T20:33:40.7552979Z contiguous=True, 2025-05-07T20:33:40.7553200Z compiled=False, 2025-05-07T20:33:40.7553419Z ) 2025-05-07T20:33:40.7553744Z self = 2025-05-07T20:33:40.7554237Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:40.7554511Z 2025-05-07T20:33:40.7554597Z @given( 2025-05-07T20:33:40.7554836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.7555218Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.7555538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.7555968Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.7556301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.7556595Z ) 2025-05-07T20:33:40.7556945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.7557390Z def test_silu_mul_quant( 2025-05-07T20:33:40.7557632Z self, 2025-05-07T20:33:40.7557827Z T: int, 2025-05-07T20:33:40.7558030Z D: int, 2025-05-07T20:33:40.7558254Z scale_ub: Optional[float], 2025-05-07T20:33:40.7558529Z contiguous: bool, 2025-05-07T20:33:40.7558770Z compiled: bool, 2025-05-07T20:33:40.7558990Z ) -> None: 2025-05-07T20:33:40.7559217Z torch.manual_seed(2025) 2025-05-07T20:33:40.7559501Z 2025-05-07T20:33:40.7559842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.7560196Z 2025-05-07T20:33:40.7560400Z x_sign = torch.sign(x) 2025-05-07T20:33:40.7560689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:40.7561046Z x = x_sign * x_clamp 2025-05-07T20:33:40.7561291Z x0 = x[:, :D] 2025-05-07T20:33:40.7561504Z x1 = x[:, D:] 2025-05-07T20:33:40.7561714Z 2025-05-07T20:33:40.7561907Z if contiguous: 2025-05-07T20:33:40.7562135Z x0 = x0.contiguous() 2025-05-07T20:33:40.7562397Z x1 = x1.contiguous() 2025-05-07T20:33:40.7562641Z 2025-05-07T20:33:40.7562840Z if scale_ub is not None: 2025-05-07T20:33:40.7563114Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:40.7563453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:40.7563772Z ) 2025-05-07T20:33:40.7563963Z else: 2025-05-07T20:33:40.7564174Z scale_ub_tensor = None 2025-05-07T20:33:40.7564431Z 2025-05-07T20:33:40.7564710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:40.7565030Z op = silu_mul_quant 2025-05-07T20:33:40.7565302Z if compiled: 2025-05-07T20:33:40.7565801Z op = torch.compile(op) 2025-05-07T20:33:40.7566099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.7566376Z 2025-05-07T20:33:40.7566579Z > y_fp8, y_scale = fn() 2025-05-07T20:33:40.7566746Z 2025-05-07T20:33:40.7566850Z moe/activation_test.py:117: 2025-05-07T20:33:40.7567148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.7567479Z moe/activation_test.py:115: in fn 2025-05-07T20:33:40.7567771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:40.7568457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:40.7569148Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:40.7569700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:40.7570386Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:40.7571058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:40.7571594Z kernel = self.compile( 2025-05-07T20:33:40.7572136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:40.7572794Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:40.7573189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:40.7573417Z 2025-05-07T20:33:40.7573628Z self = 2025-05-07T20:33:40.7574724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:40.7576178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa169b20>} 2025-05-07T20:33:40.7577531Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:40.7578561Z context = 2025-05-07T20:33:40.7578855Z 2025-05-07T20:33:40.7579032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:40.7579562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:40.7580106Z module_map=module_map) 2025-05-07T20:33:40.7580489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:40.7580853Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:40.7581198Z E ^ 2025-05-07T20:33:40.7581660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:40.7582117Z 2025-05-07T20:33:40.7582536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:40.7583048Z 2025-05-07T20:33:40.7583158Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.7583569Z self=, 2025-05-07T20:33:40.7583972Z T=2048, 2025-05-07T20:33:40.7584169Z D=5120, 2025-05-07T20:33:40.7584358Z scale_ub=None, 2025-05-07T20:33:40.7584573Z contiguous=True, 2025-05-07T20:33:40.7584796Z compiled=False, 2025-05-07T20:33:40.7585001Z ) 2025-05-07T20:33:40.7585388Z self = 2025-05-07T20:33:40.7585892Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.7586168Z 2025-05-07T20:33:40.7586246Z @given( 2025-05-07T20:33:40.7586476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.7586798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.7587115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.7587452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.7587795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.7588081Z ) 2025-05-07T20:33:40.7588426Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.7588875Z def test_silu_mul_quant( 2025-05-07T20:33:40.7589128Z self, 2025-05-07T20:33:40.7589324Z T: int, 2025-05-07T20:33:40.7589522Z D: int, 2025-05-07T20:33:40.7589757Z scale_ub: Optional[float], 2025-05-07T20:33:40.7590034Z contiguous: bool, 2025-05-07T20:33:40.7590283Z compiled: bool, 2025-05-07T20:33:40.7590506Z ) -> None: 2025-05-07T20:33:40.7590725Z torch.manual_seed(2025) 2025-05-07T20:33:40.7590972Z 2025-05-07T20:33:40.7591251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.7591598Z 2025-05-07T20:33:40.7591805Z > x_sign = torch.sign(x) 2025-05-07T20:33:40.7593772Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.7595741Z 2025-05-07T20:33:40.7595860Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:40.7596078Z 2025-05-07T20:33:40.7596188Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.7596603Z self=, 2025-05-07T20:33:40.7597009Z T=16384, 2025-05-07T20:33:40.7597208Z D=5120, 2025-05-07T20:33:40.7597397Z scale_ub=None, 2025-05-07T20:33:40.7597611Z contiguous=True, 2025-05-07T20:33:40.7597835Z compiled=False, 2025-05-07T20:33:40.7598039Z ) 2025-05-07T20:33:40.8363660Z self = 2025-05-07T20:33:40.8364880Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.8365868Z 2025-05-07T20:33:40.8366029Z @given( 2025-05-07T20:33:40.8366484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.8367311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.8367931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.8368586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.8369295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.8369580Z ) 2025-05-07T20:33:40.8369928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.8370370Z def test_silu_mul_quant( 2025-05-07T20:33:40.8370614Z self, 2025-05-07T20:33:40.8370806Z T: int, 2025-05-07T20:33:40.8371002Z D: int, 2025-05-07T20:33:40.8371214Z scale_ub: Optional[float], 2025-05-07T20:33:40.8371490Z contiguous: bool, 2025-05-07T20:33:40.8371729Z compiled: bool, 2025-05-07T20:33:40.8371951Z ) -> None: 2025-05-07T20:33:40.8372173Z torch.manual_seed(2025) 2025-05-07T20:33:40.8372418Z 2025-05-07T20:33:40.8372699Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.8374821Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.8376695Z 2025-05-07T20:33:40.8376817Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.8377036Z 2025-05-07T20:33:40.8377144Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.8377567Z self=, 2025-05-07T20:33:40.8377964Z T=4096, 2025-05-07T20:33:40.8378159Z D=5120, 2025-05-07T20:33:40.8378360Z scale_ub=None, 2025-05-07T20:33:40.8378575Z contiguous=True, 2025-05-07T20:33:40.8378808Z compiled=False, 2025-05-07T20:33:40.8379015Z ) 2025-05-07T20:33:40.8379335Z self = 2025-05-07T20:33:40.8379828Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.8380106Z 2025-05-07T20:33:40.8380187Z @given( 2025-05-07T20:33:40.8380418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.8380727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.8381039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.8381370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.8381698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.8381987Z ) 2025-05-07T20:33:40.8382339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.8382782Z def test_silu_mul_quant( 2025-05-07T20:33:40.8383087Z self, 2025-05-07T20:33:40.8383286Z T: int, 2025-05-07T20:33:40.8383489Z D: int, 2025-05-07T20:33:40.8383703Z scale_ub: Optional[float], 2025-05-07T20:33:40.8383978Z contiguous: bool, 2025-05-07T20:33:40.8384216Z compiled: bool, 2025-05-07T20:33:40.8384434Z ) -> None: 2025-05-07T20:33:40.8384656Z torch.manual_seed(2025) 2025-05-07T20:33:40.8384900Z 2025-05-07T20:33:40.8385168Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.8387263Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.8389128Z 2025-05-07T20:33:40.8389286Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.8389502Z 2025-05-07T20:33:40.8389605Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.8390018Z self=, 2025-05-07T20:33:40.8390421Z T=2048, 2025-05-07T20:33:40.8390613Z D=5120, 2025-05-07T20:33:40.8390808Z scale_ub=None, 2025-05-07T20:33:40.8391023Z contiguous=False, 2025-05-07T20:33:40.8391246Z compiled=False, 2025-05-07T20:33:40.8391450Z ) 2025-05-07T20:33:40.8391768Z self = 2025-05-07T20:33:40.8392263Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:40.8392536Z 2025-05-07T20:33:40.8392623Z @given( 2025-05-07T20:33:40.8392857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.8393214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.8393523Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.8393857Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.8394181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.8394473Z ) 2025-05-07T20:33:40.8394822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.8395263Z def test_silu_mul_quant( 2025-05-07T20:33:40.8395507Z self, 2025-05-07T20:33:40.8395777Z T: int, 2025-05-07T20:33:40.8395970Z D: int, 2025-05-07T20:33:40.8396188Z scale_ub: Optional[float], 2025-05-07T20:33:40.8396466Z contiguous: bool, 2025-05-07T20:33:40.8396708Z compiled: bool, 2025-05-07T20:33:40.8396926Z ) -> None: 2025-05-07T20:33:40.8397141Z torch.manual_seed(2025) 2025-05-07T20:33:40.8397381Z 2025-05-07T20:33:40.8397662Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.8399715Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.8401575Z 2025-05-07T20:33:40.8401692Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.8401906Z 2025-05-07T20:33:40.8402021Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.8402428Z self=, 2025-05-07T20:33:40.8402837Z T=4096, 2025-05-07T20:33:40.8403081Z D=7168, 2025-05-07T20:33:40.8403273Z scale_ub=None, 2025-05-07T20:33:40.8403490Z contiguous=True, 2025-05-07T20:33:40.8403714Z compiled=True, 2025-05-07T20:33:40.8403921Z ) 2025-05-07T20:33:40.8404239Z self = 2025-05-07T20:33:40.8404728Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:40.8404997Z 2025-05-07T20:33:40.8405079Z @given( 2025-05-07T20:33:40.8405303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.8405617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.8405924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.8406249Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.8406576Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.8406862Z ) 2025-05-07T20:33:40.8407251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.8407700Z def test_silu_mul_quant( 2025-05-07T20:33:40.8407944Z self, 2025-05-07T20:33:40.8408137Z T: int, 2025-05-07T20:33:40.8408378Z D: int, 2025-05-07T20:33:40.8408596Z scale_ub: Optional[float], 2025-05-07T20:33:40.8408865Z contiguous: bool, 2025-05-07T20:33:40.8409103Z compiled: bool, 2025-05-07T20:33:40.8409325Z ) -> None: 2025-05-07T20:33:40.8409546Z torch.manual_seed(2025) 2025-05-07T20:33:40.8409783Z 2025-05-07T20:33:40.8410052Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.8412147Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.8414018Z 2025-05-07T20:33:40.8414136Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.8414349Z 2025-05-07T20:33:40.8414462Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.8414879Z self=, 2025-05-07T20:33:40.8415287Z T=2048, 2025-05-07T20:33:40.8415480Z D=5120, 2025-05-07T20:33:40.8415672Z scale_ub=1200.0, 2025-05-07T20:33:40.8415905Z contiguous=False, 2025-05-07T20:33:40.8416129Z compiled=False, 2025-05-07T20:33:40.8416332Z ) 2025-05-07T20:33:40.8416655Z self = 2025-05-07T20:33:40.8417156Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:40.8417437Z 2025-05-07T20:33:40.8417516Z @given( 2025-05-07T20:33:40.8417750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.8418063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.8418381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.8418709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.8419043Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.8419334Z ) 2025-05-07T20:33:40.8419678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.8420121Z def test_silu_mul_quant( 2025-05-07T20:33:40.8420362Z self, 2025-05-07T20:33:40.8420554Z T: int, 2025-05-07T20:33:40.8420758Z D: int, 2025-05-07T20:33:40.8420979Z scale_ub: Optional[float], 2025-05-07T20:33:40.8421245Z contiguous: bool, 2025-05-07T20:33:40.8421489Z compiled: bool, 2025-05-07T20:33:40.8421707Z ) -> None: 2025-05-07T20:33:40.8421926Z torch.manual_seed(2025) 2025-05-07T20:33:40.8422221Z 2025-05-07T20:33:40.8422500Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.8424554Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.8426412Z 2025-05-07T20:33:40.8426536Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.8426746Z 2025-05-07T20:33:40.8426850Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.8427331Z self=, 2025-05-07T20:33:40.8427743Z T=4096, 2025-05-07T20:33:40.8427932Z D=7168, 2025-05-07T20:33:40.8428127Z scale_ub=1200.0, 2025-05-07T20:33:40.8428391Z contiguous=True, 2025-05-07T20:33:40.8428609Z compiled=False, 2025-05-07T20:33:40.8428809Z ) 2025-05-07T20:33:40.9506984Z self = 2025-05-07T20:33:40.9507498Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:40.9507784Z 2025-05-07T20:33:40.9507869Z @given( 2025-05-07T20:33:40.9522480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.9522924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.9523483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.9523887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.9524289Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.9524588Z ) 2025-05-07T20:33:40.9525126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.9525590Z def test_silu_mul_quant( 2025-05-07T20:33:40.9525849Z self, 2025-05-07T20:33:40.9526049Z T: int, 2025-05-07T20:33:40.9526256Z D: int, 2025-05-07T20:33:40.9526480Z scale_ub: Optional[float], 2025-05-07T20:33:40.9526755Z contiguous: bool, 2025-05-07T20:33:40.9527007Z compiled: bool, 2025-05-07T20:33:40.9527250Z ) -> None: 2025-05-07T20:33:40.9527472Z torch.manual_seed(2025) 2025-05-07T20:33:40.9527729Z 2025-05-07T20:33:40.9528003Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.9530083Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.9531961Z 2025-05-07T20:33:40.9532090Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.9532305Z 2025-05-07T20:33:40.9532412Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.9532833Z self=, 2025-05-07T20:33:40.9533245Z T=16384, 2025-05-07T20:33:40.9533444Z D=7168, 2025-05-07T20:33:40.9533650Z scale_ub=None, 2025-05-07T20:33:40.9533880Z contiguous=False, 2025-05-07T20:33:40.9534109Z compiled=True, 2025-05-07T20:33:40.9534326Z ) 2025-05-07T20:33:40.9534652Z self = 2025-05-07T20:33:40.9535155Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:40.9535537Z 2025-05-07T20:33:40.9535622Z @given( 2025-05-07T20:33:40.9535864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.9536191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.9536499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.9536841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.9537179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.9537473Z ) 2025-05-07T20:33:40.9537829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.9538277Z def test_silu_mul_quant( 2025-05-07T20:33:40.9538525Z self, 2025-05-07T20:33:40.9538723Z T: int, 2025-05-07T20:33:40.9538929Z D: int, 2025-05-07T20:33:40.9539155Z scale_ub: Optional[float], 2025-05-07T20:33:40.9539431Z contiguous: bool, 2025-05-07T20:33:40.9539830Z compiled: bool, 2025-05-07T20:33:40.9540100Z ) -> None: 2025-05-07T20:33:40.9540425Z torch.manual_seed(2025) 2025-05-07T20:33:40.9540734Z 2025-05-07T20:33:40.9544277Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.9546421Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.9569034Z 2025-05-07T20:33:40.9569157Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.9569378Z 2025-05-07T20:33:40.9569496Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.9570001Z self=, 2025-05-07T20:33:40.9570403Z T=4096, 2025-05-07T20:33:40.9570591Z D=7168, 2025-05-07T20:33:40.9570779Z scale_ub=None, 2025-05-07T20:33:40.9570985Z contiguous=True, 2025-05-07T20:33:40.9571201Z compiled=False, 2025-05-07T20:33:40.9571402Z ) 2025-05-07T20:33:40.9571713Z self = 2025-05-07T20:33:40.9572208Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.9572476Z 2025-05-07T20:33:40.9572558Z @given( 2025-05-07T20:33:40.9572779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.9573083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.9573428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.9573753Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.9574080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.9574371Z ) 2025-05-07T20:33:40.9574716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.9575155Z def test_silu_mul_quant( 2025-05-07T20:33:40.9575399Z self, 2025-05-07T20:33:40.9575590Z T: int, 2025-05-07T20:33:40.9575783Z D: int, 2025-05-07T20:33:40.9575993Z scale_ub: Optional[float], 2025-05-07T20:33:40.9576263Z contiguous: bool, 2025-05-07T20:33:40.9576496Z compiled: bool, 2025-05-07T20:33:40.9576719Z ) -> None: 2025-05-07T20:33:40.9576927Z torch.manual_seed(2025) 2025-05-07T20:33:40.9577165Z 2025-05-07T20:33:40.9577440Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.9579550Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.9581491Z 2025-05-07T20:33:40.9581609Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.9581816Z 2025-05-07T20:33:40.9581924Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.9582325Z self=, 2025-05-07T20:33:40.9582721Z T=16384, 2025-05-07T20:33:40.9582917Z D=7168, 2025-05-07T20:33:40.9583104Z scale_ub=None, 2025-05-07T20:33:40.9583315Z contiguous=True, 2025-05-07T20:33:40.9583532Z compiled=False, 2025-05-07T20:33:40.9583727Z ) 2025-05-07T20:33:40.9584104Z self = 2025-05-07T20:33:40.9584602Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:40.9584879Z 2025-05-07T20:33:40.9584954Z @given( 2025-05-07T20:33:40.9585240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.9585549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.9585852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.9586180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.9586512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.9586802Z ) 2025-05-07T20:33:40.9587153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.9587588Z def test_silu_mul_quant( 2025-05-07T20:33:40.9587841Z self, 2025-05-07T20:33:40.9588040Z T: int, 2025-05-07T20:33:40.9588237Z D: int, 2025-05-07T20:33:40.9588460Z scale_ub: Optional[float], 2025-05-07T20:33:40.9588738Z contiguous: bool, 2025-05-07T20:33:40.9589022Z compiled: bool, 2025-05-07T20:33:40.9589254Z ) -> None: 2025-05-07T20:33:40.9589475Z torch.manual_seed(2025) 2025-05-07T20:33:40.9589714Z 2025-05-07T20:33:40.9589987Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.9592038Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.9593899Z 2025-05-07T20:33:40.9594027Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.9594239Z 2025-05-07T20:33:40.9594349Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.9594756Z self=, 2025-05-07T20:33:40.9595167Z T=16384, 2025-05-07T20:33:40.9595363Z D=7168, 2025-05-07T20:33:40.9595554Z scale_ub=1200.0, 2025-05-07T20:33:40.9595872Z contiguous=True, 2025-05-07T20:33:40.9596097Z compiled=False, 2025-05-07T20:33:40.9596297Z ) 2025-05-07T20:33:40.9596618Z self = 2025-05-07T20:33:40.9597115Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:40.9597393Z 2025-05-07T20:33:40.9597473Z @given( 2025-05-07T20:33:40.9597706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:40.9598020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:40.9598329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:40.9598661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:40.9599046Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:40.9599338Z ) 2025-05-07T20:33:40.9599729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:40.9600180Z def test_silu_mul_quant( 2025-05-07T20:33:40.9600421Z self, 2025-05-07T20:33:40.9600616Z T: int, 2025-05-07T20:33:40.9600821Z D: int, 2025-05-07T20:33:40.9601045Z scale_ub: Optional[float], 2025-05-07T20:33:40.9601312Z contiguous: bool, 2025-05-07T20:33:40.9601553Z compiled: bool, 2025-05-07T20:33:40.9601782Z ) -> None: 2025-05-07T20:33:40.9602003Z torch.manual_seed(2025) 2025-05-07T20:33:40.9602253Z 2025-05-07T20:33:40.9602535Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:40.9604645Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:40.9606547Z 2025-05-07T20:33:40.9606677Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:40.9606887Z 2025-05-07T20:33:40.9606996Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:40.9607415Z self=, 2025-05-07T20:33:40.9607824Z T=128, 2025-05-07T20:33:40.9608015Z D=5120, 2025-05-07T20:33:40.9608223Z scale_ub=1200.0, 2025-05-07T20:33:40.9608452Z contiguous=False, 2025-05-07T20:33:40.9608684Z compiled=False, 2025-05-07T20:33:40.9608887Z ) 2025-05-07T20:33:41.0871824Z self = 2025-05-07T20:33:41.0872384Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.0872672Z 2025-05-07T20:33:41.0872755Z @given( 2025-05-07T20:33:41.0872997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.0873316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.0873628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.0873966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.0874300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.0874594Z ) 2025-05-07T20:33:41.0874950Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.0875399Z def test_silu_mul_quant( 2025-05-07T20:33:41.0875654Z self, 2025-05-07T20:33:41.0875917Z T: int, 2025-05-07T20:33:41.0876132Z D: int, 2025-05-07T20:33:41.0876362Z scale_ub: Optional[float], 2025-05-07T20:33:41.0876644Z contiguous: bool, 2025-05-07T20:33:41.0876896Z compiled: bool, 2025-05-07T20:33:41.0877133Z ) -> None: 2025-05-07T20:33:41.0877350Z torch.manual_seed(2025) 2025-05-07T20:33:41.0877593Z 2025-05-07T20:33:41.0877870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.0878214Z 2025-05-07T20:33:41.0878420Z x_sign = torch.sign(x) 2025-05-07T20:33:41.0878717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.0879024Z x = x_sign * x_clamp 2025-05-07T20:33:41.0879269Z x0 = x[:, :D] 2025-05-07T20:33:41.0879505Z x1 = x[:, D:] 2025-05-07T20:33:41.0879714Z 2025-05-07T20:33:41.0879915Z if contiguous: 2025-05-07T20:33:41.0880155Z x0 = x0.contiguous() 2025-05-07T20:33:41.0880414Z x1 = x1.contiguous() 2025-05-07T20:33:41.0880660Z 2025-05-07T20:33:41.0880859Z if scale_ub is not None: 2025-05-07T20:33:41.0881208Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.0881549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.0881868Z ) 2025-05-07T20:33:41.0882070Z else: 2025-05-07T20:33:41.0882283Z scale_ub_tensor = None 2025-05-07T20:33:41.0882541Z 2025-05-07T20:33:41.0882777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.0883092Z op = silu_mul_quant 2025-05-07T20:33:41.0883346Z if compiled: 2025-05-07T20:33:41.0883597Z op = torch.compile(op) 2025-05-07T20:33:41.0883896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.0884176Z 2025-05-07T20:33:41.0884380Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.0884544Z 2025-05-07T20:33:41.0884646Z moe/activation_test.py:117: 2025-05-07T20:33:41.0884949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.0885387Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.0885681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.0886373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.0887136Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.0887680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.0888364Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.0889033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.0889576Z kernel = self.compile( 2025-05-07T20:33:41.0890126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.0890787Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.0891241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.0891473Z 2025-05-07T20:33:41.0891692Z self = 2025-05-07T20:33:41.0892788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.0894167Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa2b4860>} 2025-05-07T20:33:41.0895525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.0896562Z context = 2025-05-07T20:33:41.0896857Z 2025-05-07T20:33:41.0897036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.0897564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.0898043Z module_map=module_map) 2025-05-07T20:33:41.0898409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.0898766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.0899026Z E ^ 2025-05-07T20:33:41.0899513Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.0899974Z 2025-05-07T20:33:41.0900394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.0900909Z 2025-05-07T20:33:41.0901023Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.0901440Z self=, 2025-05-07T20:33:41.0901923Z T=2048, 2025-05-07T20:33:41.0902119Z D=7168, 2025-05-07T20:33:41.0902327Z scale_ub=None, 2025-05-07T20:33:41.0902551Z contiguous=False, 2025-05-07T20:33:41.0902785Z compiled=False, 2025-05-07T20:33:41.0902995Z ) 2025-05-07T20:33:41.0903313Z self = 2025-05-07T20:33:41.0903834Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.0904109Z 2025-05-07T20:33:41.0904199Z @given( 2025-05-07T20:33:41.0904445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.0904758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.0905076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.0905411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.0905792Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.0906093Z ) 2025-05-07T20:33:41.0906449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.0906887Z def test_silu_mul_quant( 2025-05-07T20:33:41.0907180Z self, 2025-05-07T20:33:41.0907380Z T: int, 2025-05-07T20:33:41.0907576Z D: int, 2025-05-07T20:33:41.0907796Z scale_ub: Optional[float], 2025-05-07T20:33:41.0908070Z contiguous: bool, 2025-05-07T20:33:41.0908307Z compiled: bool, 2025-05-07T20:33:41.0908536Z ) -> None: 2025-05-07T20:33:41.0908760Z torch.manual_seed(2025) 2025-05-07T20:33:41.0909007Z 2025-05-07T20:33:41.0909280Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.0911438Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.0913315Z 2025-05-07T20:33:41.0913439Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.0913654Z 2025-05-07T20:33:41.0913768Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.0914182Z self=, 2025-05-07T20:33:41.0914595Z T=128, 2025-05-07T20:33:41.0914790Z D=7168, 2025-05-07T20:33:41.0914986Z scale_ub=1200.0, 2025-05-07T20:33:41.0915207Z contiguous=True, 2025-05-07T20:33:41.0915431Z compiled=True, 2025-05-07T20:33:41.0915641Z ) 2025-05-07T20:33:41.1227949Z self = 2025-05-07T20:33:41.1229038Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.1229737Z 2025-05-07T20:33:41.1229860Z @given( 2025-05-07T20:33:41.1230111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.1230442Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.1230757Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.1231101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.1231444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.1231738Z ) 2025-05-07T20:33:41.1232098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.1232553Z def test_silu_mul_quant( 2025-05-07T20:33:41.1232802Z self, 2025-05-07T20:33:41.1232996Z T: int, 2025-05-07T20:33:41.1233204Z D: int, 2025-05-07T20:33:41.1233427Z scale_ub: Optional[float], 2025-05-07T20:33:41.1233700Z contiguous: bool, 2025-05-07T20:33:41.1233950Z compiled: bool, 2025-05-07T20:33:41.1234291Z ) -> None: 2025-05-07T20:33:41.1234508Z torch.manual_seed(2025) 2025-05-07T20:33:41.1234764Z 2025-05-07T20:33:41.1235048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.1235394Z 2025-05-07T20:33:41.1235598Z x_sign = torch.sign(x) 2025-05-07T20:33:41.1235964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.1236276Z x = x_sign * x_clamp 2025-05-07T20:33:41.1236520Z x0 = x[:, :D] 2025-05-07T20:33:41.1236740Z x1 = x[:, D:] 2025-05-07T20:33:41.1236951Z 2025-05-07T20:33:41.1237141Z if contiguous: 2025-05-07T20:33:41.1237376Z x0 = x0.contiguous() 2025-05-07T20:33:41.1237633Z x1 = x1.contiguous() 2025-05-07T20:33:41.1237880Z 2025-05-07T20:33:41.1238078Z if scale_ub is not None: 2025-05-07T20:33:41.1238349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.1238768Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.1239092Z ) 2025-05-07T20:33:41.1239296Z else: 2025-05-07T20:33:41.1239504Z scale_ub_tensor = None 2025-05-07T20:33:41.1239827Z 2025-05-07T20:33:41.1240066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.1240382Z op = silu_mul_quant 2025-05-07T20:33:41.1240635Z if compiled: 2025-05-07T20:33:41.1240888Z op = torch.compile(op) 2025-05-07T20:33:41.1241183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.1241462Z 2025-05-07T20:33:41.1241660Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.1241824Z 2025-05-07T20:33:41.1241925Z moe/activation_test.py:117: 2025-05-07T20:33:41.1242222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.1242558Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.1242843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.1243464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.1244035Z return fn(*args, **kwargs) 2025-05-07T20:33:41.1244701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.1245388Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.1245936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.1246618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.1247289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.1247819Z kernel = self.compile( 2025-05-07T20:33:41.1248365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.1249028Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.1249428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.1249670Z 2025-05-07T20:33:41.1249879Z self = 2025-05-07T20:33:41.1250970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.1252349Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa2b59e0>} 2025-05-07T20:33:41.1253699Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.1254777Z context = 2025-05-07T20:33:41.1255073Z 2025-05-07T20:33:41.1255242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.1255775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.1256254Z module_map=module_map) 2025-05-07T20:33:41.1256624Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.1256996Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.1257262Z E ^ 2025-05-07T20:33:41.1257723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.1258180Z 2025-05-07T20:33:41.1258593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.1259157Z 2025-05-07T20:33:41.1259264Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.1259737Z self=, 2025-05-07T20:33:41.1260183Z T=128, 2025-05-07T20:33:41.1260375Z D=7168, 2025-05-07T20:33:41.1260574Z scale_ub=1200.0, 2025-05-07T20:33:41.1260793Z contiguous=True, 2025-05-07T20:33:41.1261019Z compiled=False, 2025-05-07T20:33:41.1261228Z ) 2025-05-07T20:33:41.1261548Z self = 2025-05-07T20:33:41.1262048Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.1262330Z 2025-05-07T20:33:41.1262411Z @given( 2025-05-07T20:33:41.1262649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.1262960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.1263273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.1263615Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.1263944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.1264285Z ) 2025-05-07T20:33:41.1264639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.1265082Z def test_silu_mul_quant( 2025-05-07T20:33:41.1265333Z self, 2025-05-07T20:33:41.1265793Z T: int, 2025-05-07T20:33:41.1265990Z D: int, 2025-05-07T20:33:41.1266212Z scale_ub: Optional[float], 2025-05-07T20:33:41.1266491Z contiguous: bool, 2025-05-07T20:33:41.1266733Z compiled: bool, 2025-05-07T20:33:41.1273013Z ) -> None: 2025-05-07T20:33:41.1273242Z torch.manual_seed(2025) 2025-05-07T20:33:41.1273492Z 2025-05-07T20:33:41.1273770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.1274117Z 2025-05-07T20:33:41.1274313Z x_sign = torch.sign(x) 2025-05-07T20:33:41.1274606Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.1276681Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.1278549Z 2025-05-07T20:33:41.1278674Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:41.1278886Z 2025-05-07T20:33:41.1278990Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.1279404Z self=, 2025-05-07T20:33:41.1279813Z T=128, 2025-05-07T20:33:41.1280005Z D=5120, 2025-05-07T20:33:41.1280195Z scale_ub=1200.0, 2025-05-07T20:33:41.1280417Z contiguous=True, 2025-05-07T20:33:41.1280788Z compiled=True, 2025-05-07T20:33:41.1280990Z ) 2025-05-07T20:33:41.1281309Z self = 2025-05-07T20:33:41.1281801Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.1282070Z 2025-05-07T20:33:41.1282147Z @given( 2025-05-07T20:33:41.1282381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.1282695Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.1282995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.1283323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.1283653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.1283944Z ) 2025-05-07T20:33:41.1284290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.1284733Z def test_silu_mul_quant( 2025-05-07T20:33:41.1285050Z self, 2025-05-07T20:33:41.1285249Z T: int, 2025-05-07T20:33:41.1285448Z D: int, 2025-05-07T20:33:41.1285671Z scale_ub: Optional[float], 2025-05-07T20:33:41.1286003Z contiguous: bool, 2025-05-07T20:33:41.1286241Z compiled: bool, 2025-05-07T20:33:41.1286463Z ) -> None: 2025-05-07T20:33:41.1286675Z torch.manual_seed(2025) 2025-05-07T20:33:41.1286921Z 2025-05-07T20:33:41.1287196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.1287533Z 2025-05-07T20:33:41.1287727Z x_sign = torch.sign(x) 2025-05-07T20:33:41.1288019Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.1290137Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.1292000Z 2025-05-07T20:33:41.1292122Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:41.1292333Z 2025-05-07T20:33:41.1292436Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.1292848Z self=, 2025-05-07T20:33:41.1293251Z T=128, 2025-05-07T20:33:41.1293436Z D=7168, 2025-05-07T20:33:41.1293633Z scale_ub=None, 2025-05-07T20:33:41.1293851Z contiguous=True, 2025-05-07T20:33:41.1294078Z compiled=True, 2025-05-07T20:33:41.1294287Z ) 2025-05-07T20:33:41.3818099Z self = 2025-05-07T20:33:41.3819128Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.3819535Z 2025-05-07T20:33:41.3819636Z @given( 2025-05-07T20:33:41.3819894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.3820214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.3820520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.3820842Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.3821168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.3821451Z ) 2025-05-07T20:33:41.3821791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.3822231Z def test_silu_mul_quant( 2025-05-07T20:33:41.3822475Z self, 2025-05-07T20:33:41.3822665Z T: int, 2025-05-07T20:33:41.3822862Z D: int, 2025-05-07T20:33:41.3823078Z scale_ub: Optional[float], 2025-05-07T20:33:41.3823342Z contiguous: bool, 2025-05-07T20:33:41.3823584Z compiled: bool, 2025-05-07T20:33:41.3823811Z ) -> None: 2025-05-07T20:33:41.3824135Z torch.manual_seed(2025) 2025-05-07T20:33:41.3824382Z 2025-05-07T20:33:41.3824649Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3826690Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.3828536Z 2025-05-07T20:33:41.3828660Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.3828867Z 2025-05-07T20:33:41.3843912Z FAILED 2025-05-07T20:33:41.3844253Z 2025-05-07T20:33:41.3844446Z =================================== FAILURES =================================== 2025-05-07T20:33:41.3844922Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:41.3845439Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:41.3846067Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:41.3846620Z | yield 2025-05-07T20:33:41.3847062Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:33:41.3847767Z | self._callTestMethod(testMethod) 2025-05-07T20:33:41.3848536Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:33:41.3849329Z | if method() is not None: 2025-05-07T20:33:41.3849706Z | ^^^^^^^^ 2025-05-07T20:33:41.3850830Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:41.3851828Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.3852248Z | ^^^^^^^ 2025-05-07T20:33:41.3852999Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:41.3853849Z | raise the_error_hypothesis_found 2025-05-07T20:33:41.3854442Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:41.3855026Z +-+---------------- 1 ---------------- 2025-05-07T20:33:41.3855418Z | Traceback (most recent call last): 2025-05-07T20:33:41.3856376Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:41.3857434Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3857952Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3860672Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.3863394Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:41.3863994Z | self=, 2025-05-07T20:33:41.3864559Z | T=2048, 2025-05-07T20:33:41.3864878Z | D=5120, # or any other generated value 2025-05-07T20:33:41.3865553Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:41.3866165Z | contiguous=True, # or any other generated value 2025-05-07T20:33:41.3866725Z | compiled=False, # or any other generated value 2025-05-07T20:33:41.3867159Z | ) 2025-05-07T20:33:41.3867414Z | 2025-05-07T20:33:41.3868166Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:41.3868993Z +---------------- 2 ---------------- 2025-05-07T20:33:41.3869388Z | Traceback (most recent call last): 2025-05-07T20:33:41.3870353Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:41.3871430Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3872065Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3874787Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.3877641Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:41.3878244Z | self=, 2025-05-07T20:33:41.3878801Z | T=128, 2025-05-07T20:33:41.3879103Z | D=7168, 2025-05-07T20:33:41.3879392Z | scale_ub=None, 2025-05-07T20:33:41.3879722Z | contiguous=True, 2025-05-07T20:33:41.3880054Z | compiled=True, 2025-05-07T20:33:41.3880370Z | ) 2025-05-07T20:33:41.3880711Z | 2025-05-07T20:33:41.3881426Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:41.3882283Z +---------------- 3 ---------------- 2025-05-07T20:33:41.3882690Z | Traceback (most recent call last): 2025-05-07T20:33:41.3883656Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:41.3884702Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3885220Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3887433Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.3889409Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:41.3889843Z | self=, 2025-05-07T20:33:41.3890250Z | T=128, 2025-05-07T20:33:41.3890462Z | D=5120, 2025-05-07T20:33:41.3890678Z | scale_ub=1200.0, 2025-05-07T20:33:41.3890925Z | contiguous=True, 2025-05-07T20:33:41.3891162Z | compiled=True, 2025-05-07T20:33:41.3891390Z | ) 2025-05-07T20:33:41.3891575Z | 2025-05-07T20:33:41.3892098Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:41.3892767Z +---------------- 4 ---------------- 2025-05-07T20:33:41.3893061Z | Traceback (most recent call last): 2025-05-07T20:33:41.3893766Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:41.3894482Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.3894772Z | ^^^^^^^^ 2025-05-07T20:33:41.3895413Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:41.3896106Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.3896446Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3897315Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:41.3898118Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.3898721Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:41.3899504Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.3899953Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3900700Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:41.3901808Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.3902478Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3903457Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:41.3904468Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.3905024Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3905901Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:41.3912432Z | fn() 2025-05-07T20:33:41.3913269Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:41.3914144Z | self.fn.run( 2025-05-07T20:33:41.3914880Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:41.3915758Z | kernel = self.compile( 2025-05-07T20:33:41.3916136Z | ^^^^^^^^^^^^^ 2025-05-07T20:33:41.3916968Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:41.3917935Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.3918478Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3919369Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:41.3920480Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.3921149Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:41.3921669Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.3922162Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.3922536Z | ^ 2025-05-07T20:33:41.3923169Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.3924067Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:41.3924626Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:41.3925348Z | self=, 2025-05-07T20:33:41.3925951Z | T=1, # or any other generated value 2025-05-07T20:33:41.3926394Z | D=5120, # or any other generated value 2025-05-07T20:33:41.3926870Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:41.3927370Z | contiguous=True, # or any other generated value 2025-05-07T20:33:41.3927883Z | compiled=True, # or any other generated value 2025-05-07T20:33:41.3928310Z | ) 2025-05-07T20:33:41.3928554Z | 2025-05-07T20:33:41.3929350Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:41.3930232Z +------------------------------------ 2025-05-07T20:33:41.3930754Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:41.3931381Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.3931945Z self=, 2025-05-07T20:33:41.3932490Z T=1, 2025-05-07T20:33:41.3932757Z D=5120, 2025-05-07T20:33:41.3933032Z scale_ub=None, 2025-05-07T20:33:41.3933336Z contiguous=True, 2025-05-07T20:33:41.3933639Z compiled=True, 2025-05-07T20:33:41.3933935Z ) 2025-05-07T20:33:41.3934377Z self = 2025-05-07T20:33:41.3935027Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.3935386Z 2025-05-07T20:33:41.3935500Z @given( 2025-05-07T20:33:41.3935823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.3936265Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.3936752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.3937215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.3937673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.3938067Z ) 2025-05-07T20:33:41.3938551Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.3939163Z def test_silu_mul_quant( 2025-05-07T20:33:41.3939517Z self, 2025-05-07T20:33:41.3939826Z T: int, 2025-05-07T20:33:41.3940107Z D: int, 2025-05-07T20:33:41.3940408Z scale_ub: Optional[float], 2025-05-07T20:33:41.3940793Z contiguous: bool, 2025-05-07T20:33:41.3941136Z compiled: bool, 2025-05-07T20:33:41.3941448Z ) -> None: 2025-05-07T20:33:41.3941750Z torch.manual_seed(2025) 2025-05-07T20:33:41.3942086Z 2025-05-07T20:33:41.3942469Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3942935Z 2025-05-07T20:33:41.3943214Z x_sign = torch.sign(x) 2025-05-07T20:33:41.3943615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.3944049Z x = x_sign * x_clamp 2025-05-07T20:33:41.3944396Z x0 = x[:, :D] 2025-05-07T20:33:41.3944700Z x1 = x[:, D:] 2025-05-07T20:33:41.3944983Z 2025-05-07T20:33:41.3945245Z if contiguous: 2025-05-07T20:33:41.3945567Z x0 = x0.contiguous() 2025-05-07T20:33:41.3945925Z x1 = x1.contiguous() 2025-05-07T20:33:41.3946270Z 2025-05-07T20:33:41.3946540Z if scale_ub is not None: 2025-05-07T20:33:41.3946904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.3947365Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.3947798Z ) 2025-05-07T20:33:41.3948069Z else: 2025-05-07T20:33:41.3948374Z scale_ub_tensor = None 2025-05-07T20:33:41.3948735Z 2025-05-07T20:33:41.3949060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.3949575Z op = silu_mul_quant 2025-05-07T20:33:41.3949972Z if compiled: 2025-05-07T20:33:41.3950327Z op = torch.compile(op) 2025-05-07T20:33:41.3950737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.3951129Z 2025-05-07T20:33:41.3951405Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.3951788Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.3952193Z 2025-05-07T20:33:41.3952523Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.3952982Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.3953378Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.3953801Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.3954272Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.3954687Z 2025-05-07T20:33:41.3955015Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.3955278Z 2025-05-07T20:33:41.3955428Z moe/activation_test.py:126: 2025-05-07T20:33:41.3955945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.3956467Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.3956909Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.3957977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.3959025Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.3959844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.3960766Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.3961687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.3962724Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.3963737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.3964607Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.3965709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.3966418Z fn() 2025-05-07T20:33:41.3967100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.3967880Z self.fn.run( 2025-05-07T20:33:41.3968507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.3969225Z kernel = self.compile( 2025-05-07T20:33:41.3969952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.3970825Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.3971366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.3971694Z 2025-05-07T20:33:41.3971984Z self = 2025-05-07T20:33:41.3973436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.3975325Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb2735c60>} 2025-05-07T20:33:41.3977219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.3978793Z context = 2025-05-07T20:33:41.3979184Z 2025-05-07T20:33:41.3979421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.3980173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.3980821Z module_map=module_map) 2025-05-07T20:33:41.3981314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.3981801Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.3982160Z E ^ 2025-05-07T20:33:41.3982791Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.3983393Z 2025-05-07T20:33:41.3984040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.3984722Z 2025-05-07T20:33:41.3984870Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.3985436Z self=, 2025-05-07T20:33:41.3986071Z T=2048, 2025-05-07T20:33:41.3986346Z D=5120, 2025-05-07T20:33:41.3986619Z scale_ub=1200.0, 2025-05-07T20:33:41.3986934Z contiguous=True, 2025-05-07T20:33:41.3987252Z compiled=False, 2025-05-07T20:33:41.3987539Z ) 2025-05-07T20:33:41.3987987Z self = 2025-05-07T20:33:41.3988665Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.3989032Z 2025-05-07T20:33:41.3989144Z @given( 2025-05-07T20:33:41.3989466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.3989946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.3990363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.3990824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.3991369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.3991772Z ) 2025-05-07T20:33:41.3992245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.3992853Z def test_silu_mul_quant( 2025-05-07T20:33:41.3993180Z self, 2025-05-07T20:33:41.3993444Z T: int, 2025-05-07T20:33:41.3993720Z D: int, 2025-05-07T20:33:41.3994022Z scale_ub: Optional[float], 2025-05-07T20:33:41.3994396Z contiguous: bool, 2025-05-07T20:33:41.3994747Z compiled: bool, 2025-05-07T20:33:41.3995057Z ) -> None: 2025-05-07T20:33:41.3995355Z torch.manual_seed(2025) 2025-05-07T20:33:41.3995794Z 2025-05-07T20:33:41.3996167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3996631Z 2025-05-07T20:33:41.3996900Z x_sign = torch.sign(x) 2025-05-07T20:33:41.3997297Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.3997735Z x = x_sign * x_clamp 2025-05-07T20:33:41.3998075Z x0 = x[:, :D] 2025-05-07T20:33:41.3998382Z x1 = x[:, D:] 2025-05-07T20:33:41.3998681Z 2025-05-07T20:33:41.4020218Z if contiguous: 2025-05-07T20:33:41.4020561Z x0 = x0.contiguous() 2025-05-07T20:33:41.4020903Z x1 = x1.contiguous() 2025-05-07T20:33:41.4021227Z 2025-05-07T20:33:41.4021497Z if scale_ub is not None: 2025-05-07T20:33:41.4021861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4022307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4022717Z ) 2025-05-07T20:33:41.4022983Z else: 2025-05-07T20:33:41.4023258Z scale_ub_tensor = None 2025-05-07T20:33:41.4023591Z 2025-05-07T20:33:41.4023900Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4024309Z op = silu_mul_quant 2025-05-07T20:33:41.4024647Z if compiled: 2025-05-07T20:33:41.4025082Z op = torch.compile(op) 2025-05-07T20:33:41.4025463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4025829Z 2025-05-07T20:33:41.4026094Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4026307Z 2025-05-07T20:33:41.4026438Z moe/activation_test.py:117: 2025-05-07T20:33:41.4026832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4027268Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4027638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4028555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4029503Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4030225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4031218Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4032135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4032915Z kernel = self.compile( 2025-05-07T20:33:41.4033673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4034576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4035122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4035442Z 2025-05-07T20:33:41.4035856Z self = 2025-05-07T20:33:41.4037367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4039303Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb258c220>} 2025-05-07T20:33:41.4041122Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4042494Z context = 2025-05-07T20:33:41.4042895Z 2025-05-07T20:33:41.4043138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4043814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4044465Z module_map=module_map) 2025-05-07T20:33:41.4044970Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4045452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4045795Z E ^ 2025-05-07T20:33:41.4046430Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4047053Z 2025-05-07T20:33:41.4047613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4048302Z 2025-05-07T20:33:41.4048448Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4048988Z self=, 2025-05-07T20:33:41.4049523Z T=2048, 2025-05-07T20:33:41.4049782Z D=5120, 2025-05-07T20:33:41.4050044Z scale_ub=1200.0, 2025-05-07T20:33:41.4050346Z contiguous=True, 2025-05-07T20:33:41.4050644Z compiled=True, 2025-05-07T20:33:41.4050915Z ) 2025-05-07T20:33:41.4051360Z self = 2025-05-07T20:33:41.4052058Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4052442Z 2025-05-07T20:33:41.4052618Z @given( 2025-05-07T20:33:41.4052943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4053377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4053806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4054250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4054696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4055079Z ) 2025-05-07T20:33:41.4055527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4056121Z def test_silu_mul_quant( 2025-05-07T20:33:41.4056465Z self, 2025-05-07T20:33:41.4056733Z T: int, 2025-05-07T20:33:41.4057010Z D: int, 2025-05-07T20:33:41.4057309Z scale_ub: Optional[float], 2025-05-07T20:33:41.4057670Z contiguous: bool, 2025-05-07T20:33:41.4057996Z compiled: bool, 2025-05-07T20:33:41.4058364Z ) -> None: 2025-05-07T20:33:41.4058655Z torch.manual_seed(2025) 2025-05-07T20:33:41.4058990Z 2025-05-07T20:33:41.4059359Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4059873Z 2025-05-07T20:33:41.4060133Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4060527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4060941Z x = x_sign * x_clamp 2025-05-07T20:33:41.4061261Z x0 = x[:, :D] 2025-05-07T20:33:41.4061550Z x1 = x[:, D:] 2025-05-07T20:33:41.4061839Z 2025-05-07T20:33:41.4062088Z if contiguous: 2025-05-07T20:33:41.4062402Z x0 = x0.contiguous() 2025-05-07T20:33:41.4062760Z x1 = x1.contiguous() 2025-05-07T20:33:41.4063080Z 2025-05-07T20:33:41.4063345Z if scale_ub is not None: 2025-05-07T20:33:41.4063727Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4064182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4064621Z ) 2025-05-07T20:33:41.4064908Z else: 2025-05-07T20:33:41.4065263Z scale_ub_tensor = None 2025-05-07T20:33:41.4065913Z 2025-05-07T20:33:41.4066245Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4066680Z op = silu_mul_quant 2025-05-07T20:33:41.4067015Z if compiled: 2025-05-07T20:33:41.4067326Z op = torch.compile(op) 2025-05-07T20:33:41.4067630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4067908Z 2025-05-07T20:33:41.4068112Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4068408Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4068698Z 2025-05-07T20:33:41.4068944Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4069289Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4069585Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4069908Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4070278Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4070587Z 2025-05-07T20:33:41.4070799Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4071006Z 2025-05-07T20:33:41.4071109Z moe/activation_test.py:126: 2025-05-07T20:33:41.4071414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4071750Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4072086Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4072881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4073634Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4074186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4074875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4075787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4076516Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4077252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4077894Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4078500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4079016Z fn() 2025-05-07T20:33:41.4079528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4080120Z self.fn.run( 2025-05-07T20:33:41.4080666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4081209Z kernel = self.compile( 2025-05-07T20:33:41.4081754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4082473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4082870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4083108Z 2025-05-07T20:33:41.4083315Z self = 2025-05-07T20:33:41.4084406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4085793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb258d6c0>} 2025-05-07T20:33:41.4087252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4088288Z context = 2025-05-07T20:33:41.4088585Z 2025-05-07T20:33:41.4088753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4089284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4089755Z module_map=module_map) 2025-05-07T20:33:41.4090127Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4090496Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4090767Z E ^ 2025-05-07T20:33:41.4091241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4091705Z 2025-05-07T20:33:41.4092123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4092640Z 2025-05-07T20:33:41.4092753Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4093168Z self=, 2025-05-07T20:33:41.4093578Z T=16384, 2025-05-07T20:33:41.4093782Z D=7168, 2025-05-07T20:33:41.4093977Z scale_ub=1200.0, 2025-05-07T20:33:41.4094210Z contiguous=False, 2025-05-07T20:33:41.4094445Z compiled=False, 2025-05-07T20:33:41.4094659Z ) 2025-05-07T20:33:41.4094978Z self = 2025-05-07T20:33:41.4095488Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4095773Z 2025-05-07T20:33:41.4095862Z @given( 2025-05-07T20:33:41.4096100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4096471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4096789Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4097121Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4097465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4097757Z ) 2025-05-07T20:33:41.4098106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4098547Z def test_silu_mul_quant( 2025-05-07T20:33:41.4098797Z self, 2025-05-07T20:33:41.4098996Z T: int, 2025-05-07T20:33:41.4099196Z D: int, 2025-05-07T20:33:41.4099422Z scale_ub: Optional[float], 2025-05-07T20:33:41.4099700Z contiguous: bool, 2025-05-07T20:33:41.4099947Z compiled: bool, 2025-05-07T20:33:41.4100180Z ) -> None: 2025-05-07T20:33:41.4100402Z torch.manual_seed(2025) 2025-05-07T20:33:41.4100651Z 2025-05-07T20:33:41.4100980Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4101341Z 2025-05-07T20:33:41.4101540Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4101844Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4102232Z x = x_sign * x_clamp 2025-05-07T20:33:41.4102474Z x0 = x[:, :D] 2025-05-07T20:33:41.4102701Z x1 = x[:, D:] 2025-05-07T20:33:41.4102918Z 2025-05-07T20:33:41.4103108Z if contiguous: 2025-05-07T20:33:41.4103342Z x0 = x0.contiguous() 2025-05-07T20:33:41.4103615Z x1 = x1.contiguous() 2025-05-07T20:33:41.4103861Z 2025-05-07T20:33:41.4104053Z if scale_ub is not None: 2025-05-07T20:33:41.4104336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4104681Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4104989Z ) 2025-05-07T20:33:41.4105190Z else: 2025-05-07T20:33:41.4105406Z scale_ub_tensor = None 2025-05-07T20:33:41.4105665Z 2025-05-07T20:33:41.4105953Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4106289Z op = silu_mul_quant 2025-05-07T20:33:41.4106542Z if compiled: 2025-05-07T20:33:41.4106803Z op = torch.compile(op) 2025-05-07T20:33:41.4107107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4107397Z 2025-05-07T20:33:41.4107593Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4107769Z 2025-05-07T20:33:41.4107870Z moe/activation_test.py:117: 2025-05-07T20:33:41.4108175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4108509Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4108796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4109490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4110230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4110776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4111468Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4112141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4112675Z kernel = self.compile( 2025-05-07T20:33:41.4113223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4113884Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4114286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4114519Z 2025-05-07T20:33:41.4114726Z self = 2025-05-07T20:33:41.4115893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4117317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb1450a40>} 2025-05-07T20:33:41.4118664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4119729Z context = 2025-05-07T20:33:41.4120032Z 2025-05-07T20:33:41.4120200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4120726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4121244Z module_map=module_map) 2025-05-07T20:33:41.4121615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4121974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4122280Z E ^ 2025-05-07T20:33:41.4122744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4123203Z 2025-05-07T20:33:41.4123620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4124137Z 2025-05-07T20:33:41.4124242Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4124470Z self=, 2025-05-07T20:33:41.4124556Z T=1, 2025-05-07T20:33:41.4124636Z D=7168, 2025-05-07T20:33:41.4124721Z scale_ub=None, 2025-05-07T20:33:41.4124816Z contiguous=True, 2025-05-07T20:33:41.4124901Z compiled=True, 2025-05-07T20:33:41.4124977Z ) 2025-05-07T20:33:41.4125246Z self = 2025-05-07T20:33:41.4125415Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4125423Z 2025-05-07T20:33:41.4125514Z @given( 2025-05-07T20:33:41.4125636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4125738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4125862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4125982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4126101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4126184Z ) 2025-05-07T20:33:41.4126430Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4126533Z def test_silu_mul_quant( 2025-05-07T20:33:41.4126615Z self, 2025-05-07T20:33:41.4126696Z T: int, 2025-05-07T20:33:41.4126782Z D: int, 2025-05-07T20:33:41.4126889Z scale_ub: Optional[float], 2025-05-07T20:33:41.4126987Z contiguous: bool, 2025-05-07T20:33:41.4127082Z compiled: bool, 2025-05-07T20:33:41.4127162Z ) -> None: 2025-05-07T20:33:41.4127266Z torch.manual_seed(2025) 2025-05-07T20:33:41.4127348Z 2025-05-07T20:33:41.4127521Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4127597Z 2025-05-07T20:33:41.4127700Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4127831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4127923Z x = x_sign * x_clamp 2025-05-07T20:33:41.4128014Z x0 = x[:, :D] 2025-05-07T20:33:41.4128100Z x1 = x[:, D:] 2025-05-07T20:33:41.4128182Z 2025-05-07T20:33:41.4128271Z if contiguous: 2025-05-07T20:33:41.4128366Z x0 = x0.contiguous() 2025-05-07T20:33:41.4128464Z x1 = x1.contiguous() 2025-05-07T20:33:41.4128540Z 2025-05-07T20:33:41.4128638Z if scale_ub is not None: 2025-05-07T20:33:41.4128753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4128940Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4129023Z ) 2025-05-07T20:33:41.4129109Z else: 2025-05-07T20:33:41.4129207Z scale_ub_tensor = None 2025-05-07T20:33:41.4129284Z 2025-05-07T20:33:41.4129422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4129519Z op = silu_mul_quant 2025-05-07T20:33:41.4129615Z if compiled: 2025-05-07T20:33:41.4129720Z op = torch.compile(op) 2025-05-07T20:33:41.4129828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4129915Z 2025-05-07T20:33:41.4130010Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4130133Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4130217Z 2025-05-07T20:33:41.4130359Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4130511Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4130629Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4130753Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4130932Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4131019Z 2025-05-07T20:33:41.4131121Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4131126Z 2025-05-07T20:33:41.4131232Z moe/activation_test.py:126: 2025-05-07T20:33:41.4131364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4131475Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4131616Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4132174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4132278Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4132684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4132912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4133290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4133549Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4133923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4134097Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4134441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4134530Z fn() 2025-05-07T20:33:41.4134933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4135023Z self.fn.run( 2025-05-07T20:33:41.4135366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4135467Z kernel = self.compile( 2025-05-07T20:33:41.4135852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4136032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4136162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4136166Z 2025-05-07T20:33:41.4136381Z self = 2025-05-07T20:33:41.4137164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4137725Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb1450360>} 2025-05-07T20:33:41.4138482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4138679Z context = 2025-05-07T20:33:41.4138684Z 2025-05-07T20:33:41.4138859Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4139124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4139243Z module_map=module_map) 2025-05-07T20:33:41.4139407Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4139571Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4139671Z E ^ 2025-05-07T20:33:41.4140058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4140104Z 2025-05-07T20:33:41.4140520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4140524Z 2025-05-07T20:33:41.4140637Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4140864Z self=, 2025-05-07T20:33:41.4140952Z T=4096, 2025-05-07T20:33:41.4141034Z D=5120, 2025-05-07T20:33:41.4141119Z scale_ub=None, 2025-05-07T20:33:41.4141212Z contiguous=False, 2025-05-07T20:33:41.4141299Z compiled=False, 2025-05-07T20:33:41.4141374Z ) 2025-05-07T20:33:41.4141602Z self = 2025-05-07T20:33:41.4141785Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4141792Z 2025-05-07T20:33:41.4141912Z @given( 2025-05-07T20:33:41.4142043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4142148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4142270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4142389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4142505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4142588Z ) 2025-05-07T20:33:41.4142834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4142931Z def test_silu_mul_quant( 2025-05-07T20:33:41.4143018Z self, 2025-05-07T20:33:41.4143098Z T: int, 2025-05-07T20:33:41.4143177Z D: int, 2025-05-07T20:33:41.4143285Z scale_ub: Optional[float], 2025-05-07T20:33:41.4143377Z contiguous: bool, 2025-05-07T20:33:41.4143468Z compiled: bool, 2025-05-07T20:33:41.4143561Z ) -> None: 2025-05-07T20:33:41.4143664Z torch.manual_seed(2025) 2025-05-07T20:33:41.4143748Z 2025-05-07T20:33:41.4143921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4144000Z 2025-05-07T20:33:41.4144101Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4144231Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4144327Z x = x_sign * x_clamp 2025-05-07T20:33:41.4144422Z x0 = x[:, :D] 2025-05-07T20:33:41.4144507Z x1 = x[:, D:] 2025-05-07T20:33:41.4144582Z 2025-05-07T20:33:41.4144676Z if contiguous: 2025-05-07T20:33:41.4144771Z x0 = x0.contiguous() 2025-05-07T20:33:41.4144862Z x1 = x1.contiguous() 2025-05-07T20:33:41.4144945Z 2025-05-07T20:33:41.4145039Z if scale_ub is not None: 2025-05-07T20:33:41.4145156Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4145296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4145420Z ) 2025-05-07T20:33:41.4145512Z else: 2025-05-07T20:33:41.4145611Z scale_ub_tensor = None 2025-05-07T20:33:41.4145689Z 2025-05-07T20:33:41.4145827Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4145921Z op = silu_mul_quant 2025-05-07T20:33:41.4146008Z if compiled: 2025-05-07T20:33:41.4146116Z op = torch.compile(op) 2025-05-07T20:33:41.4146224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4146304Z 2025-05-07T20:33:41.4146407Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4146412Z 2025-05-07T20:33:41.4146512Z moe/activation_test.py:117: 2025-05-07T20:33:41.4146649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4146755Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4146863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4147438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4147542Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4148029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4148260Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4148601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4148705Z kernel = self.compile( 2025-05-07T20:33:41.4149087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4149263Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4149398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4149403Z 2025-05-07T20:33:41.4149612Z self = 2025-05-07T20:33:41.4150442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4150951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb256f240>} 2025-05-07T20:33:41.4151695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4151892Z context = 2025-05-07T20:33:41.4151896Z 2025-05-07T20:33:41.4152065Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4152340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4152450Z module_map=module_map) 2025-05-07T20:33:41.4152616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4152727Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4152810Z E ^ 2025-05-07T20:33:41.4153179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4153183Z 2025-05-07T20:33:41.4153596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4153600Z 2025-05-07T20:33:41.4153709Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4153940Z self=, 2025-05-07T20:33:41.4154023Z T=4096, 2025-05-07T20:33:41.4154105Z D=7168, 2025-05-07T20:33:41.4154196Z scale_ub=None, 2025-05-07T20:33:41.4154334Z contiguous=False, 2025-05-07T20:33:41.4154426Z compiled=False, 2025-05-07T20:33:41.4154502Z ) 2025-05-07T20:33:41.4154727Z self = 2025-05-07T20:33:41.4154908Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4154913Z 2025-05-07T20:33:41.4154995Z @given( 2025-05-07T20:33:41.4155117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4155224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4155342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4155462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4155583Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4155737Z ) 2025-05-07T20:33:41.4155992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4156133Z def test_silu_mul_quant( 2025-05-07T20:33:41.4156216Z self, 2025-05-07T20:33:41.4156304Z T: int, 2025-05-07T20:33:41.4156383Z D: int, 2025-05-07T20:33:41.4156488Z scale_ub: Optional[float], 2025-05-07T20:33:41.4156638Z contiguous: bool, 2025-05-07T20:33:41.4156728Z compiled: bool, 2025-05-07T20:33:41.4156809Z ) -> None: 2025-05-07T20:33:41.4156912Z torch.manual_seed(2025) 2025-05-07T20:33:41.4156990Z 2025-05-07T20:33:41.4157161Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4157245Z 2025-05-07T20:33:41.4157342Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4157476Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4157570Z x = x_sign * x_clamp 2025-05-07T20:33:41.4157653Z x0 = x[:, :D] 2025-05-07T20:33:41.4157741Z x1 = x[:, D:] 2025-05-07T20:33:41.4157816Z 2025-05-07T20:33:41.4157903Z if contiguous: 2025-05-07T20:33:41.4158008Z x0 = x0.contiguous() 2025-05-07T20:33:41.4158145Z x1 = x1.contiguous() 2025-05-07T20:33:41.4158223Z 2025-05-07T20:33:41.4158324Z if scale_ub is not None: 2025-05-07T20:33:41.4158436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4158573Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4158663Z ) 2025-05-07T20:33:41.4158747Z else: 2025-05-07T20:33:41.4158844Z scale_ub_tensor = None 2025-05-07T20:33:41.4158928Z 2025-05-07T20:33:41.4159059Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4159159Z op = silu_mul_quant 2025-05-07T20:33:41.4159249Z if compiled: 2025-05-07T20:33:41.4159352Z op = torch.compile(op) 2025-05-07T20:33:41.4159465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4159542Z 2025-05-07T20:33:41.4159639Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4159644Z 2025-05-07T20:33:41.4159776Z moe/activation_test.py:117: 2025-05-07T20:33:41.4159935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4160041Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4160153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4160652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4160758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4161118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4161347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4161700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4161801Z kernel = self.compile( 2025-05-07T20:33:41.4162195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4162421Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4162552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4162557Z 2025-05-07T20:33:41.4162776Z self = 2025-05-07T20:33:41.4163554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4164065Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0aee0c0>} 2025-05-07T20:33:41.4164853Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4165049Z context = 2025-05-07T20:33:41.4165093Z 2025-05-07T20:33:41.4165267Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4165801Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4165947Z module_map=module_map) 2025-05-07T20:33:41.4166114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4166218Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4166307Z E ^ 2025-05-07T20:33:41.4166664Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4166669Z 2025-05-07T20:33:41.4167090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4167204Z 2025-05-07T20:33:41.4167314Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4167540Z self=, 2025-05-07T20:33:41.4167629Z T=128, 2025-05-07T20:33:41.4167709Z D=7168, 2025-05-07T20:33:41.4167798Z scale_ub=None, 2025-05-07T20:33:41.4167891Z contiguous=False, 2025-05-07T20:33:41.4167976Z compiled=True, 2025-05-07T20:33:41.4168053Z ) 2025-05-07T20:33:41.4168279Z self = 2025-05-07T20:33:41.4168451Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4168456Z 2025-05-07T20:33:41.4168543Z @given( 2025-05-07T20:33:41.4168664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4168766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4168894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4169018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4169134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4169220Z ) 2025-05-07T20:33:41.4169468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4169564Z def test_silu_mul_quant( 2025-05-07T20:33:41.4169649Z self, 2025-05-07T20:33:41.4169730Z T: int, 2025-05-07T20:33:41.4169809Z D: int, 2025-05-07T20:33:41.4169918Z scale_ub: Optional[float], 2025-05-07T20:33:41.4170009Z contiguous: bool, 2025-05-07T20:33:41.4170103Z compiled: bool, 2025-05-07T20:33:41.4170184Z ) -> None: 2025-05-07T20:33:41.4170281Z torch.manual_seed(2025) 2025-05-07T20:33:41.4170364Z 2025-05-07T20:33:41.4170535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4170612Z 2025-05-07T20:33:41.4170711Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4170840Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4171002Z x = x_sign * x_clamp 2025-05-07T20:33:41.4171092Z x0 = x[:, :D] 2025-05-07T20:33:41.4171177Z x1 = x[:, D:] 2025-05-07T20:33:41.4171254Z 2025-05-07T20:33:41.4171348Z if contiguous: 2025-05-07T20:33:41.4171443Z x0 = x0.contiguous() 2025-05-07T20:33:41.4171541Z x1 = x1.contiguous() 2025-05-07T20:33:41.4171617Z 2025-05-07T20:33:41.4171712Z if scale_ub is not None: 2025-05-07T20:33:41.4171826Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4171963Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4172043Z ) 2025-05-07T20:33:41.4172130Z else: 2025-05-07T20:33:41.4172229Z scale_ub_tensor = None 2025-05-07T20:33:41.4172306Z 2025-05-07T20:33:41.4172444Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4172603Z op = silu_mul_quant 2025-05-07T20:33:41.4172699Z if compiled: 2025-05-07T20:33:41.4172809Z op = torch.compile(op) 2025-05-07T20:33:41.4179880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4180083Z 2025-05-07T20:33:41.4180195Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4180327Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4180413Z 2025-05-07T20:33:41.4180555Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4180669Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4180777Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4180901Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4181042Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4181126Z 2025-05-07T20:33:41.4181228Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4181233Z 2025-05-07T20:33:41.4181339Z moe/activation_test.py:126: 2025-05-07T20:33:41.4181527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4181641Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4181786Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4182349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4182454Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4182827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4183050Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4183428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4183687Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4184066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4184241Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4184586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4184671Z fn() 2025-05-07T20:33:41.4185077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4185163Z self.fn.run( 2025-05-07T20:33:41.4185508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4185604Z kernel = self.compile( 2025-05-07T20:33:41.4185990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4186174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4186390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4186395Z 2025-05-07T20:33:41.4186610Z self = 2025-05-07T20:33:41.4187395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4187903Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0aada80>} 2025-05-07T20:33:41.4188655Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4188892Z context = 2025-05-07T20:33:41.4188900Z 2025-05-07T20:33:41.4189072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4189382Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4189493Z module_map=module_map) 2025-05-07T20:33:41.4189667Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4189771Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4189852Z E ^ 2025-05-07T20:33:41.4190213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4190218Z 2025-05-07T20:33:41.4190635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4190639Z 2025-05-07T20:33:41.4190753Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4191020Z self=, 2025-05-07T20:33:41.4191105Z T=128, 2025-05-07T20:33:41.4191189Z D=7168, 2025-05-07T20:33:41.4191281Z scale_ub=None, 2025-05-07T20:33:41.4191377Z contiguous=False, 2025-05-07T20:33:41.4191464Z compiled=False, 2025-05-07T20:33:41.4191540Z ) 2025-05-07T20:33:41.4192378Z self = 2025-05-07T20:33:41.4192551Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4192556Z 2025-05-07T20:33:41.4192635Z @given( 2025-05-07T20:33:41.4192765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4192868Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4192985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4193109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4193228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4193314Z ) 2025-05-07T20:33:41.4193565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4193661Z def test_silu_mul_quant( 2025-05-07T20:33:41.4193752Z self, 2025-05-07T20:33:41.4193832Z T: int, 2025-05-07T20:33:41.4193911Z D: int, 2025-05-07T20:33:41.4194017Z scale_ub: Optional[float], 2025-05-07T20:33:41.4194109Z contiguous: bool, 2025-05-07T20:33:41.4194197Z compiled: bool, 2025-05-07T20:33:41.4194285Z ) -> None: 2025-05-07T20:33:41.4194386Z torch.manual_seed(2025) 2025-05-07T20:33:41.4194462Z 2025-05-07T20:33:41.4194640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4194721Z 2025-05-07T20:33:41.4194825Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4194955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4195046Z x = x_sign * x_clamp 2025-05-07T20:33:41.4195139Z x0 = x[:, :D] 2025-05-07T20:33:41.4195273Z x1 = x[:, D:] 2025-05-07T20:33:41.4195352Z 2025-05-07T20:33:41.4195444Z if contiguous: 2025-05-07T20:33:41.4195538Z x0 = x0.contiguous() 2025-05-07T20:33:41.4195632Z x1 = x1.contiguous() 2025-05-07T20:33:41.4195794Z 2025-05-07T20:33:41.4195890Z if scale_ub is not None: 2025-05-07T20:33:41.4195999Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4196143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4196222Z ) 2025-05-07T20:33:41.4196310Z else: 2025-05-07T20:33:41.4196411Z scale_ub_tensor = None 2025-05-07T20:33:41.4196488Z 2025-05-07T20:33:41.4196625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4196721Z op = silu_mul_quant 2025-05-07T20:33:41.4196809Z if compiled: 2025-05-07T20:33:41.4196917Z op = torch.compile(op) 2025-05-07T20:33:41.4197076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4197159Z 2025-05-07T20:33:41.4197261Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4197266Z 2025-05-07T20:33:41.4197407Z moe/activation_test.py:117: 2025-05-07T20:33:41.4197543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4197647Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4197751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4198261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4198360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4198722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4198954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4199296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4199444Z kernel = self.compile( 2025-05-07T20:33:41.4199827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4200008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4200143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4200147Z 2025-05-07T20:33:41.4200353Z self = 2025-05-07T20:33:41.4201140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4201648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0cb05e0>} 2025-05-07T20:33:41.4202399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4202598Z context = 2025-05-07T20:33:41.4202603Z 2025-05-07T20:33:41.4202767Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4203042Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4203152Z module_map=module_map) 2025-05-07T20:33:41.4203315Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4203423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4203502Z E ^ 2025-05-07T20:33:41.4203866Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4203924Z 2025-05-07T20:33:41.4204339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4204346Z 2025-05-07T20:33:41.4204452Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4204682Z self=, 2025-05-07T20:33:41.4204762Z T=4096, 2025-05-07T20:33:41.4204845Z D=5120, 2025-05-07T20:33:41.4204937Z scale_ub=1200.0, 2025-05-07T20:33:41.4205023Z contiguous=True, 2025-05-07T20:33:41.4205110Z compiled=False, 2025-05-07T20:33:41.4205191Z ) 2025-05-07T20:33:41.4205412Z self = 2025-05-07T20:33:41.4205597Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.4205601Z 2025-05-07T20:33:41.4205685Z @given( 2025-05-07T20:33:41.4205852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4205965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4206086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4206246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4206368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4206444Z ) 2025-05-07T20:33:41.4206691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4206795Z def test_silu_mul_quant( 2025-05-07T20:33:41.4206875Z self, 2025-05-07T20:33:41.4206961Z T: int, 2025-05-07T20:33:41.4207039Z D: int, 2025-05-07T20:33:41.4207140Z scale_ub: Optional[float], 2025-05-07T20:33:41.4207239Z contiguous: bool, 2025-05-07T20:33:41.4207327Z compiled: bool, 2025-05-07T20:33:41.4207408Z ) -> None: 2025-05-07T20:33:41.4207508Z torch.manual_seed(2025) 2025-05-07T20:33:41.4207586Z 2025-05-07T20:33:41.4207803Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4207889Z 2025-05-07T20:33:41.4207984Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4208113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4208209Z x = x_sign * x_clamp 2025-05-07T20:33:41.4208292Z x0 = x[:, :D] 2025-05-07T20:33:41.4208378Z x1 = x[:, D:] 2025-05-07T20:33:41.4208452Z 2025-05-07T20:33:41.4208538Z if contiguous: 2025-05-07T20:33:41.4208636Z x0 = x0.contiguous() 2025-05-07T20:33:41.4208726Z x1 = x1.contiguous() 2025-05-07T20:33:41.4208801Z 2025-05-07T20:33:41.4208897Z if scale_ub is not None: 2025-05-07T20:33:41.4209005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4209143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4209227Z ) 2025-05-07T20:33:41.4209306Z else: 2025-05-07T20:33:41.4209410Z scale_ub_tensor = None 2025-05-07T20:33:41.4209496Z 2025-05-07T20:33:41.4209657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4209761Z op = silu_mul_quant 2025-05-07T20:33:41.4209867Z if compiled: 2025-05-07T20:33:41.4209970Z op = torch.compile(op) 2025-05-07T20:33:41.4210087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4210164Z 2025-05-07T20:33:41.4210259Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4210264Z 2025-05-07T20:33:41.4210371Z moe/activation_test.py:117: 2025-05-07T20:33:41.4210502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4210605Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4210715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4211214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4211322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4211731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4211958Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4212300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4212399Z kernel = self.compile( 2025-05-07T20:33:41.4212781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4212962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4213093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4213097Z 2025-05-07T20:33:41.4213309Z self = 2025-05-07T20:33:41.4214129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4214675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0cb1b20>} 2025-05-07T20:33:41.4215426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4215617Z context = 2025-05-07T20:33:41.4215621Z 2025-05-07T20:33:41.4215791Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4216055Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4216171Z module_map=module_map) 2025-05-07T20:33:41.4216403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4216506Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4216594Z E ^ 2025-05-07T20:33:41.4216951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4216956Z 2025-05-07T20:33:41.4217368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4217373Z 2025-05-07T20:33:41.4217489Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4217716Z self=, 2025-05-07T20:33:41.4217803Z T=1, 2025-05-07T20:33:41.4217881Z D=5120, 2025-05-07T20:33:41.4217966Z scale_ub=None, 2025-05-07T20:33:41.4218059Z contiguous=True, 2025-05-07T20:33:41.4218148Z compiled=True, 2025-05-07T20:33:41.4218224Z ) 2025-05-07T20:33:41.4218456Z self = 2025-05-07T20:33:41.4218618Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4218625Z 2025-05-07T20:33:41.4218705Z @given( 2025-05-07T20:33:41.4218829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4218928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4219049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4219168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4219283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4219365Z ) 2025-05-07T20:33:41.4219610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4219706Z def test_silu_mul_quant( 2025-05-07T20:33:41.4219793Z self, 2025-05-07T20:33:41.4219872Z T: int, 2025-05-07T20:33:41.4219954Z D: int, 2025-05-07T20:33:41.4220106Z scale_ub: Optional[float], 2025-05-07T20:33:41.4220202Z contiguous: bool, 2025-05-07T20:33:41.4220292Z compiled: bool, 2025-05-07T20:33:41.4220379Z ) -> None: 2025-05-07T20:33:41.4220479Z torch.manual_seed(2025) 2025-05-07T20:33:41.4220558Z 2025-05-07T20:33:41.4220729Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4220805Z 2025-05-07T20:33:41.4220902Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4221028Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4221120Z x = x_sign * x_clamp 2025-05-07T20:33:41.4221209Z x0 = x[:, :D] 2025-05-07T20:33:41.4221291Z x1 = x[:, D:] 2025-05-07T20:33:41.4221366Z 2025-05-07T20:33:41.4221456Z if contiguous: 2025-05-07T20:33:41.4221549Z x0 = x0.contiguous() 2025-05-07T20:33:41.4221640Z x1 = x1.contiguous() 2025-05-07T20:33:41.4221720Z 2025-05-07T20:33:41.4221858Z if scale_ub is not None: 2025-05-07T20:33:41.4221976Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4222112Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4222230Z ) 2025-05-07T20:33:41.4222313Z else: 2025-05-07T20:33:41.4222410Z scale_ub_tensor = None 2025-05-07T20:33:41.4222484Z 2025-05-07T20:33:41.4222618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4222714Z op = silu_mul_quant 2025-05-07T20:33:41.4222803Z if compiled: 2025-05-07T20:33:41.4222911Z op = torch.compile(op) 2025-05-07T20:33:41.4223018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4223095Z 2025-05-07T20:33:41.4223192Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4223314Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4223396Z 2025-05-07T20:33:41.4223537Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4223646Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4223795Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4223919Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4224064Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4224145Z 2025-05-07T20:33:41.4224247Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4224251Z 2025-05-07T20:33:41.4224350Z moe/activation_test.py:126: 2025-05-07T20:33:41.4224485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4224596Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4224735Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4225293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4225399Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4225770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4225994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4226365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4226632Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4227009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4227181Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4227523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4227601Z fn() 2025-05-07T20:33:41.4228011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4228139Z self.fn.run( 2025-05-07T20:33:41.4228485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4228583Z kernel = self.compile( 2025-05-07T20:33:41.4228962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4229143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4229275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4229279Z 2025-05-07T20:33:41.4229485Z self = 2025-05-07T20:33:41.4230355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4230866Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0cb2a20>} 2025-05-07T20:33:41.4231655Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4231849Z context = 2025-05-07T20:33:41.4231854Z 2025-05-07T20:33:41.4232026Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4232290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4232398Z module_map=module_map) 2025-05-07T20:33:41.4232564Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4232672Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4232795Z E ^ 2025-05-07T20:33:41.4233158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4233165Z 2025-05-07T20:33:41.4233578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4233583Z 2025-05-07T20:33:41.4233693Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4233919Z self=, 2025-05-07T20:33:41.4233998Z T=2048, 2025-05-07T20:33:41.4234083Z D=5120, 2025-05-07T20:33:41.4234167Z scale_ub=None, 2025-05-07T20:33:41.4234254Z contiguous=True, 2025-05-07T20:33:41.4234345Z compiled=True, 2025-05-07T20:33:41.4234420Z ) 2025-05-07T20:33:41.4234645Z self = 2025-05-07T20:33:41.4234821Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4234831Z 2025-05-07T20:33:41.4234910Z @given( 2025-05-07T20:33:41.4235036Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4235140Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4235258Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4235382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4235497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4235574Z ) 2025-05-07T20:33:41.4235892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4235992Z def test_silu_mul_quant( 2025-05-07T20:33:41.4236074Z self, 2025-05-07T20:33:41.4236153Z T: int, 2025-05-07T20:33:41.4236232Z D: int, 2025-05-07T20:33:41.4236339Z scale_ub: Optional[float], 2025-05-07T20:33:41.4236430Z contiguous: bool, 2025-05-07T20:33:41.4236525Z compiled: bool, 2025-05-07T20:33:41.4236659Z ) -> None: 2025-05-07T20:33:41.4236759Z torch.manual_seed(2025) 2025-05-07T20:33:41.4236835Z 2025-05-07T20:33:41.4237011Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4237093Z 2025-05-07T20:33:41.4237188Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4237319Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4237411Z x = x_sign * x_clamp 2025-05-07T20:33:41.4237502Z x0 = x[:, :D] 2025-05-07T20:33:41.4237583Z x1 = x[:, D:] 2025-05-07T20:33:41.4237659Z 2025-05-07T20:33:41.4237749Z if contiguous: 2025-05-07T20:33:41.4237842Z x0 = x0.contiguous() 2025-05-07T20:33:41.4237936Z x1 = x1.contiguous() 2025-05-07T20:33:41.4238013Z 2025-05-07T20:33:41.4238105Z if scale_ub is not None: 2025-05-07T20:33:41.4238211Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4238396Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4238479Z ) 2025-05-07T20:33:41.4238558Z else: 2025-05-07T20:33:41.4238660Z scale_ub_tensor = None 2025-05-07T20:33:41.4238777Z 2025-05-07T20:33:41.4238908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4239007Z op = silu_mul_quant 2025-05-07T20:33:41.4239098Z if compiled: 2025-05-07T20:33:41.4239207Z op = torch.compile(op) 2025-05-07T20:33:41.4239318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4239393Z 2025-05-07T20:33:41.4239490Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4239615Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4239690Z 2025-05-07T20:33:41.4239834Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4239945Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4240052Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4240225Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4240377Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4240457Z 2025-05-07T20:33:41.4240559Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4240563Z 2025-05-07T20:33:41.4240666Z moe/activation_test.py:126: 2025-05-07T20:33:41.4240798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4240916Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4241050Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4241611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4241718Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4242086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4242311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4242685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4242946Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4243325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4243492Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4243833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4243919Z fn() 2025-05-07T20:33:41.4244319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4244408Z self.fn.run( 2025-05-07T20:33:41.4244753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4244895Z kernel = self.compile( 2025-05-07T20:33:41.4245279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4245457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4245589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4245597Z 2025-05-07T20:33:41.4245804Z self = 2025-05-07T20:33:41.4246581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4247135Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0cce020>} 2025-05-07T20:33:41.4247886Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4248146Z context = 2025-05-07T20:33:41.4248151Z 2025-05-07T20:33:41.4248316Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4248580Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4248694Z module_map=module_map) 2025-05-07T20:33:41.4248857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4248963Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4249050Z E ^ 2025-05-07T20:33:41.4249452Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4249461Z 2025-05-07T20:33:41.4249876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4249883Z 2025-05-07T20:33:41.4249992Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4250216Z self=, 2025-05-07T20:33:41.4250305Z T=128, 2025-05-07T20:33:41.4250383Z D=5120, 2025-05-07T20:33:41.4250473Z scale_ub=None, 2025-05-07T20:33:41.4250562Z contiguous=True, 2025-05-07T20:33:41.4250647Z compiled=True, 2025-05-07T20:33:41.4250726Z ) 2025-05-07T20:33:41.4250949Z self = 2025-05-07T20:33:41.4251118Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4251122Z 2025-05-07T20:33:41.4251209Z @given( 2025-05-07T20:33:41.4251338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4251447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4251570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4251692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4251815Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4251891Z ) 2025-05-07T20:33:41.4252136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4252240Z def test_silu_mul_quant( 2025-05-07T20:33:41.4252321Z self, 2025-05-07T20:33:41.4252400Z T: int, 2025-05-07T20:33:41.4252485Z D: int, 2025-05-07T20:33:41.4252589Z scale_ub: Optional[float], 2025-05-07T20:33:41.4252681Z contiguous: bool, 2025-05-07T20:33:41.4252776Z compiled: bool, 2025-05-07T20:33:41.4252855Z ) -> None: 2025-05-07T20:33:41.4252952Z torch.manual_seed(2025) 2025-05-07T20:33:41.4253034Z 2025-05-07T20:33:41.4253207Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4253330Z 2025-05-07T20:33:41.4253428Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4253557Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4253653Z x = x_sign * x_clamp 2025-05-07T20:33:41.4253738Z x0 = x[:, :D] 2025-05-07T20:33:41.4253822Z x1 = x[:, D:] 2025-05-07T20:33:41.4253901Z 2025-05-07T20:33:41.4253986Z if contiguous: 2025-05-07T20:33:41.4254080Z x0 = x0.contiguous() 2025-05-07T20:33:41.4254177Z x1 = x1.contiguous() 2025-05-07T20:33:41.4254252Z 2025-05-07T20:33:41.4254343Z if scale_ub is not None: 2025-05-07T20:33:41.4254456Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4254588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4254666Z ) 2025-05-07T20:33:41.4254750Z else: 2025-05-07T20:33:41.4254894Z scale_ub_tensor = None 2025-05-07T20:33:41.4254976Z 2025-05-07T20:33:41.4255108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4255201Z op = silu_mul_quant 2025-05-07T20:33:41.4255332Z if compiled: 2025-05-07T20:33:41.4255433Z op = torch.compile(op) 2025-05-07T20:33:41.4255540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4255621Z 2025-05-07T20:33:41.4255713Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4255835Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4255914Z 2025-05-07T20:33:41.4256051Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4256155Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4256261Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4256385Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4256535Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4256614Z 2025-05-07T20:33:41.4256758Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4256762Z 2025-05-07T20:33:41.4256868Z moe/activation_test.py:126: 2025-05-07T20:33:41.4257002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4257109Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4257251Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4257813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4257920Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4258279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4258501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4258877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4259137Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4259524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4259693Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4260035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4260119Z fn() 2025-05-07T20:33:41.4260519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4260606Z self.fn.run( 2025-05-07T20:33:41.4260946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4261044Z kernel = self.compile( 2025-05-07T20:33:41.4261437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4261656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4261786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4261791Z 2025-05-07T20:33:41.4262002Z self = 2025-05-07T20:33:41.4262779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4263286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2cb0e3aac0>} 2025-05-07T20:33:41.4264075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4264307Z context = 2025-05-07T20:33:41.4264312Z 2025-05-07T20:33:41.4264481Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4264748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4264865Z module_map=module_map) 2025-05-07T20:33:41.4265028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4265132Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4265216Z E ^ 2025-05-07T20:33:41.4265843Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4265849Z 2025-05-07T20:33:41.4266360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4266369Z 2025-05-07T20:33:41.4266475Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4266704Z self=, 2025-05-07T20:33:41.4266788Z T=4096, 2025-05-07T20:33:41.4266866Z D=5120, 2025-05-07T20:33:41.4266950Z scale_ub=None, 2025-05-07T20:33:41.4267046Z contiguous=True, 2025-05-07T20:33:41.4267130Z compiled=True, 2025-05-07T20:33:41.4267205Z ) 2025-05-07T20:33:41.4267435Z self = 2025-05-07T20:33:41.4267606Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4267611Z 2025-05-07T20:33:41.4267695Z @given( 2025-05-07T20:33:41.4267817Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4267919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4268043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4268167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4268282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4268370Z ) 2025-05-07T20:33:41.4268617Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4268714Z def test_silu_mul_quant( 2025-05-07T20:33:41.4268795Z self, 2025-05-07T20:33:41.4268873Z T: int, 2025-05-07T20:33:41.4268960Z D: int, 2025-05-07T20:33:41.4269062Z scale_ub: Optional[float], 2025-05-07T20:33:41.4269154Z contiguous: bool, 2025-05-07T20:33:41.4269246Z compiled: bool, 2025-05-07T20:33:41.4269329Z ) -> None: 2025-05-07T20:33:41.4269430Z torch.manual_seed(2025) 2025-05-07T20:33:41.4269510Z 2025-05-07T20:33:41.4269682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4269757Z 2025-05-07T20:33:41.4269861Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4270061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4270153Z x = x_sign * x_clamp 2025-05-07T20:33:41.4270242Z x0 = x[:, :D] 2025-05-07T20:33:41.4270325Z x1 = x[:, D:] 2025-05-07T20:33:41.4270399Z 2025-05-07T20:33:41.4270491Z if contiguous: 2025-05-07T20:33:41.4270585Z x0 = x0.contiguous() 2025-05-07T20:33:41.4270680Z x1 = x1.contiguous() 2025-05-07T20:33:41.4270754Z 2025-05-07T20:33:41.4270845Z if scale_ub is not None: 2025-05-07T20:33:41.4270962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4271097Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4271175Z ) 2025-05-07T20:33:41.4271261Z else: 2025-05-07T20:33:41.4271359Z scale_ub_tensor = None 2025-05-07T20:33:41.4271434Z 2025-05-07T20:33:41.4271633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4271727Z op = silu_mul_quant 2025-05-07T20:33:41.4271820Z if compiled: 2025-05-07T20:33:41.4271924Z op = torch.compile(op) 2025-05-07T20:33:41.4272090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4272168Z 2025-05-07T20:33:41.4272261Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4272383Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4272463Z 2025-05-07T20:33:41.4272602Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4272707Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4272818Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4272941Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4273082Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4273163Z 2025-05-07T20:33:41.4273264Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4273276Z 2025-05-07T20:33:41.4273383Z moe/activation_test.py:126: 2025-05-07T20:33:41.4273558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4273667Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4273815Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4274373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4274475Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4274838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4275060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4275430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4275737Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4276117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4276293Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4276633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4276716Z fn() 2025-05-07T20:33:41.4277116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4277200Z self.fn.run( 2025-05-07T20:33:41.4277542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4277637Z kernel = self.compile( 2025-05-07T20:33:41.4278017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4278200Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4278408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4278415Z 2025-05-07T20:33:41.4278626Z self = 2025-05-07T20:33:41.4279402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4279908Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbcfb560>} 2025-05-07T20:33:41.4280730Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4280929Z context = 2025-05-07T20:33:41.4280934Z 2025-05-07T20:33:41.4281107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4281414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4281523Z module_map=module_map) 2025-05-07T20:33:41.4281691Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4281796Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4281887Z E ^ 2025-05-07T20:33:41.4282241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4282246Z 2025-05-07T20:33:41.4282657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4282661Z 2025-05-07T20:33:41.4282774Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4283038Z self=, 2025-05-07T20:33:41.4283129Z T=16384, 2025-05-07T20:33:41.4283212Z D=5120, 2025-05-07T20:33:41.4283298Z scale_ub=None, 2025-05-07T20:33:41.4283392Z contiguous=True, 2025-05-07T20:33:41.4283477Z compiled=True, 2025-05-07T20:33:41.4283555Z ) 2025-05-07T20:33:41.4283779Z self = 2025-05-07T20:33:41.4283954Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4283958Z 2025-05-07T20:33:41.4284037Z @given( 2025-05-07T20:33:41.4284167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4284269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4284386Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4284510Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4284626Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4284711Z ) 2025-05-07T20:33:41.4284957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4285056Z def test_silu_mul_quant( 2025-05-07T20:33:41.4285142Z self, 2025-05-07T20:33:41.4285221Z T: int, 2025-05-07T20:33:41.4285300Z D: int, 2025-05-07T20:33:41.4285403Z scale_ub: Optional[float], 2025-05-07T20:33:41.4285496Z contiguous: bool, 2025-05-07T20:33:41.4285583Z compiled: bool, 2025-05-07T20:33:41.4285666Z ) -> None: 2025-05-07T20:33:41.4285761Z torch.manual_seed(2025) 2025-05-07T20:33:41.4285835Z 2025-05-07T20:33:41.4286007Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4286081Z 2025-05-07T20:33:41.4286185Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4286310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4286403Z x = x_sign * x_clamp 2025-05-07T20:33:41.4286536Z x0 = x[:, :D] 2025-05-07T20:33:41.4286622Z x1 = x[:, D:] 2025-05-07T20:33:41.4286697Z 2025-05-07T20:33:41.4286785Z if contiguous: 2025-05-07T20:33:41.4286881Z x0 = x0.contiguous() 2025-05-07T20:33:41.4286971Z x1 = x1.contiguous() 2025-05-07T20:33:41.4287049Z 2025-05-07T20:33:41.4287140Z if scale_ub is not None: 2025-05-07T20:33:41.4287248Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4287385Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4287463Z ) 2025-05-07T20:33:41.4287544Z else: 2025-05-07T20:33:41.4287641Z scale_ub_tensor = None 2025-05-07T20:33:41.4287716Z 2025-05-07T20:33:41.4287847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4287940Z op = silu_mul_quant 2025-05-07T20:33:41.4288026Z if compiled: 2025-05-07T20:33:41.4288255Z op = torch.compile(op) 2025-05-07T20:33:41.4288370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4288444Z 2025-05-07T20:33:41.4288543Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4288712Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4288787Z 2025-05-07T20:33:41.4288932Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4289036Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4289142Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4289264Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4289408Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4289491Z 2025-05-07T20:33:41.4289593Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4289597Z 2025-05-07T20:33:41.4289700Z moe/activation_test.py:126: 2025-05-07T20:33:41.4289839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4289949Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4290123Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4290686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4290792Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4291159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4291382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4291747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4292006Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4292386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4292561Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4292903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4292988Z fn() 2025-05-07T20:33:41.4293392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4293475Z self.fn.run( 2025-05-07T20:33:41.4293813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4293911Z kernel = self.compile( 2025-05-07T20:33:41.4294291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4294469Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4294601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4294649Z 2025-05-07T20:33:41.4294859Z self = 2025-05-07T20:33:41.4295638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4296144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb551620>} 2025-05-07T20:33:41.4296891Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4297086Z context = 2025-05-07T20:33:41.4297091Z 2025-05-07T20:33:41.4297302Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4297571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4297717Z module_map=module_map) 2025-05-07T20:33:41.4297885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4297991Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4298071Z E ^ 2025-05-07T20:33:41.4298428Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4298432Z 2025-05-07T20:33:41.4298843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4298847Z 2025-05-07T20:33:41.4298955Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4299182Z self=, 2025-05-07T20:33:41.4299263Z T=1, 2025-05-07T20:33:41.4299349Z D=5120, 2025-05-07T20:33:41.4299474Z scale_ub=1200.0, 2025-05-07T20:33:41.4299560Z contiguous=True, 2025-05-07T20:33:41.4299665Z compiled=True, 2025-05-07T20:33:41.4299755Z ) 2025-05-07T20:33:41.4300000Z self = 2025-05-07T20:33:41.4300172Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4300176Z 2025-05-07T20:33:41.4300255Z @given( 2025-05-07T20:33:41.4300381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4300484Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4300598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4300721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4300834Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4300910Z ) 2025-05-07T20:33:41.4301162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4301263Z def test_silu_mul_quant( 2025-05-07T20:33:41.4301348Z self, 2025-05-07T20:33:41.4301426Z T: int, 2025-05-07T20:33:41.4301511Z D: int, 2025-05-07T20:33:41.4301616Z scale_ub: Optional[float], 2025-05-07T20:33:41.4301707Z contiguous: bool, 2025-05-07T20:33:41.4301797Z compiled: bool, 2025-05-07T20:33:41.4301883Z ) -> None: 2025-05-07T20:33:41.4301981Z torch.manual_seed(2025) 2025-05-07T20:33:41.4302055Z 2025-05-07T20:33:41.4302227Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4302302Z 2025-05-07T20:33:41.4302394Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4302523Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4302614Z x = x_sign * x_clamp 2025-05-07T20:33:41.4302696Z x0 = x[:, :D] 2025-05-07T20:33:41.4302783Z x1 = x[:, D:] 2025-05-07T20:33:41.4302860Z 2025-05-07T20:33:41.4302951Z if contiguous: 2025-05-07T20:33:41.4303095Z x0 = x0.contiguous() 2025-05-07T20:33:41.4303186Z x1 = x1.contiguous() 2025-05-07T20:33:41.4303266Z 2025-05-07T20:33:41.4303357Z if scale_ub is not None: 2025-05-07T20:33:41.4303470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4310702Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4310796Z ) 2025-05-07T20:33:41.4310882Z else: 2025-05-07T20:33:41.4310983Z scale_ub_tensor = None 2025-05-07T20:33:41.4311057Z 2025-05-07T20:33:41.4311197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4311291Z op = silu_mul_quant 2025-05-07T20:33:41.4311386Z if compiled: 2025-05-07T20:33:41.4311489Z op = torch.compile(op) 2025-05-07T20:33:41.4311594Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4311669Z 2025-05-07T20:33:41.4311833Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4311841Z 2025-05-07T20:33:41.4311945Z moe/activation_test.py:117: 2025-05-07T20:33:41.4312081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4312228Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4312328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4312709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4312804Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4313299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4313398Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4313755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4313986Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4314368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4314471Z kernel = self.compile( 2025-05-07T20:33:41.4314857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4315030Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4315162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4315167Z 2025-05-07T20:33:41.4315373Z self = 2025-05-07T20:33:41.4316221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4316740Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbe5c900>} 2025-05-07T20:33:41.4317487Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4317686Z context = 2025-05-07T20:33:41.4317690Z 2025-05-07T20:33:41.4317854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4318120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4318228Z module_map=module_map) 2025-05-07T20:33:41.4318392Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4318495Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4318576Z E ^ 2025-05-07T20:33:41.4318933Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4319013Z 2025-05-07T20:33:41.4319431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4319436Z 2025-05-07T20:33:41.4319541Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4319767Z self=, 2025-05-07T20:33:41.4319847Z T=1, 2025-05-07T20:33:41.4319926Z D=5120, 2025-05-07T20:33:41.4320010Z scale_ub=None, 2025-05-07T20:33:41.4320095Z contiguous=False, 2025-05-07T20:33:41.4320178Z compiled=True, 2025-05-07T20:33:41.4320256Z ) 2025-05-07T20:33:41.4320473Z self = 2025-05-07T20:33:41.4320643Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4320701Z 2025-05-07T20:33:41.4320783Z @given( 2025-05-07T20:33:41.4320912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4321022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4321182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4321300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4321421Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4321497Z ) 2025-05-07T20:33:41.4321740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4321839Z def test_silu_mul_quant( 2025-05-07T20:33:41.4321915Z self, 2025-05-07T20:33:41.4321992Z T: int, 2025-05-07T20:33:41.4322071Z D: int, 2025-05-07T20:33:41.4322169Z scale_ub: Optional[float], 2025-05-07T20:33:41.4322260Z contiguous: bool, 2025-05-07T20:33:41.4322344Z compiled: bool, 2025-05-07T20:33:41.4322422Z ) -> None: 2025-05-07T20:33:41.4322521Z torch.manual_seed(2025) 2025-05-07T20:33:41.4322598Z 2025-05-07T20:33:41.4322813Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4322894Z 2025-05-07T20:33:41.4322991Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4323117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4323210Z x = x_sign * x_clamp 2025-05-07T20:33:41.4323293Z x0 = x[:, :D] 2025-05-07T20:33:41.4323373Z x1 = x[:, D:] 2025-05-07T20:33:41.4323449Z 2025-05-07T20:33:41.4323531Z if contiguous: 2025-05-07T20:33:41.4323626Z x0 = x0.contiguous() 2025-05-07T20:33:41.4323716Z x1 = x1.contiguous() 2025-05-07T20:33:41.4323788Z 2025-05-07T20:33:41.4323882Z if scale_ub is not None: 2025-05-07T20:33:41.4323988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4324125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4324208Z ) 2025-05-07T20:33:41.4324285Z else: 2025-05-07T20:33:41.4324385Z scale_ub_tensor = None 2025-05-07T20:33:41.4324463Z 2025-05-07T20:33:41.4324594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4324688Z op = silu_mul_quant 2025-05-07T20:33:41.4324777Z if compiled: 2025-05-07T20:33:41.4324876Z op = torch.compile(op) 2025-05-07T20:33:41.4324987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4325059Z 2025-05-07T20:33:41.4325151Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4325278Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4325350Z 2025-05-07T20:33:41.4325487Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4325595Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4325694Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4325816Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4325960Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4326082Z 2025-05-07T20:33:41.4326184Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4326195Z 2025-05-07T20:33:41.4326291Z moe/activation_test.py:126: 2025-05-07T20:33:41.4326422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4326535Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4326669Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4327230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4327342Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4327699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4327969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4328342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4328599Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4329017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4329184Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4329522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4329604Z fn() 2025-05-07T20:33:41.4330051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4330138Z self.fn.run( 2025-05-07T20:33:41.4330479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4330574Z kernel = self.compile( 2025-05-07T20:33:41.4330995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4331174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4331304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4331314Z 2025-05-07T20:33:41.4331519Z self = 2025-05-07T20:33:41.4332300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4332814Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbe5ec00>} 2025-05-07T20:33:41.4333569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4333769Z context = 2025-05-07T20:33:41.4333774Z 2025-05-07T20:33:41.4333940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4334201Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4334313Z module_map=module_map) 2025-05-07T20:33:41.4334477Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4334585Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4334664Z E ^ 2025-05-07T20:33:41.4335021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4335025Z 2025-05-07T20:33:41.4335485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4335491Z 2025-05-07T20:33:41.4335594Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4335815Z self=, 2025-05-07T20:33:41.4335897Z T=1, 2025-05-07T20:33:41.4335974Z D=5120, 2025-05-07T20:33:41.4336061Z scale_ub=None, 2025-05-07T20:33:41.4336145Z contiguous=True, 2025-05-07T20:33:41.4336230Z compiled=False, 2025-05-07T20:33:41.4336307Z ) 2025-05-07T20:33:41.4336526Z self = 2025-05-07T20:33:41.4336691Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4336695Z 2025-05-07T20:33:41.4336775Z @given( 2025-05-07T20:33:41.4336894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4337037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4337160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4337276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4337432Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4337506Z ) 2025-05-07T20:33:41.4337747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4337844Z def test_silu_mul_quant( 2025-05-07T20:33:41.4337920Z self, 2025-05-07T20:33:41.4337999Z T: int, 2025-05-07T20:33:41.4338081Z D: int, 2025-05-07T20:33:41.4338179Z scale_ub: Optional[float], 2025-05-07T20:33:41.4338267Z contiguous: bool, 2025-05-07T20:33:41.4338357Z compiled: bool, 2025-05-07T20:33:41.4338435Z ) -> None: 2025-05-07T20:33:41.4338530Z torch.manual_seed(2025) 2025-05-07T20:33:41.4338607Z 2025-05-07T20:33:41.4338778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4338859Z 2025-05-07T20:33:41.4338996Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4339121Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4339217Z x = x_sign * x_clamp 2025-05-07T20:33:41.4339297Z x0 = x[:, :D] 2025-05-07T20:33:41.4339377Z x1 = x[:, D:] 2025-05-07T20:33:41.4339452Z 2025-05-07T20:33:41.4339534Z if contiguous: 2025-05-07T20:33:41.4339628Z x0 = x0.contiguous() 2025-05-07T20:33:41.4339725Z x1 = x1.contiguous() 2025-05-07T20:33:41.4339799Z 2025-05-07T20:33:41.4339890Z if scale_ub is not None: 2025-05-07T20:33:41.4339998Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4340132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4340216Z ) 2025-05-07T20:33:41.4340297Z else: 2025-05-07T20:33:41.4340391Z scale_ub_tensor = None 2025-05-07T20:33:41.4340470Z 2025-05-07T20:33:41.4340601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4340696Z op = silu_mul_quant 2025-05-07T20:33:41.4340784Z if compiled: 2025-05-07T20:33:41.4340886Z op = torch.compile(op) 2025-05-07T20:33:41.4340992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4341070Z 2025-05-07T20:33:41.4341162Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4341166Z 2025-05-07T20:33:41.4341262Z moe/activation_test.py:117: 2025-05-07T20:33:41.4341394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4341499Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4341605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4342100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4342200Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4342567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4342838Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4343179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4343276Z kernel = self.compile( 2025-05-07T20:33:41.4343655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4343832Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4343961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4343966Z 2025-05-07T20:33:41.4344168Z self = 2025-05-07T20:33:41.4344991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4345499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbe5f6a0>} 2025-05-07T20:33:41.4346286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4346478Z context = 2025-05-07T20:33:41.4346482Z 2025-05-07T20:33:41.4346652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4346913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4347024Z module_map=module_map) 2025-05-07T20:33:41.4347191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4347356Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4347437Z E ^ 2025-05-07T20:33:41.4347798Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4347803Z 2025-05-07T20:33:41.4348213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4348218Z 2025-05-07T20:33:41.4348328Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4348551Z self=, 2025-05-07T20:33:41.4348630Z T=128, 2025-05-07T20:33:41.4348713Z D=5120, 2025-05-07T20:33:41.4348798Z scale_ub=None, 2025-05-07T20:33:41.4348885Z contiguous=False, 2025-05-07T20:33:41.4348972Z compiled=True, 2025-05-07T20:33:41.4349047Z ) 2025-05-07T20:33:41.4349268Z self = 2025-05-07T20:33:41.4349452Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4349456Z 2025-05-07T20:33:41.4349540Z @given( 2025-05-07T20:33:41.4349663Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4349765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4349879Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4349998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4350111Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4350186Z ) 2025-05-07T20:33:41.4350431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4350528Z def test_silu_mul_quant( 2025-05-07T20:33:41.4350608Z self, 2025-05-07T20:33:41.4350688Z T: int, 2025-05-07T20:33:41.4350767Z D: int, 2025-05-07T20:33:41.4350875Z scale_ub: Optional[float], 2025-05-07T20:33:41.4350965Z contiguous: bool, 2025-05-07T20:33:41.4351103Z compiled: bool, 2025-05-07T20:33:41.4351186Z ) -> None: 2025-05-07T20:33:41.4351282Z torch.manual_seed(2025) 2025-05-07T20:33:41.4351360Z 2025-05-07T20:33:41.4351533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4351613Z 2025-05-07T20:33:41.4351705Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4351834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4351923Z x = x_sign * x_clamp 2025-05-07T20:33:41.4352007Z x0 = x[:, :D] 2025-05-07T20:33:41.4352092Z x1 = x[:, D:] 2025-05-07T20:33:41.4352168Z 2025-05-07T20:33:41.4352258Z if contiguous: 2025-05-07T20:33:41.4352351Z x0 = x0.contiguous() 2025-05-07T20:33:41.4352442Z x1 = x1.contiguous() 2025-05-07T20:33:41.4352518Z 2025-05-07T20:33:41.4352611Z if scale_ub is not None: 2025-05-07T20:33:41.4352766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4352909Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4352987Z ) 2025-05-07T20:33:41.4353107Z else: 2025-05-07T20:33:41.4353206Z scale_ub_tensor = None 2025-05-07T20:33:41.4353280Z 2025-05-07T20:33:41.4353408Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4353503Z op = silu_mul_quant 2025-05-07T20:33:41.4353590Z if compiled: 2025-05-07T20:33:41.4353693Z op = torch.compile(op) 2025-05-07T20:33:41.4353802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4353878Z 2025-05-07T20:33:41.4353975Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4353980Z 2025-05-07T20:33:41.4354078Z moe/activation_test.py:117: 2025-05-07T20:33:41.4354207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4354315Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4354423Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4354832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4354937Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4355428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4355529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4355939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4356160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4356503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4356597Z kernel = self.compile( 2025-05-07T20:33:41.4356981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4357161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4357288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4357295Z 2025-05-07T20:33:41.4357507Z self = 2025-05-07T20:33:41.4358283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4358790Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbe5f100>} 2025-05-07T20:33:41.4359540Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4359778Z context = 2025-05-07T20:33:41.4359785Z 2025-05-07T20:33:41.4359954Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4360215Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4360332Z module_map=module_map) 2025-05-07T20:33:41.4360493Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4360592Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4360677Z E ^ 2025-05-07T20:33:41.4361032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4361037Z 2025-05-07T20:33:41.4361488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4361504Z 2025-05-07T20:33:41.4361610Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4361833Z self=, 2025-05-07T20:33:41.4361961Z T=128, 2025-05-07T20:33:41.4362040Z D=7168, 2025-05-07T20:33:41.4362125Z scale_ub=1200.0, 2025-05-07T20:33:41.4362217Z contiguous=False, 2025-05-07T20:33:41.4362300Z compiled=False, 2025-05-07T20:33:41.4362375Z ) 2025-05-07T20:33:41.4362596Z self = 2025-05-07T20:33:41.4362768Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4362772Z 2025-05-07T20:33:41.4362858Z @given( 2025-05-07T20:33:41.4362978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4363078Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4363199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4363319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4363477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4363559Z ) 2025-05-07T20:33:41.4363807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4363904Z def test_silu_mul_quant( 2025-05-07T20:33:41.4363985Z self, 2025-05-07T20:33:41.4364064Z T: int, 2025-05-07T20:33:41.4364145Z D: int, 2025-05-07T20:33:41.4364244Z scale_ub: Optional[float], 2025-05-07T20:33:41.4364334Z contiguous: bool, 2025-05-07T20:33:41.4364425Z compiled: bool, 2025-05-07T20:33:41.4364506Z ) -> None: 2025-05-07T20:33:41.4364602Z torch.manual_seed(2025) 2025-05-07T20:33:41.4364681Z 2025-05-07T20:33:41.4364848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4364925Z 2025-05-07T20:33:41.4365024Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4365153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4365250Z x = x_sign * x_clamp 2025-05-07T20:33:41.4365336Z x0 = x[:, :D] 2025-05-07T20:33:41.4365812Z x1 = x[:, D:] 2025-05-07T20:33:41.4365918Z 2025-05-07T20:33:41.4366012Z if contiguous: 2025-05-07T20:33:41.4366106Z x0 = x0.contiguous() 2025-05-07T20:33:41.4366198Z x1 = x1.contiguous() 2025-05-07T20:33:41.4366272Z 2025-05-07T20:33:41.4366363Z if scale_ub is not None: 2025-05-07T20:33:41.4366474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4366608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4366689Z ) 2025-05-07T20:33:41.4366768Z else: 2025-05-07T20:33:41.4366862Z scale_ub_tensor = None 2025-05-07T20:33:41.4366935Z 2025-05-07T20:33:41.4367069Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4367161Z op = silu_mul_quant 2025-05-07T20:33:41.4367249Z if compiled: 2025-05-07T20:33:41.4367457Z op = torch.compile(op) 2025-05-07T20:33:41.4367562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4367640Z 2025-05-07T20:33:41.4367730Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4367735Z 2025-05-07T20:33:41.4367832Z moe/activation_test.py:117: 2025-05-07T20:33:41.4367966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4368066Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4368164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4368662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4368757Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4369116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4369403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4369749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4369908Z kernel = self.compile( 2025-05-07T20:33:41.4370287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4370460Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4370591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4370596Z 2025-05-07T20:33:41.4370799Z self = 2025-05-07T20:33:41.4371583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4372147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbcf9940>} 2025-05-07T20:33:41.4372903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4373094Z context = 2025-05-07T20:33:41.4373098Z 2025-05-07T20:33:41.4373263Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4373533Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4373641Z module_map=module_map) 2025-05-07T20:33:41.4373809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4373912Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4373990Z E ^ 2025-05-07T20:33:41.4374356Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4374363Z 2025-05-07T20:33:41.4374774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4374778Z 2025-05-07T20:33:41.4374883Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4375106Z self=, 2025-05-07T20:33:41.4375183Z T=128, 2025-05-07T20:33:41.4375266Z D=5120, 2025-05-07T20:33:41.4375347Z scale_ub=None, 2025-05-07T20:33:41.4375437Z contiguous=False, 2025-05-07T20:33:41.4375524Z compiled=False, 2025-05-07T20:33:41.4375598Z ) 2025-05-07T20:33:41.4375815Z self = 2025-05-07T20:33:41.4375991Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4376040Z 2025-05-07T20:33:41.4376119Z @given( 2025-05-07T20:33:41.4376242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4376345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4376458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4376578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4376692Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4376769Z ) 2025-05-07T20:33:41.4377014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4377113Z def test_silu_mul_quant( 2025-05-07T20:33:41.4377189Z self, 2025-05-07T20:33:41.4377271Z T: int, 2025-05-07T20:33:41.4377348Z D: int, 2025-05-07T20:33:41.4377446Z scale_ub: Optional[float], 2025-05-07T20:33:41.4377542Z contiguous: bool, 2025-05-07T20:33:41.4377627Z compiled: bool, 2025-05-07T20:33:41.4377752Z ) -> None: 2025-05-07T20:33:41.4377855Z torch.manual_seed(2025) 2025-05-07T20:33:41.4377928Z 2025-05-07T20:33:41.4378101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4378238Z 2025-05-07T20:33:41.4378331Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4378460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4378549Z x = x_sign * x_clamp 2025-05-07T20:33:41.4378630Z x0 = x[:, :D] 2025-05-07T20:33:41.4378713Z x1 = x[:, D:] 2025-05-07T20:33:41.4378786Z 2025-05-07T20:33:41.4378871Z if contiguous: 2025-05-07T20:33:41.4378966Z x0 = x0.contiguous() 2025-05-07T20:33:41.4379055Z x1 = x1.contiguous() 2025-05-07T20:33:41.4379128Z 2025-05-07T20:33:41.4379224Z if scale_ub is not None: 2025-05-07T20:33:41.4379329Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4379471Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4379549Z ) 2025-05-07T20:33:41.4379674Z else: 2025-05-07T20:33:41.4379777Z scale_ub_tensor = None 2025-05-07T20:33:41.4379851Z 2025-05-07T20:33:41.4379983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4380077Z op = silu_mul_quant 2025-05-07T20:33:41.4380163Z if compiled: 2025-05-07T20:33:41.4380260Z op = torch.compile(op) 2025-05-07T20:33:41.4380374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4380448Z 2025-05-07T20:33:41.4380539Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4380548Z 2025-05-07T20:33:41.4380644Z moe/activation_test.py:117: 2025-05-07T20:33:41.4380774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4380877Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4380977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4381475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4381577Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4381932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4382160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4382498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4382590Z kernel = self.compile( 2025-05-07T20:33:41.4382969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4383143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4383272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4383276Z 2025-05-07T20:33:41.4383484Z self = 2025-05-07T20:33:41.4384309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4384819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbed04a0>} 2025-05-07T20:33:41.4385563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4385757Z context = 2025-05-07T20:33:41.4385762Z 2025-05-07T20:33:41.4385927Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4386232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4386348Z module_map=module_map) 2025-05-07T20:33:41.4386548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4386648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4386728Z E ^ 2025-05-07T20:33:41.4387081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4387085Z 2025-05-07T20:33:41.4387497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4387502Z 2025-05-07T20:33:41.4387604Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4387828Z self=, 2025-05-07T20:33:41.4387913Z T=128, 2025-05-07T20:33:41.4387991Z D=5120, 2025-05-07T20:33:41.4388080Z scale_ub=1200.0, 2025-05-07T20:33:41.4388211Z contiguous=True, 2025-05-07T20:33:41.4388298Z compiled=False, 2025-05-07T20:33:41.4388381Z ) 2025-05-07T20:33:41.4388600Z self = 2025-05-07T20:33:41.4388773Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.4388778Z 2025-05-07T20:33:41.4388861Z @given( 2025-05-07T20:33:41.4388981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4389081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4389202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4389320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4389436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4389518Z ) 2025-05-07T20:33:41.4389760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4389863Z def test_silu_mul_quant( 2025-05-07T20:33:41.4389944Z self, 2025-05-07T20:33:41.4390029Z T: int, 2025-05-07T20:33:41.4390111Z D: int, 2025-05-07T20:33:41.4390211Z scale_ub: Optional[float], 2025-05-07T20:33:41.4390304Z contiguous: bool, 2025-05-07T20:33:41.4390395Z compiled: bool, 2025-05-07T20:33:41.4390474Z ) -> None: 2025-05-07T20:33:41.4390569Z torch.manual_seed(2025) 2025-05-07T20:33:41.4390649Z 2025-05-07T20:33:41.4390818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4390895Z 2025-05-07T20:33:41.4390994Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4391119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4391212Z x = x_sign * x_clamp 2025-05-07T20:33:41.4391297Z x0 = x[:, :D] 2025-05-07T20:33:41.4391381Z x1 = x[:, D:] 2025-05-07T20:33:41.4391461Z 2025-05-07T20:33:41.4391545Z if contiguous: 2025-05-07T20:33:41.4391639Z x0 = x0.contiguous() 2025-05-07T20:33:41.4391784Z x1 = x1.contiguous() 2025-05-07T20:33:41.4391860Z 2025-05-07T20:33:41.4391952Z if scale_ub is not None: 2025-05-07T20:33:41.4392067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4392199Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4392280Z ) 2025-05-07T20:33:41.4392363Z else: 2025-05-07T20:33:41.4392460Z scale_ub_tensor = None 2025-05-07T20:33:41.4392541Z 2025-05-07T20:33:41.4392670Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4392762Z op = silu_mul_quant 2025-05-07T20:33:41.4392850Z if compiled: 2025-05-07T20:33:41.4392949Z op = torch.compile(op) 2025-05-07T20:33:41.4393055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4393135Z 2025-05-07T20:33:41.4393228Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4393233Z 2025-05-07T20:33:41.4393374Z moe/activation_test.py:117: 2025-05-07T20:33:41.4393517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4393619Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4393763Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4394256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4394355Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4394713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4394934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4395271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4395370Z kernel = self.compile( 2025-05-07T20:33:41.4395804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4396030Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4396162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4396166Z 2025-05-07T20:33:41.4396372Z self = 2025-05-07T20:33:41.4397152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4397657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbed13a0>} 2025-05-07T20:33:41.4398409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4398602Z context = 2025-05-07T20:33:41.4398610Z 2025-05-07T20:33:41.4398772Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4399040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4399149Z module_map=module_map) 2025-05-07T20:33:41.4399318Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4399419Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4399502Z E ^ 2025-05-07T20:33:41.4399861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4399865Z 2025-05-07T20:33:41.4400276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4400321Z 2025-05-07T20:33:41.4400431Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4400654Z self=, 2025-05-07T20:33:41.4400734Z T=1, 2025-05-07T20:33:41.4400814Z D=7168, 2025-05-07T20:33:41.4400897Z scale_ub=1200.0, 2025-05-07T20:33:41.4400986Z contiguous=True, 2025-05-07T20:33:41.4401073Z compiled=True, 2025-05-07T20:33:41.4401146Z ) 2025-05-07T20:33:41.4401364Z self = 2025-05-07T20:33:41.4401529Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4401534Z 2025-05-07T20:33:41.4401612Z @given( 2025-05-07T20:33:41.4401732Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4401833Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4401993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4402116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4402234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4402307Z ) 2025-05-07T20:33:41.4402592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4402685Z def test_silu_mul_quant( 2025-05-07T20:33:41.4402762Z self, 2025-05-07T20:33:41.4402841Z T: int, 2025-05-07T20:33:41.4402918Z D: int, 2025-05-07T20:33:41.4403022Z scale_ub: Optional[float], 2025-05-07T20:33:41.4403110Z contiguous: bool, 2025-05-07T20:33:41.4403196Z compiled: bool, 2025-05-07T20:33:41.4403279Z ) -> None: 2025-05-07T20:33:41.4403373Z torch.manual_seed(2025) 2025-05-07T20:33:41.4403447Z 2025-05-07T20:33:41.4403617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4403692Z 2025-05-07T20:33:41.4403783Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4403914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4404046Z x = x_sign * x_clamp 2025-05-07T20:33:41.4404129Z x0 = x[:, :D] 2025-05-07T20:33:41.4404215Z x1 = x[:, D:] 2025-05-07T20:33:41.4404287Z 2025-05-07T20:33:41.4404369Z if contiguous: 2025-05-07T20:33:41.4404469Z x0 = x0.contiguous() 2025-05-07T20:33:41.4404557Z x1 = x1.contiguous() 2025-05-07T20:33:41.4404631Z 2025-05-07T20:33:41.4404723Z if scale_ub is not None: 2025-05-07T20:33:41.4404827Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4404961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4405038Z ) 2025-05-07T20:33:41.4405116Z else: 2025-05-07T20:33:41.4405213Z scale_ub_tensor = None 2025-05-07T20:33:41.4405286Z 2025-05-07T20:33:41.4405416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4405512Z op = silu_mul_quant 2025-05-07T20:33:41.4405598Z if compiled: 2025-05-07T20:33:41.4405702Z op = torch.compile(op) 2025-05-07T20:33:41.4405811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4405886Z 2025-05-07T20:33:41.4405982Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4405987Z 2025-05-07T20:33:41.4406083Z moe/activation_test.py:117: 2025-05-07T20:33:41.4406213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4406317Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4406418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4406783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4406879Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4407366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4407470Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4407875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4408099Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4408440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4408535Z kernel = self.compile( 2025-05-07T20:33:41.4408914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4409089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4409219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4409223Z 2025-05-07T20:33:41.4409429Z self = 2025-05-07T20:33:41.4410269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4410816Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbed2b60>} 2025-05-07T20:33:41.4411559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4411749Z context = 2025-05-07T20:33:41.4411753Z 2025-05-07T20:33:41.4411919Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4412181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4412295Z module_map=module_map) 2025-05-07T20:33:41.4412501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4412603Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4412685Z E ^ 2025-05-07T20:33:41.4413040Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4413045Z 2025-05-07T20:33:41.4413453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4413462Z 2025-05-07T20:33:41.4413564Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4413786Z self=, 2025-05-07T20:33:41.4413868Z T=1, 2025-05-07T20:33:41.4413945Z D=7168, 2025-05-07T20:33:41.4414029Z scale_ub=1200.0, 2025-05-07T20:33:41.4414118Z contiguous=False, 2025-05-07T20:33:41.4414203Z compiled=True, 2025-05-07T20:33:41.4414277Z ) 2025-05-07T20:33:41.4414503Z self = 2025-05-07T20:33:41.4414669Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4414676Z 2025-05-07T20:33:41.4414753Z @given( 2025-05-07T20:33:41.4414875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4414973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4415089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4415205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4415316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4415397Z ) 2025-05-07T20:33:41.4415638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4415731Z def test_silu_mul_quant( 2025-05-07T20:33:41.4415810Z self, 2025-05-07T20:33:41.4415889Z T: int, 2025-05-07T20:33:41.4415969Z D: int, 2025-05-07T20:33:41.4416130Z scale_ub: Optional[float], 2025-05-07T20:33:41.4416221Z contiguous: bool, 2025-05-07T20:33:41.4416309Z compiled: bool, 2025-05-07T20:33:41.4416390Z ) -> None: 2025-05-07T20:33:41.4416484Z torch.manual_seed(2025) 2025-05-07T20:33:41.4416558Z 2025-05-07T20:33:41.4416725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4416798Z 2025-05-07T20:33:41.4416893Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4417018Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4417105Z x = x_sign * x_clamp 2025-05-07T20:33:41.4417189Z x0 = x[:, :D] 2025-05-07T20:33:41.4417270Z x1 = x[:, D:] 2025-05-07T20:33:41.4417343Z 2025-05-07T20:33:41.4417428Z if contiguous: 2025-05-07T20:33:41.4417519Z x0 = x0.contiguous() 2025-05-07T20:33:41.4417606Z x1 = x1.contiguous() 2025-05-07T20:33:41.4417682Z 2025-05-07T20:33:41.4417818Z if scale_ub is not None: 2025-05-07T20:33:41.4417935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4418067Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4418184Z ) 2025-05-07T20:33:41.4418266Z else: 2025-05-07T20:33:41.4418363Z scale_ub_tensor = None 2025-05-07T20:33:41.4418436Z 2025-05-07T20:33:41.4418568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4418658Z op = silu_mul_quant 2025-05-07T20:33:41.4418742Z if compiled: 2025-05-07T20:33:41.4418847Z op = torch.compile(op) 2025-05-07T20:33:41.4418952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4419024Z 2025-05-07T20:33:41.4419117Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4419122Z 2025-05-07T20:33:41.4419218Z moe/activation_test.py:117: 2025-05-07T20:33:41.4419360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4419462Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4419609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4419981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4420077Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4420565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4420666Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4421022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4421245Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4421581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4421675Z kernel = self.compile( 2025-05-07T20:33:41.4422061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4422237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4422371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4422375Z 2025-05-07T20:33:41.4422580Z self = 2025-05-07T20:33:41.4423355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4423861Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbeaba60>} 2025-05-07T20:33:41.4424609Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4424928Z context = 2025-05-07T20:33:41.4424932Z 2025-05-07T20:33:41.4425094Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4425354Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4425466Z module_map=module_map) 2025-05-07T20:33:41.4425625Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4425727Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4425803Z E ^ 2025-05-07T20:33:41.4426157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4426161Z 2025-05-07T20:33:41.4426620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4426627Z 2025-05-07T20:33:41.4426730Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4426997Z self=, 2025-05-07T20:33:41.4427076Z T=1, 2025-05-07T20:33:41.4427153Z D=7168, 2025-05-07T20:33:41.4427238Z scale_ub=None, 2025-05-07T20:33:41.4427323Z contiguous=False, 2025-05-07T20:33:41.4427404Z compiled=True, 2025-05-07T20:33:41.4427480Z ) 2025-05-07T20:33:41.4427697Z self = 2025-05-07T20:33:41.4427861Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4427865Z 2025-05-07T20:33:41.4427946Z @given( 2025-05-07T20:33:41.4428063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4428163Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4428290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4428448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4428566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4428642Z ) 2025-05-07T20:33:41.4428885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4428982Z def test_silu_mul_quant( 2025-05-07T20:33:41.4429058Z self, 2025-05-07T20:33:41.4429134Z T: int, 2025-05-07T20:33:41.4429213Z D: int, 2025-05-07T20:33:41.4429312Z scale_ub: Optional[float], 2025-05-07T20:33:41.4429403Z contiguous: bool, 2025-05-07T20:33:41.4429493Z compiled: bool, 2025-05-07T20:33:41.4429571Z ) -> None: 2025-05-07T20:33:41.4429665Z torch.manual_seed(2025) 2025-05-07T20:33:41.4429744Z 2025-05-07T20:33:41.4429910Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4429985Z 2025-05-07T20:33:41.4430078Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4430207Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4430300Z x = x_sign * x_clamp 2025-05-07T20:33:41.4430383Z x0 = x[:, :D] 2025-05-07T20:33:41.4430464Z x1 = x[:, D:] 2025-05-07T20:33:41.4430541Z 2025-05-07T20:33:41.4430626Z if contiguous: 2025-05-07T20:33:41.4430719Z x0 = x0.contiguous() 2025-05-07T20:33:41.4430813Z x1 = x1.contiguous() 2025-05-07T20:33:41.4430886Z 2025-05-07T20:33:41.4430976Z if scale_ub is not None: 2025-05-07T20:33:41.4431085Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4434479Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4434569Z ) 2025-05-07T20:33:41.4434656Z else: 2025-05-07T20:33:41.4434755Z scale_ub_tensor = None 2025-05-07T20:33:41.4434832Z 2025-05-07T20:33:41.4434965Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4435067Z op = silu_mul_quant 2025-05-07T20:33:41.4435232Z if compiled: 2025-05-07T20:33:41.4435334Z op = torch.compile(op) 2025-05-07T20:33:41.4435443Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4435518Z 2025-05-07T20:33:41.4435608Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.4435817Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.4435894Z 2025-05-07T20:33:41.4436029Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4436130Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.4436232Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.4436353Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.4436494Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4436568Z 2025-05-07T20:33:41.4436667Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.4436671Z 2025-05-07T20:33:41.4436823Z moe/activation_test.py:126: 2025-05-07T20:33:41.4436958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4437065Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.4437243Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.4437800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.4437902Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.4438259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4438481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4438851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.4439109Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.4439526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.4439699Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.4440037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.4440117Z fn() 2025-05-07T20:33:41.4440514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.4440595Z self.fn.run( 2025-05-07T20:33:41.4440934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4441027Z kernel = self.compile( 2025-05-07T20:33:41.4441409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4441587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4441721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4441729Z 2025-05-07T20:33:41.4441937Z self = 2025-05-07T20:33:41.4442715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4443222Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfadd4c20>} 2025-05-07T20:33:41.4443967Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4444233Z context = 2025-05-07T20:33:41.4444237Z 2025-05-07T20:33:41.4444409Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4444672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4444787Z module_map=module_map) 2025-05-07T20:33:41.4444949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4445053Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.4445134Z E ^ 2025-05-07T20:33:41.4445489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4445493Z 2025-05-07T20:33:41.4445902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4445911Z 2025-05-07T20:33:41.4446058Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4446288Z self=, 2025-05-07T20:33:41.4446367Z T=1, 2025-05-07T20:33:41.4446484Z D=5120, 2025-05-07T20:33:41.4446566Z scale_ub=1200.0, 2025-05-07T20:33:41.4446654Z contiguous=False, 2025-05-07T20:33:41.4446736Z compiled=True, 2025-05-07T20:33:41.4446810Z ) 2025-05-07T20:33:41.4447031Z self = 2025-05-07T20:33:41.4447196Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4447201Z 2025-05-07T20:33:41.4447280Z @given( 2025-05-07T20:33:41.4447399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4447499Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4447615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4447730Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4447847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4447970Z ) 2025-05-07T20:33:41.4448215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4448313Z def test_silu_mul_quant( 2025-05-07T20:33:41.4448391Z self, 2025-05-07T20:33:41.4448470Z T: int, 2025-05-07T20:33:41.4448546Z D: int, 2025-05-07T20:33:41.4448648Z scale_ub: Optional[float], 2025-05-07T20:33:41.4448737Z contiguous: bool, 2025-05-07T20:33:41.4448824Z compiled: bool, 2025-05-07T20:33:41.4448902Z ) -> None: 2025-05-07T20:33:41.4448997Z torch.manual_seed(2025) 2025-05-07T20:33:41.4449072Z 2025-05-07T20:33:41.4449240Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4449315Z 2025-05-07T20:33:41.4449410Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4449537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4449629Z x = x_sign * x_clamp 2025-05-07T20:33:41.4449715Z x0 = x[:, :D] 2025-05-07T20:33:41.4449799Z x1 = x[:, D:] 2025-05-07T20:33:41.4449872Z 2025-05-07T20:33:41.4449959Z if contiguous: 2025-05-07T20:33:41.4450053Z x0 = x0.contiguous() 2025-05-07T20:33:41.4450145Z x1 = x1.contiguous() 2025-05-07T20:33:41.4450220Z 2025-05-07T20:33:41.4450309Z if scale_ub is not None: 2025-05-07T20:33:41.4450420Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4450557Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4450632Z ) 2025-05-07T20:33:41.4450711Z else: 2025-05-07T20:33:41.4450804Z scale_ub_tensor = None 2025-05-07T20:33:41.4450877Z 2025-05-07T20:33:41.4451009Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4451098Z op = silu_mul_quant 2025-05-07T20:33:41.4451184Z if compiled: 2025-05-07T20:33:41.4451290Z op = torch.compile(op) 2025-05-07T20:33:41.4451444Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4451523Z 2025-05-07T20:33:41.4451615Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4451622Z 2025-05-07T20:33:41.4451718Z moe/activation_test.py:117: 2025-05-07T20:33:41.4451852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4451953Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4452053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4452429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4452521Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4453015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4453112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4453509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4453739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4454119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4454213Z kernel = self.compile( 2025-05-07T20:33:41.4454592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4454766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4454904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4454908Z 2025-05-07T20:33:41.4455112Z self = 2025-05-07T20:33:41.4455930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4456443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfadd5ee0>} 2025-05-07T20:33:41.4457188Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4457384Z context = 2025-05-07T20:33:41.4457388Z 2025-05-07T20:33:41.4457556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4457816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4457928Z module_map=module_map) 2025-05-07T20:33:41.4458092Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4458202Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4458282Z E ^ 2025-05-07T20:33:41.4458636Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4458643Z 2025-05-07T20:33:41.4459058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4459062Z 2025-05-07T20:33:41.4459166Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4459390Z self=, 2025-05-07T20:33:41.4459472Z T=1, 2025-05-07T20:33:41.4459553Z D=5120, 2025-05-07T20:33:41.4459643Z scale_ub=1200.0, 2025-05-07T20:33:41.4459730Z contiguous=False, 2025-05-07T20:33:41.4459815Z compiled=False, 2025-05-07T20:33:41.4459893Z ) 2025-05-07T20:33:41.4460116Z self = 2025-05-07T20:33:41.4460334Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4460338Z 2025-05-07T20:33:41.4460421Z @given( 2025-05-07T20:33:41.4460542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4460646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4460761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4460878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4460993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4461068Z ) 2025-05-07T20:33:41.4461309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4461406Z def test_silu_mul_quant( 2025-05-07T20:33:41.4461483Z self, 2025-05-07T20:33:41.4461563Z T: int, 2025-05-07T20:33:41.4461645Z D: int, 2025-05-07T20:33:41.4461745Z scale_ub: Optional[float], 2025-05-07T20:33:41.4461877Z contiguous: bool, 2025-05-07T20:33:41.4461972Z compiled: bool, 2025-05-07T20:33:41.4462050Z ) -> None: 2025-05-07T20:33:41.4462148Z torch.manual_seed(2025) 2025-05-07T20:33:41.4462264Z 2025-05-07T20:33:41.4462431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4462510Z 2025-05-07T20:33:41.4462601Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4462725Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4462821Z x = x_sign * x_clamp 2025-05-07T20:33:41.4462900Z x0 = x[:, :D] 2025-05-07T20:33:41.4462980Z x1 = x[:, D:] 2025-05-07T20:33:41.4463056Z 2025-05-07T20:33:41.4463138Z if contiguous: 2025-05-07T20:33:41.4463230Z x0 = x0.contiguous() 2025-05-07T20:33:41.4463321Z x1 = x1.contiguous() 2025-05-07T20:33:41.4463393Z 2025-05-07T20:33:41.4463485Z if scale_ub is not None: 2025-05-07T20:33:41.4463592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4463775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4463857Z ) 2025-05-07T20:33:41.4463935Z else: 2025-05-07T20:33:41.4464033Z scale_ub_tensor = None 2025-05-07T20:33:41.4464110Z 2025-05-07T20:33:41.4464242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4464334Z op = silu_mul_quant 2025-05-07T20:33:41.4464421Z if compiled: 2025-05-07T20:33:41.4464518Z op = torch.compile(op) 2025-05-07T20:33:41.4464623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4464705Z 2025-05-07T20:33:41.4464796Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4464801Z 2025-05-07T20:33:41.4464898Z moe/activation_test.py:117: 2025-05-07T20:33:41.4465027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4465127Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4465233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4466002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4466111Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4466475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4466696Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4467038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4467131Z kernel = self.compile( 2025-05-07T20:33:41.4467509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4467686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4467816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4467911Z 2025-05-07T20:33:41.4468123Z self = 2025-05-07T20:33:41.4468900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4469403Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfadd6b60>} 2025-05-07T20:33:41.4470151Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4470340Z context = 2025-05-07T20:33:41.4470406Z 2025-05-07T20:33:41.4470577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4470840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4471006Z module_map=module_map) 2025-05-07T20:33:41.4471171Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4471270Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4471346Z E ^ 2025-05-07T20:33:41.4471702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4471706Z 2025-05-07T20:33:41.4472116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4472120Z 2025-05-07T20:33:41.4472227Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4472454Z self=, 2025-05-07T20:33:41.4472536Z T=16384, 2025-05-07T20:33:41.4472704Z D=5120, 2025-05-07T20:33:41.4472788Z scale_ub=1200.0, 2025-05-07T20:33:41.4472876Z contiguous=False, 2025-05-07T20:33:41.4472961Z compiled=True, 2025-05-07T20:33:41.4473034Z ) 2025-05-07T20:33:41.4473256Z self = 2025-05-07T20:33:41.4473435Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4473439Z 2025-05-07T20:33:41.4473515Z @given( 2025-05-07T20:33:41.4473636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4473737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4473850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4473969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4474080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4474156Z ) 2025-05-07T20:33:41.4474401Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4474500Z def test_silu_mul_quant( 2025-05-07T20:33:41.4474578Z self, 2025-05-07T20:33:41.4474658Z T: int, 2025-05-07T20:33:41.4474734Z D: int, 2025-05-07T20:33:41.4474837Z scale_ub: Optional[float], 2025-05-07T20:33:41.4474924Z contiguous: bool, 2025-05-07T20:33:41.4475010Z compiled: bool, 2025-05-07T20:33:41.4475093Z ) -> None: 2025-05-07T20:33:41.4475188Z torch.manual_seed(2025) 2025-05-07T20:33:41.4475262Z 2025-05-07T20:33:41.4475432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4475506Z 2025-05-07T20:33:41.4475596Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4475824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4475913Z x = x_sign * x_clamp 2025-05-07T20:33:41.4475998Z x0 = x[:, :D] 2025-05-07T20:33:41.4476078Z x1 = x[:, D:] 2025-05-07T20:33:41.4476154Z 2025-05-07T20:33:41.4476294Z if contiguous: 2025-05-07T20:33:41.4476387Z x0 = x0.contiguous() 2025-05-07T20:33:41.4476477Z x1 = x1.contiguous() 2025-05-07T20:33:41.4476555Z 2025-05-07T20:33:41.4476646Z if scale_ub is not None: 2025-05-07T20:33:41.4476752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4476888Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4476964Z ) 2025-05-07T20:33:41.4477040Z else: 2025-05-07T20:33:41.4477135Z scale_ub_tensor = None 2025-05-07T20:33:41.4477207Z 2025-05-07T20:33:41.4477338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4477428Z op = silu_mul_quant 2025-05-07T20:33:41.4477512Z if compiled: 2025-05-07T20:33:41.4477614Z op = torch.compile(op) 2025-05-07T20:33:41.4477720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4477836Z 2025-05-07T20:33:41.4477933Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4477942Z 2025-05-07T20:33:41.4478038Z moe/activation_test.py:117: 2025-05-07T20:33:41.4478167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4478314Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4478415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4478785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4478878Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4479368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4479467Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4479831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4480054Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4480436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4480532Z kernel = self.compile( 2025-05-07T20:33:41.4480915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4481088Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4481214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4481219Z 2025-05-07T20:33:41.4481429Z self = 2025-05-07T20:33:41.4482208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4482722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb4c8220>} 2025-05-07T20:33:41.4483472Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4483669Z context = 2025-05-07T20:33:41.4483673Z 2025-05-07T20:33:41.4483838Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4484099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4484211Z module_map=module_map) 2025-05-07T20:33:41.4484373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4484472Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4484561Z E ^ 2025-05-07T20:33:41.4484960Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4484967Z 2025-05-07T20:33:41.4485385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4485389Z 2025-05-07T20:33:41.4485491Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4485711Z self=, 2025-05-07T20:33:41.4485794Z T=2048, 2025-05-07T20:33:41.4485872Z D=7168, 2025-05-07T20:33:41.4485955Z scale_ub=1200.0, 2025-05-07T20:33:41.4486047Z contiguous=False, 2025-05-07T20:33:41.4486128Z compiled=True, 2025-05-07T20:33:41.4486201Z ) 2025-05-07T20:33:41.4486421Z self = 2025-05-07T20:33:41.4486636Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4486644Z 2025-05-07T20:33:41.4486729Z @given( 2025-05-07T20:33:41.4486849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4486947Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4487109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4487224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4487341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4487419Z ) 2025-05-07T20:33:41.4487660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4487758Z def test_silu_mul_quant( 2025-05-07T20:33:41.4487835Z self, 2025-05-07T20:33:41.4487912Z T: int, 2025-05-07T20:33:41.4487990Z D: int, 2025-05-07T20:33:41.4488088Z scale_ub: Optional[float], 2025-05-07T20:33:41.4488176Z contiguous: bool, 2025-05-07T20:33:41.4488264Z compiled: bool, 2025-05-07T20:33:41.4488346Z ) -> None: 2025-05-07T20:33:41.4488441Z torch.manual_seed(2025) 2025-05-07T20:33:41.4488561Z 2025-05-07T20:33:41.4488728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4488807Z 2025-05-07T20:33:41.4488902Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4489030Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4489125Z x = x_sign * x_clamp 2025-05-07T20:33:41.4489205Z x0 = x[:, :D] 2025-05-07T20:33:41.4489286Z x1 = x[:, D:] 2025-05-07T20:33:41.4489362Z 2025-05-07T20:33:41.4489446Z if contiguous: 2025-05-07T20:33:41.4489538Z x0 = x0.contiguous() 2025-05-07T20:33:41.4489633Z x1 = x1.contiguous() 2025-05-07T20:33:41.4489707Z 2025-05-07T20:33:41.4489797Z if scale_ub is not None: 2025-05-07T20:33:41.4489907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4490039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4490119Z ) 2025-05-07T20:33:41.4490199Z else: 2025-05-07T20:33:41.4490296Z scale_ub_tensor = None 2025-05-07T20:33:41.4490369Z 2025-05-07T20:33:41.4490502Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4490592Z op = silu_mul_quant 2025-05-07T20:33:41.4490682Z if compiled: 2025-05-07T20:33:41.4490780Z op = torch.compile(op) 2025-05-07T20:33:41.4490884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4490959Z 2025-05-07T20:33:41.4491047Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4491052Z 2025-05-07T20:33:41.4491150Z moe/activation_test.py:117: 2025-05-07T20:33:41.4491285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4491383Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4491481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4491853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4491994Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4492484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4492584Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4492939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4493168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4493506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4493604Z kernel = self.compile( 2025-05-07T20:33:41.4493981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4494197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4494335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4494340Z 2025-05-07T20:33:41.4494543Z self = 2025-05-07T20:33:41.4495356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4495864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb4c8f40>} 2025-05-07T20:33:41.4496605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4496805Z context = 2025-05-07T20:33:41.4496812Z 2025-05-07T20:33:41.4497015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4497284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4497393Z module_map=module_map) 2025-05-07T20:33:41.4497553Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4497655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4497733Z E ^ 2025-05-07T20:33:41.4498090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4498094Z 2025-05-07T20:33:41.4498507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4498512Z 2025-05-07T20:33:41.4498614Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4498848Z self=, 2025-05-07T20:33:41.4498928Z T=1, 2025-05-07T20:33:41.4499004Z D=5120, 2025-05-07T20:33:41.4499089Z scale_ub=None, 2025-05-07T20:33:41.4499178Z contiguous=False, 2025-05-07T20:33:41.4499261Z compiled=False, 2025-05-07T20:33:41.4499336Z ) 2025-05-07T20:33:41.4499552Z self = 2025-05-07T20:33:41.4499722Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4499726Z 2025-05-07T20:33:41.4499803Z @given( 2025-05-07T20:33:41.4499921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4500023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4500139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4500255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4500369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4500448Z ) 2025-05-07T20:33:41.4500742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4500834Z def test_silu_mul_quant( 2025-05-07T20:33:41.4500913Z self, 2025-05-07T20:33:41.4500996Z T: int, 2025-05-07T20:33:41.4501073Z D: int, 2025-05-07T20:33:41.4501171Z scale_ub: Optional[float], 2025-05-07T20:33:41.4501261Z contiguous: bool, 2025-05-07T20:33:41.4501347Z compiled: bool, 2025-05-07T20:33:41.4501425Z ) -> None: 2025-05-07T20:33:41.4501526Z torch.manual_seed(2025) 2025-05-07T20:33:41.4501598Z 2025-05-07T20:33:41.4501766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4501844Z 2025-05-07T20:33:41.4501936Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4502062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4502150Z x = x_sign * x_clamp 2025-05-07T20:33:41.4502276Z x0 = x[:, :D] 2025-05-07T20:33:41.4502362Z x1 = x[:, D:] 2025-05-07T20:33:41.4502438Z 2025-05-07T20:33:41.4502521Z if contiguous: 2025-05-07T20:33:41.4502617Z x0 = x0.contiguous() 2025-05-07T20:33:41.4502772Z x1 = x1.contiguous() 2025-05-07T20:33:41.4502844Z 2025-05-07T20:33:41.4502940Z if scale_ub is not None: 2025-05-07T20:33:41.4503046Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4503179Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4503259Z ) 2025-05-07T20:33:41.4503336Z else: 2025-05-07T20:33:41.4503430Z scale_ub_tensor = None 2025-05-07T20:33:41.4503506Z 2025-05-07T20:33:41.4503634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4503727Z op = silu_mul_quant 2025-05-07T20:33:41.4503812Z if compiled: 2025-05-07T20:33:41.4503911Z op = torch.compile(op) 2025-05-07T20:33:41.4504024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4504099Z 2025-05-07T20:33:41.4504235Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4504240Z 2025-05-07T20:33:41.4504340Z moe/activation_test.py:117: 2025-05-07T20:33:41.4504473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4504573Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4504674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4505167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4505267Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4505621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4505841Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4506184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4506281Z kernel = self.compile( 2025-05-07T20:33:41.4506661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4506837Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4506967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4506971Z 2025-05-07T20:33:41.4507177Z self = 2025-05-07T20:33:41.4507951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4508459Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb4c9ee0>} 2025-05-07T20:33:41.4509253Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4509447Z context = 2025-05-07T20:33:41.4509451Z 2025-05-07T20:33:41.4509618Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4509878Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4509988Z module_map=module_map) 2025-05-07T20:33:41.4510149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4510248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4510330Z E ^ 2025-05-07T20:33:41.4510724Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4510732Z 2025-05-07T20:33:41.4511145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4511191Z 2025-05-07T20:33:41.4511295Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4511517Z self=, 2025-05-07T20:33:41.4511598Z T=4096, 2025-05-07T20:33:41.4511676Z D=7168, 2025-05-07T20:33:41.4511761Z scale_ub=1200.0, 2025-05-07T20:33:41.4511850Z contiguous=False, 2025-05-07T20:33:41.4511935Z compiled=False, 2025-05-07T20:33:41.4512009Z ) 2025-05-07T20:33:41.4512229Z self = 2025-05-07T20:33:41.4512407Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4512412Z 2025-05-07T20:33:41.4512495Z @given( 2025-05-07T20:33:41.4512617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4512762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4512882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4513003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4513118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4513202Z ) 2025-05-07T20:33:41.4513445Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4513539Z def test_silu_mul_quant( 2025-05-07T20:33:41.4513620Z self, 2025-05-07T20:33:41.4513698Z T: int, 2025-05-07T20:33:41.4513776Z D: int, 2025-05-07T20:33:41.4513879Z scale_ub: Optional[float], 2025-05-07T20:33:41.4513968Z contiguous: bool, 2025-05-07T20:33:41.4514058Z compiled: bool, 2025-05-07T20:33:41.4514138Z ) -> None: 2025-05-07T20:33:41.4514234Z torch.manual_seed(2025) 2025-05-07T20:33:41.4514313Z 2025-05-07T20:33:41.4514485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4514567Z 2025-05-07T20:33:41.4514663Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4514788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4514881Z x = x_sign * x_clamp 2025-05-07T20:33:41.4514967Z x0 = x[:, :D] 2025-05-07T20:33:41.4515048Z x1 = x[:, D:] 2025-05-07T20:33:41.4515121Z 2025-05-07T20:33:41.4515209Z if contiguous: 2025-05-07T20:33:41.4515301Z x0 = x0.contiguous() 2025-05-07T20:33:41.4515393Z x1 = x1.contiguous() 2025-05-07T20:33:41.4515466Z 2025-05-07T20:33:41.4515557Z if scale_ub is not None: 2025-05-07T20:33:41.4515712Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4515850Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4515929Z ) 2025-05-07T20:33:41.4516013Z else: 2025-05-07T20:33:41.4516112Z scale_ub_tensor = None 2025-05-07T20:33:41.4516187Z 2025-05-07T20:33:41.4516371Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4516461Z op = silu_mul_quant 2025-05-07T20:33:41.4516547Z if compiled: 2025-05-07T20:33:41.4516649Z op = torch.compile(op) 2025-05-07T20:33:41.4516753Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4516827Z 2025-05-07T20:33:41.4516917Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4516922Z 2025-05-07T20:33:41.4517019Z moe/activation_test.py:117: 2025-05-07T20:33:41.4517150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4517254Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4517352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4517850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4518595Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4518965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4519190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4519569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4519665Z kernel = self.compile( 2025-05-07T20:33:41.4520042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4520217Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4520350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4520354Z 2025-05-07T20:33:41.4520558Z self = 2025-05-07T20:33:41.4521382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4521892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfb4cb420>} 2025-05-07T20:33:41.4522640Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4522832Z context = 2025-05-07T20:33:41.4522836Z 2025-05-07T20:33:41.4523001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4523266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4523376Z module_map=module_map) 2025-05-07T20:33:41.4523543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4523648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4523726Z E ^ 2025-05-07T20:33:41.4524084Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4524088Z 2025-05-07T20:33:41.4524499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4524503Z 2025-05-07T20:33:41.4524606Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4524833Z self=, 2025-05-07T20:33:41.4524910Z T=16384, 2025-05-07T20:33:41.4524989Z D=7168, 2025-05-07T20:33:41.4525071Z scale_ub=None, 2025-05-07T20:33:41.4525155Z contiguous=True, 2025-05-07T20:33:41.4525240Z compiled=True, 2025-05-07T20:33:41.4525316Z ) 2025-05-07T20:33:41.4525580Z self = 2025-05-07T20:33:41.4525754Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4525762Z 2025-05-07T20:33:41.4525842Z @given( 2025-05-07T20:33:41.4525960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4526062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4526178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4526297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4526408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4526482Z ) 2025-05-07T20:33:41.4526728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4526820Z def test_silu_mul_quant( 2025-05-07T20:33:41.4526896Z self, 2025-05-07T20:33:41.4526976Z T: int, 2025-05-07T20:33:41.4527092Z D: int, 2025-05-07T20:33:41.4527193Z scale_ub: Optional[float], 2025-05-07T20:33:41.4527288Z contiguous: bool, 2025-05-07T20:33:41.4527374Z compiled: bool, 2025-05-07T20:33:41.4527495Z ) -> None: 2025-05-07T20:33:41.4527592Z torch.manual_seed(2025) 2025-05-07T20:33:41.4527665Z 2025-05-07T20:33:41.4527831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4527907Z 2025-05-07T20:33:41.4527998Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4528124Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4528214Z x = x_sign * x_clamp 2025-05-07T20:33:41.4528294Z x0 = x[:, :D] 2025-05-07T20:33:41.4528377Z x1 = x[:, D:] 2025-05-07T20:33:41.4528451Z 2025-05-07T20:33:41.4528534Z if contiguous: 2025-05-07T20:33:41.4528629Z x0 = x0.contiguous() 2025-05-07T20:33:41.4528718Z x1 = x1.contiguous() 2025-05-07T20:33:41.4528789Z 2025-05-07T20:33:41.4528888Z if scale_ub is not None: 2025-05-07T20:33:41.4529039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4529174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4529257Z ) 2025-05-07T20:33:41.4529334Z else: 2025-05-07T20:33:41.4529431Z scale_ub_tensor = None 2025-05-07T20:33:41.4529503Z 2025-05-07T20:33:41.4529631Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4529724Z op = silu_mul_quant 2025-05-07T20:33:41.4529809Z if compiled: 2025-05-07T20:33:41.4529907Z op = torch.compile(op) 2025-05-07T20:33:41.4530015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4530088Z 2025-05-07T20:33:41.4530178Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4530182Z 2025-05-07T20:33:41.4530280Z moe/activation_test.py:117: 2025-05-07T20:33:41.4530410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4530516Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4530621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4530985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4531085Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4531573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4531670Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4532029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4532251Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4532595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4532687Z kernel = self.compile( 2025-05-07T20:33:41.4533071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4533293Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4533423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4533427Z 2025-05-07T20:33:41.4533631Z self = 2025-05-07T20:33:41.4534408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4534910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfaa5c540>} 2025-05-07T20:33:41.4535721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4535917Z context = 2025-05-07T20:33:41.4535960Z 2025-05-07T20:33:41.4536131Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4536393Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4536500Z module_map=module_map) 2025-05-07T20:33:41.4536664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4536763Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4536841Z E ^ 2025-05-07T20:33:41.4537200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4537204Z 2025-05-07T20:33:41.4537656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4537664Z 2025-05-07T20:33:41.4537770Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4537996Z self=, 2025-05-07T20:33:41.4538074Z T=4096, 2025-05-07T20:33:41.4538155Z D=5120, 2025-05-07T20:33:41.4538236Z scale_ub=None, 2025-05-07T20:33:41.4538323Z contiguous=False, 2025-05-07T20:33:41.4538409Z compiled=True, 2025-05-07T20:33:41.4538483Z ) 2025-05-07T20:33:41.4538702Z self = 2025-05-07T20:33:41.4538874Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4538878Z 2025-05-07T20:33:41.4538954Z @given( 2025-05-07T20:33:41.4539077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4539175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4539293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4539423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4539536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4539614Z ) 2025-05-07T20:33:41.4539857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4539950Z def test_silu_mul_quant( 2025-05-07T20:33:41.4540028Z self, 2025-05-07T20:33:41.4540105Z T: int, 2025-05-07T20:33:41.4540180Z D: int, 2025-05-07T20:33:41.4540281Z scale_ub: Optional[float], 2025-05-07T20:33:41.4540369Z contiguous: bool, 2025-05-07T20:33:41.4540453Z compiled: bool, 2025-05-07T20:33:41.4540532Z ) -> None: 2025-05-07T20:33:41.4540625Z torch.manual_seed(2025) 2025-05-07T20:33:41.4540698Z 2025-05-07T20:33:41.4540867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4540941Z 2025-05-07T20:33:41.4541035Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4541211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4541299Z x = x_sign * x_clamp 2025-05-07T20:33:41.4541385Z x0 = x[:, :D] 2025-05-07T20:33:41.4541464Z x1 = x[:, D:] 2025-05-07T20:33:41.4541535Z 2025-05-07T20:33:41.4541621Z if contiguous: 2025-05-07T20:33:41.4541711Z x0 = x0.contiguous() 2025-05-07T20:33:41.4541801Z x1 = x1.contiguous() 2025-05-07T20:33:41.4541882Z 2025-05-07T20:33:41.4541973Z if scale_ub is not None: 2025-05-07T20:33:41.4542078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4542219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4542295Z ) 2025-05-07T20:33:41.4542371Z else: 2025-05-07T20:33:41.4542469Z scale_ub_tensor = None 2025-05-07T20:33:41.4542541Z 2025-05-07T20:33:41.4542671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4542806Z op = silu_mul_quant 2025-05-07T20:33:41.4542898Z if compiled: 2025-05-07T20:33:41.4542999Z op = torch.compile(op) 2025-05-07T20:33:41.4543142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4543216Z 2025-05-07T20:33:41.4543311Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4543315Z 2025-05-07T20:33:41.4543413Z moe/activation_test.py:117: 2025-05-07T20:33:41.4543543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4543646Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4543746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4544112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4544205Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4544698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4544800Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4545194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4545418Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4545758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4545851Z kernel = self.compile( 2025-05-07T20:33:41.4546232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4546406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4546536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4546540Z 2025-05-07T20:33:41.4546747Z self = 2025-05-07T20:33:41.4547529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4548042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfaa5d260>} 2025-05-07T20:33:41.4548784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4548974Z context = 2025-05-07T20:33:41.4548982Z 2025-05-07T20:33:41.4549145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4549407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4549575Z module_map=module_map) 2025-05-07T20:33:41.4549735Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4549837Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4549918Z E ^ 2025-05-07T20:33:41.4550271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4550276Z 2025-05-07T20:33:41.4550688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4550692Z 2025-05-07T20:33:41.4550794Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4551016Z self=, 2025-05-07T20:33:41.4551096Z T=4096, 2025-05-07T20:33:41.4551172Z D=5120, 2025-05-07T20:33:41.4551254Z scale_ub=1200.0, 2025-05-07T20:33:41.4551385Z contiguous=False, 2025-05-07T20:33:41.4551471Z compiled=False, 2025-05-07T20:33:41.4551545Z ) 2025-05-07T20:33:41.4551768Z self = 2025-05-07T20:33:41.4551985Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4551990Z 2025-05-07T20:33:41.4552072Z @given( 2025-05-07T20:33:41.4552190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4552288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4552406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4552522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4552634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4552711Z ) 2025-05-07T20:33:41.4552954Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4553049Z def test_silu_mul_quant( 2025-05-07T20:33:41.4553125Z self, 2025-05-07T20:33:41.4553206Z T: int, 2025-05-07T20:33:41.4553287Z D: int, 2025-05-07T20:33:41.4553509Z scale_ub: Optional[float], 2025-05-07T20:33:41.4553599Z contiguous: bool, 2025-05-07T20:33:41.4553692Z compiled: bool, 2025-05-07T20:33:41.4553771Z ) -> None: 2025-05-07T20:33:41.4553866Z torch.manual_seed(2025) 2025-05-07T20:33:41.4553940Z 2025-05-07T20:33:41.4554107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4554180Z 2025-05-07T20:33:41.4554275Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4554398Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4554487Z x = x_sign * x_clamp 2025-05-07T20:33:41.4554570Z x0 = x[:, :D] 2025-05-07T20:33:41.4554651Z x1 = x[:, D:] 2025-05-07T20:33:41.4554726Z 2025-05-07T20:33:41.4554809Z if contiguous: 2025-05-07T20:33:41.4554900Z x0 = x0.contiguous() 2025-05-07T20:33:41.4554994Z x1 = x1.contiguous() 2025-05-07T20:33:41.4555068Z 2025-05-07T20:33:41.4558350Z if scale_ub is not None: 2025-05-07T20:33:41.4558474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4558615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4558697Z ) 2025-05-07T20:33:41.4558777Z else: 2025-05-07T20:33:41.4558872Z scale_ub_tensor = None 2025-05-07T20:33:41.4558947Z 2025-05-07T20:33:41.4559081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4559171Z op = silu_mul_quant 2025-05-07T20:33:41.4559257Z if compiled: 2025-05-07T20:33:41.4559359Z op = torch.compile(op) 2025-05-07T20:33:41.4559463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4559541Z 2025-05-07T20:33:41.4559632Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4559637Z 2025-05-07T20:33:41.4559734Z moe/activation_test.py:117: 2025-05-07T20:33:41.4559867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4560040Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4560140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4560644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4560739Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4561097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4561320Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4561657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4561754Z kernel = self.compile( 2025-05-07T20:33:41.4562181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4562359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4562491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4562533Z 2025-05-07T20:33:41.4562738Z self = 2025-05-07T20:33:41.4563518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4564023Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfaa5e200>} 2025-05-07T20:33:41.4564771Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4565003Z context = 2025-05-07T20:33:41.4565008Z 2025-05-07T20:33:41.4565171Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4565682Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4565807Z module_map=module_map) 2025-05-07T20:33:41.4565967Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4566069Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4566149Z E ^ 2025-05-07T20:33:41.4566504Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4566508Z 2025-05-07T20:33:41.4566916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4566924Z 2025-05-07T20:33:41.4567028Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4567259Z self=, 2025-05-07T20:33:41.4567340Z T=4096, 2025-05-07T20:33:41.4567419Z D=5120, 2025-05-07T20:33:41.4567502Z scale_ub=1200.0, 2025-05-07T20:33:41.4567587Z contiguous=False, 2025-05-07T20:33:41.4567673Z compiled=True, 2025-05-07T20:33:41.4567747Z ) 2025-05-07T20:33:41.4567963Z self = 2025-05-07T20:33:41.4568143Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4568147Z 2025-05-07T20:33:41.4568226Z @given( 2025-05-07T20:33:41.4568345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4568450Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4568564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4568687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4568801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4568995Z ) 2025-05-07T20:33:41.4569242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4569338Z def test_silu_mul_quant( 2025-05-07T20:33:41.4569415Z self, 2025-05-07T20:33:41.4569495Z T: int, 2025-05-07T20:33:41.4569571Z D: int, 2025-05-07T20:33:41.4569669Z scale_ub: Optional[float], 2025-05-07T20:33:41.4569764Z contiguous: bool, 2025-05-07T20:33:41.4569849Z compiled: bool, 2025-05-07T20:33:41.4569927Z ) -> None: 2025-05-07T20:33:41.4570024Z torch.manual_seed(2025) 2025-05-07T20:33:41.4570098Z 2025-05-07T20:33:41.4570266Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4570340Z 2025-05-07T20:33:41.4570431Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4570558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4570715Z x = x_sign * x_clamp 2025-05-07T20:33:41.4570806Z x0 = x[:, :D] 2025-05-07T20:33:41.4570888Z x1 = x[:, D:] 2025-05-07T20:33:41.4570959Z 2025-05-07T20:33:41.4571101Z if contiguous: 2025-05-07T20:33:41.4571194Z x0 = x0.contiguous() 2025-05-07T20:33:41.4571283Z x1 = x1.contiguous() 2025-05-07T20:33:41.4571354Z 2025-05-07T20:33:41.4571445Z if scale_ub is not None: 2025-05-07T20:33:41.4571549Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4571680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4571760Z ) 2025-05-07T20:33:41.4571837Z else: 2025-05-07T20:33:41.4571932Z scale_ub_tensor = None 2025-05-07T20:33:41.4572003Z 2025-05-07T20:33:41.4572131Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4572223Z op = silu_mul_quant 2025-05-07T20:33:41.4572306Z if compiled: 2025-05-07T20:33:41.4572411Z op = torch.compile(op) 2025-05-07T20:33:41.4572579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4572651Z 2025-05-07T20:33:41.4572741Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4572749Z 2025-05-07T20:33:41.4572852Z moe/activation_test.py:117: 2025-05-07T20:33:41.4572979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4573081Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4573179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4573543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4573642Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4574128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4574223Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4574585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4574807Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4575148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4575241Z kernel = self.compile( 2025-05-07T20:33:41.4575617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4575795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4575922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4575927Z 2025-05-07T20:33:41.4576132Z self = 2025-05-07T20:33:41.4576910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4577457Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfaa5f2e0>} 2025-05-07T20:33:41.4578205Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4578394Z context = 2025-05-07T20:33:41.4578399Z 2025-05-07T20:33:41.4578562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4578821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4578926Z module_map=module_map) 2025-05-07T20:33:41.4579131Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4579235Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4579314Z E ^ 2025-05-07T20:33:41.4579715Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4579719Z 2025-05-07T20:33:41.4580126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4580130Z 2025-05-07T20:33:41.4580238Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4580458Z self=, 2025-05-07T20:33:41.4580533Z T=2048, 2025-05-07T20:33:41.4580613Z D=7168, 2025-05-07T20:33:41.4580696Z scale_ub=1200.0, 2025-05-07T20:33:41.4580784Z contiguous=False, 2025-05-07T20:33:41.4580868Z compiled=False, 2025-05-07T20:33:41.4580940Z ) 2025-05-07T20:33:41.4581163Z self = 2025-05-07T20:33:41.4581382Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4581387Z 2025-05-07T20:33:41.4581471Z @given( 2025-05-07T20:33:41.4581594Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4581696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4581809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4581926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4582037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4582114Z ) 2025-05-07T20:33:41.4582355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4582448Z def test_silu_mul_quant( 2025-05-07T20:33:41.4582530Z self, 2025-05-07T20:33:41.4582605Z T: int, 2025-05-07T20:33:41.4582681Z D: int, 2025-05-07T20:33:41.4582785Z scale_ub: Optional[float], 2025-05-07T20:33:41.4582875Z contiguous: bool, 2025-05-07T20:33:41.4582963Z compiled: bool, 2025-05-07T20:33:41.4583042Z ) -> None: 2025-05-07T20:33:41.4583136Z torch.manual_seed(2025) 2025-05-07T20:33:41.4583212Z 2025-05-07T20:33:41.4583380Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4583454Z 2025-05-07T20:33:41.4583544Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4583669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4583756Z x = x_sign * x_clamp 2025-05-07T20:33:41.4583839Z x0 = x[:, :D] 2025-05-07T20:33:41.4583917Z x1 = x[:, D:] 2025-05-07T20:33:41.4583989Z 2025-05-07T20:33:41.4584075Z if contiguous: 2025-05-07T20:33:41.4584164Z x0 = x0.contiguous() 2025-05-07T20:33:41.4584250Z x1 = x1.contiguous() 2025-05-07T20:33:41.4584324Z 2025-05-07T20:33:41.4584413Z if scale_ub is not None: 2025-05-07T20:33:41.4584524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4584709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4584784Z ) 2025-05-07T20:33:41.4584864Z else: 2025-05-07T20:33:41.4584960Z scale_ub_tensor = None 2025-05-07T20:33:41.4585032Z 2025-05-07T20:33:41.4585164Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4585252Z op = silu_mul_quant 2025-05-07T20:33:41.4585336Z if compiled: 2025-05-07T20:33:41.4585435Z op = torch.compile(op) 2025-05-07T20:33:41.4585538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4585608Z 2025-05-07T20:33:41.4585701Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4585705Z 2025-05-07T20:33:41.4585799Z moe/activation_test.py:117: 2025-05-07T20:33:41.4585926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4586071Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4586171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4586675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4586815Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4587169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4587392Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4587728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4587819Z kernel = self.compile( 2025-05-07T20:33:41.4588198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4588368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4588501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4588545Z 2025-05-07T20:33:41.4588749Z self = 2025-05-07T20:33:41.4589522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4590077Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa9a42c0>} 2025-05-07T20:33:41.4590817Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4591011Z context = 2025-05-07T20:33:41.4591015Z 2025-05-07T20:33:41.4591181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4591443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4591550Z module_map=module_map) 2025-05-07T20:33:41.4591710Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4591814Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4591890Z E ^ 2025-05-07T20:33:41.4592240Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4592244Z 2025-05-07T20:33:41.4592655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4592660Z 2025-05-07T20:33:41.4592760Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4592985Z self=, 2025-05-07T20:33:41.4593105Z T=1, 2025-05-07T20:33:41.4593183Z D=7168, 2025-05-07T20:33:41.4593266Z scale_ub=None, 2025-05-07T20:33:41.4593351Z contiguous=True, 2025-05-07T20:33:41.4593434Z compiled=False, 2025-05-07T20:33:41.4593507Z ) 2025-05-07T20:33:41.4593721Z self = 2025-05-07T20:33:41.4593883Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4593891Z 2025-05-07T20:33:41.4593967Z @given( 2025-05-07T20:33:41.4594083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4594184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4594296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4594411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4594527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4594601Z ) 2025-05-07T20:33:41.4594886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4594988Z def test_silu_mul_quant( 2025-05-07T20:33:41.4595063Z self, 2025-05-07T20:33:41.4595179Z T: int, 2025-05-07T20:33:41.4595257Z D: int, 2025-05-07T20:33:41.4595356Z scale_ub: Optional[float], 2025-05-07T20:33:41.4595448Z contiguous: bool, 2025-05-07T20:33:41.4595534Z compiled: bool, 2025-05-07T20:33:41.4595610Z ) -> None: 2025-05-07T20:33:41.4595759Z torch.manual_seed(2025) 2025-05-07T20:33:41.4595832Z 2025-05-07T20:33:41.4595999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4596074Z 2025-05-07T20:33:41.4596165Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4596287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4596377Z x = x_sign * x_clamp 2025-05-07T20:33:41.4596458Z x0 = x[:, :D] 2025-05-07T20:33:41.4596543Z x1 = x[:, D:] 2025-05-07T20:33:41.4596622Z 2025-05-07T20:33:41.4596773Z if contiguous: 2025-05-07T20:33:41.4596870Z x0 = x0.contiguous() 2025-05-07T20:33:41.4596957Z x1 = x1.contiguous() 2025-05-07T20:33:41.4597031Z 2025-05-07T20:33:41.4597123Z if scale_ub is not None: 2025-05-07T20:33:41.4597227Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4597360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4597437Z ) 2025-05-07T20:33:41.4597513Z else: 2025-05-07T20:33:41.4597606Z scale_ub_tensor = None 2025-05-07T20:33:41.4597681Z 2025-05-07T20:33:41.4597808Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4597896Z op = silu_mul_quant 2025-05-07T20:33:41.4597985Z if compiled: 2025-05-07T20:33:41.4598084Z op = torch.compile(op) 2025-05-07T20:33:41.4598191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4598265Z 2025-05-07T20:33:41.4598357Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4598364Z 2025-05-07T20:33:41.4598460Z moe/activation_test.py:117: 2025-05-07T20:33:41.4598587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4598692Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4598792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4599285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4599380Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4599735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4599988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4600343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4600481Z kernel = self.compile( 2025-05-07T20:33:41.4600862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4601039Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4601166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4601171Z 2025-05-07T20:33:41.4601378Z self = 2025-05-07T20:33:41.4602151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4602703Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa9a51c0>} 2025-05-07T20:33:41.4603450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4603685Z context = 2025-05-07T20:33:41.4603690Z 2025-05-07T20:33:41.4603856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4604116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4604221Z module_map=module_map) 2025-05-07T20:33:41.4604383Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4604481Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4604561Z E ^ 2025-05-07T20:33:41.4604916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4604920Z 2025-05-07T20:33:41.4605377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4605384Z 2025-05-07T20:33:41.4605489Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4605709Z self=, 2025-05-07T20:33:41.4605788Z T=16384, 2025-05-07T20:33:41.4605864Z D=7168, 2025-05-07T20:33:41.4605948Z scale_ub=1200.0, 2025-05-07T20:33:41.4606036Z contiguous=False, 2025-05-07T20:33:41.4606118Z compiled=True, 2025-05-07T20:33:41.4606191Z ) 2025-05-07T20:33:41.4606411Z self = 2025-05-07T20:33:41.4606588Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4606593Z 2025-05-07T20:33:41.4606669Z @given( 2025-05-07T20:33:41.4606790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4606890Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4607008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4607130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4607246Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4607324Z ) 2025-05-07T20:33:41.4607565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4607657Z def test_silu_mul_quant( 2025-05-07T20:33:41.4607736Z self, 2025-05-07T20:33:41.4607812Z T: int, 2025-05-07T20:33:41.4607889Z D: int, 2025-05-07T20:33:41.4607990Z scale_ub: Optional[float], 2025-05-07T20:33:41.4608077Z contiguous: bool, 2025-05-07T20:33:41.4608162Z compiled: bool, 2025-05-07T20:33:41.4608242Z ) -> None: 2025-05-07T20:33:41.4608335Z torch.manual_seed(2025) 2025-05-07T20:33:41.4608407Z 2025-05-07T20:33:41.4608578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4608697Z 2025-05-07T20:33:41.4608794Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4608916Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4609005Z x = x_sign * x_clamp 2025-05-07T20:33:41.4609087Z x0 = x[:, :D] 2025-05-07T20:33:41.4609165Z x1 = x[:, D:] 2025-05-07T20:33:41.4609238Z 2025-05-07T20:33:41.4609323Z if contiguous: 2025-05-07T20:33:41.4609412Z x0 = x0.contiguous() 2025-05-07T20:33:41.4609498Z x1 = x1.contiguous() 2025-05-07T20:33:41.4609572Z 2025-05-07T20:33:41.4609662Z if scale_ub is not None: 2025-05-07T20:33:41.4609765Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4609902Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4609977Z ) 2025-05-07T20:33:41.4610057Z else: 2025-05-07T20:33:41.4610150Z scale_ub_tensor = None 2025-05-07T20:33:41.4610226Z 2025-05-07T20:33:41.4610399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4610496Z op = silu_mul_quant 2025-05-07T20:33:41.4610581Z if compiled: 2025-05-07T20:33:41.4610723Z op = torch.compile(op) 2025-05-07T20:33:41.4610829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4610901Z 2025-05-07T20:33:41.4610996Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4611000Z 2025-05-07T20:33:41.4611097Z moe/activation_test.py:117: 2025-05-07T20:33:41.4611227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4611325Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4611421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4611788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4611879Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4612409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4612511Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4612867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4613090Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4613426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4613518Z kernel = self.compile( 2025-05-07T20:33:41.4613899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4614069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4614194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4614198Z 2025-05-07T20:33:41.4614408Z self = 2025-05-07T20:33:41.4615184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4615693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa9a65c0>} 2025-05-07T20:33:41.4616434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4616626Z context = 2025-05-07T20:33:41.4616630Z 2025-05-07T20:33:41.4616793Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4617054Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4617204Z module_map=module_map) 2025-05-07T20:33:41.4617366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4617465Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4617544Z E ^ 2025-05-07T20:33:41.4617900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4617904Z 2025-05-07T20:33:41.4618316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4618320Z 2025-05-07T20:33:41.4618422Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4618642Z self=, 2025-05-07T20:33:41.4618724Z T=1, 2025-05-07T20:33:41.4618800Z D=7168, 2025-05-07T20:33:41.4618922Z scale_ub=None, 2025-05-07T20:33:41.4619016Z contiguous=False, 2025-05-07T20:33:41.4619099Z compiled=False, 2025-05-07T20:33:41.4619174Z ) 2025-05-07T20:33:41.4619393Z self = 2025-05-07T20:33:41.4619602Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4619607Z 2025-05-07T20:33:41.4619688Z @given( 2025-05-07T20:33:41.4619807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4619904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4620021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4620135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4620252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4620325Z ) 2025-05-07T20:33:41.4620564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4620665Z def test_silu_mul_quant( 2025-05-07T20:33:41.4620743Z self, 2025-05-07T20:33:41.4620860Z T: int, 2025-05-07T20:33:41.4620940Z D: int, 2025-05-07T20:33:41.4621036Z scale_ub: Optional[float], 2025-05-07T20:33:41.4621127Z contiguous: bool, 2025-05-07T20:33:41.4621213Z compiled: bool, 2025-05-07T20:33:41.4621289Z ) -> None: 2025-05-07T20:33:41.4621383Z torch.manual_seed(2025) 2025-05-07T20:33:41.4621458Z 2025-05-07T20:33:41.4621623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4621697Z 2025-05-07T20:33:41.4621788Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4621912Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4622001Z x = x_sign * x_clamp 2025-05-07T20:33:41.4622082Z x0 = x[:, :D] 2025-05-07T20:33:41.4622160Z x1 = x[:, D:] 2025-05-07T20:33:41.4622235Z 2025-05-07T20:33:41.4622316Z if contiguous: 2025-05-07T20:33:41.4622409Z x0 = x0.contiguous() 2025-05-07T20:33:41.4622508Z x1 = x1.contiguous() 2025-05-07T20:33:41.4622580Z 2025-05-07T20:33:41.4622669Z if scale_ub is not None: 2025-05-07T20:33:41.4622780Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4622912Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4622991Z ) 2025-05-07T20:33:41.4623066Z else: 2025-05-07T20:33:41.4623159Z scale_ub_tensor = None 2025-05-07T20:33:41.4623234Z 2025-05-07T20:33:41.4623363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4623451Z op = silu_mul_quant 2025-05-07T20:33:41.4623537Z if compiled: 2025-05-07T20:33:41.4623635Z op = torch.compile(op) 2025-05-07T20:33:41.4623739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4623813Z 2025-05-07T20:33:41.4623902Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4623906Z 2025-05-07T20:33:41.4624005Z moe/activation_test.py:117: 2025-05-07T20:33:41.4624188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4624287Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4624389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4624882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4624978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4625337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4625556Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4625895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4625987Z kernel = self.compile( 2025-05-07T20:33:41.4626408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4626590Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4626779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4626784Z 2025-05-07T20:33:41.4626987Z self = 2025-05-07T20:33:41.4627762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4628264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa9a71a0>} 2025-05-07T20:33:41.4629049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4629241Z context = 2025-05-07T20:33:41.4629248Z 2025-05-07T20:33:41.4629413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4629672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4629777Z module_map=module_map) 2025-05-07T20:33:41.4629939Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4630037Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4630112Z E ^ 2025-05-07T20:33:41.4630466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4630471Z 2025-05-07T20:33:41.4630880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4630886Z 2025-05-07T20:33:41.4630994Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4631212Z self=, 2025-05-07T20:33:41.4631290Z T=2048, 2025-05-07T20:33:41.4631370Z D=7168, 2025-05-07T20:33:41.4631451Z scale_ub=None, 2025-05-07T20:33:41.4631534Z contiguous=False, 2025-05-07T20:33:41.4631618Z compiled=True, 2025-05-07T20:33:41.4631691Z ) 2025-05-07T20:33:41.4631908Z self = 2025-05-07T20:33:41.4632080Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4632084Z 2025-05-07T20:33:41.4632160Z @given( 2025-05-07T20:33:41.4632278Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4632377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4632492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4632611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4632770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4632844Z ) 2025-05-07T20:33:41.4633088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4633180Z def test_silu_mul_quant( 2025-05-07T20:33:41.4633257Z self, 2025-05-07T20:33:41.4633333Z T: int, 2025-05-07T20:33:41.4633409Z D: int, 2025-05-07T20:33:41.4633510Z scale_ub: Optional[float], 2025-05-07T20:33:41.4633600Z contiguous: bool, 2025-05-07T20:33:41.4633683Z compiled: bool, 2025-05-07T20:33:41.4633762Z ) -> None: 2025-05-07T20:33:41.4633856Z torch.manual_seed(2025) 2025-05-07T20:33:41.4633927Z 2025-05-07T20:33:41.4634097Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4634170Z 2025-05-07T20:33:41.4634261Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4634431Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4634526Z x = x_sign * x_clamp 2025-05-07T20:33:41.4634610Z x0 = x[:, :D] 2025-05-07T20:33:41.4634730Z x1 = x[:, D:] 2025-05-07T20:33:41.4634803Z 2025-05-07T20:33:41.4634889Z if contiguous: 2025-05-07T20:33:41.4634978Z x0 = x0.contiguous() 2025-05-07T20:33:41.4635065Z x1 = x1.contiguous() 2025-05-07T20:33:41.4635143Z 2025-05-07T20:33:41.4635232Z if scale_ub is not None: 2025-05-07T20:33:41.4635335Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4635472Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4635547Z ) 2025-05-07T20:33:41.4635623Z else: 2025-05-07T20:33:41.4635771Z scale_ub_tensor = None 2025-05-07T20:33:41.4635844Z 2025-05-07T20:33:41.4635972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4636069Z op = silu_mul_quant 2025-05-07T20:33:41.4636153Z if compiled: 2025-05-07T20:33:41.4636300Z op = torch.compile(op) 2025-05-07T20:33:41.4636405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4636479Z 2025-05-07T20:33:41.4636572Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4636576Z 2025-05-07T20:33:41.4636671Z moe/activation_test.py:117: 2025-05-07T20:33:41.4636799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4636901Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4636998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4637361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4637457Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4637944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4638046Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4638405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4638626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4638962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4639054Z kernel = self.compile( 2025-05-07T20:33:41.4639432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4639607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4639733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4639737Z 2025-05-07T20:33:41.4639942Z self = 2025-05-07T20:33:41.4640720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4641274Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfac807c0>} 2025-05-07T20:33:41.4642013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4642202Z context = 2025-05-07T20:33:41.4642210Z 2025-05-07T20:33:41.4642372Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4642631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4642780Z module_map=module_map) 2025-05-07T20:33:41.4642947Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4643045Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4643161Z E ^ 2025-05-07T20:33:41.4643513Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4643517Z 2025-05-07T20:33:41.4643928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4643933Z 2025-05-07T20:33:41.4644034Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4644254Z self=, 2025-05-07T20:33:41.4644335Z T=4096, 2025-05-07T20:33:41.4644411Z D=7168, 2025-05-07T20:33:41.4644491Z scale_ub=None, 2025-05-07T20:33:41.4644578Z contiguous=False, 2025-05-07T20:33:41.4644664Z compiled=True, 2025-05-07T20:33:41.4644738Z ) 2025-05-07T20:33:41.4644998Z self = 2025-05-07T20:33:41.4645172Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4645179Z 2025-05-07T20:33:41.4645258Z @given( 2025-05-07T20:33:41.4645376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4645473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4645588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4645706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4645819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4645897Z ) 2025-05-07T20:33:41.4646135Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4646227Z def test_silu_mul_quant( 2025-05-07T20:33:41.4646306Z self, 2025-05-07T20:33:41.4646381Z T: int, 2025-05-07T20:33:41.4646461Z D: int, 2025-05-07T20:33:41.4646561Z scale_ub: Optional[float], 2025-05-07T20:33:41.4646650Z contiguous: bool, 2025-05-07T20:33:41.4646737Z compiled: bool, 2025-05-07T20:33:41.4646818Z ) -> None: 2025-05-07T20:33:41.4646910Z torch.manual_seed(2025) 2025-05-07T20:33:41.4646985Z 2025-05-07T20:33:41.4647149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4647221Z 2025-05-07T20:33:41.4647314Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4647435Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4647524Z x = x_sign * x_clamp 2025-05-07T20:33:41.4647605Z x0 = x[:, :D] 2025-05-07T20:33:41.4647684Z x1 = x[:, D:] 2025-05-07T20:33:41.4647759Z 2025-05-07T20:33:41.4647840Z if contiguous: 2025-05-07T20:33:41.4647930Z x0 = x0.contiguous() 2025-05-07T20:33:41.4648019Z x1 = x1.contiguous() 2025-05-07T20:33:41.4648091Z 2025-05-07T20:33:41.4648183Z if scale_ub is not None: 2025-05-07T20:33:41.4648336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4648467Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4648543Z ) 2025-05-07T20:33:41.4648619Z else: 2025-05-07T20:33:41.4648711Z scale_ub_tensor = None 2025-05-07T20:33:41.4648783Z 2025-05-07T20:33:41.4648913Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4649001Z op = silu_mul_quant 2025-05-07T20:33:41.4649083Z if compiled: 2025-05-07T20:33:41.4649186Z op = torch.compile(op) 2025-05-07T20:33:41.4649289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4649363Z 2025-05-07T20:33:41.4649453Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4649457Z 2025-05-07T20:33:41.4649551Z moe/activation_test.py:117: 2025-05-07T20:33:41.4649727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4649831Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4649932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4650300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4650433Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4650924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4651021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4651376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4651597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4651931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4652026Z kernel = self.compile( 2025-05-07T20:33:41.4652452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4652628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4652759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4652763Z 2025-05-07T20:33:41.4652965Z self = 2025-05-07T20:33:41.4653737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4654241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfac814e0>} 2025-05-07T20:33:41.4654986Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4655183Z context = 2025-05-07T20:33:41.4655187Z 2025-05-07T20:33:41.4655349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4655612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4655718Z module_map=module_map) 2025-05-07T20:33:41.4655876Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4655980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4656056Z E ^ 2025-05-07T20:33:41.4656407Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4656412Z 2025-05-07T20:33:41.4656831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4656877Z 2025-05-07T20:33:41.4656979Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4657206Z self=, 2025-05-07T20:33:41.4657282Z T=16384, 2025-05-07T20:33:41.4657359Z D=5120, 2025-05-07T20:33:41.4657443Z scale_ub=1200.0, 2025-05-07T20:33:41.4657528Z contiguous=False, 2025-05-07T20:33:41.4657611Z compiled=False, 2025-05-07T20:33:41.4657691Z ) 2025-05-07T20:33:41.4657905Z self = 2025-05-07T20:33:41.4658083Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4658092Z 2025-05-07T20:33:41.4658168Z @given( 2025-05-07T20:33:41.4658284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4658387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4658566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4658687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4658803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4658918Z ) 2025-05-07T20:33:41.4659158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4659254Z def test_silu_mul_quant( 2025-05-07T20:33:41.4659329Z self, 2025-05-07T20:33:41.4659406Z T: int, 2025-05-07T20:33:41.4659485Z D: int, 2025-05-07T20:33:41.4659582Z scale_ub: Optional[float], 2025-05-07T20:33:41.4659672Z contiguous: bool, 2025-05-07T20:33:41.4659755Z compiled: bool, 2025-05-07T20:33:41.4659831Z ) -> None: 2025-05-07T20:33:41.4659925Z torch.manual_seed(2025) 2025-05-07T20:33:41.4659995Z 2025-05-07T20:33:41.4660160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4660239Z 2025-05-07T20:33:41.4660331Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4660497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4660590Z x = x_sign * x_clamp 2025-05-07T20:33:41.4660673Z x0 = x[:, :D] 2025-05-07T20:33:41.4660752Z x1 = x[:, D:] 2025-05-07T20:33:41.4660825Z 2025-05-07T20:33:41.4660906Z if contiguous: 2025-05-07T20:33:41.4660996Z x0 = x0.contiguous() 2025-05-07T20:33:41.4661086Z x1 = x1.contiguous() 2025-05-07T20:33:41.4661157Z 2025-05-07T20:33:41.4661248Z if scale_ub is not None: 2025-05-07T20:33:41.4661350Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4661482Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4661559Z ) 2025-05-07T20:33:41.4661633Z else: 2025-05-07T20:33:41.4661725Z scale_ub_tensor = None 2025-05-07T20:33:41.4661800Z 2025-05-07T20:33:41.4661929Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4662019Z op = silu_mul_quant 2025-05-07T20:33:41.4662108Z if compiled: 2025-05-07T20:33:41.4662206Z op = torch.compile(op) 2025-05-07T20:33:41.4662312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4662386Z 2025-05-07T20:33:41.4662476Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4662480Z 2025-05-07T20:33:41.4662582Z moe/activation_test.py:117: 2025-05-07T20:33:41.4662709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4662808Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4662909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4663403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4663499Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4663858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4664127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4664464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4664558Z kernel = self.compile( 2025-05-07T20:33:41.4664938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4665111Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4665235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4665239Z 2025-05-07T20:33:41.4665644Z self = 2025-05-07T20:33:41.4666562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4667074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfac823e0>} 2025-05-07T20:33:41.4667879Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4668069Z context = 2025-05-07T20:33:41.4668074Z 2025-05-07T20:33:41.4668238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4668496Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4668602Z module_map=module_map) 2025-05-07T20:33:41.4668766Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4668924Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4669004Z E ^ 2025-05-07T20:33:41.4669357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4669365Z 2025-05-07T20:33:41.4669772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4669776Z 2025-05-07T20:33:41.4669884Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4670103Z self=, 2025-05-07T20:33:41.4670187Z T=16384, 2025-05-07T20:33:41.4670263Z D=5120, 2025-05-07T20:33:41.4670344Z scale_ub=1200.0, 2025-05-07T20:33:41.4670430Z contiguous=True, 2025-05-07T20:33:41.4670516Z compiled=True, 2025-05-07T20:33:41.4670589Z ) 2025-05-07T20:33:41.4670810Z self = 2025-05-07T20:33:41.4670988Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4670992Z 2025-05-07T20:33:41.4671069Z @given( 2025-05-07T20:33:41.4671193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4671290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4671403Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4671522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4671634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4671710Z ) 2025-05-07T20:33:41.4671950Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4672043Z def test_silu_mul_quant( 2025-05-07T20:33:41.4672121Z self, 2025-05-07T20:33:41.4672196Z T: int, 2025-05-07T20:33:41.4672271Z D: int, 2025-05-07T20:33:41.4672370Z scale_ub: Optional[float], 2025-05-07T20:33:41.4672460Z contiguous: bool, 2025-05-07T20:33:41.4672608Z compiled: bool, 2025-05-07T20:33:41.4672696Z ) -> None: 2025-05-07T20:33:41.4672790Z torch.manual_seed(2025) 2025-05-07T20:33:41.4672864Z 2025-05-07T20:33:41.4673035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4673108Z 2025-05-07T20:33:41.4673203Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4673325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4673412Z x = x_sign * x_clamp 2025-05-07T20:33:41.4673494Z x0 = x[:, :D] 2025-05-07T20:33:41.4673573Z x1 = x[:, D:] 2025-05-07T20:33:41.4673643Z 2025-05-07T20:33:41.4673727Z if contiguous: 2025-05-07T20:33:41.4673817Z x0 = x0.contiguous() 2025-05-07T20:33:41.4673903Z x1 = x1.contiguous() 2025-05-07T20:33:41.4673978Z 2025-05-07T20:33:41.4674068Z if scale_ub is not None: 2025-05-07T20:33:41.4674216Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4674357Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4674434Z ) 2025-05-07T20:33:41.4674512Z else: 2025-05-07T20:33:41.4674644Z scale_ub_tensor = None 2025-05-07T20:33:41.4674715Z 2025-05-07T20:33:41.4674845Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4674933Z op = silu_mul_quant 2025-05-07T20:33:41.4675018Z if compiled: 2025-05-07T20:33:41.4675119Z op = torch.compile(op) 2025-05-07T20:33:41.4675223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4675295Z 2025-05-07T20:33:41.4675390Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4675394Z 2025-05-07T20:33:41.4675490Z moe/activation_test.py:117: 2025-05-07T20:33:41.4675616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4675789Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4678997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4679456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4679558Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4680056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4680158Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4680515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4680737Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4681079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4681172Z kernel = self.compile( 2025-05-07T20:33:41.4681559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4681739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4681867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4681874Z 2025-05-07T20:33:41.4682084Z self = 2025-05-07T20:33:41.4682859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4683366Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfac83a60>} 2025-05-07T20:33:41.4684111Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4684362Z context = 2025-05-07T20:33:41.4684369Z 2025-05-07T20:33:41.4684539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4684802Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4684915Z module_map=module_map) 2025-05-07T20:33:41.4685076Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4685177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4685261Z E ^ 2025-05-07T20:33:41.4685614Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4685618Z 2025-05-07T20:33:41.4686032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4686078Z 2025-05-07T20:33:41.4686188Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4686412Z self=, 2025-05-07T20:33:41.4686604Z T=16384, 2025-05-07T20:33:41.4686681Z D=5120, 2025-05-07T20:33:41.4686764Z scale_ub=None, 2025-05-07T20:33:41.4686854Z contiguous=False, 2025-05-07T20:33:41.4686938Z compiled=True, 2025-05-07T20:33:41.4687012Z ) 2025-05-07T20:33:41.4687232Z self = 2025-05-07T20:33:41.4687408Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4687413Z 2025-05-07T20:33:41.4687493Z @given( 2025-05-07T20:33:41.4687612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4687714Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4687832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4687952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4688110Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4688188Z ) 2025-05-07T20:33:41.4688432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4688532Z def test_silu_mul_quant( 2025-05-07T20:33:41.4688609Z self, 2025-05-07T20:33:41.4688687Z T: int, 2025-05-07T20:33:41.4688769Z D: int, 2025-05-07T20:33:41.4688868Z scale_ub: Optional[float], 2025-05-07T20:33:41.4688958Z contiguous: bool, 2025-05-07T20:33:41.4689049Z compiled: bool, 2025-05-07T20:33:41.4689130Z ) -> None: 2025-05-07T20:33:41.4689227Z torch.manual_seed(2025) 2025-05-07T20:33:41.4689305Z 2025-05-07T20:33:41.4689472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4689547Z 2025-05-07T20:33:41.4689645Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4689772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4689862Z x = x_sign * x_clamp 2025-05-07T20:33:41.4689954Z x0 = x[:, :D] 2025-05-07T20:33:41.4690035Z x1 = x[:, D:] 2025-05-07T20:33:41.4690113Z 2025-05-07T20:33:41.4690198Z if contiguous: 2025-05-07T20:33:41.4690289Z x0 = x0.contiguous() 2025-05-07T20:33:41.4690383Z x1 = x1.contiguous() 2025-05-07T20:33:41.4690456Z 2025-05-07T20:33:41.4690546Z if scale_ub is not None: 2025-05-07T20:33:41.4690655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4690788Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4690864Z ) 2025-05-07T20:33:41.4690943Z else: 2025-05-07T20:33:41.4691039Z scale_ub_tensor = None 2025-05-07T20:33:41.4691115Z 2025-05-07T20:33:41.4691247Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4691339Z op = silu_mul_quant 2025-05-07T20:33:41.4691427Z if compiled: 2025-05-07T20:33:41.4691527Z op = torch.compile(op) 2025-05-07T20:33:41.4691705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4691782Z 2025-05-07T20:33:41.4691877Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4691882Z 2025-05-07T20:33:41.4691978Z moe/activation_test.py:117: 2025-05-07T20:33:41.4692110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4692211Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4692309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4692678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4692773Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4693264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4693361Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4693760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4693989Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4694366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4694460Z kernel = self.compile( 2025-05-07T20:33:41.4694842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4695015Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4695149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4695153Z 2025-05-07T20:33:41.4695359Z self = 2025-05-07T20:33:41.4696176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4696689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbb88cc0>} 2025-05-07T20:33:41.4697435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4697629Z context = 2025-05-07T20:33:41.4697633Z 2025-05-07T20:33:41.4697798Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4698061Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4698171Z module_map=module_map) 2025-05-07T20:33:41.4698336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4698439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4698518Z E ^ 2025-05-07T20:33:41.4698874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4698879Z 2025-05-07T20:33:41.4699292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4699296Z 2025-05-07T20:33:41.4699400Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4699641Z self=, 2025-05-07T20:33:41.4699731Z T=2048, 2025-05-07T20:33:41.4699822Z D=5120, 2025-05-07T20:33:41.4699922Z scale_ub=None, 2025-05-07T20:33:41.4700009Z contiguous=False, 2025-05-07T20:33:41.4700092Z compiled=True, 2025-05-07T20:33:41.4700170Z ) 2025-05-07T20:33:41.4700390Z self = 2025-05-07T20:33:41.4700610Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4700621Z 2025-05-07T20:33:41.4700699Z @given( 2025-05-07T20:33:41.4700816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4700922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4701035Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4701152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4701269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4701345Z ) 2025-05-07T20:33:41.4701584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4701681Z def test_silu_mul_quant( 2025-05-07T20:33:41.4701756Z self, 2025-05-07T20:33:41.4701832Z T: int, 2025-05-07T20:33:41.4701912Z D: int, 2025-05-07T20:33:41.4702051Z scale_ub: Optional[float], 2025-05-07T20:33:41.4702149Z contiguous: bool, 2025-05-07T20:33:41.4702234Z compiled: bool, 2025-05-07T20:33:41.4702311Z ) -> None: 2025-05-07T20:33:41.4702460Z torch.manual_seed(2025) 2025-05-07T20:33:41.4702533Z 2025-05-07T20:33:41.4702699Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4702774Z 2025-05-07T20:33:41.4702863Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4702986Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4703078Z x = x_sign * x_clamp 2025-05-07T20:33:41.4703157Z x0 = x[:, :D] 2025-05-07T20:33:41.4703237Z x1 = x[:, D:] 2025-05-07T20:33:41.4703311Z 2025-05-07T20:33:41.4703395Z if contiguous: 2025-05-07T20:33:41.4703489Z x0 = x0.contiguous() 2025-05-07T20:33:41.4703577Z x1 = x1.contiguous() 2025-05-07T20:33:41.4703651Z 2025-05-07T20:33:41.4703747Z if scale_ub is not None: 2025-05-07T20:33:41.4703852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4704030Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4704108Z ) 2025-05-07T20:33:41.4704187Z else: 2025-05-07T20:33:41.4704280Z scale_ub_tensor = None 2025-05-07T20:33:41.4704354Z 2025-05-07T20:33:41.4704481Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4704570Z op = silu_mul_quant 2025-05-07T20:33:41.4704657Z if compiled: 2025-05-07T20:33:41.4704756Z op = torch.compile(op) 2025-05-07T20:33:41.4704864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4704935Z 2025-05-07T20:33:41.4705025Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4705029Z 2025-05-07T20:33:41.4705127Z moe/activation_test.py:117: 2025-05-07T20:33:41.4705254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4705357Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4705462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4705824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4705919Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4706411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4706507Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4706864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4707083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4707418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4707514Z kernel = self.compile( 2025-05-07T20:33:41.4707894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4708113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4708244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4708248Z 2025-05-07T20:33:41.4708451Z self = 2025-05-07T20:33:41.4709226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4709727Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbb89a80>} 2025-05-07T20:33:41.4710564Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4710759Z context = 2025-05-07T20:33:41.4710802Z 2025-05-07T20:33:41.4710969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4711233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4711342Z module_map=module_map) 2025-05-07T20:33:41.4711503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4711603Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4711680Z E ^ 2025-05-07T20:33:41.4712034Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4712038Z 2025-05-07T20:33:41.4712453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4712497Z 2025-05-07T20:33:41.4712604Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4712825Z self=, 2025-05-07T20:33:41.4712906Z T=2048, 2025-05-07T20:33:41.4712988Z D=5120, 2025-05-07T20:33:41.4713072Z scale_ub=1200.0, 2025-05-07T20:33:41.4713158Z contiguous=False, 2025-05-07T20:33:41.4713245Z compiled=True, 2025-05-07T20:33:41.4713319Z ) 2025-05-07T20:33:41.4713536Z self = 2025-05-07T20:33:41.4713713Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4713718Z 2025-05-07T20:33:41.4713797Z @given( 2025-05-07T20:33:41.4713917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4714015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4714131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4714256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4714369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4714446Z ) 2025-05-07T20:33:41.4714689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4714781Z def test_silu_mul_quant( 2025-05-07T20:33:41.4714860Z self, 2025-05-07T20:33:41.4714941Z T: int, 2025-05-07T20:33:41.4715018Z D: int, 2025-05-07T20:33:41.4715117Z scale_ub: Optional[float], 2025-05-07T20:33:41.4715209Z contiguous: bool, 2025-05-07T20:33:41.4715295Z compiled: bool, 2025-05-07T20:33:41.4715377Z ) -> None: 2025-05-07T20:33:41.4715471Z torch.manual_seed(2025) 2025-05-07T20:33:41.4715545Z 2025-05-07T20:33:41.4715776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4715852Z 2025-05-07T20:33:41.4715947Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4716074Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4716214Z x = x_sign * x_clamp 2025-05-07T20:33:41.4716295Z x0 = x[:, :D] 2025-05-07T20:33:41.4716379Z x1 = x[:, D:] 2025-05-07T20:33:41.4716453Z 2025-05-07T20:33:41.4716539Z if contiguous: 2025-05-07T20:33:41.4716636Z x0 = x0.contiguous() 2025-05-07T20:33:41.4716723Z x1 = x1.contiguous() 2025-05-07T20:33:41.4716799Z 2025-05-07T20:33:41.4716890Z if scale_ub is not None: 2025-05-07T20:33:41.4716994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4717129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4717205Z ) 2025-05-07T20:33:41.4717281Z else: 2025-05-07T20:33:41.4717379Z scale_ub_tensor = None 2025-05-07T20:33:41.4717455Z 2025-05-07T20:33:41.4717582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4717721Z op = silu_mul_quant 2025-05-07T20:33:41.4717811Z if compiled: 2025-05-07T20:33:41.4717912Z op = torch.compile(op) 2025-05-07T20:33:41.4718019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4718134Z 2025-05-07T20:33:41.4718226Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4718233Z 2025-05-07T20:33:41.4718330Z moe/activation_test.py:117: 2025-05-07T20:33:41.4718458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4718561Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4718660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4719022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4719118Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4719604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4719706Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4720124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4720349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4720688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4720781Z kernel = self.compile( 2025-05-07T20:33:41.4721158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4721332Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4721459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4721464Z 2025-05-07T20:33:41.4721670Z self = 2025-05-07T20:33:41.4722448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4722953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfbb8ac00>} 2025-05-07T20:33:41.4723694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4723883Z context = 2025-05-07T20:33:41.4723887Z 2025-05-07T20:33:41.4724050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4724313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4724420Z module_map=module_map) 2025-05-07T20:33:41.4724629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4724731Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4724812Z E ^ 2025-05-07T20:33:41.4725164Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4725168Z 2025-05-07T20:33:41.4725575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4725579Z 2025-05-07T20:33:41.4725684Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4725906Z self=, 2025-05-07T20:33:41.4725989Z T=4096, 2025-05-07T20:33:41.4726066Z D=5120, 2025-05-07T20:33:41.4726149Z scale_ub=1200.0, 2025-05-07T20:33:41.4726238Z contiguous=True, 2025-05-07T20:33:41.4726360Z compiled=True, 2025-05-07T20:33:41.4726438Z ) 2025-05-07T20:33:41.4726661Z self = 2025-05-07T20:33:41.4726831Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4726875Z 2025-05-07T20:33:41.4726955Z @given( 2025-05-07T20:33:41.4727075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4727173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4727290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4727406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4727520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4727599Z ) 2025-05-07T20:33:41.4727840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4727935Z def test_silu_mul_quant( 2025-05-07T20:33:41.4728020Z self, 2025-05-07T20:33:41.4728099Z T: int, 2025-05-07T20:33:41.4728180Z D: int, 2025-05-07T20:33:41.4728325Z scale_ub: Optional[float], 2025-05-07T20:33:41.4728415Z contiguous: bool, 2025-05-07T20:33:41.4728500Z compiled: bool, 2025-05-07T20:33:41.4728591Z ) -> None: 2025-05-07T20:33:41.4728686Z torch.manual_seed(2025) 2025-05-07T20:33:41.4728766Z 2025-05-07T20:33:41.4728932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4729007Z 2025-05-07T20:33:41.4729103Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4729228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4729317Z x = x_sign * x_clamp 2025-05-07T20:33:41.4729406Z x0 = x[:, :D] 2025-05-07T20:33:41.4729486Z x1 = x[:, D:] 2025-05-07T20:33:41.4729559Z 2025-05-07T20:33:41.4729646Z if contiguous: 2025-05-07T20:33:41.4729737Z x0 = x0.contiguous() 2025-05-07T20:33:41.4729830Z x1 = x1.contiguous() 2025-05-07T20:33:41.4729909Z 2025-05-07T20:33:41.4729999Z if scale_ub is not None: 2025-05-07T20:33:41.4730110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4730247Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4730329Z ) 2025-05-07T20:33:41.4730408Z else: 2025-05-07T20:33:41.4730502Z scale_ub_tensor = None 2025-05-07T20:33:41.4730576Z 2025-05-07T20:33:41.4730708Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4730798Z op = silu_mul_quant 2025-05-07T20:33:41.4730883Z if compiled: 2025-05-07T20:33:41.4730984Z op = torch.compile(op) 2025-05-07T20:33:41.4731089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4731161Z 2025-05-07T20:33:41.4731260Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4731265Z 2025-05-07T20:33:41.4731360Z moe/activation_test.py:117: 2025-05-07T20:33:41.4731496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4731646Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4731745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4732108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4732203Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4732691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4732789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4733146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4733370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4733705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4733845Z kernel = self.compile( 2025-05-07T20:33:41.4734230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4734402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4734569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4734577Z 2025-05-07T20:33:41.4734784Z self = 2025-05-07T20:33:41.4735555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4736060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa828220>} 2025-05-07T20:33:41.4736842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4737042Z context = 2025-05-07T20:33:41.4737047Z 2025-05-07T20:33:41.4737208Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4737467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4737578Z module_map=module_map) 2025-05-07T20:33:41.4737737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4737838Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4737916Z E ^ 2025-05-07T20:33:41.4738265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4738270Z 2025-05-07T20:33:41.4738687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4738694Z 2025-05-07T20:33:41.4738795Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4739019Z self=, 2025-05-07T20:33:41.4739098Z T=128, 2025-05-07T20:33:41.4739175Z D=5120, 2025-05-07T20:33:41.4739259Z scale_ub=1200.0, 2025-05-07T20:33:41.4739345Z contiguous=False, 2025-05-07T20:33:41.4739427Z compiled=True, 2025-05-07T20:33:41.4739504Z ) 2025-05-07T20:33:41.4739720Z self = 2025-05-07T20:33:41.4739892Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4739896Z 2025-05-07T20:33:41.4739976Z @given( 2025-05-07T20:33:41.4740096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4740196Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4740316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4740481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4740596Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4740674Z ) 2025-05-07T20:33:41.4740915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4741013Z def test_silu_mul_quant( 2025-05-07T20:33:41.4741093Z self, 2025-05-07T20:33:41.4741170Z T: int, 2025-05-07T20:33:41.4741251Z D: int, 2025-05-07T20:33:41.4741350Z scale_ub: Optional[float], 2025-05-07T20:33:41.4741439Z contiguous: bool, 2025-05-07T20:33:41.4741530Z compiled: bool, 2025-05-07T20:33:41.4741608Z ) -> None: 2025-05-07T20:33:41.4741706Z torch.manual_seed(2025) 2025-05-07T20:33:41.4741780Z 2025-05-07T20:33:41.4741946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4742072Z 2025-05-07T20:33:41.4742166Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4742294Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4742386Z x = x_sign * x_clamp 2025-05-07T20:33:41.4742505Z x0 = x[:, :D] 2025-05-07T20:33:41.4742586Z x1 = x[:, D:] 2025-05-07T20:33:41.4742662Z 2025-05-07T20:33:41.4742745Z if contiguous: 2025-05-07T20:33:41.4742837Z x0 = x0.contiguous() 2025-05-07T20:33:41.4742931Z x1 = x1.contiguous() 2025-05-07T20:33:41.4743003Z 2025-05-07T20:33:41.4743094Z if scale_ub is not None: 2025-05-07T20:33:41.4743203Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4743336Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4743414Z ) 2025-05-07T20:33:41.4743495Z else: 2025-05-07T20:33:41.4743589Z scale_ub_tensor = None 2025-05-07T20:33:41.4743664Z 2025-05-07T20:33:41.4743795Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4743929Z op = silu_mul_quant 2025-05-07T20:33:41.4744022Z if compiled: 2025-05-07T20:33:41.4744122Z op = torch.compile(op) 2025-05-07T20:33:41.4744228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4744305Z 2025-05-07T20:33:41.4744397Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4744401Z 2025-05-07T20:33:41.4744504Z moe/activation_test.py:117: 2025-05-07T20:33:41.4744631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4744732Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4744835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4745199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4745291Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4745783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4745885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4746243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4746465Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4746799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4746897Z kernel = self.compile( 2025-05-07T20:33:41.4747278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4747450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4747579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4747583Z 2025-05-07T20:33:41.4747789Z self = 2025-05-07T20:33:41.4748614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4749118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa828f40>} 2025-05-07T20:33:41.4749862Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4750053Z context = 2025-05-07T20:33:41.4750057Z 2025-05-07T20:33:41.4750219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4750524Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4750638Z module_map=module_map) 2025-05-07T20:33:41.4750804Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4750967Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4751046Z E ^ 2025-05-07T20:33:41.4751400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4751405Z 2025-05-07T20:33:41.4751812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4751816Z 2025-05-07T20:33:41.4751919Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4752142Z self=, 2025-05-07T20:33:41.4752222Z T=16384, 2025-05-07T20:33:41.4752302Z D=7168, 2025-05-07T20:33:41.4752386Z scale_ub=1200.0, 2025-05-07T20:33:41.4752474Z contiguous=True, 2025-05-07T20:33:41.4752603Z compiled=True, 2025-05-07T20:33:41.4752677Z ) 2025-05-07T20:33:41.4752893Z self = 2025-05-07T20:33:41.4753074Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4753079Z 2025-05-07T20:33:41.4753157Z @given( 2025-05-07T20:33:41.4753275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4753379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4753495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4753615Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4753727Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4753806Z ) 2025-05-07T20:33:41.4754050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4754145Z def test_silu_mul_quant( 2025-05-07T20:33:41.4754224Z self, 2025-05-07T20:33:41.4754305Z T: int, 2025-05-07T20:33:41.4754385Z D: int, 2025-05-07T20:33:41.4754483Z scale_ub: Optional[float], 2025-05-07T20:33:41.4754578Z contiguous: bool, 2025-05-07T20:33:41.4754663Z compiled: bool, 2025-05-07T20:33:41.4754742Z ) -> None: 2025-05-07T20:33:41.4754839Z torch.manual_seed(2025) 2025-05-07T20:33:41.4754912Z 2025-05-07T20:33:41.4755081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4755156Z 2025-05-07T20:33:41.4755249Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4755377Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4755467Z x = x_sign * x_clamp 2025-05-07T20:33:41.4755547Z x0 = x[:, :D] 2025-05-07T20:33:41.4755631Z x1 = x[:, D:] 2025-05-07T20:33:41.4755750Z 2025-05-07T20:33:41.4755837Z if contiguous: 2025-05-07T20:33:41.4755933Z x0 = x0.contiguous() 2025-05-07T20:33:41.4756025Z x1 = x1.contiguous() 2025-05-07T20:33:41.4756146Z 2025-05-07T20:33:41.4756242Z if scale_ub is not None: 2025-05-07T20:33:41.4756347Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4756488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4756565Z ) 2025-05-07T20:33:41.4756642Z else: 2025-05-07T20:33:41.4756739Z scale_ub_tensor = None 2025-05-07T20:33:41.4756812Z 2025-05-07T20:33:41.4756940Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4757035Z op = silu_mul_quant 2025-05-07T20:33:41.4757120Z if compiled: 2025-05-07T20:33:41.4757218Z op = torch.compile(op) 2025-05-07T20:33:41.4757325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4757398Z 2025-05-07T20:33:41.4757490Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4757494Z 2025-05-07T20:33:41.4757597Z moe/activation_test.py:117: 2025-05-07T20:33:41.4757772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4757884Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4757982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4758384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4758481Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4758966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4759062Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4759418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4759636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4759976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4760111Z kernel = self.compile( 2025-05-07T20:33:41.4760489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4760669Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4760795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4760800Z 2025-05-07T20:33:41.4761006Z self = 2025-05-07T20:33:41.4761780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4762285Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa82a160>} 2025-05-07T20:33:41.4763031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4763222Z context = 2025-05-07T20:33:41.4763226Z 2025-05-07T20:33:41.4763390Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4763650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4763759Z module_map=module_map) 2025-05-07T20:33:41.4763923Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4764021Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4764103Z E ^ 2025-05-07T20:33:41.4764457Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4764502Z 2025-05-07T20:33:41.4764912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4764918Z 2025-05-07T20:33:41.4765023Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4765244Z self=, 2025-05-07T20:33:41.4765323Z T=16384, 2025-05-07T20:33:41.4765626Z D=5120, 2025-05-07T20:33:41.4765752Z scale_ub=1200.0, 2025-05-07T20:33:41.4765860Z contiguous=True, 2025-05-07T20:33:41.4765946Z compiled=False, 2025-05-07T20:33:41.4766017Z ) 2025-05-07T20:33:41.4766236Z self = 2025-05-07T20:33:41.4766411Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.4766415Z 2025-05-07T20:33:41.4766490Z @given( 2025-05-07T20:33:41.4766719Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4766821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4766937Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4767055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4767231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4767310Z ) 2025-05-07T20:33:41.4767552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4767644Z def test_silu_mul_quant( 2025-05-07T20:33:41.4767722Z self, 2025-05-07T20:33:41.4767799Z T: int, 2025-05-07T20:33:41.4767873Z D: int, 2025-05-07T20:33:41.4767972Z scale_ub: Optional[float], 2025-05-07T20:33:41.4768059Z contiguous: bool, 2025-05-07T20:33:41.4768144Z compiled: bool, 2025-05-07T20:33:41.4768225Z ) -> None: 2025-05-07T20:33:41.4768318Z torch.manual_seed(2025) 2025-05-07T20:33:41.4768390Z 2025-05-07T20:33:41.4768561Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4768635Z 2025-05-07T20:33:41.4768789Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4768915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4769005Z x = x_sign * x_clamp 2025-05-07T20:33:41.4769087Z x0 = x[:, :D] 2025-05-07T20:33:41.4769167Z x1 = x[:, D:] 2025-05-07T20:33:41.4769237Z 2025-05-07T20:33:41.4769322Z if contiguous: 2025-05-07T20:33:41.4769411Z x0 = x0.contiguous() 2025-05-07T20:33:41.4769498Z x1 = x1.contiguous() 2025-05-07T20:33:41.4769571Z 2025-05-07T20:33:41.4769660Z if scale_ub is not None: 2025-05-07T20:33:41.4769764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4769898Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4769972Z ) 2025-05-07T20:33:41.4770046Z else: 2025-05-07T20:33:41.4770141Z scale_ub_tensor = None 2025-05-07T20:33:41.4770217Z 2025-05-07T20:33:41.4770351Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4770439Z op = silu_mul_quant 2025-05-07T20:33:41.4770523Z if compiled: 2025-05-07T20:33:41.4770625Z op = torch.compile(op) 2025-05-07T20:33:41.4770729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4770800Z 2025-05-07T20:33:41.4770891Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4770895Z 2025-05-07T20:33:41.4770989Z moe/activation_test.py:117: 2025-05-07T20:33:41.4771117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4771218Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4771315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4771811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4771909Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4772265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4772550Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4772889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4772980Z kernel = self.compile( 2025-05-07T20:33:41.4773359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4773531Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4773659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4773664Z 2025-05-07T20:33:41.4773867Z self = 2025-05-07T20:33:41.4774684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4775232Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa829b20>} 2025-05-07T20:33:41.4775974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4776169Z context = 2025-05-07T20:33:41.4776173Z 2025-05-07T20:33:41.4776335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4776598Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4776707Z module_map=module_map) 2025-05-07T20:33:41.4776908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4777012Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4777092Z E ^ 2025-05-07T20:33:41.4777445Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4777449Z 2025-05-07T20:33:41.4777860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4777864Z 2025-05-07T20:33:41.4777966Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4778188Z self=, 2025-05-07T20:33:41.4778267Z T=1, 2025-05-07T20:33:41.4778345Z D=7168, 2025-05-07T20:33:41.4778431Z scale_ub=1200.0, 2025-05-07T20:33:41.4778518Z contiguous=False, 2025-05-07T20:33:41.4778600Z compiled=False, 2025-05-07T20:33:41.4778682Z ) 2025-05-07T20:33:41.4778902Z self = 2025-05-07T20:33:41.4779070Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4779081Z 2025-05-07T20:33:41.4779158Z @given( 2025-05-07T20:33:41.4779275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4779379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4779494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4779611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4779727Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4779802Z ) 2025-05-07T20:33:41.4780043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4780141Z def test_silu_mul_quant( 2025-05-07T20:33:41.4780217Z self, 2025-05-07T20:33:41.4780299Z T: int, 2025-05-07T20:33:41.4780376Z D: int, 2025-05-07T20:33:41.4780475Z scale_ub: Optional[float], 2025-05-07T20:33:41.4780617Z contiguous: bool, 2025-05-07T20:33:41.4780703Z compiled: bool, 2025-05-07T20:33:41.4780780Z ) -> None: 2025-05-07T20:33:41.4780880Z torch.manual_seed(2025) 2025-05-07T20:33:41.4780954Z 2025-05-07T20:33:41.4781119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4781196Z 2025-05-07T20:33:41.4781287Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4781410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4781502Z x = x_sign * x_clamp 2025-05-07T20:33:41.4781584Z x0 = x[:, :D] 2025-05-07T20:33:41.4781665Z x1 = x[:, D:] 2025-05-07T20:33:41.4781739Z 2025-05-07T20:33:41.4781822Z if contiguous: 2025-05-07T20:33:41.4781916Z x0 = x0.contiguous() 2025-05-07T20:33:41.4782003Z x1 = x1.contiguous() 2025-05-07T20:33:41.4782075Z 2025-05-07T20:33:41.4782240Z if scale_ub is not None: 2025-05-07T20:33:41.4782348Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4782483Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4782601Z ) 2025-05-07T20:33:41.4782678Z else: 2025-05-07T20:33:41.4782771Z scale_ub_tensor = None 2025-05-07T20:33:41.4782845Z 2025-05-07T20:33:41.4782972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4783063Z op = silu_mul_quant 2025-05-07T20:33:41.4783151Z if compiled: 2025-05-07T20:33:41.4783248Z op = torch.compile(op) 2025-05-07T20:33:41.4783353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4783426Z 2025-05-07T20:33:41.4783516Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4783520Z 2025-05-07T20:33:41.4783619Z moe/activation_test.py:117: 2025-05-07T20:33:41.4783745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4783847Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4783991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4784483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4784583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4784942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4785160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4785496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4785589Z kernel = self.compile( 2025-05-07T20:33:41.4785966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4786146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4786273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4786281Z 2025-05-07T20:33:41.4786487Z self = 2025-05-07T20:33:41.4787258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4787763Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa61c180>} 2025-05-07T20:33:41.4788505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4788702Z context = 2025-05-07T20:33:41.4788748Z 2025-05-07T20:33:41.4788916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4789175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4789282Z module_map=module_map) 2025-05-07T20:33:41.4789444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4789542Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4789623Z E ^ 2025-05-07T20:33:41.4789973Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4789978Z 2025-05-07T20:33:41.4790385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4790389Z 2025-05-07T20:33:41.4790497Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4790758Z self=, 2025-05-07T20:33:41.4790846Z T=4096, 2025-05-07T20:33:41.4790922Z D=7168, 2025-05-07T20:33:41.4791005Z scale_ub=1200.0, 2025-05-07T20:33:41.4791134Z contiguous=False, 2025-05-07T20:33:41.4791217Z compiled=True, 2025-05-07T20:33:41.4791291Z ) 2025-05-07T20:33:41.4791507Z self = 2025-05-07T20:33:41.4791680Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4791685Z 2025-05-07T20:33:41.4791762Z @given( 2025-05-07T20:33:41.4791882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4791980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4792098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4792212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4792324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4792405Z ) 2025-05-07T20:33:41.4792689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4792784Z def test_silu_mul_quant( 2025-05-07T20:33:41.4792867Z self, 2025-05-07T20:33:41.4792944Z T: int, 2025-05-07T20:33:41.4793020Z D: int, 2025-05-07T20:33:41.4793122Z scale_ub: Optional[float], 2025-05-07T20:33:41.4793210Z contiguous: bool, 2025-05-07T20:33:41.4793298Z compiled: bool, 2025-05-07T20:33:41.4793380Z ) -> None: 2025-05-07T20:33:41.4793475Z torch.manual_seed(2025) 2025-05-07T20:33:41.4793552Z 2025-05-07T20:33:41.4793718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4793791Z 2025-05-07T20:33:41.4793885Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4794008Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4794096Z x = x_sign * x_clamp 2025-05-07T20:33:41.4794181Z x0 = x[:, :D] 2025-05-07T20:33:41.4794260Z x1 = x[:, D:] 2025-05-07T20:33:41.4794337Z 2025-05-07T20:33:41.4794425Z if contiguous: 2025-05-07T20:33:41.4794516Z x0 = x0.contiguous() 2025-05-07T20:33:41.4794608Z x1 = x1.contiguous() 2025-05-07T20:33:41.4794685Z 2025-05-07T20:33:41.4794775Z if scale_ub is not None: 2025-05-07T20:33:41.4794880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4795015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4795092Z ) 2025-05-07T20:33:41.4795172Z else: 2025-05-07T20:33:41.4795267Z scale_ub_tensor = None 2025-05-07T20:33:41.4795339Z 2025-05-07T20:33:41.4795468Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4795558Z op = silu_mul_quant 2025-05-07T20:33:41.4795643Z if compiled: 2025-05-07T20:33:41.4795821Z op = torch.compile(op) 2025-05-07T20:33:41.4795930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4796052Z 2025-05-07T20:33:41.4796151Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4796156Z 2025-05-07T20:33:41.4796253Z moe/activation_test.py:117: 2025-05-07T20:33:41.4796386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4796485Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4796581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4796947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4797037Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4797527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4797626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4798021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4798252Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4798588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4798721Z kernel = self.compile( 2025-05-07T20:33:41.4799101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4799271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4802477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4802486Z 2025-05-07T20:33:41.4802708Z self = 2025-05-07T20:33:41.4803493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4804068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa61d3a0>} 2025-05-07T20:33:41.4804825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4805016Z context = 2025-05-07T20:33:41.4805025Z 2025-05-07T20:33:41.4805188Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4805450Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4805559Z module_map=module_map) 2025-05-07T20:33:41.4805718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4805820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4805906Z E ^ 2025-05-07T20:33:41.4806260Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4806266Z 2025-05-07T20:33:41.4806679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4806684Z 2025-05-07T20:33:41.4806787Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4807011Z self=, 2025-05-07T20:33:41.4807090Z T=128, 2025-05-07T20:33:41.4807168Z D=7168, 2025-05-07T20:33:41.4807251Z scale_ub=1200.0, 2025-05-07T20:33:41.4807339Z contiguous=False, 2025-05-07T20:33:41.4807422Z compiled=True, 2025-05-07T20:33:41.4807497Z ) 2025-05-07T20:33:41.4807719Z self = 2025-05-07T20:33:41.4807892Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:41.4807946Z 2025-05-07T20:33:41.4808031Z @given( 2025-05-07T20:33:41.4808149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4808251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4808369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4808484Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4808595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4808673Z ) 2025-05-07T20:33:41.4808917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4809014Z def test_silu_mul_quant( 2025-05-07T20:33:41.4809091Z self, 2025-05-07T20:33:41.4809167Z T: int, 2025-05-07T20:33:41.4809246Z D: int, 2025-05-07T20:33:41.4809343Z scale_ub: Optional[float], 2025-05-07T20:33:41.4809438Z contiguous: bool, 2025-05-07T20:33:41.4809569Z compiled: bool, 2025-05-07T20:33:41.4809656Z ) -> None: 2025-05-07T20:33:41.4809752Z torch.manual_seed(2025) 2025-05-07T20:33:41.4809827Z 2025-05-07T20:33:41.4809995Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4810110Z 2025-05-07T20:33:41.4810204Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4810327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4810415Z x = x_sign * x_clamp 2025-05-07T20:33:41.4810499Z x0 = x[:, :D] 2025-05-07T20:33:41.4810579Z x1 = x[:, D:] 2025-05-07T20:33:41.4810657Z 2025-05-07T20:33:41.4810742Z if contiguous: 2025-05-07T20:33:41.4810833Z x0 = x0.contiguous() 2025-05-07T20:33:41.4810925Z x1 = x1.contiguous() 2025-05-07T20:33:41.4810997Z 2025-05-07T20:33:41.4811086Z if scale_ub is not None: 2025-05-07T20:33:41.4811202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4811338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4811462Z ) 2025-05-07T20:33:41.4811543Z else: 2025-05-07T20:33:41.4811636Z scale_ub_tensor = None 2025-05-07T20:33:41.4811712Z 2025-05-07T20:33:41.4811843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4811934Z op = silu_mul_quant 2025-05-07T20:33:41.4812027Z if compiled: 2025-05-07T20:33:41.4812126Z op = torch.compile(op) 2025-05-07T20:33:41.4812230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4812306Z 2025-05-07T20:33:41.4812396Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4812400Z 2025-05-07T20:33:41.4812496Z moe/activation_test.py:117: 2025-05-07T20:33:41.4812626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4812727Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4812826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4813202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4813299Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4813795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4813891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4814245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4814468Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4814806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4814902Z kernel = self.compile( 2025-05-07T20:33:41.4815280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4815456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4815682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4815689Z 2025-05-07T20:33:41.4815894Z self = 2025-05-07T20:33:41.4816671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4817178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa61e0c0>} 2025-05-07T20:33:41.4818030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4818231Z context = 2025-05-07T20:33:41.4818235Z 2025-05-07T20:33:41.4818398Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4818701Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4818807Z module_map=module_map) 2025-05-07T20:33:41.4818966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4819066Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4819143Z E ^ 2025-05-07T20:33:41.4819494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4819498Z 2025-05-07T20:33:41.4819911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4819915Z 2025-05-07T20:33:41.4820019Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4820283Z self=, 2025-05-07T20:33:41.4820361Z T=2048, 2025-05-07T20:33:41.4820441Z D=7168, 2025-05-07T20:33:41.4820524Z scale_ub=None, 2025-05-07T20:33:41.4820606Z contiguous=True, 2025-05-07T20:33:41.4820687Z compiled=True, 2025-05-07T20:33:41.4820762Z ) 2025-05-07T20:33:41.4820978Z self = 2025-05-07T20:33:41.4821151Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4821155Z 2025-05-07T20:33:41.4821232Z @given( 2025-05-07T20:33:41.4821352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4821453Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4821566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4821681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4821800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4821878Z ) 2025-05-07T20:33:41.4822119Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4822217Z def test_silu_mul_quant( 2025-05-07T20:33:41.4822292Z self, 2025-05-07T20:33:41.4822372Z T: int, 2025-05-07T20:33:41.4822447Z D: int, 2025-05-07T20:33:41.4822544Z scale_ub: Optional[float], 2025-05-07T20:33:41.4822641Z contiguous: bool, 2025-05-07T20:33:41.4822724Z compiled: bool, 2025-05-07T20:33:41.4822801Z ) -> None: 2025-05-07T20:33:41.4822897Z torch.manual_seed(2025) 2025-05-07T20:33:41.4822970Z 2025-05-07T20:33:41.4823135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4823215Z 2025-05-07T20:33:41.4823305Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4823427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4823522Z x = x_sign * x_clamp 2025-05-07T20:33:41.4823647Z x0 = x[:, :D] 2025-05-07T20:33:41.4823729Z x1 = x[:, D:] 2025-05-07T20:33:41.4823804Z 2025-05-07T20:33:41.4823886Z if contiguous: 2025-05-07T20:33:41.4823985Z x0 = x0.contiguous() 2025-05-07T20:33:41.4824073Z x1 = x1.contiguous() 2025-05-07T20:33:41.4824144Z 2025-05-07T20:33:41.4824234Z if scale_ub is not None: 2025-05-07T20:33:41.4824338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4824470Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4824548Z ) 2025-05-07T20:33:41.4824624Z else: 2025-05-07T20:33:41.4824717Z scale_ub_tensor = None 2025-05-07T20:33:41.4824790Z 2025-05-07T20:33:41.4824918Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4825008Z op = silu_mul_quant 2025-05-07T20:33:41.4825095Z if compiled: 2025-05-07T20:33:41.4825239Z op = torch.compile(op) 2025-05-07T20:33:41.4825350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4825422Z 2025-05-07T20:33:41.4825512Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4825555Z 2025-05-07T20:33:41.4825653Z moe/activation_test.py:117: 2025-05-07T20:33:41.4825780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4825878Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4825976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4826337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.4826429Z return fn(*args, **kwargs) 2025-05-07T20:33:41.4826919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4827014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4827373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4827639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4827979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4828074Z kernel = self.compile( 2025-05-07T20:33:41.4828449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4828625Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4828752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4828757Z 2025-05-07T20:33:41.4828959Z self = 2025-05-07T20:33:41.4829744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4830246Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa61f2e0>} 2025-05-07T20:33:41.4830994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4831184Z context = 2025-05-07T20:33:41.4831188Z 2025-05-07T20:33:41.4831350Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4831613Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4831720Z module_map=module_map) 2025-05-07T20:33:41.4831884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4832029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4832107Z E ^ 2025-05-07T20:33:41.4832462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4832469Z 2025-05-07T20:33:41.4832876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4832881Z 2025-05-07T20:33:41.4832987Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4833209Z self=, 2025-05-07T20:33:41.4833288Z T=16384, 2025-05-07T20:33:41.4833371Z D=5120, 2025-05-07T20:33:41.4833455Z scale_ub=None, 2025-05-07T20:33:41.4833542Z contiguous=False, 2025-05-07T20:33:41.4833633Z compiled=False, 2025-05-07T20:33:41.4833707Z ) 2025-05-07T20:33:41.4833964Z self = 2025-05-07T20:33:41.4834150Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4834154Z 2025-05-07T20:33:41.4834234Z @given( 2025-05-07T20:33:41.4834396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4834494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4834607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4834728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4834841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4834915Z ) 2025-05-07T20:33:41.4835160Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4835252Z def test_silu_mul_quant( 2025-05-07T20:33:41.4835333Z self, 2025-05-07T20:33:41.4835414Z T: int, 2025-05-07T20:33:41.4835492Z D: int, 2025-05-07T20:33:41.4835592Z scale_ub: Optional[float], 2025-05-07T20:33:41.4835740Z contiguous: bool, 2025-05-07T20:33:41.4835872Z compiled: bool, 2025-05-07T20:33:41.4835956Z ) -> None: 2025-05-07T20:33:41.4836052Z torch.manual_seed(2025) 2025-05-07T20:33:41.4836129Z 2025-05-07T20:33:41.4836297Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4836372Z 2025-05-07T20:33:41.4836465Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4836592Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4838414Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4838423Z 2025-05-07T20:33:41.4838543Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:41.4838550Z 2025-05-07T20:33:41.4838652Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4838878Z self=, 2025-05-07T20:33:41.4838955Z T=4096, 2025-05-07T20:33:41.4839033Z D=7168, 2025-05-07T20:33:41.4839119Z scale_ub=1200.0, 2025-05-07T20:33:41.4839204Z contiguous=True, 2025-05-07T20:33:41.4839287Z compiled=True, 2025-05-07T20:33:41.4839362Z ) 2025-05-07T20:33:41.4839578Z self = 2025-05-07T20:33:41.4839748Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4839753Z 2025-05-07T20:33:41.4839833Z @given( 2025-05-07T20:33:41.4839952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4840051Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4840220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4840336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4840453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4840527Z ) 2025-05-07T20:33:41.4840768Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4840864Z def test_silu_mul_quant( 2025-05-07T20:33:41.4840940Z self, 2025-05-07T20:33:41.4841017Z T: int, 2025-05-07T20:33:41.4841096Z D: int, 2025-05-07T20:33:41.4841195Z scale_ub: Optional[float], 2025-05-07T20:33:41.4841285Z contiguous: bool, 2025-05-07T20:33:41.4841375Z compiled: bool, 2025-05-07T20:33:41.4841455Z ) -> None: 2025-05-07T20:33:41.4841550Z torch.manual_seed(2025) 2025-05-07T20:33:41.4841630Z 2025-05-07T20:33:41.4841838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4841917Z 2025-05-07T20:33:41.4842011Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4842134Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4843980Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4843986Z 2025-05-07T20:33:41.4844103Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:41.4844107Z 2025-05-07T20:33:41.4844216Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4844497Z self=, 2025-05-07T20:33:41.4844581Z T=16384, 2025-05-07T20:33:41.4844663Z D=7168, 2025-05-07T20:33:41.4844749Z scale_ub=None, 2025-05-07T20:33:41.4844835Z contiguous=False, 2025-05-07T20:33:41.4844925Z compiled=False, 2025-05-07T20:33:41.4844998Z ) 2025-05-07T20:33:41.4845214Z self = 2025-05-07T20:33:41.4845387Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4845391Z 2025-05-07T20:33:41.4845469Z @given( 2025-05-07T20:33:41.4845589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4845687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4845799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4845918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4846033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4846112Z ) 2025-05-07T20:33:41.4846356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4846449Z def test_silu_mul_quant( 2025-05-07T20:33:41.4846532Z self, 2025-05-07T20:33:41.4846608Z T: int, 2025-05-07T20:33:41.4846686Z D: int, 2025-05-07T20:33:41.4846789Z scale_ub: Optional[float], 2025-05-07T20:33:41.4846878Z contiguous: bool, 2025-05-07T20:33:41.4846964Z compiled: bool, 2025-05-07T20:33:41.4847045Z ) -> None: 2025-05-07T20:33:41.4847139Z torch.manual_seed(2025) 2025-05-07T20:33:41.4847214Z 2025-05-07T20:33:41.4847384Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4849186Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4849237Z 2025-05-07T20:33:41.4849361Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4849365Z 2025-05-07T20:33:41.4849466Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4849689Z self=, 2025-05-07T20:33:41.4849766Z T=2048, 2025-05-07T20:33:41.4849843Z D=7168, 2025-05-07T20:33:41.4849928Z scale_ub=1200.0, 2025-05-07T20:33:41.4850012Z contiguous=True, 2025-05-07T20:33:41.4850095Z compiled=True, 2025-05-07T20:33:41.4850171Z ) 2025-05-07T20:33:41.4850425Z self = 2025-05-07T20:33:41.4850599Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4850605Z 2025-05-07T20:33:41.4850689Z @given( 2025-05-07T20:33:41.4850847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4850952Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4851066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4851182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4851298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4851374Z ) 2025-05-07T20:33:41.4851613Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4851711Z def test_silu_mul_quant( 2025-05-07T20:33:41.4851790Z self, 2025-05-07T20:33:41.4851868Z T: int, 2025-05-07T20:33:41.4851951Z D: int, 2025-05-07T20:33:41.4852049Z scale_ub: Optional[float], 2025-05-07T20:33:41.4852144Z contiguous: bool, 2025-05-07T20:33:41.4852235Z compiled: bool, 2025-05-07T20:33:41.4852358Z ) -> None: 2025-05-07T20:33:41.4852457Z torch.manual_seed(2025) 2025-05-07T20:33:41.4852532Z 2025-05-07T20:33:41.4852699Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4852778Z 2025-05-07T20:33:41.4852871Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4852994Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4854778Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4854792Z 2025-05-07T20:33:41.4854907Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:41.4854912Z 2025-05-07T20:33:41.4855019Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4855239Z self=, 2025-05-07T20:33:41.4855319Z T=2048, 2025-05-07T20:33:41.4855397Z D=7168, 2025-05-07T20:33:41.4855479Z scale_ub=None, 2025-05-07T20:33:41.4855568Z contiguous=True, 2025-05-07T20:33:41.4855652Z compiled=False, 2025-05-07T20:33:41.4855726Z ) 2025-05-07T20:33:41.4855942Z self = 2025-05-07T20:33:41.4856111Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4856116Z 2025-05-07T20:33:41.4856193Z @given( 2025-05-07T20:33:41.4856314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4856415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4856575Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4856694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4856809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4856888Z ) 2025-05-07T20:33:41.4857127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4857220Z def test_silu_mul_quant( 2025-05-07T20:33:41.4857302Z self, 2025-05-07T20:33:41.4857379Z T: int, 2025-05-07T20:33:41.4857458Z D: int, 2025-05-07T20:33:41.4857562Z scale_ub: Optional[float], 2025-05-07T20:33:41.4857650Z contiguous: bool, 2025-05-07T20:33:41.4857735Z compiled: bool, 2025-05-07T20:33:41.4857818Z ) -> None: 2025-05-07T20:33:41.4857917Z torch.manual_seed(2025) 2025-05-07T20:33:41.4857992Z 2025-05-07T20:33:41.4858204Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4858281Z 2025-05-07T20:33:41.4858382Z > x_sign = torch.sign(x) 2025-05-07T20:33:41.4860158Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4860202Z 2025-05-07T20:33:41.4860320Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:41.4860325Z 2025-05-07T20:33:41.4860425Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4860644Z self=, 2025-05-07T20:33:41.4860725Z T=1, 2025-05-07T20:33:41.4860805Z D=7168, 2025-05-07T20:33:41.4860929Z scale_ub=1200.0, 2025-05-07T20:33:41.4861017Z contiguous=True, 2025-05-07T20:33:41.4861100Z compiled=False, 2025-05-07T20:33:41.4861177Z ) 2025-05-07T20:33:41.4861396Z self = 2025-05-07T20:33:41.4861559Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.4861564Z 2025-05-07T20:33:41.4861644Z @given( 2025-05-07T20:33:41.4861760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4861857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4861973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4862087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4862200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4862278Z ) 2025-05-07T20:33:41.4862522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4862624Z def test_silu_mul_quant( 2025-05-07T20:33:41.4862701Z self, 2025-05-07T20:33:41.4862779Z T: int, 2025-05-07T20:33:41.4862861Z D: int, 2025-05-07T20:33:41.4862959Z scale_ub: Optional[float], 2025-05-07T20:33:41.4863047Z contiguous: bool, 2025-05-07T20:33:41.4863134Z compiled: bool, 2025-05-07T20:33:41.4863212Z ) -> None: 2025-05-07T20:33:41.4863306Z torch.manual_seed(2025) 2025-05-07T20:33:41.4863382Z 2025-05-07T20:33:41.4863550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4863624Z 2025-05-07T20:33:41.4863718Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4863842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4863933Z x = x_sign * x_clamp 2025-05-07T20:33:41.4864015Z x0 = x[:, :D] 2025-05-07T20:33:41.4864095Z x1 = x[:, D:] 2025-05-07T20:33:41.4864172Z 2025-05-07T20:33:41.4864258Z if contiguous: 2025-05-07T20:33:41.4864399Z x0 = x0.contiguous() 2025-05-07T20:33:41.4864496Z x1 = x1.contiguous() 2025-05-07T20:33:41.4864570Z 2025-05-07T20:33:41.4864663Z if scale_ub is not None: 2025-05-07T20:33:41.4864771Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4864906Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4864981Z ) 2025-05-07T20:33:41.4865060Z else: 2025-05-07T20:33:41.4865153Z scale_ub_tensor = None 2025-05-07T20:33:41.4865230Z 2025-05-07T20:33:41.4865590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4865726Z op = silu_mul_quant 2025-05-07T20:33:41.4865847Z if compiled: 2025-05-07T20:33:41.4865987Z op = torch.compile(op) 2025-05-07T20:33:41.4866130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4866239Z 2025-05-07T20:33:41.4866445Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4866453Z 2025-05-07T20:33:41.4866553Z moe/activation_test.py:117: 2025-05-07T20:33:41.4866689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4866856Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4866962Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4867461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4867559Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4867917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4868138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4868479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4868582Z kernel = self.compile( 2025-05-07T20:33:41.4869023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4869204Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4869336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4869340Z 2025-05-07T20:33:41.4869545Z self = 2025-05-07T20:33:41.4870327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4870829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa7f65c0>} 2025-05-07T20:33:41.4871581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4871776Z context = 2025-05-07T20:33:41.4871781Z 2025-05-07T20:33:41.4871943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4872209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4872318Z module_map=module_map) 2025-05-07T20:33:41.4872482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4872581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4872658Z E ^ 2025-05-07T20:33:41.4873016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4873020Z 2025-05-07T20:33:41.4873435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4873498Z 2025-05-07T20:33:41.4873607Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4873832Z self=, 2025-05-07T20:33:41.4873911Z T=128, 2025-05-07T20:33:41.4873991Z D=5120, 2025-05-07T20:33:41.4874074Z scale_ub=None, 2025-05-07T20:33:41.4874160Z contiguous=True, 2025-05-07T20:33:41.4874249Z compiled=False, 2025-05-07T20:33:41.4874323Z ) 2025-05-07T20:33:41.4874539Z self = 2025-05-07T20:33:41.4874710Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4874715Z 2025-05-07T20:33:41.4874792Z @given( 2025-05-07T20:33:41.4874914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4875058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4875175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4875301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4875414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4875553Z ) 2025-05-07T20:33:41.4875860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4875954Z def test_silu_mul_quant( 2025-05-07T20:33:41.4876030Z self, 2025-05-07T20:33:41.4876112Z T: int, 2025-05-07T20:33:41.4876189Z D: int, 2025-05-07T20:33:41.4876292Z scale_ub: Optional[float], 2025-05-07T20:33:41.4876381Z contiguous: bool, 2025-05-07T20:33:41.4876467Z compiled: bool, 2025-05-07T20:33:41.4876548Z ) -> None: 2025-05-07T20:33:41.4876642Z torch.manual_seed(2025) 2025-05-07T20:33:41.4876720Z 2025-05-07T20:33:41.4876889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4876969Z 2025-05-07T20:33:41.4877062Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4877235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4877326Z x = x_sign * x_clamp 2025-05-07T20:33:41.4877410Z x0 = x[:, :D] 2025-05-07T20:33:41.4877497Z x1 = x[:, D:] 2025-05-07T20:33:41.4877569Z 2025-05-07T20:33:41.4877653Z if contiguous: 2025-05-07T20:33:41.4877748Z x0 = x0.contiguous() 2025-05-07T20:33:41.4877837Z x1 = x1.contiguous() 2025-05-07T20:33:41.4877912Z 2025-05-07T20:33:41.4878002Z if scale_ub is not None: 2025-05-07T20:33:41.4878108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4878243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4878321Z ) 2025-05-07T20:33:41.4878397Z else: 2025-05-07T20:33:41.4878494Z scale_ub_tensor = None 2025-05-07T20:33:41.4878569Z 2025-05-07T20:33:41.4878701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4878802Z op = silu_mul_quant 2025-05-07T20:33:41.4878889Z if compiled: 2025-05-07T20:33:41.4878988Z op = torch.compile(op) 2025-05-07T20:33:41.4879099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4879173Z 2025-05-07T20:33:41.4879270Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4879274Z 2025-05-07T20:33:41.4879375Z moe/activation_test.py:117: 2025-05-07T20:33:41.4879502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4879608Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4879707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4880201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4880303Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4880661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4880932Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4881270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4881365Z kernel = self.compile( 2025-05-07T20:33:41.4881747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4881921Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4882048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4882057Z 2025-05-07T20:33:41.4882259Z self = 2025-05-07T20:33:41.4883081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4883589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa7f74c0>} 2025-05-07T20:33:41.4884374Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4884565Z context = 2025-05-07T20:33:41.4884570Z 2025-05-07T20:33:41.4884733Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4884994Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4885104Z module_map=module_map) 2025-05-07T20:33:41.4885267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4885409Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4885490Z E ^ 2025-05-07T20:33:41.4885841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4885848Z 2025-05-07T20:33:41.4886262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4886267Z 2025-05-07T20:33:41.4886369Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4886589Z self=, 2025-05-07T20:33:41.4886671Z T=128, 2025-05-07T20:33:41.4886747Z D=7168, 2025-05-07T20:33:41.4886831Z scale_ub=None, 2025-05-07T20:33:41.4886916Z contiguous=True, 2025-05-07T20:33:41.4887000Z compiled=False, 2025-05-07T20:33:41.4887080Z ) 2025-05-07T20:33:41.4887299Z self = 2025-05-07T20:33:41.4887471Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4887475Z 2025-05-07T20:33:41.4887558Z @given( 2025-05-07T20:33:41.4887678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4887778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4887895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4888013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4888127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4888202Z ) 2025-05-07T20:33:41.4888443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4888538Z def test_silu_mul_quant( 2025-05-07T20:33:41.4888615Z self, 2025-05-07T20:33:41.4888692Z T: int, 2025-05-07T20:33:41.4888772Z D: int, 2025-05-07T20:33:41.4888872Z scale_ub: Optional[float], 2025-05-07T20:33:41.4888964Z contiguous: bool, 2025-05-07T20:33:41.4889102Z compiled: bool, 2025-05-07T20:33:41.4889183Z ) -> None: 2025-05-07T20:33:41.4889279Z torch.manual_seed(2025) 2025-05-07T20:33:41.4889355Z 2025-05-07T20:33:41.4889532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4889610Z 2025-05-07T20:33:41.4889702Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4889827Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4889923Z x = x_sign * x_clamp 2025-05-07T20:33:41.4890006Z x0 = x[:, :D] 2025-05-07T20:33:41.4890086Z x1 = x[:, D:] 2025-05-07T20:33:41.4890161Z 2025-05-07T20:33:41.4890246Z if contiguous: 2025-05-07T20:33:41.4890336Z x0 = x0.contiguous() 2025-05-07T20:33:41.4890430Z x1 = x1.contiguous() 2025-05-07T20:33:41.4890502Z 2025-05-07T20:33:41.4890593Z if scale_ub is not None: 2025-05-07T20:33:41.4890747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4890884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4890965Z ) 2025-05-07T20:33:41.4891042Z else: 2025-05-07T20:33:41.4891177Z scale_ub_tensor = None 2025-05-07T20:33:41.4891253Z 2025-05-07T20:33:41.4891382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4891472Z op = silu_mul_quant 2025-05-07T20:33:41.4891560Z if compiled: 2025-05-07T20:33:41.4891658Z op = torch.compile(op) 2025-05-07T20:33:41.4891762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4891838Z 2025-05-07T20:33:41.4891928Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4891933Z 2025-05-07T20:33:41.4892030Z moe/activation_test.py:117: 2025-05-07T20:33:41.4892161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4892260Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4892369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4892903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4893005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4893362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4893581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4893917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4894015Z kernel = self.compile( 2025-05-07T20:33:41.4894392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4894568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4894699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4894706Z 2025-05-07T20:33:41.4894911Z self = 2025-05-07T20:33:41.4895688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4896190Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa168540>} 2025-05-07T20:33:41.4896934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4897124Z context = 2025-05-07T20:33:41.4897129Z 2025-05-07T20:33:41.4897298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4897604Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4897713Z module_map=module_map) 2025-05-07T20:33:41.4897876Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4897975Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4898053Z E ^ 2025-05-07T20:33:41.4898410Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4898414Z 2025-05-07T20:33:41.4898822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4898826Z 2025-05-07T20:33:41.4898932Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4899154Z self=, 2025-05-07T20:33:41.4899273Z T=2048, 2025-05-07T20:33:41.4899357Z D=7168, 2025-05-07T20:33:41.4899444Z scale_ub=1200.0, 2025-05-07T20:33:41.4899529Z contiguous=True, 2025-05-07T20:33:41.4899658Z compiled=False, 2025-05-07T20:33:41.4899732Z ) 2025-05-07T20:33:41.4899948Z self = 2025-05-07T20:33:41.4900126Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.4900130Z 2025-05-07T20:33:41.4900208Z @given( 2025-05-07T20:33:41.4900331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4900429Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4900544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4900664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4900778Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4900853Z ) 2025-05-07T20:33:41.4901104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4901238Z def test_silu_mul_quant( 2025-05-07T20:33:41.4901322Z self, 2025-05-07T20:33:41.4901402Z T: int, 2025-05-07T20:33:41.4901483Z D: int, 2025-05-07T20:33:41.4901585Z scale_ub: Optional[float], 2025-05-07T20:33:41.4901674Z contiguous: bool, 2025-05-07T20:33:41.4901759Z compiled: bool, 2025-05-07T20:33:41.4901841Z ) -> None: 2025-05-07T20:33:41.4901935Z torch.manual_seed(2025) 2025-05-07T20:33:41.4902009Z 2025-05-07T20:33:41.4902182Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4903980Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4903990Z 2025-05-07T20:33:41.4904110Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4904114Z 2025-05-07T20:33:41.4904217Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4904440Z self=, 2025-05-07T20:33:41.4904519Z T=1, 2025-05-07T20:33:41.4904595Z D=5120, 2025-05-07T20:33:41.4904686Z scale_ub=1200.0, 2025-05-07T20:33:41.4904772Z contiguous=True, 2025-05-07T20:33:41.4904855Z compiled=False, 2025-05-07T20:33:41.4904932Z ) 2025-05-07T20:33:41.4905147Z self = 2025-05-07T20:33:41.4905310Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.4905317Z 2025-05-07T20:33:41.4905441Z @given( 2025-05-07T20:33:41.4905562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4905660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4905784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4905898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4906014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4906088Z ) 2025-05-07T20:33:41.4906328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4906423Z def test_silu_mul_quant( 2025-05-07T20:33:41.4906499Z self, 2025-05-07T20:33:41.4906575Z T: int, 2025-05-07T20:33:41.4906658Z D: int, 2025-05-07T20:33:41.4906758Z scale_ub: Optional[float], 2025-05-07T20:33:41.4906845Z contiguous: bool, 2025-05-07T20:33:41.4906933Z compiled: bool, 2025-05-07T20:33:41.4907012Z ) -> None: 2025-05-07T20:33:41.4907175Z torch.manual_seed(2025) 2025-05-07T20:33:41.4907255Z 2025-05-07T20:33:41.4907422Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4907537Z 2025-05-07T20:33:41.4907629Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4907754Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4907842Z x = x_sign * x_clamp 2025-05-07T20:33:41.4907923Z x0 = x[:, :D] 2025-05-07T20:33:41.4908002Z x1 = x[:, D:] 2025-05-07T20:33:41.4908078Z 2025-05-07T20:33:41.4908161Z if contiguous: 2025-05-07T20:33:41.4908255Z x0 = x0.contiguous() 2025-05-07T20:33:41.4908346Z x1 = x1.contiguous() 2025-05-07T20:33:41.4908419Z 2025-05-07T20:33:41.4908509Z if scale_ub is not None: 2025-05-07T20:33:41.4908617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4908749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4908831Z ) 2025-05-07T20:33:41.4908909Z else: 2025-05-07T20:33:41.4909046Z scale_ub_tensor = None 2025-05-07T20:33:41.4909130Z 2025-05-07T20:33:41.4909261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4909355Z op = silu_mul_quant 2025-05-07T20:33:41.4909448Z if compiled: 2025-05-07T20:33:41.4909545Z op = torch.compile(op) 2025-05-07T20:33:41.4909651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4909727Z 2025-05-07T20:33:41.4909818Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4909822Z 2025-05-07T20:33:41.4909918Z moe/activation_test.py:117: 2025-05-07T20:33:41.4910049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4910148Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4910250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4910747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4910849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4911210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4911433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4911775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4911868Z kernel = self.compile( 2025-05-07T20:33:41.4912245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4912422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4912547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4912551Z 2025-05-07T20:33:41.4912759Z self = 2025-05-07T20:33:41.4913581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4914084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa169b20>} 2025-05-07T20:33:41.4914826Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4915016Z context = 2025-05-07T20:33:41.4915020Z 2025-05-07T20:33:41.4915187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4915488Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4915600Z module_map=module_map) 2025-05-07T20:33:41.4915809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4915951Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4916029Z E ^ 2025-05-07T20:33:41.4916385Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4916389Z 2025-05-07T20:33:41.4916796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4916801Z 2025-05-07T20:33:41.4916906Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4917128Z self=, 2025-05-07T20:33:41.4917205Z T=2048, 2025-05-07T20:33:41.4917285Z D=5120, 2025-05-07T20:33:41.4917368Z scale_ub=None, 2025-05-07T20:33:41.4917455Z contiguous=True, 2025-05-07T20:33:41.4917586Z compiled=False, 2025-05-07T20:33:41.4917660Z ) 2025-05-07T20:33:41.4917880Z self = 2025-05-07T20:33:41.4918055Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4918059Z 2025-05-07T20:33:41.4918138Z @given( 2025-05-07T20:33:41.4918258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4918357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4918471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4918589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4918702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4918776Z ) 2025-05-07T20:33:41.4919020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4919114Z def test_silu_mul_quant( 2025-05-07T20:33:41.4919198Z self, 2025-05-07T20:33:41.4919278Z T: int, 2025-05-07T20:33:41.4919358Z D: int, 2025-05-07T20:33:41.4919460Z scale_ub: Optional[float], 2025-05-07T20:33:41.4919555Z contiguous: bool, 2025-05-07T20:33:41.4919639Z compiled: bool, 2025-05-07T20:33:41.4919722Z ) -> None: 2025-05-07T20:33:41.4919815Z torch.manual_seed(2025) 2025-05-07T20:33:41.4919889Z 2025-05-07T20:33:41.4920057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4920131Z 2025-05-07T20:33:41.4920223Z > x_sign = torch.sign(x) 2025-05-07T20:33:41.4922022Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4922074Z 2025-05-07T20:33:41.4922192Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:41.4922201Z 2025-05-07T20:33:41.4922304Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4922526Z self=, 2025-05-07T20:33:41.4922612Z T=16384, 2025-05-07T20:33:41.4922688Z D=5120, 2025-05-07T20:33:41.4922770Z scale_ub=None, 2025-05-07T20:33:41.4922856Z contiguous=True, 2025-05-07T20:33:41.4922939Z compiled=False, 2025-05-07T20:33:41.4923014Z ) 2025-05-07T20:33:41.4923232Z self = 2025-05-07T20:33:41.4923405Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4923409Z 2025-05-07T20:33:41.4923529Z @given( 2025-05-07T20:33:41.4923657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4923756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4923916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4924031Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4924145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4927325Z ) 2025-05-07T20:33:41.4927587Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4927681Z def test_silu_mul_quant( 2025-05-07T20:33:41.4927763Z self, 2025-05-07T20:33:41.4927840Z T: int, 2025-05-07T20:33:41.4927916Z D: int, 2025-05-07T20:33:41.4928019Z scale_ub: Optional[float], 2025-05-07T20:33:41.4928108Z contiguous: bool, 2025-05-07T20:33:41.4928193Z compiled: bool, 2025-05-07T20:33:41.4928277Z ) -> None: 2025-05-07T20:33:41.4928375Z torch.manual_seed(2025) 2025-05-07T20:33:41.4928450Z 2025-05-07T20:33:41.4928686Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4930482Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4930495Z 2025-05-07T20:33:41.4930611Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4930616Z 2025-05-07T20:33:41.4930718Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4930942Z self=, 2025-05-07T20:33:41.4931022Z T=4096, 2025-05-07T20:33:41.4931103Z D=5120, 2025-05-07T20:33:41.4931191Z scale_ub=None, 2025-05-07T20:33:41.4931275Z contiguous=True, 2025-05-07T20:33:41.4931360Z compiled=False, 2025-05-07T20:33:41.4931438Z ) 2025-05-07T20:33:41.4931656Z self = 2025-05-07T20:33:41.4931825Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4931833Z 2025-05-07T20:33:41.4931910Z @given( 2025-05-07T20:33:41.4932027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4932128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4932242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4932357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4932471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4932546Z ) 2025-05-07T20:33:41.4932792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4932936Z def test_silu_mul_quant( 2025-05-07T20:33:41.4933012Z self, 2025-05-07T20:33:41.4933093Z T: int, 2025-05-07T20:33:41.4933169Z D: int, 2025-05-07T20:33:41.4933266Z scale_ub: Optional[float], 2025-05-07T20:33:41.4933362Z contiguous: bool, 2025-05-07T20:33:41.4933447Z compiled: bool, 2025-05-07T20:33:41.4933526Z ) -> None: 2025-05-07T20:33:41.4933624Z torch.manual_seed(2025) 2025-05-07T20:33:41.4933696Z 2025-05-07T20:33:41.4933863Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4935687Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4935730Z 2025-05-07T20:33:41.4935846Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4935851Z 2025-05-07T20:33:41.4935954Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4936175Z self=, 2025-05-07T20:33:41.4936257Z T=2048, 2025-05-07T20:33:41.4936335Z D=5120, 2025-05-07T20:33:41.4936418Z scale_ub=None, 2025-05-07T20:33:41.4936507Z contiguous=False, 2025-05-07T20:33:41.4936597Z compiled=False, 2025-05-07T20:33:41.4936669Z ) 2025-05-07T20:33:41.4936885Z self = 2025-05-07T20:33:41.4937060Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4937066Z 2025-05-07T20:33:41.4937186Z @given( 2025-05-07T20:33:41.4937306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4937407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4937521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4937638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4937751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4937831Z ) 2025-05-07T20:33:41.4938072Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4938167Z def test_silu_mul_quant( 2025-05-07T20:33:41.4938246Z self, 2025-05-07T20:33:41.4938322Z T: int, 2025-05-07T20:33:41.4938399Z D: int, 2025-05-07T20:33:41.4938500Z scale_ub: Optional[float], 2025-05-07T20:33:41.4938587Z contiguous: bool, 2025-05-07T20:33:41.4938671Z compiled: bool, 2025-05-07T20:33:41.4938757Z ) -> None: 2025-05-07T20:33:41.4938855Z torch.manual_seed(2025) 2025-05-07T20:33:41.4938932Z 2025-05-07T20:33:41.4939104Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4940890Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4940898Z 2025-05-07T20:33:41.4941014Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4941018Z 2025-05-07T20:33:41.4941121Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4941348Z self=, 2025-05-07T20:33:41.4941496Z T=4096, 2025-05-07T20:33:41.4941573Z D=7168, 2025-05-07T20:33:41.4941662Z scale_ub=None, 2025-05-07T20:33:41.4941746Z contiguous=True, 2025-05-07T20:33:41.4941829Z compiled=True, 2025-05-07T20:33:41.4941907Z ) 2025-05-07T20:33:41.4942122Z self = 2025-05-07T20:33:41.4942291Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.4942299Z 2025-05-07T20:33:41.4942376Z @given( 2025-05-07T20:33:41.4942494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4942595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4942711Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4942826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4942986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4943067Z ) 2025-05-07T20:33:41.4943311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4943407Z def test_silu_mul_quant( 2025-05-07T20:33:41.4943531Z self, 2025-05-07T20:33:41.4943611Z T: int, 2025-05-07T20:33:41.4943687Z D: int, 2025-05-07T20:33:41.4943784Z scale_ub: Optional[float], 2025-05-07T20:33:41.4943876Z contiguous: bool, 2025-05-07T20:33:41.4943961Z compiled: bool, 2025-05-07T20:33:41.4944038Z ) -> None: 2025-05-07T20:33:41.4944135Z torch.manual_seed(2025) 2025-05-07T20:33:41.4944209Z 2025-05-07T20:33:41.4944374Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4946206Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4946216Z 2025-05-07T20:33:41.4946332Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4946336Z 2025-05-07T20:33:41.4946443Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4946662Z self=, 2025-05-07T20:33:41.4946742Z T=2048, 2025-05-07T20:33:41.4946819Z D=5120, 2025-05-07T20:33:41.4946904Z scale_ub=1200.0, 2025-05-07T20:33:41.4946992Z contiguous=False, 2025-05-07T20:33:41.4947074Z compiled=False, 2025-05-07T20:33:41.4947146Z ) 2025-05-07T20:33:41.4947364Z self = 2025-05-07T20:33:41.4947540Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4947546Z 2025-05-07T20:33:41.4947622Z @given( 2025-05-07T20:33:41.4947748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4947846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4947958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4948076Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4948188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4948270Z ) 2025-05-07T20:33:41.4948509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4948603Z def test_silu_mul_quant( 2025-05-07T20:33:41.4948682Z self, 2025-05-07T20:33:41.4948757Z T: int, 2025-05-07T20:33:41.4948833Z D: int, 2025-05-07T20:33:41.4948934Z scale_ub: Optional[float], 2025-05-07T20:33:41.4949025Z contiguous: bool, 2025-05-07T20:33:41.4949108Z compiled: bool, 2025-05-07T20:33:41.4949309Z ) -> None: 2025-05-07T20:33:41.4949407Z torch.manual_seed(2025) 2025-05-07T20:33:41.4949480Z 2025-05-07T20:33:41.4949649Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4951427Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4951435Z 2025-05-07T20:33:41.4951549Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4951593Z 2025-05-07T20:33:41.4951697Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4951922Z self=, 2025-05-07T20:33:41.4952037Z T=4096, 2025-05-07T20:33:41.4952112Z D=7168, 2025-05-07T20:33:41.4952200Z scale_ub=1200.0, 2025-05-07T20:33:41.4952283Z contiguous=True, 2025-05-07T20:33:41.4952364Z compiled=False, 2025-05-07T20:33:41.4952444Z ) 2025-05-07T20:33:41.4952656Z self = 2025-05-07T20:33:41.4952826Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.4952835Z 2025-05-07T20:33:41.4952912Z @given( 2025-05-07T20:33:41.4953026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4953125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4953236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4953353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4953509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4953583Z ) 2025-05-07T20:33:41.4953824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4953921Z def test_silu_mul_quant( 2025-05-07T20:33:41.4953999Z self, 2025-05-07T20:33:41.4954078Z T: int, 2025-05-07T20:33:41.4954153Z D: int, 2025-05-07T20:33:41.4954248Z scale_ub: Optional[float], 2025-05-07T20:33:41.4954339Z contiguous: bool, 2025-05-07T20:33:41.4954423Z compiled: bool, 2025-05-07T20:33:41.4954500Z ) -> None: 2025-05-07T20:33:41.4954599Z torch.manual_seed(2025) 2025-05-07T20:33:41.4954670Z 2025-05-07T20:33:41.4954836Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4956689Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4956700Z 2025-05-07T20:33:41.4956817Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4956822Z 2025-05-07T20:33:41.4956925Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4957143Z self=, 2025-05-07T20:33:41.4957223Z T=16384, 2025-05-07T20:33:41.4957297Z D=7168, 2025-05-07T20:33:41.4957378Z scale_ub=None, 2025-05-07T20:33:41.4957464Z contiguous=False, 2025-05-07T20:33:41.4957545Z compiled=True, 2025-05-07T20:33:41.4957617Z ) 2025-05-07T20:33:41.4957833Z self = 2025-05-07T20:33:41.4958056Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:41.4958063Z 2025-05-07T20:33:41.4958144Z @given( 2025-05-07T20:33:41.4958267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4958363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4958475Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4958593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4958707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4958781Z ) 2025-05-07T20:33:41.4959020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4959112Z def test_silu_mul_quant( 2025-05-07T20:33:41.4959193Z self, 2025-05-07T20:33:41.4959268Z T: int, 2025-05-07T20:33:41.4959344Z D: int, 2025-05-07T20:33:41.4959486Z scale_ub: Optional[float], 2025-05-07T20:33:41.4959578Z contiguous: bool, 2025-05-07T20:33:41.4959666Z compiled: bool, 2025-05-07T20:33:41.4959748Z ) -> None: 2025-05-07T20:33:41.4959882Z torch.manual_seed(2025) 2025-05-07T20:33:41.4959953Z 2025-05-07T20:33:41.4960123Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4961904Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4961913Z 2025-05-07T20:33:41.4962030Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4962075Z 2025-05-07T20:33:41.4962176Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4962398Z self=, 2025-05-07T20:33:41.4962479Z T=4096, 2025-05-07T20:33:41.4962555Z D=7168, 2025-05-07T20:33:41.4962642Z scale_ub=None, 2025-05-07T20:33:41.4962724Z contiguous=True, 2025-05-07T20:33:41.4962806Z compiled=False, 2025-05-07T20:33:41.4962881Z ) 2025-05-07T20:33:41.4963094Z self = 2025-05-07T20:33:41.4963263Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4963267Z 2025-05-07T20:33:41.4963344Z @given( 2025-05-07T20:33:41.4963460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4963564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4963680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4963799Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4963914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4963988Z ) 2025-05-07T20:33:41.4964227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4964322Z def test_silu_mul_quant( 2025-05-07T20:33:41.4964397Z self, 2025-05-07T20:33:41.4964477Z T: int, 2025-05-07T20:33:41.4964552Z D: int, 2025-05-07T20:33:41.4964647Z scale_ub: Optional[float], 2025-05-07T20:33:41.4964737Z contiguous: bool, 2025-05-07T20:33:41.4964822Z compiled: bool, 2025-05-07T20:33:41.4964898Z ) -> None: 2025-05-07T20:33:41.4964996Z torch.manual_seed(2025) 2025-05-07T20:33:41.4965068Z 2025-05-07T20:33:41.4965232Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4967390Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4967491Z 2025-05-07T20:33:41.4967611Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4967616Z 2025-05-07T20:33:41.4967718Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4967938Z self=, 2025-05-07T20:33:41.4968018Z T=16384, 2025-05-07T20:33:41.4968094Z D=7168, 2025-05-07T20:33:41.4968173Z scale_ub=None, 2025-05-07T20:33:41.4968262Z contiguous=True, 2025-05-07T20:33:41.4968406Z compiled=False, 2025-05-07T20:33:41.4968482Z ) 2025-05-07T20:33:41.4968701Z self = 2025-05-07T20:33:41.4968873Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:41.4968937Z 2025-05-07T20:33:41.4969014Z @given( 2025-05-07T20:33:41.4969132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4969228Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4969340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4969459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4969569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4969645Z ) 2025-05-07T20:33:41.4969884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4969976Z def test_silu_mul_quant( 2025-05-07T20:33:41.4970055Z self, 2025-05-07T20:33:41.4970135Z T: int, 2025-05-07T20:33:41.4970213Z D: int, 2025-05-07T20:33:41.4970396Z scale_ub: Optional[float], 2025-05-07T20:33:41.4970488Z contiguous: bool, 2025-05-07T20:33:41.4970572Z compiled: bool, 2025-05-07T20:33:41.4970655Z ) -> None: 2025-05-07T20:33:41.4970749Z torch.manual_seed(2025) 2025-05-07T20:33:41.4970820Z 2025-05-07T20:33:41.4970988Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4972771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4972781Z 2025-05-07T20:33:41.4972898Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4972903Z 2025-05-07T20:33:41.4973004Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4973224Z self=, 2025-05-07T20:33:41.4973299Z T=16384, 2025-05-07T20:33:41.4973375Z D=7168, 2025-05-07T20:33:41.4973460Z scale_ub=1200.0, 2025-05-07T20:33:41.4973544Z contiguous=True, 2025-05-07T20:33:41.4973627Z compiled=False, 2025-05-07T20:33:41.4973707Z ) 2025-05-07T20:33:41.4973923Z self = 2025-05-07T20:33:41.4974100Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.4974104Z 2025-05-07T20:33:41.4974182Z @given( 2025-05-07T20:33:41.4974299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4974404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4974565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4974684Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4974800Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4974874Z ) 2025-05-07T20:33:41.4975112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4975209Z def test_silu_mul_quant( 2025-05-07T20:33:41.4975284Z self, 2025-05-07T20:33:41.4975361Z T: int, 2025-05-07T20:33:41.4975435Z D: int, 2025-05-07T20:33:41.4975532Z scale_ub: Optional[float], 2025-05-07T20:33:41.4975624Z contiguous: bool, 2025-05-07T20:33:41.4975709Z compiled: bool, 2025-05-07T20:33:41.4975786Z ) -> None: 2025-05-07T20:33:41.4975884Z torch.manual_seed(2025) 2025-05-07T20:33:41.4975957Z 2025-05-07T20:33:41.4976121Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4977953Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4978000Z 2025-05-07T20:33:41.4978115Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4978119Z 2025-05-07T20:33:41.4978224Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4978442Z self=, 2025-05-07T20:33:41.4978527Z T=128, 2025-05-07T20:33:41.4978604Z D=5120, 2025-05-07T20:33:41.4978692Z scale_ub=1200.0, 2025-05-07T20:33:41.4978782Z contiguous=False, 2025-05-07T20:33:41.4978904Z compiled=False, 2025-05-07T20:33:41.4978980Z ) 2025-05-07T20:33:41.4979198Z self = 2025-05-07T20:33:41.4979371Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:41.4979376Z 2025-05-07T20:33:41.4979451Z @given( 2025-05-07T20:33:41.4979571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4979667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4979781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4979897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4980009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4980085Z ) 2025-05-07T20:33:41.4980324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4980419Z def test_silu_mul_quant( 2025-05-07T20:33:41.4980498Z self, 2025-05-07T20:33:41.4980577Z T: int, 2025-05-07T20:33:41.4980656Z D: int, 2025-05-07T20:33:41.4980757Z scale_ub: Optional[float], 2025-05-07T20:33:41.4980848Z contiguous: bool, 2025-05-07T20:33:41.4980933Z compiled: bool, 2025-05-07T20:33:41.4981013Z ) -> None: 2025-05-07T20:33:41.4981106Z torch.manual_seed(2025) 2025-05-07T20:33:41.4981177Z 2025-05-07T20:33:41.4981345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4981418Z 2025-05-07T20:33:41.4981516Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4981640Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4981730Z x = x_sign * x_clamp 2025-05-07T20:33:41.4981815Z x0 = x[:, :D] 2025-05-07T20:33:41.4981896Z x1 = x[:, D:] 2025-05-07T20:33:41.4981967Z 2025-05-07T20:33:41.4982054Z if contiguous: 2025-05-07T20:33:41.4982148Z x0 = x0.contiguous() 2025-05-07T20:33:41.4982238Z x1 = x1.contiguous() 2025-05-07T20:33:41.4982364Z 2025-05-07T20:33:41.4982455Z if scale_ub is not None: 2025-05-07T20:33:41.4982559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.4982700Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.4982776Z ) 2025-05-07T20:33:41.4982857Z else: 2025-05-07T20:33:41.4982951Z scale_ub_tensor = None 2025-05-07T20:33:41.4983024Z 2025-05-07T20:33:41.4983155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.4983248Z op = silu_mul_quant 2025-05-07T20:33:41.4983331Z if compiled: 2025-05-07T20:33:41.4983433Z op = torch.compile(op) 2025-05-07T20:33:41.4983537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4983608Z 2025-05-07T20:33:41.4983703Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.4983708Z 2025-05-07T20:33:41.4983847Z moe/activation_test.py:117: 2025-05-07T20:33:41.4983984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4984084Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.4984220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.4984723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.4984823Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.4985182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.4985407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.4985745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.4985844Z kernel = self.compile( 2025-05-07T20:33:41.4986226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.4986440Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.4986572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.4986579Z 2025-05-07T20:33:41.4986784Z self = 2025-05-07T20:33:41.4987564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.4988068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa2b4860>} 2025-05-07T20:33:41.4988818Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.4989017Z context = 2025-05-07T20:33:41.4989024Z 2025-05-07T20:33:41.4989186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.4989451Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.4989560Z module_map=module_map) 2025-05-07T20:33:41.4989720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.4989821Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.4989897Z E ^ 2025-05-07T20:33:41.4990258Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.4990263Z 2025-05-07T20:33:41.4990676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.4990723Z 2025-05-07T20:33:41.4990827Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4991055Z self=, 2025-05-07T20:33:41.4991133Z T=2048, 2025-05-07T20:33:41.4991209Z D=7168, 2025-05-07T20:33:41.4991293Z scale_ub=None, 2025-05-07T20:33:41.4991380Z contiguous=False, 2025-05-07T20:33:41.4991466Z compiled=False, 2025-05-07T20:33:41.4991538Z ) 2025-05-07T20:33:41.4991753Z self = 2025-05-07T20:33:41.4991931Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:41.4991935Z 2025-05-07T20:33:41.4992012Z @given( 2025-05-07T20:33:41.4992129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4992231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4992345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4992503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4992624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4992698Z ) 2025-05-07T20:33:41.4992984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4993076Z def test_silu_mul_quant( 2025-05-07T20:33:41.4993153Z self, 2025-05-07T20:33:41.4993233Z T: int, 2025-05-07T20:33:41.4993308Z D: int, 2025-05-07T20:33:41.4993406Z scale_ub: Optional[float], 2025-05-07T20:33:41.4993497Z contiguous: bool, 2025-05-07T20:33:41.4993580Z compiled: bool, 2025-05-07T20:33:41.4993657Z ) -> None: 2025-05-07T20:33:41.4993753Z torch.manual_seed(2025) 2025-05-07T20:33:41.4993825Z 2025-05-07T20:33:41.4993994Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4995887Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.4995899Z 2025-05-07T20:33:41.4996020Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.4996024Z 2025-05-07T20:33:41.4996128Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.4996350Z self=, 2025-05-07T20:33:41.4996435Z T=128, 2025-05-07T20:33:41.4996512Z D=7168, 2025-05-07T20:33:41.4996599Z scale_ub=1200.0, 2025-05-07T20:33:41.4996687Z contiguous=True, 2025-05-07T20:33:41.4996771Z compiled=True, 2025-05-07T20:33:41.4996846Z ) 2025-05-07T20:33:41.4997069Z self = 2025-05-07T20:33:41.4997237Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.4997243Z 2025-05-07T20:33:41.4997327Z @given( 2025-05-07T20:33:41.4997444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.4997546Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.4997665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.4997782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.4997894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.4997972Z ) 2025-05-07T20:33:41.4998214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.4998308Z def test_silu_mul_quant( 2025-05-07T20:33:41.4998392Z self, 2025-05-07T20:33:41.4998470Z T: int, 2025-05-07T20:33:41.4998553Z D: int, 2025-05-07T20:33:41.4998657Z scale_ub: Optional[float], 2025-05-07T20:33:41.4998796Z contiguous: bool, 2025-05-07T20:33:41.4998886Z compiled: bool, 2025-05-07T20:33:41.4998970Z ) -> None: 2025-05-07T20:33:41.4999067Z torch.manual_seed(2025) 2025-05-07T20:33:41.4999144Z 2025-05-07T20:33:41.4999309Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.4999385Z 2025-05-07T20:33:41.4999479Z x_sign = torch.sign(x) 2025-05-07T20:33:41.4999604Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.4999693Z x = x_sign * x_clamp 2025-05-07T20:33:41.4999778Z x0 = x[:, :D] 2025-05-07T20:33:41.4999859Z x1 = x[:, D:] 2025-05-07T20:33:41.4999932Z 2025-05-07T20:33:41.5000024Z if contiguous: 2025-05-07T20:33:41.5000115Z x0 = x0.contiguous() 2025-05-07T20:33:41.5000207Z x1 = x1.contiguous() 2025-05-07T20:33:41.5000281Z 2025-05-07T20:33:41.5000419Z if scale_ub is not None: 2025-05-07T20:33:41.5000532Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.5000666Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.5000810Z ) 2025-05-07T20:33:41.5000891Z else: 2025-05-07T20:33:41.5000985Z scale_ub_tensor = None 2025-05-07T20:33:41.5001058Z 2025-05-07T20:33:41.5001192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.5001283Z op = silu_mul_quant 2025-05-07T20:33:41.5001370Z if compiled: 2025-05-07T20:33:41.5001470Z op = torch.compile(op) 2025-05-07T20:33:41.5001574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.5001651Z 2025-05-07T20:33:41.5001742Z > y_fp8, y_scale = fn() 2025-05-07T20:33:41.5001747Z 2025-05-07T20:33:41.5001844Z moe/activation_test.py:117: 2025-05-07T20:33:41.5001975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.5002077Z moe/activation_test.py:115: in fn 2025-05-07T20:33:41.5002219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.5002590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:41.5002687Z return fn(*args, **kwargs) 2025-05-07T20:33:41.5003175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:41.5003280Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:41.5003634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.5003858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.5004193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.5004287Z kernel = self.compile( 2025-05-07T20:33:41.5004674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.5004849Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.5004985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.5004990Z 2025-05-07T20:33:41.5005194Z self = 2025-05-07T20:33:41.5005967Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.5006472Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f2bfa2b59e0>} 2025-05-07T20:33:41.5007220Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.5007465Z context = 2025-05-07T20:33:41.5007473Z 2025-05-07T20:33:41.5007637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.5007900Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.5008010Z module_map=module_map) 2025-05-07T20:33:41.5008172Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.5008275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:41.5008353Z E ^ 2025-05-07T20:33:41.5008707Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.5008712Z 2025-05-07T20:33:41.5009169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.5009176Z 2025-05-07T20:33:41.5009279Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.5009543Z self=, 2025-05-07T20:33:41.5009624Z T=128, 2025-05-07T20:33:41.5009702Z D=7168, 2025-05-07T20:33:41.5009792Z scale_ub=1200.0, 2025-05-07T20:33:41.5009877Z contiguous=True, 2025-05-07T20:33:41.5009961Z compiled=False, 2025-05-07T20:33:41.5010039Z ) 2025-05-07T20:33:41.5010253Z self = 2025-05-07T20:33:41.5010424Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:41.5010428Z 2025-05-07T20:33:41.5010510Z @given( 2025-05-07T20:33:41.5010628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.5010729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.5010852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.5011012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.5011130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.5011207Z ) 2025-05-07T20:33:41.5011447Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.5011544Z def test_silu_mul_quant( 2025-05-07T20:33:41.5011622Z self, 2025-05-07T20:33:41.5011698Z T: int, 2025-05-07T20:33:41.5011775Z D: int, 2025-05-07T20:33:41.5011872Z scale_ub: Optional[float], 2025-05-07T20:33:41.5011959Z contiguous: bool, 2025-05-07T20:33:41.5012048Z compiled: bool, 2025-05-07T20:33:41.5012125Z ) -> None: 2025-05-07T20:33:41.5012220Z torch.manual_seed(2025) 2025-05-07T20:33:41.5012292Z 2025-05-07T20:33:41.5012459Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.5012534Z 2025-05-07T20:33:41.5012628Z x_sign = torch.sign(x) 2025-05-07T20:33:41.5012755Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.5014542Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.5014550Z 2025-05-07T20:33:41.5014667Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:41.5014672Z 2025-05-07T20:33:41.5014777Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.5015000Z self=, 2025-05-07T20:33:41.5015079Z T=128, 2025-05-07T20:33:41.5015205Z D=5120, 2025-05-07T20:33:41.5015289Z scale_ub=1200.0, 2025-05-07T20:33:41.5015379Z contiguous=True, 2025-05-07T20:33:41.5015465Z compiled=True, 2025-05-07T20:33:41.5015539Z ) 2025-05-07T20:33:41.5015757Z self = 2025-05-07T20:33:41.5015923Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:41.5015928Z 2025-05-07T20:33:41.5016006Z @given( 2025-05-07T20:33:41.5016128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.5016228Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.5016343Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.5016463Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.5016577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.5016659Z ) 2025-05-07T20:33:41.5016942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.5017045Z def test_silu_mul_quant( 2025-05-07T20:33:41.5017126Z self, 2025-05-07T20:33:41.5017203Z T: int, 2025-05-07T20:33:41.5017324Z D: int, 2025-05-07T20:33:41.5017425Z scale_ub: Optional[float], 2025-05-07T20:33:41.5017514Z contiguous: bool, 2025-05-07T20:33:41.5017599Z compiled: bool, 2025-05-07T20:33:41.5017681Z ) -> None: 2025-05-07T20:33:41.5017777Z torch.manual_seed(2025) 2025-05-07T20:33:41.5017850Z 2025-05-07T20:33:41.5018020Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.5018093Z 2025-05-07T20:33:41.5018188Z x_sign = torch.sign(x) 2025-05-07T20:33:41.5018311Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.5020127Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.5020142Z 2025-05-07T20:33:41.5020260Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:41.5020265Z 2025-05-07T20:33:41.5020366Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.5020591Z self=, 2025-05-07T20:33:41.5020669Z T=128, 2025-05-07T20:33:41.5020748Z D=7168, 2025-05-07T20:33:41.5020832Z scale_ub=None, 2025-05-07T20:33:41.5020918Z contiguous=True, 2025-05-07T20:33:41.5021001Z compiled=True, 2025-05-07T20:33:41.5021077Z ) 2025-05-07T20:33:41.5021295Z self = 2025-05-07T20:33:41.5021469Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.5021474Z 2025-05-07T20:33:41.5021554Z @given( 2025-05-07T20:33:41.5021673Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.5021773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.5021886Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.5022001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.5022116Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.5022190Z ) 2025-05-07T20:33:41.5022431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.5022527Z def test_silu_mul_quant( 2025-05-07T20:33:41.5022604Z self, 2025-05-07T20:33:41.5022683Z T: int, 2025-05-07T20:33:41.5022762Z D: int, 2025-05-07T20:33:41.5022861Z scale_ub: Optional[float], 2025-05-07T20:33:41.5022952Z contiguous: bool, 2025-05-07T20:33:41.5023087Z compiled: bool, 2025-05-07T20:33:41.5023165Z ) -> None: 2025-05-07T20:33:41.5023262Z torch.manual_seed(2025) 2025-05-07T20:33:41.5023339Z 2025-05-07T20:33:41.5023504Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.5025283Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:41.5025289Z 2025-05-07T20:33:41.5025445Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:41.5025587Z =============================== warnings summary =============================== 2025-05-07T20:33:41.5025891Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:41.5026234Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:41.5026529Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:41.5027402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:41.5027636Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:41.5027640Z 2025-05-07T20:33:41.5027853Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:41.5028063Z ================= 1 failed, 1 deselected, 3 warnings in 13.15s ================= 2025-05-07T20:33:43.0829866Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:43.1444164Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:43.1444423Z 2025-05-07T20:33:43.1444603Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:43.1445177Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:43.1445590Z 2025-05-07T20:33:43.1445594Z 2025-05-07T20:33:43.1445598Z 2025-05-07T20:33:43.1462426Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:43.1552191Z Post job cleanup. 2025-05-07T20:33:43.2538980Z [command]/usr/bin/git version 2025-05-07T20:33:43.2581385Z git version 2.47.1 2025-05-07T20:33:43.2616152Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/db22bca0-6ceb-4a34-9559-aee67b9a86bd/.gitconfig' 2025-05-07T20:33:43.2626608Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/db22bca0-6ceb-4a34-9559-aee67b9a86bd' before making global git config changes 2025-05-07T20:33:43.2627464Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:43.2640915Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:43.2682647Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:43.2717213Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:43.3054766Z Entering 'external/asmjit' 2025-05-07T20:33:43.3121468Z Entering 'external/composable_kernel' 2025-05-07T20:33:43.3195030Z Entering 'external/cpuinfo' 2025-05-07T20:33:43.3259952Z Entering 'external/cutlass' 2025-05-07T20:33:43.3336969Z Entering 'external/googletest' 2025-05-07T20:33:43.3404196Z Entering 'external/hipify_torch' 2025-05-07T20:33:43.3469523Z Entering 'external/json' 2025-05-07T20:33:43.3559248Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:43.3582467Z http.https://github.com/.extraheader 2025-05-07T20:33:43.3593728Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:43.3624637Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:43.3953627Z Entering 'external/asmjit' 2025-05-07T20:33:43.3996634Z http.https://github.com/.extraheader 2025-05-07T20:33:43.4039330Z Entering 'external/composable_kernel' 2025-05-07T20:33:43.4084203Z http.https://github.com/.extraheader 2025-05-07T20:33:43.4132537Z Entering 'external/cpuinfo' 2025-05-07T20:33:43.4175374Z http.https://github.com/.extraheader 2025-05-07T20:33:43.4218404Z Entering 'external/cutlass' 2025-05-07T20:33:43.4261693Z http.https://github.com/.extraheader 2025-05-07T20:33:43.4314964Z Entering 'external/googletest' 2025-05-07T20:33:43.4357879Z http.https://github.com/.extraheader 2025-05-07T20:33:43.4401462Z Entering 'external/hipify_torch' 2025-05-07T20:33:43.4444699Z http.https://github.com/.extraheader 2025-05-07T20:33:43.4488487Z Entering 'external/json' 2025-05-07T20:33:43.4531917Z http.https://github.com/.extraheader 2025-05-07T20:33:43.4685862Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:43.4719486Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:43.4729863Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:43.4730235Z ##[endgroup] 2025-05-07T20:33:43.4827292Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:54.2269793Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:34:10.6241936Z Cleaning up orphan processes